Master’s Thesis Presentation • Bioinformatics • Spectrum and Retention Time Prediction for N-Glycopeptides Using Deep Learning

Wednesday, August 9, 2023 9:30 am - 10:30 am EDT (GMT -04:00)

Please note: This master’s thesis presentation will take place online.

Shuyang Zhang, Master’s candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Ming Li

Sequencing proteins and glycans have important clinical applications, as glycosylation is shown to play a significant role in cellular communication and immune response. Certain glycans are linked to the diagnosis of cancer as well as targeted immunotherapy. Mass spectrometry is a powerful tool that helps us gain insight into peptide sequences and glycan structures, by using database search, spectral library, or de novo sequencing. Spectrum and retention time prediction using deep learning has gained popularity with studies on non-glycosylated peptides and has been shown to improve database search results via rescoring. This thesis proposes deep learning models to predict spectrum and retention time for N-glycopeptides and then discusses the applications of these models with respect to glycopeptide sequencing.

Chapter 3 presents a graph deep learning model to predict fragment ion intensities of observed spectrums and define a spectrum representation for glycan fragments with up to three cleavages. The spectrum prediction model has a median cosine similarity of 0.921, which is 20% higher than previous attempts at glycopeptide spectrum prediction.

For retention time prediction in Chapter 4, we propose a model with two parallel encoders for both peptide and glycan input and apply transfer learning for the sequence encoder. The retention time prediction model has a Pearson correlation of 1.0, which is higher than the previous 0.98 and 0.96 attempts. We also introduce the 95 percentile delta as an evaluation metric, as well as discuss the interpretability of our model.

Finally, in Chapter 5, we apply our spectrum and retention time prediction models in glycopeptide sequencing pipelines, including database search and de novo search. We show that our model improves identification by rescoring and has the potential to be used as a filter for false positives. We also demonstrate that our model improves de novo identification when used in the scoring function.