Using machine learning to discover the structure of glycoproteins, diverse molecules that perform a myriad of functions in organisms | Cheriton School of Computer Science

Glycoproteins are a diverse class of molecules in which one or more sugar molecules — what are known as glycans — are attached to protein molecules. More than half of all proteins in human cells are thought to be glycoproteins.

Glycans affect the folding and stability of protein molecules and play a vital role in regulating their functions both broadly and finely, even their transport within cells, allowing proteins to perform a much wider range of functions. Glycoproteins mediate countless cellular and physiological processes including the ability of the cells of the immune system to find and target pathogens and cancer cells in the body, among many other critical functions.

But if glycosylation — the term for adding a glycan to an organic molecule — does not occur as it should within cells, it can have serious deleterious effects on the growth, development and health of an individual, explains Qianqiu Zhang, a PhD student at the Cheriton School of Computer Science.

photo of Qianqiu Zhang in the Davis Centre

Qianqiu Zhang is a doctoral candidate who studies bioinformatics at the Cheriton School of Computer Science. She is advised by University Professor Ming Li, who holds the Canada Research Chair in Bioinformatics.

“Even if two peptides — the building blocks of proteins — have exactly the same sequence of amino acids, they may not function identically if their glycan structures are different,” Qianqiu explains. “We need to determine both the sequence of amino acids in the peptide and the structure of the glycans attached to them.”

With this goal in mind, Qianqiu with her advisor at the Cheriton School of Computer Science and colleagues at Waterloo-based Bioinformatics Solutions developed GlycanFinder. This database search and sequencing software tool identifies as well as discovers new glycopeptides using data from mass spectrometry, an analytical technique that determines the abundance and mass of molecules in a sample.

GlycanFinder starts with mass spectrometry data from whole glycopeptides. The results from the mass spectrometer are presented as a mass spectrum, data plotted as a graph that shows the distribution of molecules in a sample by their molecular weight. The goal of the software is to use the mass spectrum data to determine the identity and order of amino acids in the peptide and the sugar molecules in the glycan.

“GlycanFinder’s database search is relatively straightforward,” Qianqiu said. “Say I’m given mass spec data for a glycopeptide sample. GlycanFinder matches the mass spectrum of the sample with the known mass of peptides and glycans in a database, thereby deducing its structure.”

Simply put, the database search engine finds the most matched peptides and glycans, first by conducting a peptide-based search then a glycan-based search against known proteins and glycans.

However, discovering new glycopeptides — what’s known as de novo sequencing — is more challenging, Qianqiu said. “This is partly because glycans often are not linear like amino acids in a peptide. Structurally, glycans are like a tree with multiple branches and leaves. These tree-like structures can be attached to different amino acids of the peptide. For example, what are known as N-linked glycans are attached to the nitrogen atom in asparagine.”

To tackle this problem, GlycanFinder uses artificial intelligence, a deep learning model that performs N-linked glycan sequencing on mass spectra that has not been identified using the database search.

“N-linked glycans have patterns,” Qianqiu explains. “In nature, one saccharide in a glycan is always linked to another. This is a natural pattern and the goal of de novo sequencing is to learn this pattern. We used a deep learning model that’s trained on all the known glycan structures and on the mass spectrum of a sample to learn and then predict the glycan’s structure by recurrently building a tree of glycans from the root which is attached to the peptide, to the branches and leaves.”

To do this task, GlycanFinder uses what’s known as a graph transformer, a method to generate a tree-like structure that involves identifying connections between nodes.

“The graph transformer model reconstructs glycan trees from scratch, without imposing predefined rules, structures, or types of modifications,” Qianqiu said. “We learn both the sequence of amino acids in the peptide and the structure of the glycans attached to them. Such accurate glycopeptide profiling in cells and tissues is essential to develop diagnostic tests to detect disease and to develop potential treatments.”

To learn more about the research on which this feature article is based, please see Weiping Sun, Qianqiu Zhang, Xiyue Zhang, Ngoc Hieu Tran, M. Ziaur Rahman, Zheng Chen, Chao Peng, Jun Ma, Ming Li, Lei Xin and Baozhen Shan. Glycopeptide database search and de novo sequencing with PEAKS GlycanFinder enable highly sensitive glycoproteomics. Nature Communications 14, 4046 (2023).