Peptide identification is a core challenge in proteomics, the study of proteins, their structure and functions. Unlike genomics, which examines an organism’s genetic information, proteomics is far more complex. The proteome — the complete set of proteins produced or modified by a cell or system — varies not only across different cell types but also over time.
Analyzing proteomics data is crucial because the number of proteins an organism produces exceeds what can be inferred solely from genomic and transcriptomic data.
For example, alternative splicing allows a single gene to produce multiple mRNA transcripts, each of which can be translated into distinct proteins with different functions. Additionally, proteins undergo various post-translational modifications, further expanding the diversity and functions of proteins within an organism.
“DNA sequencing is relatively straightforward,” says Yonghan Yu, a PhD candidate at the Cheriton School of Computer Science. “In contrast, sequencing proteins is far more challenging. Protein molecules must first be broken down into smaller fragments known as peptides, which are then analyzed using mass spectrometry. This analytical tool separates peptides based on their mass and electrical charge, generating data that computational methods use to infer the protein’s amino acid sequence.”
data:image/s3,"s3://crabby-images/c692f/c692f6822c0a69b80ea0973f95ad2abfff949fec" alt="Yonghan Yu in the Davis Centre"
Yonghan Yu is a PhD candidate advised by University Professor Ming Li. He holds a bachelor’s degree in electronics and communication, with a minor in computing, from City University of Hong Kong.
The most widely used method to identify proteins from mass spectrometry data is database search, which compares experimental spectra to theoretical spectra derived from known peptide sequences. However, traditional database search engines have limitations because they rely on heuristic scoring functions, a method that ranks candidate peptide–spectrum matches based on how well theoretical peptide fragments match experimental mass spectra.
“With traditional database search, the heuristic scoring function is suboptimal and influenced by multiple factors,” Yonghan explains. “To reduce potential biases and improve identification accuracy, probabilistic models based on statistical significance or probability estimation are typically used.”
One may think that deep learning offers a good solution to this problem. However, since University Professor Ming Li’s group first introduced deep learning to solve the de novo peptide sequencing problem in their 2017 Proceedings of the National Academy of Sciences paper, the challenge of extending deep learning to database search has remained an open problem for eight years, with numerous research groups attempting to tackle it.
DeepSearch, a novel deep learning–based end-to-end database search method developed by Yonghan and University Professor Li, addresses these limitations and solves the open problem.
Unlike previous approaches, DeepSearch is a data-driven method that uses deep learning to generate high-dimensional embeddings of experimental mass spectra and peptide sequences. It calculates peptide–spectrum match scores using cross-modal cosine similarity as the scoring scheme, leading to a more robust and unbiased identification process.
“Computational biologists have built libraries of high-quality peptide–spectrum matches,” Yonghan said. “We use these libraries to train a deep learning model, which refines the scoring function. This allows DeepSearch to identify peptides more accurately than traditional methods.”
To evaluate its effectiveness, the researchers tested DeepSearch on protein datasets from various species, including Arabidopsis thaliana (a plant model), HEK293 (a human cell line), Caenorhabditis elegans (a nematode worm model), and Escherichia coli (a bacterial model). The results demonstrated that DeepSearch outperformed MSFragger, MS-GF+, and MaxQuant, three widely used search engines, as well as had high accuracy and robustness across diverse datasets.
Beyond improving peptide identification, DeepSearch brings new capabilities to proteomics. Among its key strengths is its ability to identify proteins with post-translational modifications, a task that previous deep learning-based protein identification methods struggle with.
“DeepSearch can perform post-translational modification searches in a zero-shot manner,” Yonghan says. “This means it can identify modified proteins without being trained on post-translational modification data. Instead of requiring specialized training, the model can generalize and apply its learned patterns to new scenarios.”
As researchers continue to refine deep learning models for proteomics, DeepSearch shows the potential for such methods to revolutionize the field, leading to deeper insights into the molecular mechanisms underlying health and disease.
“Our goal is to create better proteomic tools so that we can accurately identify peptides from different sources like antibodies and cancer proteomics,” Yonghan says. “If we can do that, we are one step closer to developing personalized therapies.”
To learn more about the research on which this feature is based, please see Yonghan Yu and Ming Li. Towards highly sensitive deep learning-based end-to-end database search for tandem mass spectrometry. Nature Machine Intelligence 7, 85–95 (2025).