Computer scientists develop zero-shot algorithm for de novo sequencing of post-translationally modified peptides | Cheriton School of Computer Science

An international research team led by computer scientists at the Cheriton School of Computer Science has developed a machine learning algorithm that could help researchers uncover protein changes that are difficult to detect with existing tools.

“Proteins do much of the work inside cells, and after they are made our cells can chemically modify them in many ways,” said Zeping Mao, a PhD candidate and lead author on the study.

The human genome contains about 20,000 genes that code for proteins, yet the number of proteins in the body is vastly greater. Estimates suggest the human proteome may exceed one million distinct protein variants. Much of this diversity arises from post-translational modifications, or PTMs, chemical changes that occur after proteins are produced and can alter their structure and function.

Zeping Mao holds a pipette in a biology wet lab

Zeping Mao is a PhD candidate at the Cheriton School of Computer Science, advised by Professor Ming Li. He has a Bachelor’s degree in Biology from Tsinghua University.

He will defend his thesis, titled Deep Learning for Accurate and Reliable De Novo Peptide Sequencing: From Missing Fragmentation to Open Modification Discovery, on July 9, 2026.

PTMs play a critical role in regulating cellular processes. But when PTMs occur abnormally, they can alter protein properties and impair their function, contributing to the onset and progression of many diseases. But identifying PTMs in complex biological samples remains technically challenging.

Many existing peptide sequencing methods work best when researchers already know what they are looking for. They often rely on a reference protein database, a predefined list of candidate modifications, or labelled training data for the modifications they are expected to identify.

This dependence on prior knowledge makes de novo sequencing of post-translationally modified peptides particularly difficult. Because PTMs alter a peptide’s mass and fragmentation pattern, machine learning models often need labelled examples of specific modifications during training.

“If a modification is rare, unexpected or missing from the database, existing methods can overlook it,” Zeping explained. “It’s like trying to solve a puzzle but only being able to see a few pieces.”

The research team’s new algorithm, called RNovA, short for rotary positional embedding-enhanced de novo sequencing algorithm, analyzes mass spectrometry data to infer peptide sequences and discover candidate PTMs in a zero-shot setting. Designed as a modular framework, RNovA combines open PTM discovery with high-accuracy peptide prediction. By separating modification detection from sequence inference, it can systematically identify modified peptides without requiring user-supplied candidates or prior knowledge of specific modifications.

Being a zero-shot method means the model can detect unexpected modifications without retraining on each new PTM or relying on a predefined list of candidate modified residues. By reducing the need for labelled training data, the approach helps address one of the major challenges in identifying previously unknown, rare or poorly characterized PTMs

In their study, RNovA achieved state-of-the-art performance on standard de novo peptide sequencing benchmarks and outperformed a widely used traditional tool on a synthetic dataset containing diverse PTMs. The team also used RNovA to identify kynurenine-modified peptides, an uncommon but biologically relevant PTM, in clinical samples from patients with rheumatoid arthritis. The findings were subsequently validated using synthesized reference peptides.

“RNovA gives scientists a way to look beyond what is already catalogued,” Zeping said. “Expanding the PTM list may help researchers find new cellular modifications and new markers for cancer and other diseases. It’s a very powerful tool that will help biologists to broaden their horizons.”

Zeping and his collaborators are also exploring whether the approach can be extended to cross-linking mass spectrometry, a technique that reveals which regions of proteins are located near one another in three-dimensional space. If successful, this work could make protein-structure measurements more cost effective and higher throughput, enabling the generation of larger experimental datasets for AI-for-biology models.

“The long-term goal is to make structural proteomics data abundant enough to train much more powerful AI models of biology,” Zeping said.

Ultimately, the work could support basic research and help researchers identify new disease mechanisms, discover biomarkers, and develop more targeted therapeutic treatments.

To learn more about the research on which this feature is based, please see Zeping Mao, Chao Peng, Yuling Chen, Ping Wu, Qianqiu Zhang, Yonghan Yu, Ruixue Zhang, Lei Xin, Baozhen Shan, Haiteng Deng, Ming Li. Zero-shot de novo peptide sequencing with open posttranslational modification discovery. Nature Biotechnology (2026).