CS 886: Deep Learning for Biotechnology
Winter 2022

Ming Li DC 3355, x84659 mli@uwaterloo.ca
Course time and location:Fridays 1:00pm-3:20pm, DC 2568. We are approved to remain online thoughtout the term. Please come to zoom meeting id: 863 4846 7016 (no password) at 1pm Fridays. The student presentations have started.
Office hours:Due to the pandemic, let's talk online. I am available almost all the time. Just email me. If you wish to call me at 519 500 3026 please send an email first to make an appointment, and we can make zoom appointments to meet online as well.
Reference Materials:Papers listed below.

Course description: Deep learning has brought truly revolutionary changes in many fields, including biotechnology. This course intends to review the recent applications of deep learning in biotechnology. Topics include deep learning applications in, but not limited to: proteomics, protein folding and protein design, (small) drug discovery, genomics, liquid biopsy, and COVID-19.

The course will be run as follows. I will do some lectures at the beginning introducing the basics and recent progress of deep learning, These include word2vec, attention, and pretraining models such as GPT and BERT. Then during the second part of the course, each student will present one or a group of research papers on deep learning applications in one field of biotechnology. The paper you choose should represent an important progress in that field, or on shortcomings of current approachs and how we can solve a fundamental problem in biotechnology. Additionally, each student will need to do one course project of your own choice and present it to the class at the end of the term. I expect the students already knew the basics of deep learning such as different types of gates, pooling, backpropogation gradient descent methods, fully connected networks, recurrent networks such as LSTM and GRU, convolutional networks, and more specialized structures such as residue networks and Grid LSTM, recursive structure, memory networks, sequence-to-sequence structure, generative adversarial nets (GANs). If you do not already know about these, you can read about these materials online or go to my lecture notes at: https://cs.uwaterloo.ca/~mli/cs898-2017.html. I might also briefly review some of these materials if needed.

GPUs: In order for some of you to do experiments, students can go to https://www.awseducate.com/application to sign up. Amazon will review the application for a couple of days. More information can be found at: https://aws.amazon.com/cn/education/awseducate/ Sharcnet might be another resource for GPU. It is possible to apply for a TPU from google, https://heartbeat.fritz.ai/step-by-step-use-of-google-colab-free-tpu-75f8629492b3

Marking Scheme: Each student is evaluated according to the following three components:

Presentations will be posted on this website (the presenters should provide these materials to me) several days before class.

Course announcements and lecture notes will appear on this page. Please look at this page regularly.

    Reading Materials or pointers to start in each field (these papers are only meant to serve as pointers or guides to the literature in each field) :

    Deep Learning for cancer early detection (liquid biopsy)

  1. Claire Asher, Machine-learning algorithms tuned to detecting cancer DNA in the blood could pave the way for personalized cancer care. The Scientist, Jun, 2018.

    Protein folding and protein design:

  2. NH, Tran, JB Xu, M. Li: A tale of two solutions in protein science: immunopeptide sequencing and protein structure prediction Briefings in Bioinformatics. 2022.
  3. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706-710 (2020).
  4. I. Anishchenko et al, De novo protein design by deep network hllucination. Nature, 600, 547-552(2021) (We need somebody present this paper.)


  5. B. Wen et al., Deep learning in proteomics. Proteomics, 20, 2020.
  6. NH, Tran, JB Xu, M. Li: A tale of two solutions in protein science: immunopeptide sequencing and protein structure prediction. Briefings in Bioinformatics. 2022.

    Drug Design:

  7. J. Vamatheva et al. Applications of machine learning in drug discovery and development. Nature reviews, drug discovery. April 11, 2019.
  8. J. Jimenez-Luna et al, Drug discovery with explainable artificial intelligence. Oct. 13, 2020, Nature Machine Intelligence.


  9. James Zou et al, A primer on deep learning in genomics. Nature Genetics 51, 12-18 (2019).
  10. B. Alipanahi et al, Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature Biotech. 2015.


  11. C. Shorten, et al, Deep learning applications for COVID-19. Journal of Big Data 8:18 (2021).

    Deep learning for CRISPR technology:

  12. Kim et al, Deep learning improves prediction of CRISPR-Cf1 guide RNA activity Nature Biotech. Jan. 29, 2018.
  13. G. Chuai et alL DeepCRISPR: optimized CRISPR guide RNA design by deep learning Genome Biology, 2018.
  14. R. Leenay et al, Large dataset enables prediction of repair after CRISPR-Cas9 editing in primary T cells, Nature Biotechnology, Sept. 2019.

    Deep learning for global warming. (The past 7 years are the hottest years in human history. Stopping this trend is a matter of our species survival. Deep learning can help. Well, one might argue this topic might not be "biotechnology" in a narrow sense, but I think the problem is too important to ignore.):

  15. How to use Deep learning for global warming. (goodworklabs.com/deep-learning-for-global-warming/)
  16. MIT Technology Review: 10 ways AI could help fight climate change
  17. M. Reichstein et al. Deep learning and process understanding for data-driven earth system science. Nature, Feb. 13, 2019.

    Deep Learning Basics (These papers are not for course presentations options):

  18. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, in Advances in neural information processing systems, 2013, pp. 3111-3119.
  19. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, A neural probabilistic language model, Journal of machine learning research, vol. 3, no. Feb, pp. 1137-1155, 2003.
  20. T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781, 2013.
  21. J. Pennington, R. Socher, and C. D. Manning, Glove: Global vectors for word representation. in EMNLP, vol. 14, 2014, pp. 1532-1543.
  22. X. Rong, word2vec parameter learning explained, arXiv preprint arXiv:1411.2738, 2014.
  23. Vahe Tshitoyan, John Dagdelen, Leigh Weston, Alexander Dunn, Ziqin Rong, Olga Kononova, Kristin A. Persson, Gerbrand Ceder & Anubhav Jain, Unsupervised word embeddings capture latent knowledge from materials science literature, Nature, July 3, 2019.
  24. R. Johnson and T. Zhang, Semi-supervised convolutional neural networks for text categorization via region embedding, in Advances in neural information processing systems, 2015, pp. 919-927.
  25. R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning, Semi-supervised recursive autoencoders for predicting sentiment distributions, in Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2011, pp. 151-161.
  26. Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, Character-aware neural language models, in AAAI, 2016, pp. 2741-2749.
  27. C. D. Santos and B. Zadrozny, Learning character-level representations for part-of-speech tagging, in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1818-1826.
  28. Y. Ma, E. Cambria, and S. Gao, Label embedding for zero-shot fine-grained named entity typing, in COLING, Osaka, 2016, pp. 171-180.
  29. X. Chen, L. Xu, Z. Liu, M. Sun, and H. Luan, Joint learning of character and word embeddings, in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
  30. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, Enriching word vectors with subword information, arXiv preprint 27 arXiv:1607.04606, 2016.
  31. A. Herbelot and M. Baroni, High-risk learning: acquiring new word vectors from tiny data, arXiv preprint arXiv:1707.06556, 2017.
  32. Y. Pinter, R. Guthrie, and J. Eisenstein, Mimicking word embeddings using subword rnns, arXiv preprint arXiv:1707.06961, 2017.
  33. L. Lucy and J. Gauthier, Are distributional representations ready for the real world? evaluating word vectors for grounded perceptual meaning, arXiv preprint arXiv:1705.11168, 2017.
  34. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, arXiv preprint arXiv:1802.05365, 2018 (ELMO)
  35. A. Mousa and B. Schuller, Contextual bidirectional long short-term memory recurrent neural network language models: A generative approach to sentiment analysis, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, vol. 1, 2017, pp. 1023-1032.
  36. A. Vaswani, N. Shazeer, N. Parmar, and J. Uszkoreit, Attention is all you need, arXiv preprint arXiv:1706.03762, 2017
  37. A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pretraining, URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/language-unsupervised/language understanding paper. pdf, 2018. (OpenAI-GPT)
  38. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018.
  39. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut, ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
  40. Mandar Joshi*, Danqi Chen*, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, Omer Levy. SpanBERT: Improving Pre-training by Representing and Predicting Spans, 2019.
  41. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach.
  42. Stephen Merity, Single Headed Attention RNN: Stop Thinking With Your Head 2019.
  43. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436-444(2015).
  44. R. Socher, Y. Bengio, C. Manning, Deep learning for NLP, ACL 2012
  45. R. Sutton and A. Barto: Reinforcement Learning: an introduction. MIT Press (1998).
  46. K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio, On the properties of neural machine translation: encoder-decoder approaches, Oct. 2014.
  47. K. Cho, B. van Merrienboer, C. Culcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation. Jun. 2014
  48. I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks. 2014.
  49. Alex Graves, Generating sequences with recurrent neural networks. 2013-2014 (this paper generates handwritting characters by LSTM)

Lecture Notes:

Student Presentations / Topics:

Yonghan Yu: Protein structure prediction (AlphaFold2). Feb. 4th.

Jesse AK Elliott: covid-19 vaccine design. (presentation toward the end of the term), G. Li et al, DeepImmuno: deep learning-empowered prediction and generation of immunogenic peptides for T-cell immunity, Briefings in Bioinformatics;

Yiping Wang, multi-modal clinical report generation (image + language processing). Feb. 4.

Mehrshad Sadria: The application of deep learning in single-cell sequencing technology.

Zeping Mao: Regulatory genomics. Yonglin Park and M. Kellis, Nature Biotech, 33 (2015), p 825-826. Z. Avsec et al,Deep learning at base-resolution reveals motif syntax of the cis-regulatory code. 2019, later published in Nature Genetics (Base-resolution models of transcription-factor binding reveal soft motif syntax). Thanks to Meheshad Sadria for proposing this topic.

Partha Chakraborty: Transformer-based approaches in representation learning of protein sequences (Based on papers: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. TALE: transformer-based protein function annotation with joint sequence-label embedding).

Shuyang Zhang, using language models to predict virus (covid-19) mutation escap. (Brian Hie et al, Learning mutational semantics, and Seq-2-Seq GAN based prediction)

Ronak Pradeep: Biomedical neural information retrieval and fact verification. (Covidex: neural ranking models and keyword search infrastructure for the covid-19 open research dataset, Scientic claim verification with VerT5erini, and Vera: Prediction techniques for reducing harmful misinformation in consumer health search.)

Kishanthan Thangrarajah: Deep learning for cancer prediction. March 4.

Haudi Ghiassi Nejad: Deep learning applications in drug design. Pushing the boundaries of molecular representation for drug discovery with graph attention mechanism. (Feb. 11, or Feb 18).

Lucas Fenanx: Deep learning and bio-privacy

Robert Wang: phylogeny by deep learning. A. Bhattacharjee, et al: Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices. E. S. Azer, et al. Tumor phylogeny topology inference via deep learning.

Aniruddhan Murali, Protein design. J, Wang ... D. Baker, Deep learning methods for designing proteins scaffolding functional sites. I, Anischchenko, ... D. Baker, De novo protein design by deep network hallucination. Nature, (2021)

Gustavo Sutter, Biological sequencing embedding (presentation time, end of Feb). Gabriele Corso, Neural Distance Embeddings for Biological sequences. NeurIPS, 2021.

Negar Arabzadehghahyazi. Domain-specific language model pretraining for biomedical natural language processing. March 4th, or March 11 presentation.

Maryam Yalsavar, Deep learning for medical image analysis. L. Hou, Patch-based convolutional neural network for whole slide tissue image classification.

Chris West, Deep learning for modelling of protein-protein and protein-ligand interactions with applications in drug discovery. P. Gainza et al, Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning.. Nature Methods 2020. Presentation time: Feb 11.

Owen Chambers, deep learning for CRISPR technology.

Michael Karras, Deep learning for EEG

Xinyu Shi, Vision transformer for covid-19 diagnosis. Federated split vision transformer for covid-19 CXR using task-agnostic training. Multi-task vision transformer using low-level chest X-ray feature corpus for covid-19 diagnosis and severity quantification. Visual transformer with statistical test for covid-19 classification.

Anastasiia Livochka, Deep learning predicts tuberculosis drug resistane status from whole-genome sequencing. Drug resistance prediction using deep learning techniques on HIV-1 sequence.

Saber Malekmohammadi, Covid-19 and the vision models. (March 25 presentation)

Xueguang Ma, Protein embedding


All students, registered or if you plan to register to the course, please come to zoom meeting with meeting id 863 4846 7916, on Jan. 7, 1pm. I look forward to meeting you soon!

On Jan 14, Friday, we will again do a QA session at 1pm. Please make sure you watch the lecture 2 video before the class time. I will not letcture during the class, but instead, we will answer questions and discuss about projects etc. The zoom meeting id remains the same. Please join the meeting on time.

Jan. 28, If you have not chosen a presentation time yet, please do so asap. I especially need a volunteer for Feb.4 -- the next week. There are still a few students who have not chosen a presentation topic yet.

There are still 4 students not in our presentation schedule -- -- --please contact me asap. It looks like we will have 4 people each week to present starting from Feb. 18.

It is now decided that our course will stay online for the whole term.

Deadline for handing in the 10 page project writeup is April 15.

Presentation Schedule:

  • Feb. 4, Yonghan Yu, Yonghan's talk on Protein Structure Prediction by Deep Learning Yiping Wang, Yiping's talk on Clinical report generation Student presentation week1 video.
  • Feb. 11, Chris West Chris' talk on protein-protein interactionby Deep Learning , Shuyang Zhang Shuyang's talk on Deep Learning the language of viral evolution, Aref JafariAref talk on clinical report generation. Student presentation week 2 video.
  • Feb 18, Michael Karras Michael's talk on EEG, Partha Chakraborty Partha's talk on protein function annotation , Haudi Ghiassi Nejad Haudi's talk on drug development, Zeping Mao Zeping's talk on regulatory genomics. Student presentation week 3 video.
  • Feb. 25 Reading week.
  • Mar. 4, Ronak Prodeep Ronak presentation, Kishanthan Thangrarajah Kishanthan's talk on Cancer early detection, Negar Arabzadehghahyazi Negar's talk on deep learning for Medical document pretraining. Owen Chambers Owen's talk on deep learning for CRISPR Technology. Student presentation week 4 video.
  • Mar. 11, Xinyu Shi Xinyu's presentation , Robert Wang Robert's presentation ,, Aniruddhan Murali Aniruddhan presentation, Xueguang MaXueguang's presentation , Student presentation week 5 video.
  • Mar. 18, Jesse AK Elliott Jesse's presentation on immunogenicity , Mehrshad SandriaMehrshad's presentation on single cell sequencing , Saber Malekmohammadi saber's presentation on covid-19 image analysis , Maryam Yalsavar Maryam's presentation on medical image , Student presentation week 6 video.
  • Mar. 25, Lucas Fenaux Lucas' presentation, Anastasiia Livochka Anastasiia's presentation on antibiotics resistance , Arman Hafizi, Gustavo Sutter Gustavo's presentation, Student presentation week 7 video.
  • April 1 (class time), April 4 (4pm), Project presentations.

    Final Projects: