CS 886: Deep Learning for Biotechnology
Winter 2022 |
|
INSTRUCTOR: |
Ming
Li |
DC 3355, x84659 |
mli@uwaterloo.ca |
Course time and location: | Fridays 1:00pm-3:20pm, DC
2568. We are approved to remain online
thoughtout the term.
Please come to zoom meeting id:
863 4846 7016 (no password) at 1pm Fridays.
The student presentations have started. |
Office hours: | Due to the pandemic, let's talk online.
I am available almost all the time. Just email me. If
you wish to call me at 519 500 3026 please send an email
first to make an appointment, and
we can make zoom appointments to meet online as well.
|
Reference Materials: | Papers listed below.
|
Course description: Deep learning has brought truly revolutionary changes in many fields, including
biotechnology. This course intends to review the recent applications of deep
learning in biotechnology. Topics include deep learning applications in, but not limited to:
proteomics, protein folding and protein design, (small) drug discovery, genomics, liquid biopsy, and COVID-19.
The course will be run as follows.
I will do some lectures at the beginning introducing the basics and recent progress
of deep learning, These include word2vec,
attention, and pretraining models such as GPT and BERT.
Then during the second part of the course,
each student will present one or a group of
research papers on deep learning applications in one field of biotechnology.
The paper you choose should represent an important progress in that field,
or on shortcomings of current approachs and how we can solve a
fundamental problem in biotechnology. Additionally, each student will need
to do one course project of your own choice and present
it to the class at the end of the term.
I expect the students already knew the basics of
deep learning such as different types of
gates, pooling, backpropogation gradient descent methods,
fully connected networks,
recurrent networks such as LSTM and GRU,
convolutional networks, and more specialized
structures such as residue networks and Grid LSTM, recursive structure,
memory networks, sequence-to-sequence structure, generative adversarial nets
(GANs). If you do not already know about these, you can read about
these materials online or go to my lecture notes at:
https://cs.uwaterloo.ca/~mli/cs898-2017.html.
I might also briefly review some of these materials if needed.
GPUs: In order for some of you to do experiments,
students can go to https://www.awseducate.com/application
to sign up. Amazon will review the application for a couple of days.
More information can be
found at: https://aws.amazon.com/cn/education/awseducate/
Sharcnet might be another resource for GPU.
It is possible to apply for a TPU from google,
https://heartbeat.fritz.ai/step-by-step-use-of-google-colab-free-tpu-75f8629492b3
Marking Scheme:
Each student is evaluated according to the following three components:
-
[30 marks] Present a paper that represents one aspect of recent progress in
biotechnology (40 minutes). You need to demonstrate thorough understanding of
the background and relevant literature of your topic.
In addition to focusing on one problem,
presentations should include an in-depth
survey of the relevant literature and educational. Each week, we will have
three students presenting.
-
[65 marks] Do a project on one problem in biotechnology using deep learning, and present your own
project in class at the end of the term for 20 minutes.
I will be very happy to discuss projects with you.
-
[5 marks] Class attendance and participation.
Presentations will be posted on this website
(the presenters should provide these materials to me) several days before class.
Course announcements and lecture notes will appear on this page.
Please look at this page regularly.
Reading Materials or pointers to start in each field (these papers are only meant to serve as
pointers or guides to the literature in each field) :
Deep Learning for cancer early detection (liquid biopsy)
-
Claire Asher, Machine-learning algorithms tuned to detecting cancer DNA in the blood could pave the way for personalized cancer care.
The Scientist, Jun, 2018.
Protein folding and protein design:
-
NH, Tran, JB Xu, M. Li: A tale of two solutions in protein science: immunopeptide sequencing and protein structure prediction
Briefings in Bioinformatics. 2022.
-
Senior, A. W. et al. Improved protein structure prediction using potentials from deep
learning. Nature 577, 706-710 (2020).
-
I. Anishchenko et al, De novo protein design by deep network
hllucination.
Nature, 600, 547-552(2021) (We need somebody present this paper.)
Proteomics:
-
B. Wen et al., Deep learning in proteomics.
Proteomics, 20, 2020.
-
NH, Tran, JB Xu, M. Li: A tale of two solutions in protein science: immunopeptide sequencing and protein structure prediction.
Briefings in Bioinformatics. 2022.
Drug Design:
-
J. Vamatheva et al. Applications of machine learning in drug discovery and development.
Nature reviews, drug discovery. April 11, 2019.
-
J. Jimenez-Luna et al, Drug discovery with explainable artificial intelligence.
Oct. 13, 2020, Nature Machine Intelligence.
Genomics:
-
James Zou et al, A primer on deep learning in genomics. Nature Genetics 51, 12-18 (2019).
-
B. Alipanahi et al, Predicting the sequence specificities of DNA-and RNA-binding proteins by
deep learning. Nature Biotech. 2015.
Covid-19:
-
C. Shorten, et al, Deep learning applications for COVID-19. Journal of Big Data 8:18 (2021).
Deep learning for CRISPR technology:
-
Kim et al, Deep learning improves prediction of CRISPR-Cf1 guide RNA activity
Nature Biotech. Jan. 29, 2018.
-
G. Chuai et alL
DeepCRISPR: optimized CRISPR guide RNA design by deep learning
Genome Biology, 2018.
-
R. Leenay et al, Large dataset enables prediction of repair after
CRISPR-Cas9 editing in primary T cells, Nature Biotechnology,
Sept. 2019.
Deep learning for global warming. (The past 7 years are the hottest
years in human history. Stopping this trend is a matter of our species
survival. Deep learning can help. Well, one might argue this topic
might not be "biotechnology" in a narrow sense, but I think the
problem is too important to ignore.):
-
How to use Deep learning for global
warming. (goodworklabs.com/deep-learning-for-global-warming/)
-
MIT Technology Review: 10 ways AI could help fight climate change
-
M. Reichstein et al. Deep learning and process understanding for
data-driven earth system science. Nature, Feb. 13, 2019.
Deep Learning Basics (These papers are not for course presentations options):
-
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
Distributed representations of words and phrases and
their compositionality, in Advances in neural information processing systems,
2013, pp. 3111-3119.
-
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, A neural probabilistic language model, Journal of machine learning research, vol. 3, no.
Feb, pp. 1137-1155, 2003.
-
T. Mikolov, K. Chen, G. Corrado, and J. Dean,
Efficient estimation of word representations in vector space, arXiv
preprint arXiv:1301.3781, 2013.
-
J. Pennington, R. Socher, and C. D. Manning, Glove: Global vectors for word representation. in EMNLP, vol. 14,
2014, pp. 1532-1543.
-
X. Rong, word2vec parameter learning explained, arXiv preprint arXiv:1411.2738, 2014.
-
Vahe Tshitoyan, John Dagdelen, Leigh Weston, Alexander Dunn,
Ziqin Rong, Olga Kononova, Kristin A. Persson, Gerbrand Ceder & Anubhav Jain,
Unsupervised word embeddings capture latent knowledge from materials
science literature, Nature, July 3, 2019.
-
R. Johnson and T. Zhang, Semi-supervised convolutional neural networks for text categorization via region embedding,
in Advances in neural information processing systems, 2015, pp. 919-927.
-
R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning, Semi-supervised recursive autoencoders
for predicting sentiment distributions, in Proceedings of the conference on empirical methods in natural language
processing. Association for Computational Linguistics, 2011, pp. 151-161.
-
Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, Character-aware neural language models, in AAAI, 2016, pp. 2741-2749.
-
C. D. Santos and B. Zadrozny, Learning character-level representations for part-of-speech tagging, in Proceedings of
the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1818-1826.
-
Y. Ma, E. Cambria, and S. Gao, Label embedding for zero-shot fine-grained named entity typing, in COLING, Osaka,
2016, pp. 171-180.
-
X. Chen, L. Xu, Z. Liu, M. Sun, and H. Luan, Joint learning of character and word embeddings, in Twenty-Fourth
International Joint Conference on Artificial Intelligence, 2015.
-
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, Enriching word vectors with subword information, arXiv preprint 27 arXiv:1607.04606, 2016.
-
A. Herbelot and M. Baroni, High-risk learning: acquiring new word vectors from tiny data, arXiv preprint
arXiv:1707.06556, 2017.
-
Y. Pinter, R. Guthrie, and J. Eisenstein, Mimicking word embeddings using subword rnns, arXiv preprint
arXiv:1707.06961, 2017.
-
L. Lucy and J. Gauthier, Are distributional representations ready for the real world? evaluating word vectors for grounded
perceptual meaning, arXiv preprint arXiv:1705.11168, 2017.
-
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word
representations, arXiv preprint arXiv:1802.05365, 2018 (ELMO)
-
A. Mousa and B. Schuller, Contextual bidirectional long short-term memory recurrent neural network language models:
A generative approach to sentiment analysis, in Proceedings of the 15th Conference of the European Chapter of the
Association for Computational Linguistics: Volume 1, Long Papers, vol. 1, 2017, pp. 1023-1032.
-
A. Vaswani, N. Shazeer, N. Parmar, and J. Uszkoreit,
Attention is all you need, arXiv preprint arXiv:1706.03762, 2017
-
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pretraining,
URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/language-unsupervised/language
understanding paper. pdf, 2018. (OpenAI-GPT)
-
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language
understanding, arXiv preprint arXiv:1810.04805, 2018.
-
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
Radu Soricut, ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
-
Mandar Joshi*, Danqi Chen*, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer,
Omer Levy.
SpanBERT: Improving Pre-training by Representing and Predicting Spans, 2019.
-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
RoBERTa: A Robustly Optimized BERT Pretraining Approach.
-
Stephen Merity, Single Headed Attention RNN: Stop Thinking With Your Head
2019.
-
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436-444(2015).
-
R. Socher, Y. Bengio, C. Manning, Deep learning for NLP, ACL 2012
-
R. Sutton and A. Barto: Reinforcement Learning: an introduction.
MIT Press (1998).
-
K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio,
On the properties of neural machine translation: encoder-decoder approaches,
Oct. 2014.
-
K. Cho, B. van Merrienboer, C. Culcehre, D. Bahdanau, F. Bougares, H. Schwenk,
Y. Bengio,
Learning phrase representations using RNN encoder-decoder for statistical
machine translation. Jun. 2014
-
I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, Y. Bengio, Generative adversarial networks. 2014.
-
Alex Graves, Generating sequences with recurrent neural networks.
2013-2014 (this paper generates handwritting characters by LSTM)
Lecture Notes:
Student Presentations / Topics:
Yonghan Yu: Protein structure prediction (AlphaFold2). Feb. 4th.
Jesse AK Elliott: covid-19 vaccine design. (presentation toward the
end of the term), G. Li et al, DeepImmuno: deep learning-empowered
prediction and generation of immunogenic peptides for T-cell
immunity, Briefings in Bioinformatics;
Yiping Wang, multi-modal clinical report generation (image +
language processing). Feb. 4.
Mehrshad Sadria: The application of deep learning in single-cell
sequencing technology.
Zeping Mao: Regulatory genomics.
Yonglin Park and M. Kellis, Nature Biotech, 33 (2015), p 825-826.
Z. Avsec et al,Deep learning at base-resolution reveals motif syntax of the
cis-regulatory code. 2019, later published in Nature Genetics
(Base-resolution models of transcription-factor binding reveal soft
motif syntax).
Thanks to Meheshad Sadria for proposing this topic.
Partha Chakraborty: Transformer-based approaches in representation
learning of protein sequences (Based on papers: Biological structure and function
emerge from scaling unsupervised learning to 250 million protein
sequences.
TALE: transformer-based protein function annotation with joint
sequence-label embedding).
Shuyang Zhang, using language models to predict virus (covid-19)
mutation escap. (Brian Hie et al, Learning mutational semantics, and
Seq-2-Seq GAN based prediction)
Ronak Pradeep: Biomedical neural information retrieval and fact
verification.
(Covidex: neural ranking models and keyword search infrastructure
for the covid-19 open research dataset, Scientic claim
verification with VerT5erini, and Vera: Prediction techniques for
reducing harmful misinformation in consumer health search.)
Kishanthan Thangrarajah: Deep learning for cancer prediction. March 4.
Haudi Ghiassi Nejad: Deep learning applications in drug design.
Pushing the boundaries of molecular representation for drug
discovery with graph attention mechanism.
(Feb. 11, or Feb 18).
Lucas Fenanx: Deep learning and bio-privacy
Robert Wang: phylogeny by deep learning.
A. Bhattacharjee, et al:
Machine learning based imputation techniques for estimating
phylogenetic
trees from incomplete distance matrices.
E. S. Azer, et al. Tumor phylogeny topology inference via deep learning.
Aniruddhan Murali, Protein design.
J, Wang ... D. Baker, Deep learning methods for designing proteins
scaffolding functional sites.
I, Anischchenko, ... D. Baker,
De novo protein design by deep network hallucination. Nature, (2021)
Gustavo Sutter, Biological sequencing embedding (presentation time,
end of Feb).
Gabriele Corso,
Neural Distance Embeddings for Biological sequences.
NeurIPS, 2021.
Negar Arabzadehghahyazi. Domain-specific language model pretraining
for biomedical natural language processing.
March 4th, or March 11 presentation.
Maryam Yalsavar, Deep learning for medical image analysis.
L. Hou, Patch-based convolutional neural network for whole slide tissue
image classification.
Chris West, Deep learning for modelling of protein-protein and
protein-ligand interactions with applications in drug discovery.
P. Gainza et al, Deciphering interaction fingerprints from protein
molecular surfaces using geometric deep learning.. Nature Methods
2020. Presentation time: Feb 11.
Owen Chambers, deep learning for CRISPR technology.
Michael Karras, Deep learning for EEG
Xinyu Shi, Vision transformer for covid-19 diagnosis. Federated
split vision transformer for covid-19 CXR using task-agnostic
training. Multi-task vision transformer using low-level chest X-ray
feature corpus for covid-19 diagnosis and severity
quantification. Visual transformer with statistical test for
covid-19 classification.
Anastasiia Livochka, Deep learning predicts tuberculosis drug
resistane status from whole-genome sequencing. Drug resistance
prediction using deep learning techniques on HIV-1 sequence.
Saber Malekmohammadi, Covid-19 and the vision models.
(March 25 presentation)
Xueguang Ma, Protein embedding
Announcements:
All students, registered or if you
plan to register to the course,
please come to zoom meeting with meeting id 863 4846 7916, on
Jan. 7, 1pm. I look forward to meeting you soon!
On Jan 14, Friday, we will again do a QA session at 1pm. Please make
sure you watch the lecture 2 video before the class time. I will not
letcture during the class, but instead, we will answer questions and
discuss about projects etc. The zoom meeting id remains the same.
Please join the meeting on time.
Jan. 28, If you have not chosen a presentation time yet, please do so asap. I
especially need a volunteer for Feb.4 -- the next week. There are
still a few students who have not chosen a presentation topic yet.
There are still 4 students not in our presentation schedule -- --
--please contact me asap. It
looks like we will have 4 people each week to present starting from
Feb. 18.
It is now decided that our course will stay online for the whole term.
Deadline for handing in the 10 page project writeup is April 15.
Presentation Schedule:
Feb. 4, Yonghan Yu, Yonghan's talk on
Protein Structure Prediction by Deep Learning
Yiping Wang, Yiping's talk on
Clinical report generation
Student presentation week1 video.
Feb. 11, Chris West Chris' talk on protein-protein interactionby Deep Learning
, Shuyang Zhang Shuyang's talk on
Deep Learning the language of viral evolution, Aref
JafariAref talk on clinical report
generation.
Student presentation week 2 video.
Feb 18, Michael Karras Michael's
talk on EEG, Partha Chakraborty Partha's
talk on protein function annotation
, Haudi Ghiassi Nejad Haudi's
talk on drug development, Zeping Mao Zeping's
talk on regulatory genomics.
Student presentation week 3 video.
Feb. 25 Reading week.
Mar. 4, Ronak Prodeep Ronak presentation, Kishanthan Thangrarajah
Kishanthan's
talk on Cancer early detection, Negar
Arabzadehghahyazi Negar's
talk on deep learning for Medical document pretraining.
Owen Chambers Owen's
talk on deep learning for CRISPR Technology.
Student presentation week 4 video.
Mar. 11, Xinyu Shi Xinyu's presentation ,
Robert Wang Robert's presentation ,,
Aniruddhan Murali Aniruddhan
presentation,
Xueguang MaXueguang's presentation ,
Student presentation week 5 video.
Mar. 18, Jesse AK Elliott Jesse's presentation on
immunogenicity , Mehrshad
SandriaMehrshad's presentation on single
cell sequencing ,
Saber Malekmohammadi saber's presentation on
covid-19 image analysis ,
Maryam Yalsavar Maryam's presentation on
medical image ,
Student presentation week 6 video.
Mar. 25, Lucas Fenaux Lucas' presentation, Anastasiia Livochka Anastasiia's presentation on
antibiotics resistance , Arman Hafizi,
Gustavo Sutter Gustavo's presentation,
Student presentation week 7 video.
April 1 (class time), April 4 (4pm), Project presentations.
Final Projects:
-
Deadline for handing in the 10 page project paper is April 15.
-
April 1 final project presentations:
Part 1 video.
-
April 4, 4pm. Final project presentations:
Part 2 video.
Maintained by Ming Li