Please note: This PhD seminar will take place online.
Fatemeh Alipour, PhD candidate
David R. Cheriton School of Computer Science
Supervisors: Professors Lila Kari, Yang Lu
Traditional supervised learning methods for DNA sequence taxonomic classification are often limited by the need for labor-intensive labeling and time-consuming sequence alignments, restricting their applicability to large genomic datasets and distantly related organisms.
Addressing these challenges, this presentation introduces CGRclust, a novel framework that combines unsupervised twin contrastive clustering of Chaos Game Representations (CGR) with convolutional neural networks (CNNs). Uniquely, CGRclust employs unsupervised learning to classify two-dimensional CGR images of DNA sequences, advancing its application in this domain. By leveraging twin contrastive learning, CGRclust efficiently identifies distinctive sequence patterns without requiring traditional sequence alignment or biological/taxonomic labels.
We demonstrate CGRclust’s capability through its application to twenty-five diverse datasets, ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, protists, as well as viral whole genome assemblies and synthetic DNA sequences. The results demonstrate CGRclust’s potential to overcome conventional method limitations, offering a robust, scalable, and efficient alternative for unsupervised DNA sequence clustering.