Scientists have recently developed highly successful tools designed to probe the inner workings of a living cell. Technologies that obtain sequence information from DNA and protein molecules work in conjunction with tools such as X-ray devices and nuclear magnetic resonance machines to reveal the three dimensional conformation of these molecules. More recent technologies, such as DNA micro arrays, strive to investigate the interactions between the various proteins and genes within a cell. The net result of these investigations is the production of huge amounts of data.
The objective of bioinformatics is to store, retrieve, manipulate, visualize, analyze, integrate and interpret data from these information sources so that we can fully understand the vast array of life processes that occur in the living cell. Even the non-genetic diseases need the understanding of disease path way to cure. If we are to treat any of the 5,000 or so genetic diseases as well as the many communicable diseases. Applications of this knowledge include drug design and medical diagnostic procedures for modern drug design and medical diagnostic procedures.
Current research in the Bioinformatics Research Group leads to the design, development and assessment of computational tools for the exploration of data in all these categories. In an effort to be well grounded in application areas, we collaborate with biologists to study the practical usefulness of the methods we develop. Here is a brief overview of some of our current research interests.
- Biocomputation and nanocomputation by self-assembly.
- Comparative genomics.
- Genome analysis including statistical methods for gene prediction.
- Inference of inheritance patterns of mutations (called haplotype inference).
- Knowledge inference from biomedical literature.
- Mass spectrometry data analysis.
- Protein function prediction.
- Protein structure prediction (including complete 3-d structures and binding sites).
- Software and theory of homology search and motif discovery.
At the risk of oversimplification, we may view bioinformatics data as dealing with sequence, structure and function. The Bioinformatics Research Group has made a particularly strong impact in the area of sequence analysis. This includes both theoretical studies and the development of application software for the processing of sequences. For example the PatternHunter program [B. Ma, J. Tromp, M. Li, PatternHunter: Faster and more sensitive homology search. Bioinformatics, 18:3(2002), 440-445] was used in the initial sequencing and comparative analyses of the mouse genome [many authors, including D. Brown, M. Li and B. Ma, Nature, Dec. 5, 2002] and the rat genome [many authors, including B. Ma]. More recent research presented at ISMB2005: ExonHunter: A Comprehensive Approach to Gene Finding, B. Brejova, D. Brown, M. Li, T. Vinar, Intelligent Systems for Molecular Biology, 2005.
The determination of the structure of a protein provides important information on its function and interactions, and is crucial for a full understanding of the role played by the protein within a cell.
Current research involves the development of methods for the prediction of protein structure from amino acid sequence, using a combination of computational and biochemical techniques. Two computational aspects of this problem are the development of scoring functions capable of recognizing correctly folded proteins, and the development of search algorithms for exploring the folding space of proteins. A key achievement in this area is the development of a scoring function able to accurately recognize native protein structures [McConkey et al., Proc. Natl. Acad. Sci. U.S.A., Vol. 100, no. 6, Mar. 18, 2003, pp. 3215-3220]. Another significant accomplishment has been the success of the RAPTOR program, which was ranked top among individual automatic protein 3D structure prediction servers at the recent CAFASP3 competition. Information on RAPTOR can be found in: [J. Xu and M. Li, Assessing RAPTOR's new linear programming approach for fold recognition in CAFASP3. PROTEINS: Structure, Function, and Genetics, 53(S6), Oct. 2003, pp. 579-584].
Modern health and agricultural research requires the high-throughput identification of proteins from biological samples. The mass spectrometry (MS) and tandem mass spectrometry (MS/MS) have become the standard experimental methods for the protein identification purpose. The complexity and size of the mass spectrometry data exclude the possibility of manual interpretation. With novel algorithm [B. Ma, K.Zhang, C. Liang. An Effective Algorithm for the Peptide De Novo Sequencing from MS/MS Spectrum. JCSS 70, 2005, pp. 418-430], we developed the PEAKS software [B. Ma et al. PEAKS: Powerful Software for Peptide De Novo Sequencing by Tandem Mass Spectrometry. Rapid Communication in Mass Spectrometry 17(20), 2003, pp. 2337-2342] for peptide de novo sequencing and protein identification from tandem mass spectrometry data. The software is being used world wide in several hundreds of research institutes and has become the industrial standard software for peptide de novo sequencing.
We have also investigated large-scale duplication in the history of the flowering plant Arabidopsis thaliana [T. Vision, D. Brown, S. Tanksley. The origins of genome duplication in Arabidopsis. Science 290, 2000, pp. 2114-2117], and in the human genome as part of the Human Genome Project, as reported in the original paper announcing the draft human genome sequence [International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 2001, pp. 860-921].