William Stafford Noble, Ph.D.
Technological advances in the second half of the twentieth century have been dominated by two fields: computer science and computational molecular biology. The genome sequence projects currently underway represent the first major convergence of these two fields and have led to the birth of a new subdiscipline, bioinformatics. Twenty-four completed genomes are publicly available on the web as of October, 1999. In the post-genomic era, when complete genomes are known for many species and even for many individual organisms within a species, the volume of sequence data available will require increasingly sophisticated computational analyses. Following is a list of some of our laboratory’s current and past research projects.
Gene prediction and functional annotation. A biologist with access to an array of genomic data of various types wants to answer two primary questions. First, where are the genes within the complete genomic sequence? Answering this question accurately involves not only recognizing the boundaries of the genes, but also locating their constituent elements, including exons, promoter regions, regulatory binding sites, and other key features. Given these elements, it is straightforward to infer the amino acid sequence of the protein derived from each gene. The second question involves classifying each protein according to its function. Such classification is usually accomplished by inferring homology, or evolutionary relatedness, between the protein of interest and some protein of known function.
We are working on a project that is divided into two phases, corresponding to the two tasks of gene prediction and functional annotation. In the first phase, we will develop and apply a gene finding system. This system is designed to be scalable and flexible with respect to the gene features it models, the machine learning algorithms it employs, and the range of experimental data from which it learns. Using as the core machine learning algorithm a discriminative training method based upon hidden neural networks, we will first validate the system by applying it to the complete C. elegans genome, comparing its predictions to those made by other gene finders. We will then retrain the system for the more difficult task of recognizing genes in human DNA.
The second phase of this project will consist of two parts. First, the software framework used for the gene finding system from phase one will be generalized to model families of related proteins. The resulting system will be capable of functionally classifying new proteins; the models employed, however, will be fundamentally sequence-based. In order to learn from non-sequential data such as mRNA expression data from microarray experiments, we will also develop functional classification techniques using support vector machines (SVMs). The statistics calculated by the sequence-based modeling system will function as one set of features used by the SVM system. Additional features will come from, for example, DNA microarray experiments, the upstream promoter region of each gene, similarity scores to known protein families, and three-dimensional structural information. The resulting discriminative classification system will provide excellent protein recognition capabilities.
The long-term goal of this research is to produce a system that not only functionally classifies genes, but also provides explanations for those classifications. The system developed here will be capable of providing functional explanations based upon the features of a single gene. Furthermore, the probabilistic nature of the models produced here will allow them to be incorporated into a future system that will be capable of producing contextual functional explanations on the level of complete pathways or even the entire cell.
Microarray expression analysis. Our laboratory has developed a new method of functionally classifying genes using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods such as hierarchical clustering methods and self organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers
Protein modeling. We have developed a software toolkit, called Meta-MEME, for modeling families of related proteins. The models produced by Meta-MEME provide insight into the structural and functional operation of the proteins in question, and may be used to discover functional and evolutionary relationships between protein sequences.
Meta-MEME produces hidden Markov models (HMMs) of protein families. HMMs are probabilistic models that have been used extensively in speech recognition but have only recently been applied to problems in computational biology. Meta-MEME’s models improve upon previous protein HMMs in two ways. First, Meta-MEME focuses on motif regions, small regions that are essential for the proper functioning or structural conformation of the protein. This focus on motifs makes Meta-MEME’s models much smaller than traditional protein HMMs, allowing the motif-based HMMs to be trained more efficiently and from smaller data sets. The latter concern is of particular importance, since many protein families currently contain only a few known members. Second, Meta-MEME models imply a more complex model of evolution than is implied by standard protein HMMs. A Meta-MEME model may be non-linear, thereby allowing for the repetition or shuffling of motif-sized elements within the protein family. These large-scale evolutionary events cannot be adequately modeled by linear HMMs.
Homology detection. Sequencing projects produce large quantities of biological sequence data; however, this data is only useful insofar as the functions of individual sequences are understood. Wet lab techniques for determining the function of a protein sequence can be time-consuming and expensive. Computational methods that infer sequence homologies (i.e., evolutionary relationships) can frequently replace wet lab experiments. By combining information from multiple sequences, hidden Markov models provide a powerful homology detection method. By focusing on motif regions, Meta-MEME’s models provide improved homology detection performance compared to standard protein HMMs.
However, we have shown that a linear combination of pairwise scores from a fast, heuristic algorithm can provide even better homology detection performance than model-based techniques. We call this algorithm Family Pairwise Search. We have developed accurate statistical scores for this algorithm as well as a motif-based version of the algorithm. This algorithm is also available via the web.
Multiple alignment and phylogenetic analysis. In addition to detecting functional similarities, protein models may be used to determine evolutionary relationships among proteins. An alignment of multiple sequences provides an easily understood representation of the evolutionary correspondence of various sites in a set of related proteins. Furthermore, such alignments provide a rich source of data for inferring the phylogenetic relationships among species. However, recent work indicates that even an accurate multiple alignment of a large sequence set may yield an incorrect phylogeny, and that the quality of the phylogenetic tree improves when the input consists only of the highly-conserved, motif regions of the alignment. Our laboratory has introduced two methods of producing multiple alignments that include only the conserved regions of the initial alignment. Our results indicate that the motif alignments produced using these techniques lead to phylogenetic trees that are significantly better than trees based on alignments of entire sequences.
Pavlidis, Paul, Darrin P. Lewis and William Stafford Noble. “Exploring gene expression data with class scores.” Proceedings of the Pacific Symposium on Biocomputing, January 2-7, 2002. To appear.
Leslie, Christina, Eleazar Eskin and William Stafford Noble. “The spectrum kernel: An SVM-string kernel for protein classification.” Proceedings of the Pacific Symposium on Biocomputing, January 2-7, 2002. To appear.
Pavlidis, Paul and William Stafford Noble. “Analysis of strain and regional variation in gene expression in mouse brain.” Genome Biology 2(10):research0042.1.0042.15, 2001..
Pavlidis, Paul, Jason Weston, Jinsong Cai and William Stafford Noble. “Learning gene functional classifications from multiple data types.” Submitted for publication.
Pavlidis, Paul, Christopher Tang and William Stafford Noble. “Classification of genes using probabilistic models of microarray expression profiles.” Proceedings of BIOKDD 2001: Workshop on Data Mining in Bioinformatics. To appear.
Muhle, Rebecca A., Paul Pavlidis, William Noble Grundy and Emmet Hirsch. “A high throughput study of gene expression in preterm labor using a subtractive microarray approach.” American Journal of Obstetrics and Gynecology. To appear.