|DNA microarrays are emerging as one of the keys for deciphering the information content of the genome. The analysis of the massive amounts of gene expression data these methods generate is an enormous challenge. Working with methods from data mining, machine learning and statistics, we are developing algorithms and tools that take advantage of these new technologies. In collaboration with laboratory researchers, we are applying these methods to diverse domains including cancer, diabetes, viral infection, and nervous system function.
Enabled by advances in technology, molecular biological research is moving beyond the study of individual genes and into larger questions of how genes work together to generate the complex behaviors of cells and organisms. Among the keys to gaining this integrated understanding of the cell are new methods for measuring the expression levels (one measure of gene activity) of thousands of genes simultaneously. These technologies include cDNA microarrays and oligonucleotide arrays (DNA chips). In a typical experiment, measurements are taken from two or more metabolic conditions, genotypes, or pathological states. With such data, one can as questions such as:
- What genes show different expression levels between the samples?
- What patterns of gene expression do the genes show?
- What kinds of genes show ‘interesting’ patterns in the data?
- What does the data say about the function of a particular gene?
- How does this data relate to other information about these genes?
These questions, and others, represent the challenges set forth by biologists for computational biologists. Particular areas we are concentrating on are:
Statistics and data mining. Typically the first question asked about an expression data set is ‘which genes show differential expression?’ For example, in a cancer study we might want to know which genes show differences in expression between tumor and normal tissue samples. To help answer this type of question, we are developing statistical methods which are robust yet simple to apply. Combined with appropriate visualization, these methods provide a powerful means of analyzing complex data sets. Our goal is to provide a set of tools which can be readily applied by biologists who wish to explore their data in a straightforward and statistically meaningful way.
Relating expression to known functions. For most genes, a newly measured expression profile does not exist in an informational vacuum. Databases of sequences, scientific literature, annotations and ontologies all can be used to help make sense of these complex data sets. We are developing methods which unite information from multiple sources to aid biologists in the interpretation of their data. The goal is to make new connections between genes, their expression patterns, and evidence from other sources to help gain an integrated view of cellular function.
Gene and sample classification. Genome sequencing has provided science with many more questions than answers. Thousands of genes have been identified have little or nothing known about their function. Our approach to this problem is to apply machine learning techniques to gene expression data as well as other types of data such as promoter structure and phylogeny. These supervised methods provide a complementary approach to unsupervised methods such as clustering. The same methods can also be applied to the classification of the samples in an experiment.
Please visit Paul’s home page for more information and software to download.
Pavlidis, P., Weston, J., Cai, J., and Grundy W.N. (Submitted) Supervised classification of genes using heterogeneous data.
Pavlidis, P., Lewis, D.P., and Noble, W.S. (2002, to appear) Exploring gene expression data with class scores.Proceedings of the Pacific Symposium on Biocomputing 7.
Pavlidis, P. and Noble W.S. (2001) Analysis of strain and regional variation of gene expression in mouse brain. Genome Biology.
Pavlidis, P., Tang, C. and Grundy, W.N. (2001) Classification of genes using probabilistic models of microarray expression profiles. Workshop on Data Mining in Bioinformatics, held in conjunction with SIGKDD01.
Muhle, R.A., Pavlidis, P., Grundy, W.N., Hirsch, E. (2001) A high throughput study of gene expression in preterm labor using a subtractive microarray approach. American Journal of Obstetrics and Gynecology 185, pp. 716-724
Pavlidis, P., Weston, J., Cai, J., and Grundy, W.N. (2001) Gene functional classification from heterogenous data Proceedings of the Fifth Annual International Conference on Computational Biology (RECOMB), 249-255.
Pavlidis, P., Furey, T.S., Liberto M., Haussler D., and Grundy W.N. (2001) Promoter region-based classification of genes. Proceedings of the Pacific Symposium on Biocomputing 6, 151-164.