
Image from ScienceNOW, July 14, 2005
Cornell Home
BSCB Home
MBG Home
CS Home
Compbio Graduate Field
New Life Sciences Init
PHAST
Browser Mirror
Carlos Bustamante
Jason Mezey
Andy Clark
Chip
Aquadro
Lee Kraus
Steve Tanksley
Michael Stanhope
Katie Pollard
Webb Miller
David Haussler
Gill Bejerano
Rasmus Nielsen
Note: This page has fallen out of date. See publications for latest work.
While working in David Haussler's lab at the University of California, Santa Cruz, I developed a program called ExoniPhy (pronounced "ex-ON-i-FY") for predicting evolutionarily conserved protein-coding exons from multiply aligned genomic sequences. ExoniPhy is based on a statistical model of gene structure and evolution called a phylogenetic hidden Markov model, or "phylo-HMM", and uses several new techniques to help discriminate between coding and noncoding sequences (see paper). We have run ExoniPhy on alignments of the human, mouse, and rat genomes (see "Exoniphy" track on UCSC Genome Browser, July 2003 human version) and identified thousands of potentially "novel" human exons -- i.e., sequences that bear the evolutionary signature of protein-coding genes but that are not supported, or are only weakly supported, by mRNA and EST sequence data. In collaboration with Michael Brent's lab at Washington University, St. Louis (as part of the Mammalian Gene Collection project), and with Phil Green's lab at the University of Washington, we have tested about 300 predicted gene fragments by RT-PCR, and shown that roughly two thirds of them are expressed and spliced, strongly suggesting that they are real genes. Recently, Bruce Roe and colleagues at the University of Oklahoma have examined about 20 of these novel gene fragments in zebrafish embryos and found evidence that some belong to development genes, which were presumably missed by mRNA and EST sequencing projects because they are expressed at very specific times or in specific tissues. Work is underway to refine our prediction methods and to test more genes in the lab.
We have developed another phylo-HMM-based program, called phastCons, that identifies orthologous sequences significantly more similar across species than would be expected if they were evolving neutrally, which are likely to have critical biological functions and to be maintained by purifying natural selection. Using phastCons, we have conducted the most extensive study to date of conserved elements in eukaryotes, including vertebrate, insect, nematode worm, and yeast genomes (see paper). We found that, as organism complexity increases (from yeasts to worms to insects to vertebrates), increasingly larger fractions of conserved bases fall outside of coding regions, perhaps reflecting increasing importance of regulatory functions. In vertebrates, the vast majority of conserved bases do not encode proteins. Some of these sequences presumably regulate transcription and splicing, some are probably RNA genes, some may function in as yet unknown ways. We also found that certain extraordinarily conserved sequences appear to be important in post-transcriptional regulation. PhastCons is the basis of the popular conservation tracks in the UCSC Genome Browser, and is being heavily used in the ENCODE project.
PhastCons and programs like it make the simplifying assumption that each sequence is either under selection in all species or under selection in no species. However, many cases are known of sequences that have gained or lost function (and hence come under selection or begun to drift) on some branch of the phylogeny, and we would like to be able to detect these sequences as well. In addition, we would like to address the critical question of the rate of "turnover" of functional elements -- that is, how frequently are functional elements gained and lost over evolutionary time, and how does this affect our ability to detect them using cross-species comparisons? With Katie Pollard (now at UC Davis) and David Haussler, I have recently developed a new method called DLESS for detecting sequences under lineage-specific negative selection. Work is underway to extend these methods to allow for positive selection as well, and to use them to obtain good estimates of the rate of turnover of functional elements.
I am participating in a project led by David Haussler and Webb Miller to reconstruct the complete genome sequence of the last common ancestor of all boroeutherian mammals, which lived roughly 80 million years ago (see preliminary paper). My focus in this project is on adapting the phylogenetic models I have used in gene finding to improve reconstructions of protein-coding sequences. In related work, I am collaborating with Webb Miller and his student Bob Harris on using probabilistic reconstructions of ancestral genomes to improve genomic alignments.