2007 Project
From Computational Genomics Course
Contents |
Project Proposal
A one-page project proposal is due Tues Oct 30th. In this proposal, you should briefly describe your general problem area (citing literature as needed) and outline in some detail what you plan to do. Comment on the algorithms or models you plan to use, how you will implement them, what data sets you will analyze, and what scientific question you will address.
Project Guidelines
A good project should be:
- challenging. The project is an opportunity to do something larger in scale, more difficult, and more open-ended than what you have done for the homeworks. You should choose something sufficiently ambitious to give you a taste of what it is like to do research in computational genomics. Recall that the project is worth 30% of your grade (equivalent to more than three homeworks); the amount of work you do for it should be commensurate.
- focused. While the project should be challenging, you have to be realistic about how much can be done in six to eight weeks. Focus on a small piece so that you have time to read the literature, design and implement a method, analyze some data, and do a nice job of writing everything up. If things go well, you can always expand the project later.
- relevant. It's good to find synergy with your other work, but the project has to be closely related to the material covered in this class. It should involve probabilistic modeling, scientific questions in genomics, genetics, or molecular biology, nontrivial algorithms, and some significant programming.
Recall that the expectations for BTRY 484 and BTRY 684 are somewhat different. A 684 project should involve some novel research. It may not be possible in the time available to make a huge leap forward, but you should try something new. Except in rare cases, you should also implement a method of your own design and apply it to real biological data. Ideally, a 484 project would also involve some implementation and data analysis, but novelty is not required -- it is okay to reimplement a published method, provided it is sufficiently challenging.
All projects will be due Friday, December 14. Please turn them in to me at my office.
Some Possible Projects
The following is a list of possible ideas for projects. Some of them are quite focused, others are more vague. Because of time limits, I have only given brief sketches here. If you find any of these possibilities interesting, come and talk to me and I may be able to provide you with some more information. Feel free to mix and match among these ideas, and to combine them with ideas of your own. You are also free, of course, to choose something not on this list.
(NOTE: these have been carried over from last year. I will try to update them with some new ideas during the next few days)
- Cross-species motif finding. There has been substantial interest in recent years in incorporating cross-species data into motif-finding algorithms, especially in using evolutionary conservation to help in the identification of regulatory elements. However, there's a lot more that could be done, ranging from relatively simple methods for combining standard motif models with conservation scores (say, from phastCons), to full probabilistic combinations of motif models and phylogenetic models. More use could be made of insertions and deletions, more could be done to allow for gain and loss of regulatory elements over time, and more could be done to model groupings of regulatory elements into cis regulatory modules. See. e.g., Moses et al., Sinha et al., Wang and Stormo,Siddharthan et al., Blanchette et al.
- Reconstruction of protein-coding genes. Comparative genomics gurus David Haussler and Webb Miller are leading an ambitious project to reconstruct the entire genomes of ancestral mammals, based on present-day mammalian genomes. We have been somewhat involved in this work and have agreed to use our expertise in probabilistic models of the evolution of protein-coding DNA to develop improved tools for reconstructing ancestral genes (in some respects, these are the most important parts of the genome to get right!). The goal of this project would be to take a first pass at this problem, using relatively simple codon models. It would involve using and adapting our PHAST (PHylogenetic Analysis with Space/Time models) software package. See Ma et al., Blanchette et al., and a recent news article in Wired Magazine.
- Identification of lineage-specific substitution patterns. We have pioneered new methods for detecting sequences under lineage-specific selectional pressure, and are using these methods to identify lineage-specific functional elements and to measure the rate of turnover of functional elements over evolutionary time (see Pollard et al. for a recent high-profile example). There are a number of ways in which this work could be extended and generalized, for example, by considering changes in the pattern as well as the rate of substitution, and by relaxing some of the simplifying assumptions about what types of "gain" and "loss" events can occur. There are also fantastic opportunities for data analysis using improved models or the existing model. We are just now beginning to have sufficiently deep sequence data to address these questions of gain and loss in mammals genome wide. Our current methods are described in this RECOMB 2006 paper.
- Fast alignment sampling. Steady progress is being made on the difficult problem of "statistical alignment", i.e., treating pairwise and multiple alignment as a statistical problem, using maximum likelihood or Bayesian methods for parameter estimation, and allowing consideration of not just a single alignment but a whole distribution of alignments. Some of the most promising methods use Markov chain Monte Carlo methods to sample from the posterior distribution over multiple alignments. A bottleneck in these approaches is the step that samples a single pairwise alignment, or sometimes a single three-way alignment, conditional on the rest of the multiple alignment. This project would involve experimenting with fast heuristic algorithms for proposing draws from this conditional distribution, then the use of rejection sampling to "correct" for the inaccuracies in the heuristics. See, e.g., Redelings and Suchard and Lunter et al..
- RNA motif finding. In (statistical) DNA motif finding, one tries to learn a motif model from sequences that are believed to have binding sites for the same transcription factor, and then to identify the individual binding sites. An analogous, but harder, problem is to identify common "motifs" in a set of structural RNAs, perhaps defined partly by their sequences of bases and partly by their secondary structures. It is possible in principle to develop methods for RNA motif finding that are directly analogous to standard DNA methods but use stochastic context free grammars in place of motif models. Actually making this work, however, could be a significant challenge. (I do not believe it has been tried.) The place to start would probably be with Gibbs sampling or simulated annealing approaches.
- Detection of selection using both comparative sequence data and polymorphism data. We have developed a widely used program called phastCons for identifying sequences under negative selection based on comparative sequence data. Using a phylogenetic hidden Markov model (phylo-HMM), PhastCons find sequences that are evolving more slowly than would be expected under neutral evolution, and predicts that they are under negative selection. However, another rich source of information about selection, not considered by phastCons, is polymorphism data -- sequences under selection show different SNP densities, and different frequency spectra, than sequences that are evolving neutrally. While there has been quite a bit of work on detecting selection based on polymorphism data (much of it here at Cornell), and quite a bit based on comparative sequence data, not nearly as much has been done to use both kinds of data together. The goal of this project would be to develop a simple unified probabilistic model for comparative sequence data and polymorphism data, and to use it to detect sequences under negative (or possibly positive) selection.
Example Projects
Below are some example project reports from previous years. These were generally strong projects that earned A or A- grades. They are not correct in all respects, so beware of occasional errors.
- "Predicting the loss of a microRNA motif in one of twelve species of Drosophila," Nandita Garud (undergrad, Biology/Biometry). This is a good example of an class project that nicely complemented an undergraduate thesis project. ( PDF )
- "RNA motif finding: A fully Bayesian approach," Benjamin Logsdon (grad, Computational Biology). This is a first-rate grad project that could grow into a publication. ( PDF )
- "Prediction of RNA secondary structures by modified Nussinov algorithm," Jalal Siddiqui (undergrad, Chemical Engineering). This is an example of a solid undergrad project by a student who didn't have much previous background in computational biology. Jalal started with a fairly straightforward algorithm discussed in the Durbin book but extended it in an interesting way and applied it to a real biological data set. ( PDF )
- "Motif finding," Chun-Nam Yu (grad, Computer Science). This is a good example of a solid "reimplementation" project. Chun-Nam built a motif finder similar in many ways to MEME and used it to reanalyze a published data set. ( PDF )
