STA613/CBB540: Statistical methods in computational biology: Spring 2016

Prof:Sayan Mukherjee sayan@stat.duke.edu OH: Wednesday 2:15-3:15pm, 112 Old Chem
TA:Ryan Muraglia OH: Wednesday 10:00-12:00pm, SCC in Old Chem
Class:Tu/Thu 3:05-4:20am 025 Old Chem

Description

This course is based on case studies of statistical approaches to problems in computational biology. We will learn about statistical modeling in computational biology by formulating biological questions and repeating the following steps:
  1. formalize the question as a probabilistic model (typically via a likelihood);
  2. clarify the interpretation of model parameters and the model assumptions;
  3. develop methods for parameter estimation;
  4. quantify uncertainty in parameter estimation;
  5. interpret the parameters to address the biological question.

Statistics at the level of STA611 (Introduction to Statistical Methods) is expected, along with knowledge of linear algebra and multivariate calculus.

Course grade is based on an a midterm (30%), a final project (50%), and and biweekly homeworks (20%). The project can be either a reanalysis of the data in one of the case studies covered during the semester or a project of interest to the student (rotation projects are great). Homeworks are due to me exactly two weeks after they are handed out at the beginning of class. Late homeworks will not be accepted, although you are allowed one late homework (maximum one week) for the course. Students may (and should) collaboratively discuss the homework assignments; however, I expect each student to program and write up their own homework solutions. Please write the names of the students you discussed the homework assignment with at the top of your solutions.

A second set of references for R will also be useful. First, you can download R from the CRAN website. There are many resources, such as R Studio, that can help with the programming interface, and tutorials on R are all over the place. If you are getting bored with the standard graphics package, I really like using ggplot2 for beautiful graphics and figures. Finally, you can integrate R code and output with plain text using KNITR, but that might be going a bit too far for beginners.

We will have daily readings for the course, but there is no formal text for this class. However, some texts and notes that may be useful include:

  1. Michael Lavine, Introduction to Statistical Thought (an introductory statistical textbook with plenty of R examples, and it's online too)
  2. Ewans and Grant, Statistical Methods in Bioinformatics
  3. Cristianini and Hahn, Introduction to Computational Genomics
  4. Sayan Mukherjee, Statistical methods for computational biology
  5. Kevin Murphy, Machine Learning: a probabilistic perspective
  6. Durbin, Eddy, Krogh, Mitchison, Biological Sequence Analysis
  7. Joseph Felsenstein, Inferring phylogenies

This syllabus is tentative, and will almost surely be superceded. Reload your browser for the current version.

A link to a list of possible projects will appear soon.

Note: The final project TeX template and final project style file should be used in preparation of your final project report. Please follow the instructions and let me know if you have questions. Presentations will be on April 10th and 15th; the reports will be due on April 18th (Friday).


Syllabus

Lecture notesTopicHomework
Jan 14 Modelling biogical phenomena[Pearson, 1893][Turing, 1952]
Jan 19 Inference of population structure I [Pritchard et al., 2000][Stinging commentary to Nicholas Wade] HW 1 due Jan 28
Jan 21 eQTL mapping [Stranger et al., 2007]
Jan 26-28 Hypothesis testing[Storey et al., 2003] [Stephens, 2014] [Subramanian et al, 2005]
Feb 2 No class
Feb 4 Markov chain Monte Carlo [Rosenthal] HW 2 due Feb 22 genotypes expression
Feb 9, 11 Linear Mixed Models[Runcie, 2013][Yang et al, 2014]
Feb 16 Epigenomics[Lea et al, 2015]
Feb 18 Epistasis[Sharp et al, 2016][Crawford et al 2015]
Feb 23-25, Mar 1 Motif finding and EM[Bailey and Elkan, 1994][Dempster et al, 1977]
Mar 3 HMM notes Coalescent notes Inference of population histories and HMMs[Li and Durbin, 2011]
Mar 8 Mixture models and EM[Bailey et al., 1995]
Mar 10 Hidden Markov models[Burge & Karlin, 1997] Midterm due Mar 19
Mar 22 Review and proof of EMProof of EM
Mar 24 Gene network models[Schafer & Strimmer, 2005] [Wright, 1918]
March 29 more notes HMMs
Mar 31 more notes Morphometrics[Bookstein, 1996] [Milnor, 2010]
April 5 Microbiomes
Apr 7 Species models and the enigma code[IJ Good, 1953][IJ Good, 1979]
Apr 12 Open
Apr 14 Open
Apr 19 Final project presentations
Apr 21 Final project presentations