STA613/CBB540: Statistical methods in computational biology: Spring 2017

Prof:Sayan Mukherjee sayan@stat.duke.edu OH: Wednesday 2:15-3:15pm, 112 Old Chem
TA:Justin Silverman OH: Th 10-12, 211A Old Chem
Class:Tu/Thu 8:30-9:45am 025 Old Chem

Description

This course is based on case studies of statistical approaches to problems in computational biology. We will learn about statistical modeling in computational biology by formulating biological questions and repeating the following steps:
  1. formalize the question as a probabilistic model (typically via a likelihood);
  2. clarify the interpretation of model parameters and the model assumptions;
  3. develop methods for parameter estimation;
  4. quantify uncertainty in parameter estimation;
  5. interpret the parameters to address the biological question.

Statistics at the level of STA611 (Introduction to Statistical Methods) is expected, along with knowledge of linear algebra and multivariate calculus.

A second set of references for R will also be useful. First, you can download R from the CRAN website. There are many resources, such as R Studio, that can help with the programming interface, and tutorials on R are all over the place. If you are getting bored with the standard graphics package, I really like using ggplot2 for beautiful graphics and figures. Finally, you can integrate R code and output with plain text using KNITR, but that might be going a bit too far for beginners.

We will have daily readings for the course, but there is no formal text for this class. However, some texts and notes that may be useful include:

  1. Michael Lavine, Introduction to Statistical Thought (an introductory statistical textbook with plenty of R examples, and it's online too)
  2. Ewans and Grant, Statistical Methods in Bioinformatics
  3. Cristianini and Hahn, Introduction to Computational Genomics
  4. Sayan Mukherjee, Statistical methods for computational biology
  5. Kevin Murphy, Machine Learning: a probabilistic perspective
  6. Durbin, Eddy, Krogh, Mitchison, Biological Sequence Analysis
  7. Joseph Felsenstein, Inferring phylogenies

Grading

Course grade is based on an a take home midterm (30%), a final project (50%), and and biweekly homeworks (20%). The project can be either a reanalysis of the data in one of the case studies covered during the semester or a project of interest to the student (rotation projects are great). Homeworks are due to me exactly two weeks after they are handed out at the beginning of class. Late homeworks will not be accepted. Students may (and should) collaboratively discuss the homework assignments; however, I expect each student to program and write up their own homework solutions. Please write the names of the students you discussed the homework assignment with at the top of your solutions.


Previous final projects

Exploring Genetic Pleiotropy in Complex Disease through Linkage Analysis
Dog Population Splits and Mixtures from Genome-wide Allele Frequency Data
Paramter inference for a differential equations model of the NKCC2 cotransporter
Histone Occupancy and Gene Expression
Estimating the Effect of Single Nucleotide Variation on Transcription Factor Binding Affinity
Dealing With Censorship in Animal Models Involving Large Biological Datasets
Partial Factor Regression Model in a Genetics Context
Determining Critical Features for Protein Crystallization using Regression
Identification of cofactors of NPR1 by exploratory factor analysis of public microarrays.
Classifying and Clustering Vegetation in Belize Rainforests using Support Vector Machine
Differential expression analysis for RNA-Seq of
 single olfactory sensory neurons
Modeling Shear Stress in Schlemm’s Canal
Multi-Model Gene Expression Data Generation Framework withLinear Regression and Mixed-Effect Models
Dynamics of correlated sets of reactions in metabolic networks
Application of Bayesian Sparse Latent Factor Models in Metabolomic Proling of Peripheral Blood

Note: The final project TeX template and final project style file should be used in preparation of your final project report. Please follow the instructions and let me know if you have questions. Presentations will be on April 13th and 17th; the reports will be due on May 1.


This syllabus is tentative, and will almost surely be superceded. Reload your browser for the current version.

Syllabus

This syllabus is tentative, and will almost surely be superceded. Reload your browser for the current version.

  1. (Jan 12) Modeling biogical phenomena:

  2. (Jan 17) Inference of population structure:

  3. (Jan 19) eQTL mapping:

  4. (Jan 24, 26) Multiple hypothesis testing:

  5. (Jan 31, Feb 2) Markov chain Monte Carlo:

  6. (Feb 7) Epistasis and nonlinear regression:

  7. (Feb 9, 14) Linear mixed models, Quantitative genetics, and Statistical genetics:

  8. (Feb 16, 21) Motif finding, Mixture models, EM:

  9. (Feb 23, 28) Hidden Markov models and gene finding:

  10. (Mar 2, 7) Reconstructing population histories and coalescent models:

  11. (Mar 9) Midterm review:

  12. (Mar 21, 23) Guest lectures Justin Silverman:

  13. (Mar 28, 30) Gene networks, Path analysis, Graphical models:

  14. (Apr 4) Analysis of Hi-C data

  15. (April 6, 11) TBD:

  16. (April 13, 17) Final projet presentations: