STA613/CBB540: Statistical methods in computational biology: Spring 2018

Prof:Sayan Mukherjee OH: Wednesday 2:15-3:15pm, 112 Old Chem
Class:Tuesday 11:45am-2:40pm 025 Old Chem


This course is based on case studies of statistical approaches to problems in computational biology. We will learn about statistical modeling in computational biology by formulating biological questions and repeating the following steps:
  1. formalize the question as a probabilistic model (typically via a likelihood);
  2. clarify the interpretation of model parameters and the model assumptions;
  3. develop methods for parameter estimation;
  4. quantify uncertainty in parameter estimation;
  5. interpret the parameters to address the biological question.

Statistics at the level of STA611 (Introduction to Statistical Methods) is expected, along with knowledge of linear algebra and multivariate calculus.

A second set of references for R will also be useful. First, you can download R from the CRAN website. There are many resources, such as R Studio, that can help with the programming interface, and tutorials on R are all over the place. If you are getting bored with the standard graphics package, I really like using ggplot2 for beautiful graphics and figures. Finally, you can integrate R code and output with plain text using KNITR, but that might be going a bit too far for beginners.

We will have daily readings for the course, but there is no formal text for this class. However, some texts and notes that may be useful include:

  1. Michael Lavine, Introduction to Statistical Thought (an introductory statistical textbook with plenty of R examples, and it's online too)
  2. Ewans and Grant, Statistical Methods in Bioinformatics
  3. Cristianini and Hahn, Introduction to Computational Genomics
  4. Sayan Mukherjee, Statistical methods for computational biology
  5. Kevin Murphy, Machine Learning: a probabilistic perspective
  6. Durbin, Eddy, Krogh, Mitchison, Biological Sequence Analysis
  7. Joseph Felsenstein, Inferring phylogenies


Course grade is based on an a take home midterm (30%), a final project (50%), and and biweekly homeworks (20%). The project can be either a reanalysis of the data in one of the case studies covered during the semester or a project of interest to the student (rotation projects are great). Homeworks are due to me exactly two weeks after they are handed out at the beginning of class. Late homeworks will not be accepted. Students may (and should) collaboratively discuss the homework assignments; however, I expect each student to program and write up their own homework solutions. Please write the names of the students you discussed the homework assignment with at the top of your solutions.

Previous final projects

Exploring Genetic Pleiotropy in Complex Disease through Linkage Analysis
Dog Population Splits and Mixtures from Genome-wide Allele Frequency Data
Paramter inference for a differential equations model of the NKCC2 cotransporter
Histone Occupancy and Gene Expression
Estimating the Effect of Single Nucleotide Variation on Transcription Factor Binding Affinity
Dealing With Censorship in Animal Models Involving Large Biological Datasets
Partial Factor Regression Model in a Genetics Context
Determining Critical Features for Protein Crystallization using Regression
Identification of cofactors of NPR1 by exploratory factor analysis of public microarrays.
Classifying and Clustering Vegetation in Belize Rainforests using Support Vector Machine
Differential expression analysis for RNA-Seq of
 single olfactory sensory neurons
Modeling Shear Stress in Schlemm’s Canal
Multi-Model Gene Expression Data Generation Framework with Linear Regression and Mixed-Effect Models
Dynamics of correlated sets of reactions in metabolic networks
Application of Bayesian Sparse Latent Factor Models in Metabolomic Profiling of Peripheral Blood

Note: The final project TeX template and final project style file should be used in preparation of your final project report. Please follow the instructions and let me know if you have questions. There will be a poster session on April 24th and the reports will be due on May 1.

This syllabus is tentative, and will almost surely be superceded. Reload your browser for the current version.


This syllabus is tentative, and will almost surely be superceded. Reload your browser for the current version.

  1. (Jan 16) Modeling biogical phenomena:

  2. (Jan 16) Inference of population structure:

  3. (Jan 23) Multiple hypothesis testing:

  4. (Jan 30) eQTL mapping:

  5. (Jan 30) Epistasis and nonlinear regression:

  6. (Feb 6) Markov chain Monte Carlo:

  7. (Feb 13) Linear mixed models, Quantitative genetics, and Statistical genetics:

  8. (Feb 20) Motif finding, Mixture models, EM:

  9. (Feb 27) Hidden Markov models and gene finding:

  10. (Mar 6) Reconstructing population histories and coalescent models:

  11. (Mar 20) Compositional data, time series models:

  12. (Mar 27)) Gene networks, Path analysis, Graphical models:

  13. (Apr 3) Class cancelled due to machine learning day

  14. (Apr 10) New functional assays: single cell expression

  15. (Apr 17) New functional assays: 3D structure of the genome, also optimization