STA613/CBB540: Statistical methods in computational biology: Spring 2017
Prof: | Sayan Mukherjee | |
sayan@stat.duke.edu | |
OH: Wednesday 2:15-3:15pm, 112 Old
Chem |
TA: | Justin Silverman | |
| |
OH: Th 10-12, 211A Old Chem |
Class: | Tu/Thu
8:30-9:45am | | | | 025 Old Chem |
Description
This course is based on case studies of statistical approaches to problems in computational biology. We will learn about statistical modeling in computational biology by formulating biological questions and repeating the following steps:
- formalize the question as a probabilistic model (typically via a likelihood);
- clarify the interpretation of model parameters and the model assumptions;
- develop methods for parameter estimation;
- quantify uncertainty in parameter estimation;
- interpret the parameters to address the biological question.
Statistics at the level of STA611 (Introduction to Statistical Methods) is expected, along with knowledge of linear algebra and multivariate calculus.
A second set of references for R will also be useful. First, you can
download R from the CRAN
website. There are many resources, such as R Studio, that can help with the
programming interface, and tutorials
on R are all over the place. If you are getting bored with the standard graphics package, I really like using ggplot2 for beautiful graphics and figures. Finally, you can integrate R code and
output with plain text using KNITR, but that might be going a
bit too far for beginners.
We will have daily readings for the course, but there is no formal text for this class. However, some texts and notes that may be useful include:
Michael Lavine, | Introduction to Statistical Thought (an introductory statistical textbook with plenty of R examples, and it's online too)
Ewans and Grant, | Statistical Methods in Bioinformatics
Cristianini and Hahn, | Introduction to Computational Genomics
Sayan Mukherjee, | Statistical methods for computational biology
Kevin Murphy, | Machine Learning: a probabilistic perspective
Durbin, Eddy, Krogh, Mitchison, | Biological Sequence Analysis
Joseph Felsenstein, | Inferring phylogenies
Grading
Course grade is based on an a take home midterm (30%), a final project (50%),
and and biweekly homeworks (20%). The project can be
either a reanalysis of the data in one
of the case studies covered during the semester or a project of interest to the student (rotation projects are great). Homeworks are due to me exactly two weeks after they are handed out at the beginning of class. Late homeworks will not be accepted. Students may (and should) collaboratively discuss the homework assignments; however, I expect each student to program and write up their own homework solutions. Please write the names of the students you discussed the homework assignment with at the top of your solutions.
Previous final projects
Exploring Genetic Pleiotropy in Complex Disease through Linkage
Analysis
Dog Population Splits and Mixtures from Genome-wide Allele Frequency
Data
Paramter inference for a differential equations model of the NKCC2 cotransporter
Histone Occupancy and Gene Expression
Estimating the Effect of Single Nucleotide Variation on Transcription
Factor Binding Affinity
Dealing With Censorship in Animal Models Involving Large Biological
Datasets
Partial Factor Regression Model in a Genetics Context
Determining Critical Features for Protein Crystallization using
Regression
Identification of cofactors of NPR1 by exploratory factor analysis of
public microarrays.
Classifying and Clustering Vegetation in Belize Rainforests using
Support Vector Machine
Differential expression analysis for RNA-Seq of
single olfactory
sensory neurons
Modeling Shear Stress in Schlemm’s Canal
Multi-Model Gene Expression Data Generation Framework withLinear
Regression and Mixed-Effect Models
Dynamics of correlated sets of reactions in metabolic networks
Application of Bayesian Sparse Latent Factor Models in
Metabolomic Proling of Peripheral Blood
Note: The final project TeX template and final project style file should be used in preparation of your final project report. Please follow the instructions and let me know if you have questions. Presentations will be on April 13th and 17th; the reports will be due on May 1.
This syllabus is tentative, and will almost surely be superceded. Reload your browser for the current version.
This syllabus is tentative, and will almost surely be superceded. Reload your browser for the current version.
- (Jan 12) Modeling biogical phenomena:
- (Jan 17) Inference of population
structure:
- (Jan 19) eQTL mapping:
- (Jan 24, 26) Multiple hypothesis testing:
- (Jan 31, Feb 2) Markov chain Monte Carlo:
- (Feb 7) Epistasis and nonlinear regression:
- (Feb 9, 14) Linear mixed models, Quantitative genetics, and Statistical genetics:
- (Feb 16, 21) Motif finding, Mixture models, EM:
- (Feb 23, 28) Hidden Markov models and gene finding:
- (Mar 2, 7) Reconstructing population histories and coalescent models:
- (Mar 9) Midterm review:
- Take-home midterm due Mar 28:
- (Mar 21, 23) Guest lectures Justin Silverman:
- (Mar 28, 30) Gene networks, Path analysis, Graphical models:
- (Apr 4) Analysis of Hi-C data
- (April 6, 11) TBD:
- (April 13, 17) Final projet presentations: