STA613/CBB540, Spring 2018

STA613/CBB540: Statistical methods in computational biology: Spring 2018

Prof:	Sayan Mukherjee	sayan@stat.duke.edu	OH: Wednesday 2:15-3:15pm, 112 Old Chem

Class:	Tuesday 11:45am-2:40pm		025 Old Chem

Description

This course is based on case studies of statistical approaches to problems in computational biology. We will learn about statistical modeling in computational biology by formulating biological questions and repeating the following steps:

formalize the question as a probabilistic model (typically via a likelihood);
clarify the interpretation of model parameters and the model assumptions;
develop methods for parameter estimation;
quantify uncertainty in parameter estimation;
interpret the parameters to address the biological question.

Statistics at the level of STA611 (Introduction to Statistical Methods) is expected, along with knowledge of linear algebra and multivariate calculus.

A second set of references for R will also be useful. First, you can download R from the CRAN website. There are many resources, such as R Studio, that can help with the programming interface, and tutorials on R are all over the place. If you are getting bored with the standard graphics package, I really like using ggplot2 for beautiful graphics and figures. Finally, you can integrate R code and output with plain text using KNITR, but that might be going a bit too far for beginners.

We will have daily readings for the course, but there is no formal text for this class. However, some texts and notes that may be useful include:

Michael Lavine, Introduction to Statistical Thought (an introductory statistical textbook with plenty of R examples, and it's online too)
Ewans and Grant, Statistical Methods in Bioinformatics
Cristianini and Hahn, Introduction to Computational Genomics
Sayan Mukherjee, Statistical methods for computational biology
Kevin Murphy, Machine Learning: a probabilistic perspective
Durbin, Eddy, Krogh, Mitchison, Biological Sequence Analysis
Joseph Felsenstein, Inferring phylogenies

Grading

Course grade is based on an a take home midterm (30%), a final project (50%), and and biweekly homeworks (20%). The project can be either a reanalysis of the data in one of the case studies covered during the semester or a project of interest to the student (rotation projects are great). Homeworks are due to me exactly two weeks after they are handed out at the beginning of class. Late homeworks will not be accepted. Students may (and should) collaboratively discuss the homework assignments; however, I expect each student to program and write up their own homework solutions. Please write the names of the students you discussed the homework assignment with at the top of your solutions.

Previous final projects

Exploring Genetic Pleiotropy in Complex Disease through Linkage Analysis
Dog Population Splits and Mixtures from Genome-wide Allele Frequency Data
Paramter inference for a differential equations model of the NKCC2 cotransporter
Histone Occupancy and Gene Expression
Estimating the Effect of Single Nucleotide Variation on Transcription Factor Binding Affinity
Dealing With Censorship in Animal Models Involving Large Biological Datasets
Partial Factor Regression Model in a Genetics Context
Determining Critical Features for Protein Crystallization using Regression
Identification of cofactors of NPR1 by exploratory factor analysis of public microarrays.
Classifying and Clustering Vegetation in Belize Rainforests using Support Vector Machine
Differential expression analysis for RNA-Seq of  single olfactory sensory neurons
Modeling Shear Stress in Schlemm’s Canal
Multi-Model Gene Expression Data Generation Framework with Linear Regression and Mixed-Effect Models
Dynamics of correlated sets of reactions in metabolic networks
Application of Bayesian Sparse Latent Factor Models in Metabolomic Profiling of Peripheral Blood

Note: The final project TeX template and final project style file should be used in preparation of your final project report. Please follow the instructions and let me know if you have questions. There will be a poster session on April 24th and the reports will be due on May 1.

This syllabus is tentative, and will almost surely be superceded. Reload your browser for the current version.