This is a master-level introductory course to statistical learning methods for prediction and inference. This course introduces students to concepts and techniques of modern regression and predictive modelling. The course will blend theory and application using a range of examples. Topics include exploratory data analysis and visualization, linear and generalized linear models, model selection, penalized estimation and shrinkage methods including Lasso, ridge regression and Bayesian regression, decision trees and ensemble methods. Other advanced topics, such as robust estimation, smoothing splines, support vector machines and neural networks, will be briefly discussed. The R programming language and applications are used throughout.
Corequisite: Statistical Science 323D or 523L and Statistical Science 360, 601, or 602L. All students should be comfortable with linear or matrix algebra and mathematical statistics at the level of STA 611 and be familiar with the R programming language and linear regression. Students should be familiar with Bayesian statistics either by taking the introduction to Bayesian inference STA 360/601/602 or currently co-registered in the course. Please see me if you have questions about the pre-requisites or background.
Acknowledgement: This course webpage contains materials such as lecture slides, homework and datasets that were developed or adapted by Merlise Clyde, Bin Yu, Raaz Dwivedi and Ryan Tibshirani.
Lecture Location: Old Chem 116
Lecture Time: Tuesday and Thursday 3:30pm - 4:45pm
Lab1 Location: LSRC A155
Lab1 Time: Monday 3:30pm - 4:45pm
Lab2 Location: Social Sciences 124
Lab2 Time: Monday 5:15pm - 6:30pm
Email: yuansi.chen at duke.edu
Office hours: by appointment or before and after lecture (2:30 - 3:00, 4:45 - 5:15 in Old Chem 223B)
Jose Pliego San Martin
Email: jose.pliego.san.martin at duke.edu
Leads Lab1
Zoom office hour: Tu 11:00 - 12:00 (see Sakai for the link)
In-person office hour: Mo 4:45 - 5:45 in LSRC A155
Aihua Li
Email: aihua.li at duke.edu
Leads Lab2
In-person office hours: Fr 11:30-1:30 in Old Chem 025
Nancy Huang
Email: ranxin.huang at duke.edu
Leads projects: you may ask her project related questions
In-person office hour: Th 10:00 - 11:00 in Old Chem 025
Zoom office hour: Th 12:30 - 01:30 (see Sakai for the link)
Ed discussion active hours: Mo 4:00 - 5:00
Main text:
An Introduction to Statistical Learning: with Applications in R (Second Edition) by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. ISL is our main book reference. We intend to cover most content in this course. The book is freely available on the authors’ website.
Elements of Staistical Learning (12th printing) by Trevor Hastie, Robert Tibshirani and Jerome Friedman. ESL is more advanced than ISL. It is also freely available on the authors’ website.
Optional:
Applied Predictive Modeling by Kuhn and Johnson, covers caret package for R
Applied linear regression by Sanford Weisberg. In depth coverage of linear regression and extensions, model checking, and more. The associated Computing Primer for Applied Linear Regression Using R is useful for the companion R package
A First Course in Bayesian Statistical Methods by Peter Hoff.
Tentative, please refresh for the latest version
Attend lectures
Complete required reading in textbook
Complete bi-weekly homework (posted on Sakai)
Complete two projects (posted on Sakai)
Midterm (in class on Oct 6)
Final (2:00-5:00pm on 12/15 in Old Chem 116)
Homework (25%) + Midterm (20%) + Final (20%) + Projects (30%) + Participation (5%). Participation includes lecture attendance, lab attendance, discussion participation on Sakai. Grades may be curved at the end of the semester. Cumulative numerical averages of 90 - 100 are guaranteed at least an A-, 80 - 89 at least a B-, and 70 - 79 at least a C-, however the exact ranges for letter grades will be determined after the final exam. The more evidence there is that the class has mastered the material, the more generous the curve will be.
An Introduction to R (pdf version), the most up-to-date official R intro
Swirl, interactive tutorial in R console
R for Data Science, teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it
ggplot2: elegant graphics for data analysis, covers the basics of ggplot2 for plotting
Duke StatSci Learning R Tutorials, tutorials designed for Duke Stat students.
The caret package, a well-documented wrapper package to streamline the process for creating predictive models
Tidymodels, a collection of packages for modeling and machine learning using tidyverse principles, possible alternative to caret
A short summary of math for machine learning written by Garrett Thomas
Stanford's machine learning class provides additional reviews of linear algebra and probability theory
Matrix Algebra from a Statistician’s Perspective by David A. Harville.
The Matrix Cookbook by Kaare B. Petersen and Michael S. Pedersen.
Linear Algebra Done Right by Sheldon Axler.
STA 523 Statistical Programming, helpful for learning R, git, github and computing at Duke
Duke StatSci Learning R Tutorials, great online R tutorial
Linear Regression and Modeling, online course by Mine Cetinkaya-Rundel