This is a master-level introductory course to statistical learning methods for prediction and inference. This course introduces students to concepts and techniques of modern regression and predictive modelling. The course will blend theory and application using a range of examples. Topics include exploratory data analysis and visualization, linear and generalized linear models, model selection, penalized estimation and shrinkage methods including Lasso, ridge regression and Bayesian regression, decision trees and ensemble methods. Other advanced topics, such as robust estimation, smoothing splines, support vector machines and neural networks, will be briefly discussed. The R programming language and applications are used throughout.

Corequisite: Statistical Science 323D or 523L and Statistical Science 360, 601, or 602L. All students should be comfortable with linear or matrix algebra and mathematical statistics at the level of STA 611 and be familiar with the R programming language and linear regression. Students should be familiar with Bayesian statistics either by taking the introduction to Bayesian inference STA 360/601/602 or currently co-registered in the course. Please see me if you have questions about the pre-requisites or background.

Acknowledgement: This course webpage contains materials such as lecture slides, homework and datasets that were developed or adapted by Merlise Clyde, Bin Yu, Raaz Dwivedi and Ryan Tibshirani.

Lecture Location: Old Chem 116

Lecture Time: Tuesday and Thursday 3:30pm - 4:45pm

Lab1 Location: LSRC A155

Lab1 Time: Monday 3:30pm - 4:45pm

Lab2 Location: Social Sciences 124

Lab2 Time: Monday 5:15pm - 6:30pm

Email: yuansi.chen at duke.edu

Office hours: by appointment or before and after lecture (2:30 - 3:00, 4:45 - 5:15 in Old Chem 223B)

Jose Pliego San Martin

Email: jose.pliego.san.martin at duke.edu

Leads Lab1

Zoom office hour: Tu 11:00 - 12:00 (see Sakai for the link)

In-person office hour: Mo 4:45 - 5:45 in LSRC A155

Aihua Li

Email: aihua.li at duke.edu

Leads Lab2

In-person office hours: Fr 11:30-1:30 in Old Chem 025

Nancy Huang

Email: ranxin.huang at duke.edu

Leads projects: you may ask her project related questions

In-person office hour: Th 10:00 - 11:00 in Old Chem 025

Zoom office hour: Th 12:30 - 01:30 (see Sakai for the link)

Ed discussion active hours: Mo 4:00 - 5:00

Main text:

An Introduction to Statistical Learning: with Applications in R (Second Edition) by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. ISL is our main book reference. We intend to cover most content in this course. The book is freely available on the authors’ website.

Elements of Staistical Learning (12th printing) by Trevor Hastie, Robert Tibshirani and Jerome Friedman. ESL is more advanced than ISL. It is also freely available on the authors’ website.

Optional:

Applied Predictive Modeling by Kuhn and Johnson, covers caret package for R

Applied linear regression by Sanford Weisberg. In depth coverage of linear regression and extensions, model checking, and more. The associated Computing Primer for Applied Linear Regression Using R is useful for the companion R package

A First Course in Bayesian Statistical Methods by Peter Hoff.

Tentative, please refresh for the latest version

Attend lectures

Complete required reading in textbook

Complete bi-weekly homework (posted on Sakai)

Complete two projects (posted on Sakai)

Midterm (in class on Oct 6)

Final (2:00-5:00pm on 12/15 in Old Chem 116)

Homework (25%) + Midterm (20%) + Final (20%) + Projects (30%) + Participation (5%). Participation includes lecture attendance, lab attendance, discussion participation on Sakai. Grades may be curved at the end of the semester. Cumulative numerical averages of 90 - 100 are guaranteed at least an A-, 80 - 89 at least a B-, and 70 - 79 at least a C-, however the exact ranges for letter grades will be determined after the final exam. The more evidence there is that the class has mastered the material, the more generous the curve will be.

An Introduction to R (pdf version), the most up-to-date official R intro

Swirl, interactive tutorial in R console

R for Data Science, teach you how to do data science with R: Youâ€™ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it

ggplot2: elegant graphics for data analysis, covers the basics of ggplot2 for plotting

Duke StatSci Learning R Tutorials, tutorials designed for Duke Stat students.

The caret package, a well-documented wrapper package to streamline the process for creating predictive models

Tidymodels, a collection of packages for modeling and machine learning using tidyverse principles, possible alternative to caret

A short summary of math for machine learning written by Garrett Thomas

Stanford's machine learning class provides additional reviews of linear algebra and probability theory

Matrix Algebra from a Statisticianâ€™s Perspective by David A. Harville.

The Matrix Cookbook by Kaare B. Petersen and Michael S. Pedersen.

Linear Algebra Done Right by Sheldon Axler.

STA 523 Statistical Programming, helpful for learning R, git, github and computing at Duke

Duke StatSci Learning R Tutorials, great online R tutorial

Linear Regression and Modeling, online course by Mine Cetinkaya-Rundel