STA 521: Predictive Modeling

Summary

Predictive modeling and applied machine learning methods are an increasingly important tools in both industry and academia. We will start with reviewing important facts from matrix algebra and learning about how do build simple predictive models. Then we will delve into unsupervised and supervised learning, where the data may or may not be normally distributed. The data is often highly dimensional in the covariates or parameter space, calling for dimension reduction. We will explore a range of approaches including factor analysis, principal component analysis, shrinkage methods, and then moving along to data mining techniques such as classification and clustering methods. Time permitting we will cover some special topics. Predictive modeling and applied machine learning methods are an increasingly important tools in both industry and academia. We will start by understanding the basics of data science, what this is and how it's important in the modern data. We then will learn tools that will aide us in predictive modeling and data mining (science) such as reproducible research through Markdown, RStudio, and version control (git). You will be responsible for learning these. Then we will delve into unsupervised and supervised learning, where the data may or may not be normally distributed. The data is often highly dimensional in the covariates or parameter space, calling for dimension reduction. We will explore a range of approaches including factor analysis, principal component analysis, shrinkage methods, and then moving along to data mining techniques such as classification and clustering methods. Time permitting we will cover some special topics.

Readings are posted at the top of each slide.

Time and Location

T/Th: 8:30--9:45 AM, Labs (Mondays): 11:45 -- 1:00 PM, 071 Perkins

Course Staff

Assistant Professor of Statistical Science

Rebecca C. Steorts
Old Chemistry
beka [At] stat [dot] duke [dot] edu
Office: TBD
Office hours: Tuesday and Wednesday 12--1 PM.

Head TA

Abbas Zaidi, PhD Student
abbas [dot] zaidi [AT] duke [dot] edu
Office hours: Thursday, 10-11 am, Old Chem 211A

TA

Yikun (Joey) Zhou, MS Student
yikun[dot]zhou[at]duke.edu
Office hours: M,Tu,W, 2-3 pm, Old Chem 211A

Prerequisites

Students are expected to be very familiar with R and will be expected to have learned LaTex by the end of the course. All reports, exams, etc. should be submitted in Latex pdf format. Students are expected to be very familiar with R. Please see Prof. Steorts if unsure whether you meet the requirements.

Course Grades and Workload

Homework assignments will be announced in class (along with the due date). It must be turned in at the beginning of the lecture on the due date. Late homework will not be accepted.

All homework's and take home exams \emph{must} be submitted to the Sakai website. You must submit them via the instructions on the homework or lab instructions and the format as well. Any failure to do so will result in deductions or a grade of 0. Submissions via email to the TA's or instructor will not be accepted for credit. See below for more information about LaTex.

Makeup exams must be approved before the time of the exam and will be given only in case of medical or family emergencies (which must be appropriately documented). All work turned in for a grade must be entirely your own. This particularly relates to homework. You are encouraged to talk to each other regarding homework problems or to the instructor/TA, however the write up and solution \emph{must} be entirely your own solution and work.

Furthermore, you are responsible for everything from lecture. Do not depend on the course web page for announcements regarding due dates for homework, changes in schedules, etc. Such announcements will be made in class. Homework assignments will be uploaded to the course webpage along with course readings (please check here frequently for updates).

There is a Google Group course discussion page called dataMining521. Please direct questions about homeworks and other matters to that page. Otherwise, you can email the instructors (TAs and professor). Note that we are more likely to respond to the Google questions than to the email, and your classmates may respond too, so that is a good place to start.

Most questions should be directed to the Google group and Discussion Forum for the course. The webpage can be found at Multivariate Google Groups. Posting via email is done through: datamining521 [at] googlegroups [dot] com.

Cell phones should be turned off (or set on silent). Laptops are allowed when we are doing applied examples or labs in class, but otherwise should not be out or being used.

Please see the syllabus for other course policies such on cheating, missed classes, etc.