STA 721: Linear Models

Fall 2019

Linear Models

Professor Merlise Clyde

Department of Statistical Science Duke University

Course Description

This course introduces students to Gaussian linear models and extensions for model building, including exploratory data analysis techniques and model checking, variable transformations and selection, parameter estimation and interpretation, prediction, hierarchical models, model selection and Bayesian model averaging. The concepts of linear models will be covered from Bayesian and classical viewpoints. Topics in Markov chain Monte Carlo simulation will be introduced as required. Co-requisite STA 601.

Instructional Team

Name	email	location	office hours
Dr. Merlise Clyde	clyde@duke.edu	223E Old Chem	Wed 3-4 & Fri 2:15-3:15 or by appointment
Vittorio Orlandi (TA)	vittorio.orlandi@duke.edu	TBA	TBA
Pritam Dey (TA)	pritam.dey@duke.edu	TBA	TBA

Books & Materials

Textbook	Ordering Information
	Plane Answers to Complex Questions Ronald Christensen (2011) 4th Edition Springer-Verlag, NY. The textbook is freely as an eBook thru the Duke Library. You’re welcomed to read on screen or print it out. If you prefer a paperback version you can buy it at the cost of printing from Springer or purchase a hardback version at the Bookstore.
	Data Analysis Using Regression and Multilevel/Hierarchical Models Gelman & Hill (2009) ISBN-13: 978-0521686891
	Bayesian and Frequentist Regression Methods Wakefield (2013). First edition Springer-Verlag ISBN 978-1-4419-0924-4. The textbook is freely as an eBook thru the Duke Library. You’re welcomed to read on screen or print out chapters.
	A First Course in Bayesian Statistical Methods, Hoff, P. L. (2009), Springer. ISBN 978-0-387-92299-7 Review chapters via eBook through the Duke Library. Used in co-requisite course STA 601/STA 360

Computation

We will use R as a programming language for computation and data analysis, with the use existing packages written in R to support the course. All students will have access to RStudio/R on a server within the department and support during the labs. You are free to all run RStudio/R on your personal laptop or desktop. See the Resources page for books and other resources for learning R. You should bring a calculator to exams. There is no restrictions on the type of calculator (but not on a mobile device).

Discussion Forums

We will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TAs, and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza (peer answers earn participation points!). If you have any problems or feedback for the developers, email team@piazza.com.

Find our class page at: Piazza or through the link on Sakai. Click to signup

Green Classroom

This course has achieved Duke’s Green Classroom Certification. The certification indicates that the faculty member teaching this course has taken significant steps to green the delivery of this course. Your faculty member has completed a checklist indicating their common practices in areas of this course that have an environmental impact, such as paper and energy consumption. Some common practices implemented by faculty to reduce the environmental impact of their course include allowing electronic submission of assignments, providing online readings and turning off lights and electronics in the classroom when they are not in use. The eco-friendly aspects of course delivery may vary by faculty, by course and throughout the semester. Learn more at http://sustainability.duke.edu/action/certifications/classroom/index.php.

Current Lectures

November 2019

Robust Estimation"

In this lecture we look at robust regression methods to automatically account for potential outliers.

Project Slides Source Document Readings

Lectures

Under construction!

Introduction to Linear Models

Overview of linear models higlighting what’s to come. Click onlinks for additional information and supporting material

Code Dataset Project Slides Source Document Readings

Maximum Likelihood Estimation

Maximum likelihood estimation in linear models via projections.

Merlise Clyde

Project Slides Source Document Readings

Projections

In this lecture we exam the geometric properties of OLS and the role of projections. In particular we will find expectations of OLS …

Project Slides Source Document Readings

Multivariate Normal Theory

In this lecture we will review/present distribution theory related to the Multivariate normal distribution, in particular, linear …

Project Slides Source Document Readings

Sampling Distributions of MLEs

In this lecture we will review/present distribution theory related to the sampling distribution of the MLEs, in particular, Student-t …

Code Project Slides Source Document Readings

Prediction and Properties of MLEs

In this lecture we will cover prediction and optimal estimation/prediction and what quantities can be estimated or predicted.

Project Slides Source Document Readings

Identifiability and Gauss-Markov

In this lecture we will cover the Gauss-Markov theorem that establishes that out of the class of all linear unbiased estimators that …

Project Slides Source Document Readings

Introduction to Bayesian Linear Models

In this lecture we will introduce Bayesian estimation for linear models using the Normal-Gamma conjugate prior.

Project Slides Source Document Readings

Choice of Conjugate Prior Distributions

In this lecture we will go into more details about the Normal-Gamma conjugate prior and limiting cases in linear models.

Project Slides Source Document Readings

Non-Informative Priors and Default Priors

In this lecture we will go into more details about the Normal-Gamma conjugate prior and limiting cases in linear models, including …

Project Slides Source Document Readings

Zellner's g-Priors

In this lecture we will go into more details about the Normal-Gamma conjugate prior looking at a special case of the g-prior.

Code Project Slides Source Document Readings

Cauchy Priors: Mixtures of Normals & MCMC

In this lecture we will go show how Cauchy priors can be derived as a mixture of normal distributions and introduce MCMC sampling for …

Project Slides Source Document Readings

Properties of Bayesian Estimators

In this lecture we will illustrate MCMC sampling with the Cauchy prior as mixtures of g-priors and look at properties of estimators. To …

Project Slides Source Document Readings

Shrinkage Estimation and Ridge Regression

In this lecture we will look at properties of estimators. To address problems for estimation with nearly singular matrices, we will …

Project Slides Source Document Readings

Bayesian Ridge Regression and Shrinkage Estimation

In this lecture we look at ridge regression from a Bayesian perspective and discuss choice of priors and inference via MCMC.

Project Slides Source Document Readings

Shrinkage Priors and Selection

In this lecture we look at model comparison using ANOVA and sequential F tests.

Code Project Slides Source Document Readings

Shrinkage Priors and Selection

In this lecture we look at properties on priors for shrinkage estimators to have desirable properties.

Code Project Slides Source Document Readings

LASSO and Baysian LASSO Regression

In this lecture we look at shrinkage and selection estimators based on LASSO regression from a penalized likelihood approach and a …

Project Slides Source Document Readings

Model Choice

In this lecture we look at model selection from a Bayesian perspective.

Code Project Slides Source Document Readings

BMA

In this lecture we look at Bayesian model averaging and choice of prior distributions with a focus on g-priors or mixtures of g-priors.

Project Slides Source Document Readings

Desiderata for BMA/BVS

In this lecture we look at desirable features that priors for Bayesian model averaging or varialbe selection which leads us to mixtures …

Dataset Project Slides Source Document Readings

MCMC in BMA/BVS and Inference

In this lecture show how MCMC can be used for BMA/BVS and challenges within. Using the output we discuss variaous estimators for …

Dataset Project Slides Source Document Readings

Residuals and Checking Assumptions

In this lecture we look at residual disgnostics and methods to identify influential points and outliers.

Project Slides Source Document Readings

Robust Estimation"

In this lecture we look at robust regression methods to automatically account for potential outliers.

Project Slides Source Document Readings

Homework

Homework Assignment 6

Oct 28, 2019 12:00 AM

Homework Assignment 5

Oct 11, 2019 12:00 AM

Homework Assignment 4

Sep 23, 2019 12:00 AM

Homework Assignment 4

Sep 23, 2019 12:00 AM

Homework Assignment 3

Sep 15, 2019 12:00 AM

Homework Assignment 2

Sep 7, 2019 12:00 AM

Homework Assignment 1

Aug 28, 2019 12:00 AM

Project

Data Analysis Project

What do Barbie dolls, food wrap, edamame, and spermicides have in common? And what do they have to do with low sperm counts, precocious puberty, and breast cancer? “Everything” say those who support the notion that hormone mimics are disrupting everything from fish gender to human fertility. “Nothing” counter others who regard the connection as trumped up, alarmist chemophobia. The controversy swirls around the significance of a number of substances that behave like estrogens and appear to be practically everywhere–from plastic toys to topical sunscreens. Read more

Repo Data

Calendar

Tentative outline; please refresh for the latest version. Each Lecture/HW has additional details, including reading assignments, code and data.

Week	Date	Topic	HW
1	08-26-2019	Introduction
	08-28-2019	MLE
	08-30-2019	Lab 1: Intro to Weaving Latex and R
2	09-02-2019	Projections & Expectations	HW1
	09-04-2019	Normal Theory
	09-06-2019	Lab 2: Introduction to GitHub and Rstudio	See invitation sent from Sakai
3	09-09-2019	Sampling Distributions	HW2
	09-11-2019	Prediction
	09-13-2019	Lab 3:
4	09-16-2019	Gauss-Markov and Prediction	HW3
	09-18-2019	Bayes Estimation in Linear Models
	09-20-2019	Lab 4: Writing functions, coding style, and Q&A
5	09-23-2019	[Conjugate Priors in Linear Models]	HW4
	09-25-2019	Non-informative Priors
	09-27-2019	Lab 5: Q&A
6	09-30-2019	G-Priors and Prior Choices
	10-02-2019	Review
	10-04-2019	Midterm
7	10-07-2019	Fall Break
	10-09-2019	Cauchy Priors: Mixtures & MCMC
	10-11-2019	Lab 6: JAGS	HW5
8	10-14-2019	Bayes Estimation
	10-14-2019	Ridge Regression
	10-16-2019	Bayesian Ridge Regression
9	10-21-2019	Lasso and Bayesian Lasso Regression	HW6
	10-23-2019	Shrinkage Priors and Selection
10	10-28-2019	Testing and Model Comparison	HW7 (Nott & Kohn code)
	10-30-2019	Testing and Model Comparison continued
11	11-04-2019	Model Choice
	11-06-2019	BMA	HW8
12	11-11-2019	Criteria for Priors for use in BMA/BVS
	11-13-2019	MCMC in BMA/BVS and inference
13	11-18-2019	Factors and Hierarchical Models
	11-20-2019	Residuals and Checking
		Transformations & Normality
14	11-25-2019	Robustness	TakeHome Data Analysis
	11-27-2019	Thanksgiving Break
15	12-02-2019	Graduate Reading Period
	12-04-2019	Graduate Reading Period
16	12-09-2019	Graduate Reading Period
	12-12-2019	Final Exam 2-5	Link Classroom 5

Resources

Computing & Other Resources

R resources:

The R Project for Statistical Computing
R Downloads Duke Mirror with Linux, Mac, Windows
Rstudio Easy user interface for R/R Markdown and more for Linux, Mac and Windows Download
An Introduction to R (pdf) (html version) the most up-to-date official R intro
CodeSchool R Tutorial: Another brief R tutorial, in case you would like to have another avenue by which to get introduced to R.
twotorials: how to do stuff in r. two minutes or less.

CRAN Comprehensive R Archive Network

R CRAN SITE downloads, documentation, and more
FAQs please read before posting a question! at CRAN or search using google
Packages contributed packages for R
R reference card (pdf) cheatsheet for useful functions
Writing R extensions (pdf) Explanations of writing packages and calling C from R

R Books

R in a Nutshell (Joseph Adler) Buy it @ Amazon
ggplot2: Elegant Graphics for Data Analysis (Use R!) Buy it at Amazon
Advanced R (Chapman & Hall/CRC The R Series) Buy it @ Amazon

JAGS

JAGS is Just Another Gibbs Sampler. It is a program for analysis of Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation not wholly unlike BUGS. The name is a misnomer as JAGS implements more than just Gibbs Samplers. JAGS was written with three aims in mind:

To have a cross-platform engine for the BUGS language
To be extensible, allowing users to write their own functions, distributions and samplers.
To be a plaftorm for experimentation with ideas in Bayesian modelling

Resources for JAGS:

JAGS website on Sourceforge for Downloads, Manuals, etc
JAGS News Martin Plummer’s
site for all things JAGS
rjags on CRAN

Linear/Matrix Algebra

Gilbert Strang’s Online Course at MIT
- Video Lectures
- Introduction to Linear Algebra. Strang, Gilbert. 4th ed. Wellesley, MA: Wellesley-Cambridge Press, 2009. ISBN: 9780980232714. Buy @ Amazon
Matrix Algebra from a Statistician’s Perspective. Harville, David A. eBook in Duke Library
Plane Answers to Complex Questions, Christensen, Ronald. eBook in Duke Library

Emacs

Using emacs as an editor for R, C/C++, LaTeX provides a great environment for editing, compiling and debugging - you can even use it as a shell!

emacs reference card
Emacs Speaks Statistics is an add-on package for GNU Emacs and XEmacs. It is designed to support editing of scripts and interaction with various statistical analysis programs such as R, S-Plus, SAS, Stata and OpenBUGS/JAGS. Although all users of these statistical analysis programs are welcome to apply ESS, advanced users or professionals who regularly work with text-based statistical analysis scripts, with various statistical languages/programs, or with different operating systems might benefit from it the most.
- ESS reference card

Syllabus

Course expectations, outline, grading policy, and more

Course goals & objectives

This course introduces students to linear models and its extensions for model building, including exploratory data analysis techniques, variable transformations and selection, parameter estimation and interpretation, prediction, hierarchical models, model selection and Bayesian model averaging. The concepts of linear models will be covered from Bayesian and classical viewpoints. Topics in Markov chain Monte Carlo simulation will be introduced as required, however it is expected that students have either taken STA 601 or are co-registered.

All students should be extremely comfortable with linear/matrix algebra and mathematical statistics at the level of STA 611 or equivalent; Statistical Inference - Casella and Berger is an excellent resource in case you need to review any mathematical statistics. If you need to review linear algebra, please explore material under Resources and links - Gilbert Strang’s online course is highly recommended.

The course goals are as follows:

Understand the different philosophical approaches to statistical analyses (Bayesian and frequentists)
Build a solid foundation for the probability theory and inference for Gaussian linear models and extensions.
Build appropriate statistical models for data, perform data analysis using appropriate software, and communicate results without use of statistical jargon.
Become familiar with reproducible research using github, RStudio, knitr and $\LaTeX$ to produce technical, literate data analyses.

Course Outline

Course topics will be drawn (but subject to change) from

Motivation for Studying Linear Models as Foundation
Random Vectors and Matrices
Multivariate Normal Distribution Theory
Conditional Normal Distribution Theory
Linear Models via Coordinate free representations (examples)
Maximum Likelihood Estimation & Projections
Interval Estimation: Distribution of Quadratic Forms
Gauss-Markov Theorem & Optimality of OLS
Formulation of Bayesian Inference
Subjective and Default Priors
Related Shrinkage Methods and Penalized Likelihoods (Ridge regression, lasso, horseshoe etc)
Model Selection (comparison of classical and Bayesian approaches)
Bayes Factors
Bayesian Model Averaging
Model Checking: Residual Analysis, Added-Variable Plots, Cooks-Distance Transformations
Bayesian Outliers
Bayesian Robust Methods for Outliers
Generalized Linear Model and Weighted Regression
Hierarchical Models

Please check the website for updates, slides and current readings.

Grading:

Homework	20%
Midterm	25%
TakeHome	25%
Final	25%
Participation	5%

Grades may be curved at the end of the semester. Cumulative numerical averages of 90 - 100 are guaranteed at least an A-, 80 - 89 at least a B-, and 70 - 79 at least a C-, however the exact ranges for letter grades will be determined after the final exam. The more evidence there is that the class has mastered the material, the more generous the curve will be.

Homework:

These will be assigned weekly on the course webpage.

The objective of the problem sets is to help you develop a more in-depth understanding of the material and help you prepare for exams and projects. Grading will be based on completeness as well as accuracy. In order to receive credit you must show all your work.

No late assignments will be allowed, however the lowest score will be dropped.

You are welcomed, and encouraged, to work with each other on the problems, but you must turn in your own work. If you copy someone else’s work, both parties will receive a 0 for the problem set grade as well as being reported to the Office of Student Conduct. Work submitted on Sakai will be checked for instances of plagiarism prior to being graded.

Submission instructions: You will submit your HW on Sakai by uploading a PDF. If the TAs cannot view your work, or read your handwriting, you will lose points accordingly. We will be using R/knitr with $\LaTeX$ for preparing assignments using github classroom for data analysis.

Attendance & Participation:

You are expected to be present at class meeting and actively participate in the discussion. Your attendance and participation during class, as well as your activity on the discussion forum on Sakai will make up 5% of your grade in this class. While I might sometimes call on you during the class discussion, it is your responsibility to be an active participant without being called on.

Takehome Data Analysis Problem

The objective of the TakeHome is to give you independent applied research experience using real data and statistical methods. You will use all (relevant) techniques learned in this class to analyze a dataset provided by me.

Further details on the TakeHome will be provided as due dates approach.

Note that you must score at least 30% of the points on the TakeHome Exam in order to pass this class.

Exams:

There will be one midterm and one final in this class. See course info for dates and times of the exams. You are allowed to use one sheet of notes (``cheat sheet”) on the midterm and two for the final. This sheet must be no larger than 8 ¹⁄₂ x 11, and must be prepared by you. You may use both sides of the sheet and can write as small as you wish.

Policies Regarding Homework:

No late Homework
The lowest HW score will be dropped automatically at the end of the semester
Late work policy for TakeHome Data Analysis: 10% off for each day late.
The final exam must be taken at the stated time. Please book flights accordingly!
There will be no Makeup exams; if you miss the midterm for any reason, your predicted grade given the other information from the class will be used to fill in the missing grade.
Regrade requests must be made within 3 days of when the assignment is returned, and must be submitted in writing. These will be honored if points were tallied incorrectly, or if you feel your answer is correct but it was marked wrong. No regrade will be made to alter the number of points deducted for a mistake. There will be no grade changes after the final exam.
Use of disallowed materials (textbook, class notes, web references, any form of communication with classmates or other persons, etc.) during inclass exams will not be tolerated. For the Take Home data analysis, students are limited to materials covered in class or course resources; no external queries or use of outside resources. This will result in a 0 on the exam for all students involved, possible failure of the course, and will be reported to the Office of Student Conduct. If you have any questions about whether something is or is not allowed, please ask me beforehand.

Email & Forum (Piazza):

I will regularly send announcements by email through Sakai; please make sure to check your email daily.

Any non-personal questions related to the material covered in class, problem sets, labs, projects, etc. should be posted on Piazza forum. Before posting a new question please make sure to check if your question has already been answered. The TAs and myself will be answering questions on the forum daily and all students are expected to answer questions as well. Please use informative titles for your posts.

Note that it is more efficient to answer most statistical questions ``in person” so make use of Office Hours.

Students with disabilities:

Students with disabilities who believe they may need accommodations in this class are encouraged to contact the Student Disability Access Office at (919) 668-1267 as soon as possible to better ensure that such accommodations can be made.

Academic integrity:

Duke University is a community dedicated to scholarship, leadership, and service and to the principles of honesty, fairness, respect, and accountability. Citizens of this community commit to reflect upon and uphold these principles in all academic and non-academic endeavors, and to protect and promote a culture of integrity. Cheating on exams and quizzes, plagiarism on homework assignments and projects, lying about an illness or absence and other forms of academic dishonesty are a breach of trust with classmates and faculty, violate the Duke Community Standard, and will not be tolerated. Such incidences will result in a 0 grade for all parties involved as well as being reported to the Office of Student Conduct. Additionally, there may be penalties to your final class grade. Please review the Duke’s Academic Dishonesty policies.