Statistics 322/522
Design of Surveys and Causal Studies

  Spring 2018

Course Home Page


Course Description

In this course, we learn about different designs for collecting data and their implications for statistical inference. We cover two main topics: how to design surveys of populations in ways that give reliable estimates, and how to design studies in ways that allow for valid causal claims. With regard to surveys, we investigate the mathematical underpinnings of randomization as a tool for data collection. We focus on the benefits and pitfalls of deviating from purely randomized samples, including stratification, clustering, and convenience sampling. We learn how to design and analyze complicated surveys typically employed by government agencies. We also discuss special designs for hard to reach populations and issues of fairness and generalizability when using big data to train algorithms for predictive analytics. With regard to causal studies, we again discuss the central role of randomization as a tool for ensuring fair comparisons of treatments. We focus on the benefits and pitfalls of variations on randomized designs, including blocking and factorial designs. We discuss design for observational studies, focusing on methods like propensity score matching. Throughout we discuss a variety of genuine designs spanning applications in public policy, health, and the social and natural sciences.

Course Objectives

Logistics

Prerequisites

Students must have passed STA 210 or an equivalent course in regression analysis. We do a lot of manipulations with discrete random variables, so comfort applying expectation and variance formulas is necessary.

Readings

There are no required texts for this course. Instead, we will read book chapters and articles posted on the course website on Sakai. Two useful texts for reference include:

Lohr, S. L.  (2010),  Sampling: Design and Analysis, 2nd edition  Cengage Learning.  ISBN 0495105279.

Imbens, G. W. and Rubin, D. B. (2015), Causal Inference for Statistics, Social, and Behavioral Sciences: An Introduction, Cambridge University Press. ISBN 0521885884

Computing

We will use the statistical software package R for analyzing data.  It can be downloaded for free at http://www.r-project.org/.  Alternatively, R is available on the public computers on campus.

Calculator

Students don't need a calculator for this course.

Schedule of Topics

We will cover the topics in the table below.  We may spend different amounts of time on each topic than shown, depending on the interests of the participants in the course.

Introduction to course.
1 lectures
Basics of surveys of finite populations. Questionnaire design. 1 lectures
Design-based estimation in simple random samples
1 lectures
Design-based estimation in general samples (Horvitz-Thompson estimator)
1 lectures
Stratified samples
2 lectures
Cluster and unequal probability samples
2 lectures
Multi-stage sampling designs
2 lectures
Model-based methods for finite population inference
1 lectures
Non-random sampling for finite population inference
1 lectures
Data privacy and surveys
1 lectures
Capture-recapture methods for population size estimation
1 lectures
Basics of causal studies. Potential outcomes. Randomization.
1 lectures
Fisher randomization tests
1 lectures
Blocked designs
1 lectures
Factorial designs
1 lectures
Fractional factorial designs
2 lectures
Observational study design, including propensity score methods
4 lectures


Graded work

Graded work for the course will consist of two term exams, home work assignments, and two projects.  Students' final grades will be determined as follows:
 
Assignments
30%
Midterm Exam 1
30%
Midterm Exam 2
30%
Projects
5% each

There are no make-ups for graded work except for medical or familial emergencies or for reasons approved by the instructor before the due date.  See the instructor in advance of relevant due dates to discuss possible alternatives.

Descriptions of graded work

Assignments:

Assignments are posted on the Statistics 322/522 course web site on Sakai.  Students turn in these assignments at the beginning of class on the due date.  Students are permitted to work with others on the assignments, but each person must write up and turn in their own answers.  The assignments are designed to build students' knowledge of the computational and the mathematical aspects of study design, and to analyze survey or causal inference data.

Exams:

The first midterm exam will cover mathematical and conceptual aspects of survey sampling. The second midterm exam will cover mathematical and conceptual aspects of causal inference.   

Projects:

One project will cover surveys and the other project will cover causal inferences. Students work in pairs on the projects. The projects involve designing or analyzing surveys/causal studies, applying the methods learned in the course.   

Academic honesty

Students are expected to abide by Duke's Community Standard for all work for this course.  Violations of the Standard will result in a zero grade for the relevant assignment and will be reported to the Dean of Students for adjudication. Additionally, there may be penalties to the final grade for the course.   Ignorance of what constitutes academic dishonesty is not a justifiable excuse for violations.

For the exams, students are required to work alone.  For the assignments, students may work with others but each student must submit his or her own answers. For assignments involving computer programming, students can get advice from each other but are required to write their own code. For the projects, students are required to work in teams of two individuals. Teams are permitted to talk with others in the course, but each team must write up their own project report.