This class covers data analytic tools for discrete data. Starting from the properties of exponential families, we will investigate the general concepts behind generalized linear models (GLMs), and survey a variety of different models in the particular contexts for which they are suited. Some examples include models for binary data, polytomous data (ordered and unordered), count data, contingency tables, matrix data, and tree-structured data. These models cover a wide range of applications, from classical to modern. We will cover models involving random/mixed effects and (if time permits) some recent methodological development for coping with the increasing dimensionality and complexity in modern data sets, in particular generative models, graphical models, and latent variable models.
STA 611/532/732, STA 521/721, and linear algebra.
Some familiarity with statistical software such as R.
Li Ma (Instructor), Email: li.maPENGUIN@dukePENGUIN.edu
Joe Mathews (TA), Email: joseph.mathewsPUFFIN@dukePUFFIN.edu
Don't forget to remove the arctic birds from the email addresses!
TBD.
WF 1:25-2:40PM in Old Chem 025
Lecture notes
All lecture notes and handouts will be posted on Canvas.
Textbook
Categorical data analysis. 3rd Ed. By Alan Agresti. (Available online at Duke library.)
Computing references
Other references (that may be helpful but are not necessary)
Generalized linear models. 2nd Ed. (McCullagh and Nelder)
Exponential families in theory and practice. (Bradley Efron)
Bayesian data analysis. 3rd Ed. By Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin.
Bayesian models for categorical data. (Congdon)
Homework: 3 to 4 assignments (30%). Answers or reports for the data analytical problems must be typed with LaTeX. Answers to theoretical problems can be handwritten. Code should be attached in appendix. Homework Assignments will be graded on a 4-point scale (Excellent, Good, Fair, and Poor). Both an Excellent and a Good will give you full credit for grading purposes. You must show your work to receive credit. Late homeworks will be accepted, but will incur a one-level grade penalty for each 24-hour period it is late (starting from the minute past the deadline). The lowest homework grade will be dropped. Homework assignments are to be released and submitted on Gradescope.
Review on one designated topic area (30%): Presentation and leading discussion in-class (20%) + active participation during others’ presentation and discussion (10%).
A course project (40%): a proposal (5%) + a final report (25%) + an in-class presentation (10%).
About collaboration on homework: While discussions and collaborations are extremely important and greatly encouraged in scientific research, the essential skills for data analysis are best acquired through independent efforts during the training stage. Therefore some of the homework problems, especially open-ended data analysis problems will be marked as ‘‘to be completed independently’’. For those problems, before the homework is submitted, you may only discuss with the instructor and/or the TAs.