STA 345: Multivariate Analysis

Class:Tu Th 2:50-4:05pm   Old Chem 025
Prof:Robert L. Wolpert (wolpert@stat.duke.edu)
OH:Wed 4:15-5:00pm Old Chem 211c
TA:Anirban Bhattacharya (anib86@gmail.com)
OH:Mon 3:00-4:30pm Old Chem 211a

Tentative Schedule

Description

Half a century ago the phrase Multivariate Statistics was generally understood to describe sampling-theory based statistical methods for studying multi-dimensional normally-distributed data. The fundamental tools for this aspect of the subject are a deep understanding of linear algebra and of the probability distributions associated with the normal, such as Wishart and its kin. The best-known methods arising in this area are PCA (Principal Components Analysis), FA (Factor Analysis), Hotelling's T2 test, and perhaps relatives like Principal Components Regression and multivariate ANOVA.

More recently, interest in computational methods, causality, and model formulation have all led to a growth in the study of Graphical Models in which the conditional (in)depependence structure for a family of random variables is encoded in the form of a graph, a collection of points (the vertices) some of which are connected (by edges, or possibly-ordered pairs of vertices). For non-Gaussian distributions it is sometimes necessary to go beyond graphs to "hypergraphs".

My plan is to try to cover the high-lights of both traditional (multivariate Gaussian) MVA and of graphical models. This will be the first time I've taught this material and I'll be learning some of it as we go along, so don't expect a smooth ride or a polished syllabus. I hope to have some computing aspects for the course if I can manage it.

Students are expected to be (or become) comfortable with probability theory at the level of STA214 or STA205, statistical inference at the level of STA215, and linear models at the level of STA244. Some experience in computing in R or MatLab would be helpful.

The methods we consider will most often be tailored for problems in which the number of observations (traditionally denoted n) exceeds (maybe by a lot) the number of uncertain parameters (traditionally p). Recently there is a great deal of interest in problems where p»n--- this arises naturally in genomic applications, intrusion detection, and other emerging areas of interest. Since those problems are studeied in detail in our sister course Statistical Data Mining (STA 218), we won't spend much time on it here.

Assessment

This is a 300-level course and really shouldn't be graded--- but, since it is, there will be five problem sets (about once every two weeks) and an optional final project. Final project can be either a five page (or so) paper presenting a data-analysis using methods from this course on data of interest to you (or I can help you find some, if you prefer), or a five page (or so) paper (or 15 minute oral presentation) of a journal article that either develops or applies interesting multivariate methods. Any student who turns in the homeworks with a good-faith effort at completing them will receive at least an A- in the course; any student who also completes an optional project will receive an A; students who do neither of these will receive a B or B+.

Textbooks

The textbooks for the course are: Other interesting books with complimentary strengths (which I'll use at times) include: Morrison's book is good but insanely expensive ($185 at Amazon.com or $157 straight from the publisher), but Chapters 1 & 2 of (3/e) are accessible from the publisher on-line. Here's a helpful multimedia introduction to MVA click here.
Last modified: .