Class: | Tu Th 2:50-4:05pm | Old Chem 025 | |
Prof: | Robert L. Wolpert | (wolpert@stat.duke.edu) | |
OH: | Wed 4:15-5:00pm | Old Chem 211c | |
TA: | Anirban Bhattacharya | (anib86@gmail.com) | |
OH: | Mon 3:00-4:30pm | Old Chem 211a |
Half a century ago the phrase Multivariate Statistics was generally understood to describe sampling-theory based statistical methods for studying multi-dimensional normally-distributed data. The fundamental tools for this aspect of the subject are a deep understanding of linear algebra and of the probability distributions associated with the normal, such as Wishart and its kin. The best-known methods arising in this area are PCA (Principal Components Analysis), FA (Factor Analysis), Hotelling's T^{2} test, and perhaps relatives like Principal Components Regression and multivariate ANOVA.
More recently, interest in computational methods, causality, and model formulation have all led to a growth in the study of Graphical Models in which the conditional (in)depependence structure for a family of random variables is encoded in the form of a graph, a collection of points (the vertices) some of which are connected (by edges, or possibly-ordered pairs of vertices). For non-Gaussian distributions it is sometimes necessary to go beyond graphs to "hypergraphs".
My plan is to try to cover the high-lights of both traditional (multivariate Gaussian) MVA and of graphical models. This will be the first time I've taught this material and I'll be learning some of it as we go along, so don't expect a smooth ride or a polished syllabus. I hope to have some computing aspects for the course if I can manage it.
Students are expected to be (or become) comfortable with probability theory at the level of STA214 or STA205, statistical inference at the level of STA215, and linear models at the level of STA244. Some experience in computing in R or MatLab would be helpful.
The methods we consider will most often be tailored for problems in which the number of observations (traditionally denoted n) exceeds (maybe by a lot) the number of uncertain parameters (traditionally p). Recently there is a great deal of interest in problems where p»n--- this arises naturally in genomic applications, intrusion detection, and other emerging areas of interest. Since those problems are studeied in detail in our sister course Statistical Data Mining (STA 218), we won't spend much time on it here.