STA561 COMPSCI571: Probabilistic Machine Learning: Fall 2015

Prof:Sayan Mukherjee OH: Mon 10-12112 Old Chem
Abhishek Dubey abhisdub@cs.duke.eduOH: Wednesday 10-11am LSRC D309
Yuhao Liang yuhao.liang@duke.eduOH: Monday 7:00-9:00pm Old Chem 211a
Xinyi Li
Class:M/W 8:30-9:45am Social Sciences 136


Introduction to machine learning techniques. Graphical models, latent variable models, dimensionality reduction techniques, statistical learning, regression, kernel methods, state space models, HMMs, MCMC. Emphasis is on applying these techniques to real data in a variety of application areas.

News and information

All students: we will have one poster session, Dec 4. The poster session will be in Gross Hall 3rd floor East Meeting Space. For a keynote version of an example poster see tex example or keynote example. If you are auditing the course, we'd love to have you at the poster sessions (bring your research groups too!).

Statistics at the level of STA611 (Introduction to Statistical Methods) is encouraged, along with knowledge of linear algebra and multivariate calculus.

Course grade is based on an in class midterm (15%), in class final (35%), a final project (40%), and the poster session for the final project (10%). We will have homeworks but they will not be graded, we will post solutions.

There is a Piazza course discussion page. Please direct questions about homeworks and other matters to that page. Otherwise, you can email the instructors (TAs and professor) at Note that we are more likely to respond to the Piazza questions than to the email, and your classmates may respond too, so that is a good place to start.

The final porjects should be in LaTeX. If you have never used LaTeX before, there are online tutorials, Mac GUIs, and even online compilers that might help you.

The course project will include a project proposal due mid-semester, a four page writeup of the project at the end of the semester, and an all-campus poster session where you will present your work. This is the most important part of the course; we strongly encourage you to come and discuss project ideas with us early and often throughout the semester. We expect some of these projects to become publications. You are absolutely permitted to use your current rotation or research project as course projects. Examples of last years projects.

A second set of references for R may be useful. First, you can download R from the CRAN website. There are many resources, such as R Studio, that can help with the programming interface, and tutorials on R are all over the place. If you are getting bored with the standard graphics package, I really like using ggplot2 for beautiful graphics and figures. Finally, you can integrate R code and output with plain text using KNITR, but that might be going a bit too far if you are a beginner.

The course will follow my lecture notes (this will be updated as the course proceeds), Lecture Notes. Some other texts and notes that may be useful include:

  1. Kevin Murphy, Machine Learning: a probabilistic perspective
  2. Michael Lavine, Introduction to Statistical Thought (an introductory statistical textbook with plenty of R examples, and it's online too)
  3. Chris Bishop, Pattern Recognition and Machine Learning
  4. Daphne Koller & Nir Friedman, Probabilistic Graphical Models
  5. Hastie, Tibshirani, Friedman, Elements of Statistical Learning (ESL) (PDF available online)
  6. David J.C. MacKay Information Theory, Inference, and Learning Algorithms (PDF available online)

The final project TeX template and final project style file should be used in preparation of your final project report. Please follow the instructions and let me know if you have questions. We will have a poster session where you present your research project in lieu of a final exam.

This syllabus is tentative, and will almost surely be modified. Reload your browser for the current version.

This years final projects

  1. Predicting sales of Rossman's stores
  2. Gentrification Index Using Yelp Data
  3. Risk estimates of tree mortality across species using Bayesian hierarchical models
  4. Classification of TV Channels
  5. Prediction of Coupon Purchasing Behavior
  6. Classification of Cardiac Tissue Regions Based on Motion Profile in Ultrasound Images
  7. Spectral Clustering of Chinese Herbal Medicine Network
  8. Use of Machine Learning in Predicting Bankruptcy
  9. Distinguishing malignant from benign breast tumors
  10. Detection of Solar Panes from Satellite Imagery
  11. Yelp Customer Review Bias Analysis through Linear Mixed Effect Models with Natural Language Sentiment Polarity Scores
  12. Testing the CAPM Theory for German CDS Based on a Model with GARCH-type Volatilities and SSAEPD Errors
  13. Bayesian Non-Parametrics and Dirichlet Process Clustering Techniques
  14. Text Analysis of News Articles (Building a Protest Dataset through Machine Learning)
  15. Information Popularity and Diffusion Size Prediction in Online Social Networks
  16. Cascading Classifier for Face Detection
  17. What's Cooking ? Predicting Cuisines from Recipe Ingredients
  18. Analysing Senator Community Structure from Roll Call Data
  19. Handwritten Digits Recognition
  20. A Neural Algorithm for Artistic Style
  21. Machine Learning with Python
  22. Predictive Modeling of Bank Marketing for Term Deposit
  23. Air Pollution Distribution Analysis for Beijing Haze
  24. Beyond SVD
  25. Legislation approval ratings prediction via vote correlation
  26. Categorical Prediction of Song Popularity Using Topological Data Analysis
  27. Movie Recommender System
  28. The Effect of Racial Diversity on High School Graduation Rates
  29. Comparison of feature selection methods in modeling resting metabolic rate
  30. Randomization as regularization
  31. Designing an optimum traffic signal system using reinforcement learning
  32. Topic modeling for community analysis and range estimation
  33. Classifying Soccer Matches in the English Premier League
  34. Spectral algorithms and tensor methods for learning in POMDPs
  35. World Cup Recap
  36. Dimension Reduction Methods on Handwritten Digits Recognition
  37. ML methods for Drosophila Dorsal closure
  38. The Animal Model for Censored Traits
  39. Spectral Clustering and Community Detection in Labeled Graph
  40. Cluster Analysis of Endogenous Taxi Driver Schedule Patterns

  41. Syllabus

    1. (August 24th) Introduction and review: Lecture 1 in notes

    2. (August 26th) No class
    3. (August 31th) Linear regression, the proceduralist approach: Lecture 2 in notes

    4. (September 2nd) Bayesian motivation for proceduralist approach: Lecture 3 in notes

    5. (September 7th) Bayesian linear regression: Lecture 4 in notes
    6. (September 9th) Reproducing kernel Hilbert spaces: Lecture 5 in notes

    7. (September 14th) Nonlinear regression: Lecture 6 in notes

    8. (September 16th, 21st) Support Vector Machines: Lecture 7 in notes
    9. (September 23rd) Regularized logistic regression: Lecture 8 in notes
    10. (September 28th) Gaussian process regression: Lecture 9 in notes

    11. (September 30th) Sparse regression: Lecture 10 in notes

    12. (October 5th) The boosting hypothesis and Adaboost: Lecture 11 in notes

    13. (October 7th) In class midterm

    14. (October 14th, 19th) Statistical learning theory: Lecture 12 in notes
    15. (October 19th, 21st) Mixture models and latent space models: Lecture 13 in notes

    16. (October 26th, 28th) Latent Dirichlet Allocation: Lecture 14 in notes

    17. (November 2nd, 4th) Markov chain Monte Carlo: Lecture 15 in notes

    18. (November 9th, 11th) Hidden Markov models Lecture 16 in notes

    19. (November 23rd) In class final
    20. (December 4th) Poster session (2pm)
    21. (December 7th) Final projects due