This course introduces students to concepts and techniques of Classical and Bayesian approaches for modern regression and predictive modelling. The course will blend theory and application using a range of examples.
It is expected that students have either taken STA 601 (or 360/602), are co-registed, or are familiar with some basics of Bayesian analysis! We will introduce JAGS to simplify modelling and use a range of R packages to support computing.
All students should be comfortable with mathematical statistics at the level of STA 250/611. Linear algebra and basics of linear regression are also considered prerequisite. Materials for reviewing linear algebra are under the Resource tab.
The course goals are as follows:
Course topics will be drawn (but subject to change) from
An Introduction to Statistical Learning: with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani.
This is available freely available as an eBook Get it @Duke through the Duke Library. You are welcome to download or print it out. If you prefer a paperback version you may buy it at cost from Springer (see links from library site) or purchase a hardback version at the Duke Bookstore or through Amazon.
For additional information check out Videos for the ISL book
Data Analysis using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill is available from Amazon. We will refer to this book for practicle aspects of Regression and Bayesian Hierarchical modelling, model checking and more.
Other resources for reference books, statistical computing using R, etc are provided on the Resource tab
Screencasts from Lectures are available on Sakai under Lessons
Homework | 25% |
Midterm I | 20% |
Midterm II | 20% |
Data Analysis Project Part I | 15% |
Data Analysis Project Part II | 15% |
Participation | 5% |
Grades may be curved at the end of the semester. Cumulative numerical averages of 90 - 100 are guaranteed at least an A-, 80 - 89 at least a B-, and 70 - 79 at least a C-, however the exact ranges for letter grades will be determined after the final exam. The more evidence there is that the class has mastered the material, the more generous the curve will be.
We will use R as a programming language for data analysis and use existing packages written in R to support the course in addition to Bayesian computation using JAGS. You should have access to a laptop or desktop capable of running R or RStudio. We will also provide access to a dedicated server running RStudio Pro for all students that will have a unified environment. See the Resources page for books and other reseources for learning R.
Assignments are in the Course Calendar. These will be assigned weekly.
The objective of the problem sets is to help you develop a more in-depth understanding of the material and help you prepare for exams and projects. Grading will be based on completeness as well as accuracy. The assignments will be a mix of individual and coding team based assignments
Submission instructions: You will submit your HW on Sakai by uploading a PDF or through the Course Organization repo on github for your team.
All assignments will be time stamped and late work will be penalized based on this time stamp (see late work policy below).
The objective of the Project is to give you real-world research experience using real data and statistical methods. You will use all (relevant) techniques learned in this class or explore additional advanced material to solve a problem, explore its properties (either analytically or through simulation) and present it using reporducible methods.
Further details will be provided as due dates approach.
There will be two midterms in this class. See course info for dates and times of the exams. You are allowed to use one sheet of notes (``cheat sheet”) for each exam. This sheet must be no larger than 8 1/2 x 11, and must be prepared by you. You may use both sides of the sheet and can write as small as you wish.
See the Course Calendar for dates for the Midterms and other assignments. If MSS students have more than 3 exams in one week, please notify me as soon as possible, so that we may adjust timing if possible as early in the semester as possible.
There will be no makeup exams. If a student misses one exam for any reason, their score will be imputed based on the previous or future exam. Missing both exams will result in a grade of 0 on the exams.
You are expected to be present at class meeting and actively participate in the discussion. Your attendance and participation during class, as well as your activity on the discussion forum on piazza, commits on github, and peer evaluation will make up 5% of your grade in this class. While I might sometimes call on you during the class discussion, it is your responsibility to be an active participant without being called on.
We will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TAs, and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza (peer answers earn participation points!). If you have any problems or feedback for the developers, email team@piazza.com.
Any non-personal questions related to the material covered in class, problem sets, labs, projects, etc. should be posted on Piazza. Before posting a new question please make sure to check if your question has already been answered. The TAs and myself will be answering questions on the forum daily and all students are expected to answer questions as well as part of the participation grade. Please use informative titles for your posts and link to the appropriate topics.
Note that it is more efficient to answer most statistical questions ``in person” so make use of Office Hours.
Students with disabilities who believe they may need accommodations in this class are encouraged to contact the Student Disability Access Office at (919) 668-1267 as soon as possible to better ensure that such accommodations can be made.
Duke University is a community dedicated to scholarship, leadership, and service and to the principles of honesty, fairness, respect, and accountability. Citizens of this community commit to reflect upon and uphold these principles in all academic and non-academic endeavors, and to protect and promote a culture of integrity. Cheating on exams and quizzes, plagiarism on homework assignments and projects, lying about an illness or absence and other forms of academic dishonesty are a breach of trust with classmates and faculty, violate the Duke Community Standard, and will not be tolerated. Such incidences will result in a 0 grade for all parties involved as well as being reported to the Office of Student Conduct. Additionally, there may be penalties to your final class grade. Please review the Duke’s Academic Dishonesty policies.
Use of disallowed materials (textbook, class notes, web references, any form of communication with classmates or other persons, etc.) during exams or take home projects will not be tolerated. This will result in a 0 on the exam for all students involved, possible failure of the course, and will be reported to the Office of Student Conduct. If you have any questions about whether something is or is not allowed, ask me beforehand.
Reuse and building upon ideas or code are major parts of modern open-source software development. As a professional data scientist you will often build on the code of others. This class is structured such that all solutions are public. You are encouraged to learn from the work of your peers, however this should not involve simply cutting and pasting. I won’t hunt down people who are simply copying-and-pasting solutions, because without challenging themselves to learn the material, they are simply wasting their time and money taking this class.
However, please respect the terms of use and/or license of any code you find, and if you reimplement or duplicate an algorithm or code from elsewhere, you must credit the original source with an inline comment. Failure to credit may result in a grade of zero.