## STA114: Statistics

### Optional Term Project

#### Due: 7pm Wednesday, May 2, 2001

Students who wish to may replace the final exam by a Final Project in which they apply statistical methods from this course to a real data set in order to solve a scientific problem. Each project must include:
• a description of a scientific question being addressed;
• a description of some data set taken in the hope of answering that question;
• a description of the statistical methods you used to help illuminate the evidence offered by the data;
• some critical analysis of the statistical model you used. Graphical methods are especially useful here-- scatter plots, histograms, residual plots, etc. will be helpful. If your analysis involved regression, did you transform one or more of the variables? How? Why? Did you have to include a quadratic term? How did you handle your variable selection problem? Are you satisfied that the assumptions of linearity, equality of variance, and approximate normality are satisfied? Why?;
• and the conclusions that your analysis helps you to draw, in the context of the original scientifc question.
Projects:
• Must include full references for the data set used and for any statistical techniques used that were not taught in this class;
• Will be judged on the statistical sophistication and insight they show (no bonus for time spent collecting your own data). This assignment replaces a cumulative final exam, so it should show your mastery of this entire course;
• Must be between 5 and 10 pages long. Computers should be used, but the project should be a paper and not just computer output-- include only the relevant plots or tables, and describe in your own words what light they shed on the scientific problem at hand;
• Are due any time up to the beginning of the scheduled final exam (7pm Wednesday May 2).

Projects must demonstrate mastery of a range of statistical ideas; routine binomial analysis of survey data would not be appropriate. The project takes the place of a 3-hour comprehensive final exam and is not any easier than studying for and taking the final- you must show just as much depth and breadth of statistical knowledge in a project as you would have in an exam. Most projects will involve model building and model elaboration, computing and displaying posterior distributions for quantities of interest, often using regression and linear models (you may want to read ahead a bit).

Substantial use of methods from outside this course (for example, from econometrics or environmetrics courses) is discouraged, since your goal is to show matery of ideas from this course. Statistical methods not covered in the class sylabus must be carefully referenced and explained to demonstrate that the analysis is original.

Team projects are possible, with a maximum of three team members, but will have to be substantially deeper (and a bit longer) than individual projects and must show each participant's specific contribution in detail.

One source of data sets is the book A Handbook of Small Data Sets by Hand et al. While the book's 510 data sets are only described in the book itself (you can borrow my copy in my office, and xerox a copy of whatever data sets you like), the data sets (just numbers, no stories) are on-line. You can get to them by following the Data link from the Home or Syllabus pages, then take the Hand et al. link from there. A similar collection of 100 data sets appears in the book Data by Andrews and Herzberg; this one (also on-line) includes some famous datasets, like the Stanford Heart Transplant data we've looked at in class and the 1875-1894 Deaths by Horsekicks in the Prussion Army data (they follow the poisson distribution, probably too well, making some people suspect that outliers were altered or removed). There are lots of other data sets available on-line too; start with the class web page, or use a search engine and have fun. CMU's StatLib and its Data & Story archive are especially good places to start.

While you're welcome to use your own data, it's probably not worthwhile to go collecting data only for this project (takes too much time to do it well). On the other hand, if you already have data from your ongoing research, coursework in another class, hobbies, etc., especially something you know and care about, feel free to use that dataset for this project.

Ask by e-mail ( wolpert@stat.duke.edu) or in person if you have additional questions. You can find me before or after class, in my Office Hours, or at other times I'm not teaching or away. I'm also happy to look over outlines or drafts and give you some feedback and suggestions UNTIL THE LAST WEEK OF CLASS. Sorry, but I will have little if any time during reading and exam weeks--- please start your projects early if you would like some feedback or help on them.