The project in this class represent an opportunity for you to tackle an open ended statistical analysis to address a specific research questions. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class and apply them to a complex dataset in a meaningful and appropriate way. All analyses must be done in RStudio and written up using Rmarkdown.
You should write as if you are explaining your results to someone who would be interested in your research question, whether this is another scholar in your field or a peer sharing your interest in the topic. Keep in mind that this audience may or may not have taken statistics, but you must be statistically accurate and use correct statistical terminology, but must also explain your conclusions in a way that a lay person can understand.
To download the template files for the project run the following code inside RStudio:
download.file("http://stat.duke.edu/~cr173/Sta102_Fa15/Proj/proposal.Rmd", destfile="proposal.Rmd")
download.file("http://stat.duke.edu/~cr173/Sta102_Fa15/Proj/project.Rmd", destfile="project.Rmd")
In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored. As such, your dataset should have at least 30 observations and between 5 to 20 variables (exceptions can be made but you must speak with me first). Additionally, your data must represent a sample, and not a population as it is no possible to perform inference with population data. The dataset’s variables should include both categorical variables (e.g. political party affiliation, gender) and numerical variables (e.g. years of education, number of foreign languages spoken fluently, height, weight).
All analyses must be done in RStudio using the template files provided. Make sure that you are able to load your data into RStudio as this can be tricky depending on the source. If you are having trouble ask for help before it is too late. Also remember that you must include the code to load your data in the Rmd document as well as any supplementary code or tools (e.g. the inference function).
On November 13th you will hand in a project proposal. This consists of completing the provided template (proposal.Rmd
) and answering the included questions. This should introduce your general research question (this should include your hypothesized answer) and your data (where it came from, how it was collected, what are the cases, what are the variables, etc.). You will also include some preliminary exploratory data analysis (univariate descriptions of the variables relevant for your research question is sufficient) in order to prove the data is imported into Rstudio and is correctly formated. You will be provided with feedback on the quality of your research question and data so that you will be able to address any issues before completing the final project.
As stated above, the goal of the project is to use the skills you have acquired in this class to undertake a novel statistical analysis of a research question of your choosing. You must use RStudio for your analysis and write up all results using knitr using the provided template. Specifically, your writeup should provide a narrative of your analysis including all necessary background information. In general, the writeup should have three primary components: an introduction, the analysis, and a conclusion. The introduction should contain a description of the dataset and research question, and should also address the significance of the research question as well as the relevance of the data to answering that question.
The bulk of the assignment will consist of a detailed analysis of the data using the methodologies we have discussed in class. While some questions can be addressed directly by a single univariate or bivariate inference test, this likely indicates the research question is too specific and should be broadened. Conversely, if you find yourself performing more than a handful of tests you question is either too broad or the tests are redundant. The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient picking the correct tool for the job at hand and that you are able to correctly interpret and present the results of that tool(s). Focus on methods that help you to answer your specific research questions.
The writeup should also include a one to two page conclusion and discussion. This should summarize what you have learned about your research question along with statistical arguments supporting your conclusions. It is also a good idea to critique/assess your own methods by discussing any potential limitations and providing suggestions for improvements. Issues pertaining to the reliability and validity of your data, and appropriateness of the statistical analysis should be discussed here. A paragraph on what you would do differently if you were able to start over with the project or what you would do next if you were going to continue work on the project can also be included.
Some other general guidelines:
There is no reason to directly include formulas or calculations - the audience is not interested in how you calculated a p-value or confidence interval
Simply reporting summary statistics, CI, or p-values is never enough, you must describe what these values mean.
Never include a figure that is not directly mentioned and described in the text.
Avoid including redundant information (e.g. tests, figures, text, etc.)
Pay attention to the presentation of your write up - neatness, organization, coherency, and clarity count.
For each assignment you must turn in your write up using Sakai’s Assignments tool, you will be allowed to upload the assignment(s) multiple times without penalty until the deadline.
Your submission must include:
All markdown files (.Rmd)
All knit output files (.html)
You do not need to include your data.
Late work policy applies (-10% per day) until all files are submitted in working format. It is your responsibility to confirm that any file uploaded to Sakai are working properly (i.e. corrupted files are not an excuse for late work).
Grading of the project by the professor and TAs will take into account the following:
Content - What is the quality of research and/or policy question and relevancy of data to those questions?
Correctness - Are statistical procedures carried out and explained correctly?
Writing and Presentation - What is the quality of the statistical presentation, writing and explanations?
Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?
A general breakdown of grading is as follows:
90%-100% - Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
80%-89% - Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
70%-79% - Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
60%-69% - Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
Below 60% - Student is not making a sufficient effort.
Please note that if you score less 30% on the project you cannot pass this course and that late projects are assessed a 10% per day penalty - as such your project must be turned within one week of the deadline in order to pass this class.