The project in this class represent an opportunity for you to tackle an open ended statistical analysis to address a specific research questions. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class and apply them to a complex dataset in a meaningful and appropriate way. All analyses must be done in RStudio and written up using Rmarkdown.
You should write as if you are explaining your results to someone who would be interested in your research question, whether this is another scholar in your field or a peer sharing your interest in the topic. Keep in mind that this audience may or may not have taken statistics, but you must be statistically accurate and use correct statistical terminology, but must also explain your conclusions in a way that a lay person can understand.
To download the template files for the project run the following code inside RStudio:
download.file("http://stat.duke.edu/~cr173/Sta102_Sp16/Proj/project.Rmd", destfile="project.Rmd")
For your analysis you will be using data from the Behavioral Risk Factor Surveillance System (BRFSS) from 2013 that has been organized and imported into R by the Duke Libraries staff (from Data and Visualization services ). This data set has been chosen because it is a large, high quality, and clean and it also covers a wide variety of health and socioeconomic factors.
Your first task will be to examine the available variables via the codebook and devise a research question that you believe can be addressed using these data and the techniques we have covered in class. You are only expected to look at a small subet of the data - successful projects will likely only use 2-6 variables. If you would like to view the original CDC code book it can be found here.
Unlike the labs you should first download the data,
download.file("http://stat.duke.edu/~cr173/Sta102_Sp16/Proj/brfss2013.RData", destfile="brfss2013.RData")
once it has downloaded, it can be loaded into R using
load("brfss2013.RData")
You should avoid repeatedly downloading the data as it is significantly larger than anything we’ve used in lab and as such will cause unnecessary delays when knitting your document. If you want to use the inference function then be sure that it is also loaded by your document by including
load(url("http://www.stat.duke.edu/~cr173/Sta102_Sp16/Lab/inference.RData"))
From BRFSS Overview
The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is administered and supported by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion. BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS was initiated in 1984, with 15 states collecting surveillance data on risk behaviors through monthly telephone interviews. Over time, the number of states participating in the survey increased; by 2001, 50 states, the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands were participating in the BRFSS. Today, all 50 states, the District of Columbia, Puerto Rico, and Guam collect data annually and American Samoa, Federated States of Micronesia, and Palau collect survey data over a limited point- in-time (usually one to three months). In this document, the term “state” is used to refer to all areas participating in BRFSS, including the District of Columbia, Guam, and the Commonwealth of Puerto Rico.
The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use. Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.
Health characteristics estimated from the BRFSS pertain to the non-institutionalized adult population, aged 18 years or older, who reside in the US. In 2013, additional question sets were included as optional modules to provide a measure for several childhood health and wellness indicators, including asthma prevalence for people aged 17 years or younger.
As stated above, the goal of the project is to use the skills you have acquired in this class to undertake a novel statistical analysis of a research question of your choosing. You must use RStudio for your analysis and write up all results using knitr and the provided template. Specifically, your writeup should provide a narrative of your analysis including all necessary background information. In general, the writeup should have three primary components: an introduction, the analysis, and a conclusion. The introduction should contain a description of the research question, a summary / description of the relevant data in the BRFSS, and should also address the significance of the research question as well as the relevance of the data to answering that question.
The bulk of the assignment will consist of a detailed analysis of the data using the methodologies we have discussed in class. While some questions can be addressed directly by a single univariate or bivariate inference test, this likely indicates the research question is too specific and should be broadened. Conversely, if you find yourself performing more than a handful of tests you question is either too broad or the tests are redundant. The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient picking the correct tool for the job at hand and that you are able to correctly interpret and present the results of that tool(s). Focus on methods that help you to answer your specific research questions.
The writeup should also include a one to two page conclusion and discussion. This should summarize what you have learned about your research question along with statistical arguments supporting your conclusions. It is also a good idea to critique/assess your own methods by discussing any potential limitations and providing suggestions for improvements. Issues pertaining to the reliability and validity of the data, and appropriateness of the statistical analysis should be discussed here. A paragraph on what you would do differently if you were able to start over with the project or what you would do next if you were going to continue work on the project can also be included.
Some other general guidelines:
There is no reason to directly include formulas or calculations - the audience is not interested in how you calculated a p-value or confidence interval
Simply reporting summary statistics, CI, or p-values is never enough, you must describe what these values mean.
Never include a figure that is not directly mentioned and described in the text.
Avoid including redundant information (e.g. tests, figures, text, etc.)
Pay attention to the presentation of your write up - neatness, organization, coherency, and clarity count.
You will turn in your write up using Sakai’s Assignments tool, you will be allowed to upload the assignment(s) multiple times without penalty until the deadline.
Your submission must include:
The project markdown file (project.Rmd)
The knit output file (project.html)
Late work policy applies (-10% per day) until all files are submitted in working format. It is your responsibility to confirm that any file uploaded to Sakai are working properly (i.e. corrupted files are not an excuse for late work).
Grading of the project by the professor and TAs will take into account the following:
Content - What is the quality of research and/or policy question and relevancy of data to those questions?
Correctness - Are statistical procedures carried out and explained correctly?
Writing and Presentation - What is the quality of the statistical presentation, writing and explanations?
Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?
A general breakdown of grading is as follows:
90%-100% - Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
80%-89% - Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
70%-79% - Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
60%-69% - Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
Below 60% - Student is not making a sufficient effort.
Please note that if you score less 30% on the project you cannot pass this course and that late projects are assessed a 10% per day penalty - as such your project must be turned within one week of the deadline in order to pass this class.