Project


Background

The project in this class represent an opportunity for you to tackle an open ended statistical analysis to address a specific research questions. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class and apply them to a complex dataset in a meaningful and appropriate way. All analyses must be done in RStudio and written up using Rmarkdown.

You should write as if you are explaining your results to someone who would be interested in your research question, whether this is another scholar in your field or a peer sharing your interest in the topic. Keep in mind that this audience may or may not have taken statistics, but you must be statistically accurate and use correct statistical terminology, but must also explain your conclusions in a way that a lay person can understand.



Template Files

To download the template files for the project run the following code inside RStudio:

download.file("http://stat.duke.edu/~cr173/Sta102_Sp16/Proj/project.Rmd",  destfile="project.Rmd")



Data set

For your analysis you will be using data from the Behavioral Risk Factor Surveillance System (BRFSS) from 2013 that has been organized and imported into R by the Duke Libraries staff (from Data and Visualization services ). This data set has been chosen because it is a large, high quality, and clean and it also covers a wide variety of health and socioeconomic factors.

Your first task will be to examine the available variables via the codebook and devise a research question that you believe can be addressed using these data and the techniques we have covered in class. You are only expected to look at a small subet of the data - successful projects will likely only use 2-6 variables. If you would like to view the original CDC code book it can be found here.

Getting the data

Unlike the labs you should first download the data,

download.file("http://stat.duke.edu/~cr173/Sta102_Sp16/Proj/brfss2013.RData",  destfile="brfss2013.RData")

once it has downloaded, it can be loaded into R using

load("brfss2013.RData")

You should avoid repeatedly downloading the data as it is significantly larger than anything we’ve used in lab and as such will cause unnecessary delays when knitting your document. If you want to use the inference function then be sure that it is also loaded by your document by including

load(url("http://www.stat.duke.edu/~cr173/Sta102_Sp16/Lab/inference.RData"))

BRFSS Background

From BRFSS Overview

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is administered and supported by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion. BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS was initiated in 1984, with 15 states collecting surveillance data on risk behaviors through monthly telephone interviews. Over time, the number of states participating in the survey increased; by 2001, 50 states, the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands were participating in the BRFSS. Today, all 50 states, the District of Columbia, Puerto Rico, and Guam collect data annually and American Samoa, Federated States of Micronesia, and Palau collect survey data over a limited point- in-time (usually one to three months). In this document, the term “state” is used to refer to all areas participating in BRFSS, including the District of Columbia, Guam, and the Commonwealth of Puerto Rico.

The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use. Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.

Health characteristics estimated from the BRFSS pertain to the non-institutionalized adult population, aged 18 years or older, who reside in the US. In 2013, additional question sets were included as optional modules to provide a measure for several childhood health and wellness indicators, including asthma prevalence for people aged 17 years or younger.



Project - due Wednesday, May 4th by 5 pm

As stated above, the goal of the project is to use the skills you have acquired in this class to undertake a novel statistical analysis of a research question of your choosing. You must use RStudio for your analysis and write up all results using knitr and the provided template. Specifically, your writeup should provide a narrative of your analysis including all necessary background information. In general, the writeup should have three primary components: an introduction, the analysis, and a conclusion. The introduction should contain a description of the research question, a summary / description of the relevant data in the BRFSS, and should also address the significance of the research question as well as the relevance of the data to answering that question.

The bulk of the assignment will consist of a detailed analysis of the data using the methodologies we have discussed in class. While some questions can be addressed directly by a single univariate or bivariate inference test, this likely indicates the research question is too specific and should be broadened. Conversely, if you find yourself performing more than a handful of tests you question is either too broad or the tests are redundant. The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient picking the correct tool for the job at hand and that you are able to correctly interpret and present the results of that tool(s). Focus on methods that help you to answer your specific research questions.

The writeup should also include a one to two page conclusion and discussion. This should summarize what you have learned about your research question along with statistical arguments supporting your conclusions. It is also a good idea to critique/assess your own methods by discussing any potential limitations and providing suggestions for improvements. Issues pertaining to the reliability and validity of the data, and appropriateness of the statistical analysis should be discussed here. A paragraph on what you would do differently if you were able to start over with the project or what you would do next if you were going to continue work on the project can also be included.

Some other general guidelines:



Submission

You will turn in your write up using Sakai’s Assignments tool, you will be allowed to upload the assignment(s) multiple times without penalty until the deadline.

Your submission must include:

Late work policy applies (-10% per day) until all files are submitted in working format. It is your responsibility to confirm that any file uploaded to Sakai are working properly (i.e. corrupted files are not an excuse for late work).



Grading

Grading of the project by the professor and TAs will take into account the following:


A general breakdown of grading is as follows:


Please note that if you score less 30% on the project you cannot pass this course and that late projects are assessed a 10% per day penalty - as such your project must be turned within one week of the deadline in order to pass this class.