Statistics 104: Summer 2013 - Session 1
Data Analysis and Statistical Inference
Professor Mine Çetinkaya-Rundel
FAQ:
I will be updating this FAQ page as the class progresses with common questions I receive from students. I recommend that you check this page when you have a question to see if it has already been answered. If not, send me an email, and if the answer to your question might benefit the whole class I'll post it here as well. Note that references are to the OpenIntro Statistics, Second Edition. Click on the question to reveal the answer.
If you have a question about something in the textbook that looks like it might be a typo or doesn't make sense, first check to see if it's been corrected already on the textbook errata. If not, send me an email about it so that I can confirm whether it's a typo or not, and if it is, it'll be added to the list.
To identify the explanatory variable in a pair of variables, identify which of the two is suspected of affecting the other. We're saying ``suspected" here because in order to make a causal statement we need to have a randomized experiment. But these terms are defined for all types of studies. For example, let's say we're interested in finding out if studying for an exam with the TV on affects exam performance. Below are two studies investigating this issue:
At the beginning of an exam ask students if they have studied for the exam while also watching TV. Once the exams are graded, compare the average exam scores of those who studied with the TV on and those who studied without.
Randomly assign a group of students to study for an upcoming exam while watching TV and others to study without the TV on. Once the exams are graded, compare the average exam scores of those who studied with the TV on and those who studied without.
In both scenarios the explanatory variable is studying with the TV on or off, and the response variable is exam score. However in the first scenario we have an observational study. Therefore even if we find a significant difference between the average scores of the two groups, we can't make a causal connection between watching TV and exam performance. The second study is a randomized experiment, therefore a causal connection can be made between the explanatory and response variables. Review section 1.3.4 for more information.
A confounding variable (also called a lurking variable) is a variable that is correlated with both the explanatory and response variables. Confounding variables are basically the reason why we cannot make causal connections between the explanatory and response variables in observational studies. For example, a study has found that the 20 year survival rate (how likely is a subject to be alive in 20 years) for smokers was higher than that of non-smokers, which seems to indicate that smokers are less likely to die. This seems to contradict what we know about the health effects of smoking. A closer look at the data shows that majority of the smokers were young people, and majority of the non-smokers were old people. In this case age seems to be a confounding variable. What is causing the survival rate to appear lower for the smokers is that they're younger (hence more likely to live longer), not that smoking is good for you. Another example on sunscreen use and skin cancer is provided in the textbook. Review section 1.4.1 for more information.
In both cases we first divide our target population into smaller groups. The main difference between stratified and cluster sampling is that in stratified sampling the groups (strata) are homogenous with respect to a variable we think might have an effect on the response variable. For example, say we're studying the effect of of watching TV on academic performance at a particular college. We might want to make sure that we include equal numbers of first-years, sophomores, juniors, and seniors in our study. So we divide up our target population into four strata based on year, and randomly sample students from within each sample. This will probably require getting a list of all students from the registrar, randomly picking a given number of students from each year, and finding these students and collecting information on how much they watch TV and their academic performance, like GPA. Sounds like a tedious process... Alternatively we might randomly pick a few courses from the list of all courses offered at this college and go to the classrooms for these students and sample everyone in each of the classes we picked. In this case the classes are the groups (clusters), and they're probably not homogenous groups. We simply picked them to make data collection a little more manageable. Review section 1.4.2 for more information.
Au contraire, you want to use the IQR in this case to describe variability in distributions with extreme observations. The IQR is robust to these observations, and this is preferred because we want statistics that describe the bulk of the data. However, as usual when describing distributions, you should always mention shape, center, spread, and unusual observations. So these extreme cases, which may be the most interesting ones, still get their place in the spotlight, we just don't want to factor them into the calculation of statistics that describe the distribution as a whole. Review section 1.6.6 for more information.
There are two common errors, first check if your issue is related to one of these:
Error: > sign in code chunks in lab report - This sign in R means it's done processing the previous command and is ready to accept another. When you copy and paste your code from the console to your lab document you should not include that sign.
Solution: Get rid of > signs at the beginning of your commands in your lab report and reprocess. If this was the only issue, it should now be resolved, and your document should now look how you expect it to.
Error: Error: code outside of ```{r} and ``` - In your markdown document all R code should be between the ```{r} and ``` lines, such that the code chunk is highlighted in a light gray band. If you insert your code outside of these allotted code chunk areas, the code won't process and the document might give you an error or might just not look how you expect it to.
Solution: Place all code in the allotted code chunk regions between ```{r} and ``` lines. If this was the only issue, it should now be resolved, and your document should now look how you expect it to.
If your problem isn't related to one of these, send me an email with your lab report document attached, or copy and paste your entire lab report at the bottom of your email and I'll help you figure out the issue.
You should now see a file called mom.csv in the Files window. You can select that file and export it (check the box next to the file, click on More, and then Export...) so that you can have a copy of this file on your computer, and can submit it with your project. For more information on the write.csv() function, use the following.
Under the Files tab in the bottom right corner of RStudio you should see a button called Upload (with a yellow up arrow). Click on that, and then click on Choose File and find your data file and hit OK. You should then see this file listed in the Files window.
This means that you have successfully uploaded your file to RStudio, but it's not yet in your Workspace. In order to get it in your Workspace, click on Import Dataset (under the Workspace tab on the top right corner of RStudio), then click on From Text File...
and choose your data file from the list. Make sure the radio button for Heading is selected for Yes (assuming that the first row of your dataset is the header row).
In order to use this dataset as a part of your write up, you need to include a piece of code in your .Rmd file to read the data in. Suppose your data file's name is "d_prj1.csv", and you want to call your dataset "d" then use the following:
Locate the .Rmd file you want to export in the Files pane (lower right corner), check the box next it, then click on More -> Export, and then click on Download in the pop-up window.
The simplest approach is to use the plot function on the entire dataset. The second approach is to use a new function from a contributed R package to get a much fancier plot. The two downsides with the second option are (1) It doesn't handle NAs automatically (this may not be an issue with your second project since there aren't many NAs), and (2) it takes a while to generate the plot so you'll need to be patient. The examples below use the ACS data from the multiple regression lab.
plot function:
plot(acs)
In this output you'll see that the lower diagonal of the plot matrix has repetitive information from the upper diagonal (same plots, with axes reversed). Also, depending on the number of variables you have, the plots may be small. If R complains about the plotting window being too small, just increase the size of your plotting window by dragging the margins in RStudio. You can use this to quickly determine which variables are related, then make single plots for those relationships that you'd like to view more closely. If you want to plot only certain variables, you can first make a subset, and then use the plot function.
Subsetting based on column number: Only plot relationships between variables in columns 1 through 5.
plot(acs[,1:5])
Subsetting based on variable names: First subset the data selecting variables that are numerical, and them plot the relationships between them.
install.packages("GGally") # install package
library(GGally) # load package
acs_noNA=na.omit(acs)# omit rows with NAs
ggpairs(acs_noNA)
If you only want to plot certain columns of the dataset (say, 1 through 5), use
ggpairs(acs_noNA, columns = c(1:5))
This might be very useful since otherwise the plot gets very busy. Another parameter you might want to change in the ggpairs function is the font size of the correlation coefficients (they're pretty small by default).
You can calculate confidence intervals for slopes manually (finding the appropriate t* for the degrees of freedom and confidence level you need), or you can use the confint function in R. The example below uses the ACS dataset from the multiple regression lab. You can either get confidence intervals for all slopes using:
m = lm(income ~ gender + hrs_work, data = acs)
confint(m)
or for one parameter at a time using:
confint(m, parm = "hrs_work")
Use the help file for the function to figure out how to change the confidence level.
You can do this in one of two ways: (1) create the table elsewhere (like Excel), save table as an image, embed image in report. (2) create the table in knitr.
Create the table elsewhere (like Excel), save table as an image, embed image in report: Once you create your table you can either save it as an image file, or take a screenshot. Save this image file in the same directory as your .Rmd file (by uploading it onto RStudio). Let's assume your image file is called "table_screenshot.png". In order to embed this in your report use the following code:
![table of blahs](table_screenshot.png)
where "table of blahs" is just a short description of your table.
Create the table in knitr: Tables in knitr have the following structure
First Header | Second Header
------------- | -------------
Content Cell | Content Cell
Content Cell | Content Cell
Just replace the text with your content, and extend the table as needed.
Here is a quick example: I have this table that I created in Excel and saved as an image file by taking a screenshot on my computer and cropping around the table:
I upload this file into RStudio, in the same directory as the .Rmd file for my project. Then using the command
![table of blahs](table_screenshot.png)
I embed the file in my report. Alternatively, I can create the same table using the following code:
| col name 1 | col name 2 | col name 3
------------- | ----------- | ----------- | -----------
row name 1 | [some text] | [some text] | [some text]
row name 2 | [some text] | [some text] | [some text]
row name 3 | [some text] | [some text] | [some text]
No, you need to have your clicker to be able to get credit for the day. Note that up to two unexcused late arrivals or absences will not affect your clicker grade.
Yes and no. If there is a readiness assessment that day and you walk in late, you won't be given additional time and you may not be able to perform as well as you would have had you had more time. If there is no readiness assessment and you walk in just a few minutes late, you'll at most miss one or two clicker questions for the day. This shouldn't affect your score since answering at least 75% of the questions gets you a full score for the day.