The goal of this lab is to use Analysis of Variance (ANOVA) to compare means in multiple groups. Additionally, you will be introduced to new R functions used for wrangling and summarizing data.
Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix lab-03-anova-. It contains the starter documents you need to complete the warmup exercise.
Clone the repo and create a new project in RStudio Cloud.
Configure git by typing the following in the console.
library(usethis)
use_git_config(user.name="your name", user.email="your email")
We will use the following packages in today’s lab.
library(tidyverse)
library(knitr)
library(broom)
Currently your project is called Untitled Project. Update the name of your project to be “Lab 03 - ANOVA”.
Before we introduce the data, let’s warm up with a simple exercise.
Pick one team member to update the author and date fields at the top of the R Markdown file. Knit, commit, and push all the updated documents to Github.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.
In today’s lab, we will analyze the diamonds
dataset from the ggplot2 package. Type ?diamonds
in the console to see a dictionary of the variables in the data set. This analysis will focus on the relationship between a diamond’s carat weight and its color. Before starting the exercises, take a moment to read more about the diamond attributes on the Gemological Institute of America webpage: https://www.gia.edu/diamond-quality-factor.
The diamonds dataset contains the price and other characteristics for over 50,000 diamonds price from $326 to $18823. In this lab, we will analyze the subset of diamonds that are priced $1200 or less.
diamonds_low
that is the subset of diamonds priced $1200 or less. How many observations are in diamonds_low
?When using Analysis of Variance (ANOVA) to compare group means, it is ideal to have approximately the same number of observations in each group. Therefore, we will combine the worst two color groups, I and J, and create a new color category called “I/J”. Since color
is an ordinal (<ord>
) variable, we need to use the recode_factor
function in the dplyr package to create the new category.
Use the count
function before and after making the new color category to ensure the recoding worked as expected.
# number of observations at each color level
diamonds_low %>%
count(color)
#create a new vector of the recoded values
color_recoded <- recode_factor(diamonds_low$color,
`I` = "I/J", `J` = "I/J",
.default = levels(diamonds_low$color))
#replace the color variable with the recoded data
diamonds_low <- diamonds_low %>%
mutate(color = color_recoded)
Refer to the ggplot2 Cheat Sheet and ggplot2 reference for plot ideas and help with code.
We begin by plotting the relationship between color
and carat
. As a group, brainstorm ways to plot the relationship between the two variables, then make one of the plots. Be sure to include informative axes labels and an informative title.
Fill in the code below to calculate the mean and variance of carat
at each level of color
.
The group_by
function is used to do calculation in groups. The summarise
function is used to reduce variables to values.
diamonds_low %>%
group_by(_______) %>%
summarise(n = n(),
avg_carat = mean(carat),
var_carat = _______)
Based on the plots and summary statistics, does there appear to be a relationship between carat weight the color of diamonds? In other words, does there appear to be a significant difference in the mean carat weight across colors?
Are the assumptions for ANOVA met? Comment on each assumption using the summary statistics and/or plots from previous exercises to support your conclusion. You may also calculate any additional summary statistics or make additional plots as needed.
Regardless of your answer to Excerise 4, We will proceed with the analysis in the remainder of this lab as if the assumptions are met.
tidy
function from the broom package is used to put the ANOVA output in a dataframe, and with the kable
function from the knitr package, you can display the results in an easy-to-read table.anova <- aov(carat ~ color, data=diamonds_low)
anova %>%
tidy() %>%
kable()
Use the ANOVA table to calculate the total mean square, i.e. the sample variance of carat
. Show your calculations. You can put the calculations in a code chunk to use R like a calculator.
What is \(\hat{\sigma}^2\), the estimated variance of carat
within each level of color
.
We can use ANOVA to test if the true mean value of carat
is equal for all levels of color
, i.e.
\[ H_0: \mu_1 = \mu_2 = \dots = \mu_6\]
State the alternative hypothesis is the context of the data.
Based on the ANOVA table, what is your conclusion from the test of the hypotheses in the previous question? State the conclusion in the context of the data.
Use the code below to plot a 95% confidence interval for the mean carat weight at each level of color. Calculate the value of sigma
by filling in the estimated variance from Exercise 7.
The formula for the confidence interval for the mean of group \(k\) is
The critical value \(t^*\) is calculated using the t distribution with \(n-K\) degrees of freedom.
The standard error of the mean is calculated using \(\hat{\sigma}\), the square root of the variance within each group calculated from the ANOVA table.
\[\bar{y}_k \pm t^* \frac{\hat{\sigma}}{\sqrt{n_k}}\]
n.groups <- diamonds_low %>% distinct(color) %>% count()
crit.val <- qt(0.975, (nrow(diamonds_low)-n.groups$n))
sigma <- sqrt(_________)
conf.intervals <- diamonds_low %>%
group_by(color) %>%
summarise(mean_carat = mean(carat), n = n(),
lower = mean_carat - crit.val * sigma/sqrt(n),
upper = mean_carat + crit.val * sigma/sqrt(n))
ggplot(data=conf.intervals,aes(x=color,y=mean_carat)) +
geom_point() +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1) +
labs(title="95% confidence interval for the mean value of carat",
subtitle="by Color") +
coord_flip()
For what color level is the mean carat weight the most different from all the others?
Based on this analysis, describe the relationship between the color and the mean carat weight in diamonds that cost $1200 or less. Refer to the diamond documentation to recall what the color scale means.
You’re done! Commit all remaining changes, use the commit message “Done with Lab 3!”, and push. Before you wrap up the assignment, make sure the .Rmd, .html, and .md documents are all updated on your GitHub repo.