Lab 03: Comparing group means using ANOVA

due Wed, Feb 6 at 11:59p

The goal of this lab is to use Analysis of Variance (ANOVA) to compare means in multiple groups. Additionally, you will be introduced to new R functions used for wrangling and summarizing data.

Getting Started

library(usethis)
use_git_config(user.name="your name", user.email="your email")

Packages

We will use the following packages in today’s lab.

library(tidyverse)
library(knitr)
library(broom)

Project name:

Currently your project is called Untitled Project. Update the name of your project to be “Lab 03 - ANOVA”.

Warm up

Before we introduce the data, let’s warm up with a simple exercise.

YAML:

Data

In today’s lab, we will analyze the diamonds dataset from the ggplot2 package. Type ?diamonds in the console to see a dictionary of the variables in the data set. This analysis will focus on the relationship between a diamond’s carat weight and its color. Before starting the exercises, take a moment to read more about the diamond attributes on the Gemological Institute of America webpage: https://www.gia.edu/diamond-quality-factor.

Exercises

The diamonds dataset contains the price and other characteristics for over 50,000 diamonds price from $326 to $18823. In this lab, we will analyze the subset of diamonds that are priced $1200 or less.

  1. Create a dataframe called diamonds_low that is the subset of diamonds priced $1200 or less. How many observations are in diamonds_low?

When using Analysis of Variance (ANOVA) to compare group means, it is ideal to have approximately the same number of observations in each group. Therefore, we will combine the worst two color groups, I and J, and create a new color category called “I/J”. Since color is an ordinal (<ord>) variable, we need to use the recode_factor function in the dplyr package to create the new category.

Use the count function before and after making the new color category to ensure the recoding worked as expected.

# number of observations at each color level 
diamonds_low %>% 
  count(color)
#create a new vector of the recoded values
color_recoded <-  recode_factor(diamonds_low$color,
                                `I` = "I/J", `J` = "I/J",
                                .default = levels(diamonds_low$color))

#replace the color variable with the recoded data
diamonds_low <- diamonds_low %>% 
  mutate(color = color_recoded)

Refer to the ggplot2 Cheat Sheet and ggplot2 reference for plot ideas and help with code.

  1. We begin by plotting the relationship between color and carat. As a group, brainstorm ways to plot the relationship between the two variables, then make one of the plots. Be sure to include informative axes labels and an informative title.

  2. Fill in the code below to calculate the mean and variance of carat at each level of color.

The group_by function is used to do calculation in groups. The summarise function is used to reduce variables to values.

diamonds_low %>% 
  group_by(_______) %>%
  summarise(n = n(), 
            avg_carat = mean(carat),
            var_carat = _______)

Based on the plots and summary statistics, does there appear to be a relationship between carat weight the color of diamonds? In other words, does there appear to be a significant difference in the mean carat weight across colors?

  1. When using ANOVA to compare means across groups, we make the following assumptions (note how similar they are to the assumptions for regression):

Are the assumptions for ANOVA met? Comment on each assumption using the summary statistics and/or plots from previous exercises to support your conclusion. You may also calculate any additional summary statistics or make additional plots as needed.

Regardless of your answer to Excerise 4, We will proceed with the analysis in the remainder of this lab as if the assumptions are met.

  1. Use the code below to calculate the ANOVA table. The tidy function from the broom package is used to put the ANOVA output in a dataframe, and with the kable function from the knitr package, you can display the results in an easy-to-read table.
anova <- aov(carat ~ color, data=diamonds_low)
anova %>% 
  tidy() %>%
   kable()
  1. Use the ANOVA table to calculate the total mean square, i.e. the sample variance of carat. Show your calculations. You can put the calculations in a code chunk to use R like a calculator.

  2. What is \(\hat{\sigma}^2\), the estimated variance of carat within each level of color.

  3. We can use ANOVA to test if the true mean value of carat is equal for all levels of color, i.e.

\[ H_0: \mu_1 = \mu_2 = \dots = \mu_6\]

State the alternative hypothesis is the context of the data.

  1. Based on the ANOVA table, what is your conclusion from the test of the hypotheses in the previous question? State the conclusion in the context of the data.

  2. Use the code below to plot a 95% confidence interval for the mean carat weight at each level of color. Calculate the value of sigma by filling in the estimated variance from Exercise 7.

The formula for the confidence interval for the mean of group \(k\) is

The critical value \(t^*\) is calculated using the t distribution with \(n-K\) degrees of freedom.

The standard error of the mean is calculated using \(\hat{\sigma}\), the square root of the variance within each group calculated from the ANOVA table.

\[\bar{y}_k \pm t^* \frac{\hat{\sigma}}{\sqrt{n_k}}\]

n.groups <- diamonds_low %>% distinct(color) %>% count()
crit.val <- qt(0.975, (nrow(diamonds_low)-n.groups$n))
sigma <- sqrt(_________)

conf.intervals <- diamonds_low %>%
  group_by(color) %>% 
  summarise(mean_carat = mean(carat), n = n(), 
            lower = mean_carat - crit.val * sigma/sqrt(n),
            upper = mean_carat + crit.val * sigma/sqrt(n))
ggplot(data=conf.intervals,aes(x=color,y=mean_carat)) +
  geom_point() + 
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1) + 
  labs(title="95% confidence interval for the mean value of carat",
       subtitle="by Color") +
  coord_flip()
  1. For what color level is the mean carat weight the most different from all the others?

  2. Based on this analysis, describe the relationship between the color and the mean carat weight in diamonds that cost $1200 or less. Refer to the diamond documentation to recall what the color scale means.

You’re done! Commit all remaining changes, use the commit message “Done with Lab 3!”, and push. Before you wrap up the assignment, make sure the .Rmd, .html, and .md documents are all updated on your GitHub repo.