Getting Started

Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix hw-02. It contains the starter documents you need to complete the lab.
Clone the repo in RStudio Cloud
Configure git using the use_git_config() function.

Tips

Here are some tips as you complete HW 02:

Make sure all your code chunks are informatively named, are not repeated, and there are no spaces in the code chunk names.
Periodically knit your document and commit changes (the more often the better, for example, once per each new task)
Push all your changes back to your GitHub repo
Once you push your changes back you do not need to do anything else to “submit” your work. And you can of course push multiple times throughout the assignment. At the time of the deadline we will take whatever is in your repo and consider it your final submission, and grade the state of your work at that time (which means even if you made mistakes before then, you won’t be penalized for them as long as the final state of your work is correct).

Packages

We will use the following packages in this assignment:

library(tidyverse)
library(broom)
library(knitr)

Questions

Part 1: Computations & Concepts

The Computations & Concepts section of homework contains short answer questions about the concepts discussed in class. Some of these questions may also require short chunks of code to produce the output needed to answer the question. Answers should be written in complete sentences.

Analysis of Variance is used to compare means across multiple groups. Some of the values from the table are provided; however, many are missing.

dfw <- 727 # degrees of freedom within (residuals)
dft <- 730 # total degress of freedom
ssw <- 6.486 # sum of squares within (residuals)
sst <- 19.386 # total sum of squares

Fill in the missing values from the ANOVA table, and use the code to display the completed table. Once you have filled in the values for the table, change the eval= to TRUE so the code will be evaluated and the results will be displayed.

dfb <- # degrees of freedom between (model)
ssb <-  # sum of squares between (model)
msb <-  # mean square between
msw <-  # mean square within (residuals)
f_stat <-  #F -statistic 
p_val <- 1-pf(____, _____, _____)  # p-value

Use the code below to combine all of the values and print an ANOVA table.

source <- c("Between Groups", "Within Groups", "Total")
df <- c(dfb, dfw,dft)
ss <- c(ssb, ssw, sst)
ms <- c(msb, msw,NA)  
f.statistic <- c(f_stat, NA, NA)
p.value <- c(p_val,NA,NA)

# combine the columns to make a table called "anova"
anova <- bind_cols("Source"=source,"df"=df,"Sum of squares"=ss,
            "Mean square"=ms,"F-statistic"=f.statistic,"p-value"=p.value)

# print the table 
kable(anova)

Use the table you created in the previous question. How many groups are there? Is there sufficient evidence against the claim that the group population means are equal? Briefly explain.
A linear regression model was used to describe the relationship between a quantitative predictor variable $x$ and a response variable $y$. The output from the Analysis of Variance is shown below. Use the table to answer Questions 3 - 7. You can assume all assumptions for linear regression (and thus ANOVA) are sufficiently met.

term	df	sumsq	meansq	statistic	p.value
x	1	0.3569372	0.3569372	3.187972	0.0757126
Residuals	198	22.1688128	0.1119637	NA	NA

How many observations are the dataset used to conduct this analysis?

What is the estimate of $\sigma^2$?
Use the ANOVA table to calculate the proportion of the variation in $y$ that is explained by $x$. Show how you calculated this value in your R code, i.e. use R as a “calculator”.
What does the F-statistic mean? In other words, how is the F-statistic calculated?
We can use this table to test the following hypotheses: \[H_0: \beta_1 = 0 \hspace{5mm} \text{ vs }\hspace{5mm} H_a: \beta_1 \neq 0\] State the conclusion of the test in terms of the relationship between $x$ and $y$. Briefly describe how you came to this conclusion.

Part 2: Data Analysis

The Data Analysis section of homework contains open-ended data analysis questions. Your response should be neatly organized and read as a complete narrative. This means that in addition to addressing the question there should also be exploratory data analysis and an analysis of the model assumptions. In short, these questions should be treated as “mini-projects”.

We will use the diamonds dataset for this analysis. Use linear regression to description the relationship between the price and carat of diamonds that cost $1000 or less and have a “Good” cut. Your analysis should address the following:
- Use the results from the ANOVA table to discuss carat sufficiently explains the variation in price.
- Compare the conclusions from the ANOVA table to the conclusions you derive from the statistical inference (hypothesis test and confidence interval) on the slope.
- Describe the relationship between a diamond’s price and the carat weight. Be sure to include the confidence interval in your description.
- Use the model to predict the price of a single diamond that that is 0.55 carats and include the appropriate interval. Is this prediction reliable? Briefly explain why or why not.

As usual, be sure to include exploratory data analysis and an analysis of the model assumptions.

Grading

Total	70
Questions 1 - 7	30
Question 8	30
Documents complete and neatly organized (Markdown and knitted documents)	5
Answers written in complete sentences	3
Regular and informative commit messages	2

HW 02: Analysis of Variance

due Mon, Feb 11 at 11:59p