October 9, 2014

"All statistical models are wrong, but some are useful." –George Box

- Every model makes assumptions that aren’t true. Understanding those assumptions is critical!

- Every model only estimates the average value response variable for a given set of inputs. Assessing the quality of those estimates is important.

- Modeling involves a delicate balance between simplicity & accuracy. There is usually no “correct” model, but there is no value in making a model more complicated than it needs to be. Remember that you will eventually have to explain it to someone else!

- For models with a quantitative response, we can quantify the goodness-of-fit using \(R^2\), the coefficient of determination, which is the proportion of the variability in the response explained by the model

- Key assumptions for linear regression:
- Linearity: the form of the relationship is roughly linear
- Normality of Residuals: the residuals are distributed roughly normally, centered at 0
- Constant Variance of Residuals: the variance of the residuals remains relatively constant with respect to the response and explanatory variables.

library(downloader) download("https://stat.duke.edu/courses/Fall14/sta112.01/data/movies.Rdata", destfile = "movies.Rdata") load("movies.Rdata") qplot(critics_score, audience_score, data = movies)

mod = lm(audience_score ~ critics_score, data = movies) summary(mod)

## ## Call: ## lm(formula = audience_score ~ critics_score, data = movies) ## ## Residuals: ## Min 1Q Median 3Q Max ## -42.43 -8.69 1.26 9.68 52.40 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 36.5960 1.4807 24.7 <2e-16 *** ## critics_score 0.4052 0.0242 16.8 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 14.3 on 424 degrees of freedom ## Multiple R-squared: 0.398, Adjusted R-squared: 0.397 ## F-statistic: 281 on 1 and 424 DF, p-value: <2e-16

qplot(mod$fitted, mod$residuals) + geom_abline(intercept = 0, slope = 0, lty = 2)

**Overall:**

\(H_0:\) All \(\beta_i\) = 0 (\(i = 1, \cdots, k\))

\(H_A:\) There is at least one \(\beta_i\) 0

**Individual slopes:**

\(H_0: \beta_i = 0\) - True slope is 0, i.e. no relationship between expalantory and response variables (\(i = 1, \cdots, k\))

\(H_0: \beta_i \ne 0\) - True slope is 0, i.e. there is a relationship between expalantory and response variables

Is critics' score a significant predictor of audience score?

summary(mod)

## ## Call: ## lm(formula = audience_score ~ critics_score, data = movies) ## ## Residuals: ## Min 1Q Median 3Q Max ## -42.43 -8.69 1.26 9.68 52.40 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 36.5960 1.4807 24.7 <2e-16 *** ## critics_score 0.4052 0.0242 16.8 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 14.3 on 424 degrees of freedom ## Multiple R-squared: 0.398, Adjusted R-squared: 0.397 ## F-statistic: 281 on 1 and 424 DF, p-value: <2e-16

Interpret the following confidence interval for the slope of the regression line.

confint(mod, 'critics_score', level = 0.95)

## 2.5 % 97.5 % ## critics_score 0.3577 0.4528

Predict the audience score of a movie with a critics' score of 50.

new_movie50 = data.frame(critics_score = 50) predict(mod, new_movie50)

## 1 ## 56.86

But there must be some uncertainty around this value…

Suppose I want to predict (a) the potential salary of a single Duke StatSci alum vs. (b) the average potential salary of a group of StatSci alum. Which is easier to predict, i.e. how will the intervals compare?

A confidence interval for the average (expected) value of \(y\), \(E(y)\), for a given \(x^\star\), is \[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \] where \(s\) is the standard deviation of the residuals, calculated as \(\sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}}\).

Predict the average audience score of movies with a critics' score of 50.

predict(mod, new_movie50, interval = "confidence", level = 0.95)

## fit lwr upr ## 1 56.86 55.48 58.23

How would you expect the width of the 95% confidence interval for the average audience scores of movies with a critics' score of 50 (\(x^\star = 10\)) to compare to the previous confidence interval (where \(x^\star = 50\))?

\[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \]