October 9, 2014

Quality of fit

Assessing quality of fit

"All statistical models are wrong, but some are useful." –George Box
  • Every model makes assumptions that aren’t true. Understanding those assumptions is critical!
  • Every model only estimates the average value response variable for a given set of inputs. Assessing the quality of those estimates is important.
  • Modeling involves a delicate balance between simplicity & accuracy. There is usually no “correct” model, but there is no value in making a model more complicated than it needs to be. Remember that you will eventually have to explain it to someone else!
  • For models with a quantitative response, we can quantify the goodness-of-fit using \(R^2\), the coefficient of determination, which is the proportion of the variability in the response explained by the model
  • Key assumptions for linear regression:
    • Linearity: the form of the relationship is roughly linear
    • Normality of Residuals: the residuals are distributed roughly normally, centered at 0
    • Constant Variance of Residuals: the variance of the residuals remains relatively constant with respect to the response and explanatory variables.

Movies on Rotten Tomatoes

library(downloader)
download("https://stat.duke.edu/courses/Fall14/sta112.01/data/movies.Rdata", destfile = "movies.Rdata")
load("movies.Rdata")
qplot(critics_score, audience_score, data = movies)

plot of chunk unnamed-chunk-2

Predicting audience score

mod = lm(audience_score ~ critics_score, data = movies)
summary(mod)
## 
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.43  -8.69   1.26   9.68  52.40 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    36.5960     1.4807    24.7   <2e-16 ***
## critics_score   0.4052     0.0242    16.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.3 on 424 degrees of freedom
## Multiple R-squared:  0.398,  Adjusted R-squared:  0.397 
## F-statistic:  281 on 1 and 424 DF,  p-value: <2e-16

(1) Linearity

qplot(mod$fitted, mod$residuals) +
  geom_abline(intercept = 0, slope = 0, lty = 2)

plot of chunk unnamed-chunk-4

(2) Nearly normal residuals

plot of chunk unnamed-chunk-6

(2) Constant variance of residuals

plot of chunk unnamed-chunk-7

Inference

Hypotheses

Overall:

\(H_0:\) All \(\beta_i\) = 0 (\(i = 1, \cdots, k\))

\(H_A:\) There is at least one \(\beta_i\) 0

Individual slopes:

\(H_0: \beta_i = 0\) - True slope is 0, i.e. no relationship between expalantory and response variables (\(i = 1, \cdots, k\))

\(H_0: \beta_i \ne 0\) - True slope is 0, i.e. there is a relationship between expalantory and response variables

Is critics' score a significant predictor of audience score?
summary(mod)
## 
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.43  -8.69   1.26   9.68  52.40 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    36.5960     1.4807    24.7   <2e-16 ***
## critics_score   0.4052     0.0242    16.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.3 on 424 degrees of freedom
## Multiple R-squared:  0.398,  Adjusted R-squared:  0.397 
## F-statistic:  281 on 1 and 424 DF,  p-value: <2e-16

Confidence interval for a slope

Interpret the following confidence interval for the slope of the regression line.
confint(mod, 'critics_score', level = 0.95)
##                2.5 % 97.5 %
## critics_score 0.3577 0.4528

Predictions, and uncertainty around predictions

Prediction

Predict the audience score of a movie with a critics' score of 50.
new_movie50 = data.frame(critics_score = 50)
predict(mod, new_movie50)
##     1 
## 56.86

But there must be some uncertainty around this value…

Averages vs. an individual observation

Suppose I want to predict (a) the potential salary of a single Duke StatSci alum vs. (b) the average potential salary of a group of StatSci alum. Which is easier to predict, i.e. how will the intervals compare?

Confidence intervals for average values

A confidence interval for the average (expected) value of \(y\), \(E(y)\), for a given \(x^\star\), is \[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \] where \(s\) is the standard deviation of the residuals, calculated as \(\sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}}\).

Confidence intervals for average values

Predict the average audience score of movies with a critics' score of 50.
predict(mod, new_movie50, interval = "confidence", level = 0.95)
##     fit   lwr   upr
## 1 56.86 55.48 58.23

Sliding across the x-axis

How would you expect the width of the 95% confidence interval for the average audience scores of movies with a critics' score of 50 (\(x^\star = 10\)) to compare to the previous confidence interval (where \(x^\star = 50\))?

\[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \]