Quality of fit

Assessing quality of fit

"All statistical models are wrong, but some are useful." –George Box
  • Every model makes assumptions that aren’t true. Understanding those assumptions is critical!
  • Every model only estimates the average value response variable for a given set of inputs. Assessing the quality of those estimates is important.
  • Modeling involves a delicate balance between simplicity & accuracy. There is usually no “correct” model, but there is no value in making a model more complicated than it needs to be. Remember that you will eventually have to explain it to someone else!
  • For models with a quantitative response, we can quantify the goodness-of-fit using \(R^2\), the coefficient of determination, which is the proportion of the variability in the response explained by the model
  • Key assumptions for linear regression:
    • Linearity: the form of the relationship is roughly linear
    • Normality of Residuals: the residuals are distributed roughly normally, centered at 0
    • Constant Variance of Residuals: the variance of the residuals remains relatively constant with respect to the response and explanatory variables.

Movies on Rotten Tomatoes

library(downloader)
download("https://stat.duke.edu/courses/Fall14/sta112.01/data/movies.Rdata", destfile = "movies.Rdata")
load("movies.Rdata")
qplot(critics_score, audience_score, data = movies)

plot of chunk unnamed-chunk-2

Predicting audience score

mod = lm(audience_score ~ critics_score, data = movies)
summary(mod)
## 
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.43  -8.69   1.26   9.68  52.40 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    36.5960     1.4807    24.7   <2e-16 ***
## critics_score   0.4052     0.0242    16.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.3 on 424 degrees of freedom
## Multiple R-squared:  0.398,  Adjusted R-squared:  0.397 
## F-statistic:  281 on 1 and 424 DF,  p-value: <2e-16

(1) Linearity

qplot(mod$fitted, mod$residuals) +
  geom_abline(intercept = 0, slope = 0, lty = 2)

plot of chunk unnamed-chunk-4

(2) Nearly normal residuals

plot of chunk unnamed-chunk-6

(2) Constant variance of residuals

plot of chunk unnamed-chunk-7

Inference

Hypotheses

Overall:

\(H_0:\) All \(\beta_i\) = 0 (\(i = 1, \cdots, k\))

\(H_A:\) There is at least one \(\beta_i\) 0

Individual slopes:

\(H_0: \beta_i = 0\) - True slope is 0, i.e. no relationship between expalantory and response variables (\(i = 1, \cdots, k\))

\(H_0: \beta_i \ne 0\) - True slope is 0, i.e. there is a relationship between expalantory and response variables

Is critics’ score a significant predictor of audience score?
summary(mod)
## 
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.43  -8.69   1.26   9.68  52.40 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    36.5960     1.4807    24.7   <2e-16 ***
## critics_score   0.4052     0.0242    16.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.3 on 424 degrees of freedom
## Multiple R-squared:  0.398,  Adjusted R-squared:  0.397 
## F-statistic:  281 on 1 and 424 DF,  p-value: <2e-16

Confidence interval for a slope

Interpret the following confidence interval for the slope of the regression line.
confint(mod, 'critics_score', level = 0.95)
##                2.5 % 97.5 %
## critics_score 0.3577 0.4528

Predictions, and uncertainty around predictions

Prediction

Predict the audience score of a movie with a critics’ score of 50.
new_movie50 = data.frame(critics_score = 50)
predict(mod, new_movie50)
##     1 
## 56.86

But there must be some uncertainty around this value…

Averages vs. an individual observation

Suppose I want to predict (a) the potential salary of a single Duke StatSci alum vs. (b) the average potential salary of a group of StatSci alum. Which is easier to predict, i.e. how will the intervals compare?

Confidence intervals for average values

A confidence interval for the average (expected) value of \(y\), \(E(y)\), for a given \(x^\star\), is \[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \] where \(s\) is the standard deviation of the residuals, calculated as \(\sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}}\).

Confidence intervals for average values

Predict the average audience score of movies with a critics’ score of 50.
predict(mod, new_movie50, interval = "confidence", level = 0.95)
##     fit   lwr   upr
## 1 56.86 55.48 58.23

Sliding across the x-axis

How would you expect the width of the 95% confidence interval for the average audience scores of movies with a critics’ score of 50 (\(x^\star = 10\)) to compare to the previous confidence interval (where \(x^\star = 50\))?

\[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \]

plot of chunk unnamed-chunk-12

Confirm

(ci50 = predict(mod, new_movie50, interval="confidence", level = 0.95))
##     fit   lwr   upr
## 1 56.86 55.48 58.23
ci50[3] - ci50[2]
## [1] 2.75
new_movie10 = data.frame(critics_score = 10)
(ci10 = predict(mod, new_movie10, interval="confidence", level = 0.95))
##     fit   lwr   upr
## 1 40.65 38.15 43.15
ci10[3] - ci10[2]
## [1] 5

Recap

The width of the confidence interval for \(E(y)\) increases as \(x^\star\) moves away from the center.

  • Conceptually: We are much more certain of our predictions at the center of the data than at the edges (and our level of certainty decreases even further when predicting outside the range of the data – extrapolation).

  • Mathematically: As \((x^\star - \bar{x})^2\) term increases, the margin of error of the confidence interval increases as well.

Prediction intervals for single values

Earlier we learned how to calculate a confidence interval for average \(y\), \(E(y)\), for a given \(x^\star\).

Suppose we’re not interested in the average, but instead we want to predict a future value of \(y\) for a given \(x^\star\).

Would you expect there to be more uncertainty around an average or a specific predicted value?

Prediction intervals for single values

A for \(y\) for a given \(x^\star\) is \[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ 1 + \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \] where \(s\) is the standard deviation of the residuals.

  • The formula is very similar, except the variability is higher since there is an added 1 in the formula.

  • Prediction level: If we repeat the study of obtaining a regression data set many times, each time forming a XX% prediction interval at \(x^\star\), and wait to see what the future value of \(y\) is at \(x^\star\), then roughly XX% of the prediction intervals will contain the corresponding actual value of \(y\).

Prediction intervals for single values

Predict the audience score of a movie with a critics’ score of 50. How does this interval compare to the confidence interval for the average audience score of movies with a critics’ score of 50?
(pi50 = predict(mod, new_movie50, interval = "prediction", level = 0.95))
##     fit   lwr   upr
## 1 56.86 28.73 84.98
pi50[3] - pi50[2]
## [1] 56.25
ci50[3] - ci50[2]
## [1] 2.75

Recap - CI vs. PI

CI for \(E(y)\) vs. PI for \(y\) - differences

  • A prediction interval is similar in spirit to a confidence interval, except that
    • the prediction interval is designed to cover a ``moving target“, \ the random future value of \(y\), while
    • the confidence interval is designed to cover the ``fixed target“, \ the average (expected) value of \(y\), \(E(y)\),

for a given \(x^\star\).

  • Although both are centered at \(\hat{y}\), the prediction interval is wider than the confidence interval, for a given \(x^\star\) and confidence level. This makes sense, since
    • the prediction interval must take account of the tendency of \(y\) to fluctuate from its mean value, while
    • the confidence interval simply needs to account for the uncertainty in estimating the mean value.

CI for \(E(y)\) vs. PI for \(y\) - similarities

  • For a given data set, the error in estimating \(E(y)\) and \(\hat{y}\) grows as \(x^\star\) moves away from \(\bar{x}\). Thus, the further \(x^\star\) is from \(\bar{x}\), the wider the confidence and prediction intervals will be.

  • If any of the conditions underlying the model are violated, then the confidence intervals and prediction intervals may be invalid as well. This is why it’s so important to check the conditions by examining the residuals, etc.

Application Exercise 13 [time permitting]

  1. Plot the confidence and prediction intervals for \(x^\star = 80\) on the same plot and compare their widths.
  2. Calculate the confidence and prediction interval for movies with a critics score of 0 - 100, and plot the two intervals on the same plot.