"All statistical models are wrong, but some are useful." –George Box
- Every model makes assumptions that aren’t true. Understanding those assumptions is critical!
- Every model only estimates the average value response variable for a given set of inputs. Assessing the quality of those estimates is important.
- Modeling involves a delicate balance between simplicity & accuracy. There is usually no “correct” model, but there is no value in making a model more complicated than it needs to be. Remember that you will eventually have to explain it to someone else!
- For models with a quantitative response, we can quantify the goodness-of-fit using \(R^2\), the coefficient of determination, which is the proportion of the variability in the response explained by the model
- Key assumptions for linear regression:
- Linearity: the form of the relationship is roughly linear
- Normality of Residuals: the residuals are distributed roughly normally, centered at 0
- Constant Variance of Residuals: the variance of the residuals remains relatively constant with respect to the response and explanatory variables.
library(downloader)
download("https://stat.duke.edu/courses/Fall14/sta112.01/data/movies.Rdata", destfile = "movies.Rdata")
load("movies.Rdata")
qplot(critics_score, audience_score, data = movies)
mod = lm(audience_score ~ critics_score, data = movies)
summary(mod)
##
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.43 -8.69 1.26 9.68 52.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.5960 1.4807 24.7 <2e-16 ***
## critics_score 0.4052 0.0242 16.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.3 on 424 degrees of freedom
## Multiple R-squared: 0.398, Adjusted R-squared: 0.397
## F-statistic: 281 on 1 and 424 DF, p-value: <2e-16
qplot(mod$fitted, mod$residuals) +
geom_abline(intercept = 0, slope = 0, lty = 2)
Overall:
\(H_0:\) All \(\beta_i\) = 0 (\(i = 1, \cdots, k\))
\(H_A:\) There is at least one \(\beta_i\) 0
Individual slopes:
\(H_0: \beta_i = 0\) - True slope is 0, i.e. no relationship between expalantory and response variables (\(i = 1, \cdots, k\))
\(H_0: \beta_i \ne 0\) - True slope is 0, i.e. there is a relationship between expalantory and response variables
summary(mod)
##
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.43 -8.69 1.26 9.68 52.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.5960 1.4807 24.7 <2e-16 ***
## critics_score 0.4052 0.0242 16.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.3 on 424 degrees of freedom
## Multiple R-squared: 0.398, Adjusted R-squared: 0.397
## F-statistic: 281 on 1 and 424 DF, p-value: <2e-16
confint(mod, 'critics_score', level = 0.95)
## 2.5 % 97.5 %
## critics_score 0.3577 0.4528
new_movie50 = data.frame(critics_score = 50)
predict(mod, new_movie50)
## 1
## 56.86
But there must be some uncertainty around this value…
A confidence interval for the average (expected) value of \(y\), \(E(y)\), for a given \(x^\star\), is \[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \] where \(s\) is the standard deviation of the residuals, calculated as \(\sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}}\).
predict(mod, new_movie50, interval = "confidence", level = 0.95)
## fit lwr upr
## 1 56.86 55.48 58.23
\[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \]
(ci50 = predict(mod, new_movie50, interval="confidence", level = 0.95))
## fit lwr upr
## 1 56.86 55.48 58.23
ci50[3] - ci50[2]
## [1] 2.75
new_movie10 = data.frame(critics_score = 10)
(ci10 = predict(mod, new_movie10, interval="confidence", level = 0.95))
## fit lwr upr
## 1 40.65 38.15 43.15
ci10[3] - ci10[2]
## [1] 5
The width of the confidence interval for \(E(y)\) increases as \(x^\star\) moves away from the center.
Conceptually: We are much more certain of our predictions at the center of the data than at the edges (and our level of certainty decreases even further when predicting outside the range of the data – extrapolation).
Mathematically: As \((x^\star - \bar{x})^2\) term increases, the margin of error of the confidence interval increases as well.
Earlier we learned how to calculate a confidence interval for average \(y\), \(E(y)\), for a given \(x^\star\).
Suppose we’re not interested in the average, but instead we want to predict a future value of \(y\) for a given \(x^\star\).
Would you expect there to be more uncertainty around an average or a specific predicted value?A for \(y\) for a given \(x^\star\) is \[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ 1 + \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \] where \(s\) is the standard deviation of the residuals.
The formula is very similar, except the variability is higher since there is an added 1 in the formula.
Prediction level: If we repeat the study of obtaining a regression data set many times, each time forming a XX% prediction interval at \(x^\star\), and wait to see what the future value of \(y\) is at \(x^\star\), then roughly XX% of the prediction intervals will contain the corresponding actual value of \(y\).
(pi50 = predict(mod, new_movie50, interval = "prediction", level = 0.95))
## fit lwr upr
## 1 56.86 28.73 84.98
pi50[3] - pi50[2]
## [1] 56.25
ci50[3] - ci50[2]
## [1] 2.75
for a given \(x^\star\).
For a given data set, the error in estimating \(E(y)\) and \(\hat{y}\) grows as \(x^\star\) moves away from \(\bar{x}\). Thus, the further \(x^\star\) is from \(\bar{x}\), the wider the confidence and prediction intervals will be.
If any of the conditions underlying the model are violated, then the confidence intervals and prediction intervals may be invalid as well. This is why it’s so important to check the conditions by examining the residuals, etc.