Lecture 14 -

Single Predictor Regression, Pt 2

`"All statistical models are wrong, but some are useful." –George Box`

- Every model makes assumptions that aren’t true. Understanding those assumptions is critical!

- Every model only estimates the average value response variable for a given set of inputs. Assessing the quality of those estimates is important.

- Modeling involves a delicate balance between simplicity & accuracy. There is usually no “correct” model, but there is no value in making a model more complicated than it needs to be. Remember that you will eventually have to explain it to someone else!

- For models with a quantitative response, we can quantify the goodness-of-fit using \(R^2\), the coefficient of determination, which is the proportion of the variability in the response explained by the model

- Key assumptions for linear regression:

- Linearity: the form of the relationship is roughly linear
- Normality of Residuals: the residuals are distributed roughly normally, centered at 0
- Constant Variance of Residuals: the variance of the residuals remains relatively constant with respect to the response and explanatory variables.

```
library(downloader)
download("https://stat.duke.edu/courses/Fall14/sta112.01/data/movies.Rdata", destfile = "movies.Rdata")
load("movies.Rdata")
qplot(critics_score, audience_score, data = movies)
```

```
mod = lm(audience_score ~ critics_score, data = movies)
summary(mod)
```

```
##
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.43 -8.69 1.26 9.68 52.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.5960 1.4807 24.7 <2e-16 ***
## critics_score 0.4052 0.0242 16.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.3 on 424 degrees of freedom
## Multiple R-squared: 0.398, Adjusted R-squared: 0.397
## F-statistic: 281 on 1 and 424 DF, p-value: <2e-16
```

```
qplot(mod$fitted, mod$residuals) +
geom_abline(intercept = 0, slope = 0, lty = 2)
```

**Overall:**

\(H_0:\) All \(\beta_i\) = 0 (\(i = 1, \cdots, k\))

\(H_A:\) There is at least one \(\beta_i\) 0

**Individual slopes:**

\(H_0: \beta_i = 0\) - True slope is 0, i.e. no relationship between expalantory and response variables (\(i = 1, \cdots, k\))

\(H_0: \beta_i \ne 0\) - True slope is 0, i.e. there is a relationship between expalantory and response variables

Is critics’ score a significant predictor of audience score?

`summary(mod)`

```
##
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.43 -8.69 1.26 9.68 52.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.5960 1.4807 24.7 <2e-16 ***
## critics_score 0.4052 0.0242 16.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.3 on 424 degrees of freedom
## Multiple R-squared: 0.398, Adjusted R-squared: 0.397
## F-statistic: 281 on 1 and 424 DF, p-value: <2e-16
```

Interpret the following confidence interval for the slope of the regression line.

`confint(mod, 'critics_score', level = 0.95)`

```
## 2.5 % 97.5 %
## critics_score 0.3577 0.4528
```

Predict the audience score of a movie with a critics’ score of 50.

```
new_movie50 = data.frame(critics_score = 50)
predict(mod, new_movie50)
```

```
## 1
## 56.86
```

But there must be some uncertainty around this value…

Suppose I want to predict (a) the potential salary of a single Duke StatSci alum vs. (b) the average potential salary of a group of StatSci alum. Which is easier to predict, i.e. how will the intervals compare?

A confidence interval for the average (expected) value of \(y\), \(E(y)\), for a given \(x^\star\), is \[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \] where \(s\) is the standard deviation of the residuals, calculated as \(\sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}}\).

Predict the average audience score of movies with a critics’ score of 50.

`predict(mod, new_movie50, interval = "confidence", level = 0.95)`

```
## fit lwr upr
## 1 56.86 55.48 58.23
```

How would you expect the width of the 95% confidence interval for the average audience scores of movies with a critics’ score of 50 (\(x^\star = 10\)) to compare to the previous confidence interval (where \(x^\star = 50\))?

\[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \]

`(ci50 = predict(mod, new_movie50, interval="confidence", level = 0.95))`

```
## fit lwr upr
## 1 56.86 55.48 58.23
```

`ci50[3] - ci50[2]`

`## [1] 2.75`

```
new_movie10 = data.frame(critics_score = 10)
(ci10 = predict(mod, new_movie10, interval="confidence", level = 0.95))
```

```
## fit lwr upr
## 1 40.65 38.15 43.15
```

`ci10[3] - ci10[2]`

`## [1] 5`

The width of the confidence interval for \(E(y)\) increases as \(x^\star\) moves away from the center.

Conceptually: We are much more certain of our predictions at the center of the data than at the edges (and our level of certainty decreases even further when predicting outside the range of the data – extrapolation).

Mathematically: As \((x^\star - \bar{x})^2\) term increases, the margin of error of the confidence interval increases as well.

Earlier we learned how to calculate a confidence interval for average \(y\), \(E(y)\), for a given \(x^\star\).

Suppose we’re not interested in the average, but instead we want to predict a future value of \(y\) for a given \(x^\star\).

Would you expect there to be more uncertainty around an average or a specific predicted value?A for \(y\) for a given \(x^\star\) is \[ \hat{y} \pm t^\star_{n - 2} s \sqrt{ 1 + \frac{1}{n} + \frac{(x^\star - \bar{x})^2}{(n - 1)s_x^2} } \] where \(s\) is the standard deviation of the residuals.

The formula is very similar, except the variability is higher since there is an added 1 in the formula.

Prediction level: If we repeat the study of obtaining a regression data set many times, each time forming a XX% prediction interval at \(x^\star\), and wait to see what the future value of \(y\) is at \(x^\star\), then roughly XX% of the prediction intervals will contain the corresponding actual value of \(y\).

Predict the audience score of a movie with a critics’ score of 50. How does this interval compare to the confidence interval for the average audience score of movies with a critics’ score of 50?

`(pi50 = predict(mod, new_movie50, interval = "prediction", level = 0.95))`

```
## fit lwr upr
## 1 56.86 28.73 84.98
```

`pi50[3] - pi50[2]`

`## [1] 56.25`

`ci50[3] - ci50[2]`

`## [1] 2.75`

- A prediction interval is similar in spirit to a confidence interval, except that
- the prediction interval is designed to cover a ``moving target“, \ the random future value of \(y\), while
- the confidence interval is designed to cover the ``fixed target“, \ the average (expected) value of \(y\), \(E(y)\),

for a given \(x^\star\).

- Although both are centered at \(\hat{y}\), the prediction interval is wider than the confidence interval, for a given \(x^\star\) and confidence level. This makes sense, since
- the prediction interval must take account of the tendency of \(y\) to fluctuate from its mean value, while
- the confidence interval simply needs to account for the uncertainty in estimating the mean value.

For a given data set, the error in estimating \(E(y)\) and \(\hat{y}\) grows as \(x^\star\) moves away from \(\bar{x}\). Thus, the further \(x^\star\) is from \(\bar{x}\), the wider the confidence and prediction intervals will be.

If any of the conditions underlying the model are violated, then the confidence intervals and prediction intervals may be invalid as well. This is why it’s so important to check the conditions by examining the residuals, etc.

- Plot the confidence and prediction intervals for \(x^\star = 80\) on the same plot and compare their widths.
- Calculate the confidence and prediction interval for movies with a critics score of 0 - 100, and plot the two intervals on the same plot.