class: center, middle, inverse, title-slide # Simple Linear Regression ## Inference & Prediction ### Prof. Maria Tackett & Youngsoo Baek ### 01.27.20 --- class: middle, center ### [Click for PDF of slides](04-slr-inf.pdf) --- ### Announcements - Lab 02 due **Tuesday at 11:59p** - One person from each group submit the assignment on Gradescope. - Click to add all group members names to the submission. - HW 01 due **Wed, Jan 29 at 11:59p** - Fill out daily engagement survey (should receive it at the end of class) - [Reading for today](https://www2.stat.duke.edu/courses/Spring20/sta210.001/reading/reading-02.html) - No lecture on Wedneseday. Watch rstudio::conf Jan 29 - 30 and complete reflection by Jan 31 at 11:59p - Click [here](https://www2.stat.duke.edu/courses/Spring20/sta210.001/hw/rstudio-conf.html) for details --- ### Today's Agenda - Model Assumptions - Inference for regression - Prediction (time permitting) --- ### Packages and Data ```r library(tidyverse) library(broom) library(modelr) library(knitr) library(fivethirtyeight) #fandango dataset library(cowplot) #plot_grid() function ``` ```r movie_scores <- fandango %>% rename(critics = rottentomatoes, audience = rottentomatoes_user) ``` --- ### rottentomatoes.com What is the relationship between the critics' score and audience score for movies? -- <img src="img/03/movie-rating-1.png" width="70%" style="display: block; margin: auto;" /> -- <img src="img/03/movie-rating-2.png" width="70%" style="display: block; margin: auto;" /> --- ### Critic vs. Audience Ratings - To answer this question, we will analyze the critic and audience scores from rottentomatoes.com. - The data was first used in the article [Be Suspicious of Online Movie Ratings, Especially Fandango's](https://fivethirtyeight.com/features/fandango-movies-ratings/). - Variables: - **`critics`**: critics score for the film (0 - 100) - **`audience`**: Audience score for the film (0 - 100) --- .small[ ```r ggplot(data = movie_scores, mapping = aes(x = critics, y = audience)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Audience Score vs. Critics Score") ``` <img src="04-slr-inf_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- ### The Model ```r model <- lm(audience ~ critics, data = movie_scores) tidy(model) %>% kable(format = "markdown", digits = 3) ``` |term | estimate| std.error| statistic| p.value| |:-----------|--------:|---------:|---------:|-------:| |(Intercept) | 32.316| 2.343| 13.795| 0| |critics | 0.519| 0.035| 15.028| 0| `$$\hat{\text{audience}} = 32.316 + 0.519 \times \text{critics}$$` -- - <font class = "vocab">Slope: </font> For each additional percentage point in the critics score, the audience score is expected to increase by 0.519 percentage points on average. - <font class = "vocab">Intercept: </font> If a movie gets a 0% from the critics, the audience score is expected to be 32.316%. --- class: middle, center ### Checking Model Assumptions --- ### Recall: The regression model When we fit a simple linear regression model, we assume for every `\(x_i\)` the distribution of `\(y\)` is approximately Normal... - with mean `\(\beta_0 + \beta_1x_i\)`. - and variance `\(\sigma^2\)` <img src="img/02/regression.png" width="80%" style="display: block; margin: auto;" /> --- ### Assumptions for Regression When we fit a simple linear regression model, we must check that our data follows the relationship on the previous slide by checking the following assumptions... <br> 1. <font class="vocab">Linearity: </font>The plot of the mean value of `\(y\)` versus `\(x\)` falls on a straight line 2. <font class="vocab">Constant Variance: </font>The variance of the distribution of `\(y\)` for a given value of `\(x\)` is `\(\sigma^2\)`, i.e. is the same for all values of `\(x\)` 3. <font class="vocab">Normality: </font> For a given `\(x\)`, the distribution of `\(y\)` is Normal 4. <font class="vocab">Independence: </font>All observations are independent --- ### Checking assumptions We will use various plots to check the assumptions for regression: 1. Scatterplot of `\(y\)` vs. `\(x\)` - Do this as part of the exploratory data analysis to see if there are any obvious departures from linearity 2. Plot of residuals vs. predictor variable - Check linearity condition - Check constant variance condition 3. Histogram and Normal QQ-Plot of residuals - Check Normality condition --- ### Linearity: Plot of residuals vs. predictor - If the linear model, `\(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\)`, adequately describes the relationship between `\(x\)` and `\(y\)`, then the residuals should reflect random (chance) error - To assess this, we can look at a plot of the residuals vs. the predictor variable - .vocab[Linearity satisifed] if there is no distinguishable pattern in the residuals plot, i.e. the residuals should be randomly scattered - A non-random pattern (e.g. a parabola) suggests a linear model does not adequately describe the relationship between `\(x\)` and `\(y\)` --- ### Constant Variance: Plot of residuals vs. predictor - If the spread of the distribution of `\(y\)` is equal for all values of `\(x\)`, then the spread of the residuals should be approximately equal for each value of `\(x\)` - To assess this, we can look at a plot of the residuals vs. the predictor variable - .vocab[Constant variance satisifed] if the vertical spread of the residuals is approximately equal as you move from left to right (i.e. there is no "fan" pattern) - A fan pattern suggests the constant variance assumption is not satisified and transformation or some other remedy is required (more on this later in the semester) --- ### Example residual plots <img src="img/02/resid_plots.png" width="80%" style="display: block; margin: auto;" /> --- .small[ ```r movie_scores <- movie_scores %>% mutate(residuals = resid(model)) ``` ```r ggplot(data = movie_scores, mapping = aes(x = critics, y = residuals)) + geom_point() + geom_hline(yintercept = 0, color = "red")+ labs(title = "Residuals vs. Critics Score") ``` <img src="04-slr-inf_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] .question[ What are your conclusions about the linearity and constant variance assumptions? ] --- ### Normality: Histogram & Normal QQ-Plot - The linear model assumes that the distribution of `\(y\)` is Normal for every value of `\(x\)`. - This is impossible to check in practice, so we will look at the overall distribution of the residuals to assess if the Normality assumption is satisified - .vocab[Normaltiy satisified] if a histogram of the residuals is approximately Normal - Can also check that a Normal QQ-plot falls along a diagonal line - Note: Most inference methods for regression are robust to some departures from Normality --- ### Understanding the Normal QQ plot <img src="img/02/normal_qqplot.png" width="100%" style="display: block; margin: auto;" /> --- ```r ggplot(data = movie_scores, mapping = aes(x = residuals)) + geom_histogram() + labs(title = "Distribution of Residuals") ``` <img src="04-slr-inf_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ```r ggplot(data = movie_scores, mapping = aes(sample = residuals)) + stat_qq() + stat_qq_line() + labs(title = "Normal QQ Plot of Residuals") ``` <img src="04-slr-inf_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ### Checking Independence - Often, we can conclude that the independence assumption is sufficiently met based on a description of the data and how it was collected. - Two common violations of the independence assumption: - <font class="vocab">Serial Effect</font>: If the data were collected over time, the residuals should be plotted in time order to determine if there is serial correlation - <font class="vocab">Cluster Effect</font>: If there are subgroups represented in the data that are not accounted for in the model (e.g. type of movie), you can color the points in the residual plots by group to see if the model systematically over or under predicts for a particular subgroup --- class: middle, center ### Inference for `\(\beta_1\)` --- ### Questions of interest In our example, we will treat the data as a random sample of movies from rottentomatoes.com **Questions of interest** - What is a plausible range of values of the true population slope for `critics`? (<font class = "vocab">confidence interval</font>) - Is there actually a linear relationship between `critics` and `audience`, or is the relationshp we observed due to random chance? - We estimated `\(\hat{\beta}_1 = 0.519\)`, but is there sufficient evidence to conclude that the true population slope `\(\beta\)` is different from 0? (<font class = "vocab">hypothesis test</font>) --- class: middle, center ### What is a plausible range of values of the true population slope for `critics`? --- ### General form of the confidence interval - Let <font class="vocab">SE</font> be the standard error of the statistic used to estimate the parameter of interest, then the general form of the confidence interval is .alert[ `$$\text{ Estimate} \pm \text{ (critical value) } \times \text{SE}$$` ] - *Note*: The critical value is determined by the distribution of the estimate (statistic) and the confidence level - For the regression slope: - `\(\hat{\beta}_1\)` is the statistic used to estimate the parameter, `\(\beta_1\)` - We will write the confidence interval as `$$\mathbf{\hat{\beta}_1 \pm t^* SE(\hat{\beta}_1)}$$` --- ### Confidence interval for `\(\beta_1\)` - The confidence interval for the regression slope is .alert[ `$$\mathbf{\hat{\beta}_1 \pm t^* SE(\hat{\beta}_1)}$$` ] - `\(t^*\)` is the critical value associated with the confidence level. + It is calculated from a `\(t\)` distribution with `\(n-2\)` degrees of freedom - `\(SE(\hat{\beta}_1)\)` is the standard error for the slope `$$SE(\hat{\beta}_1) = \sqrt{\frac{\hat{\sigma}^2}{\sum\limits_{i=1}^n (x_i - \bar{x})^2}} \hspace{2.5mm} = \hspace{2.5mm} \hat{\sigma}\sqrt{\frac{1}{(n-1)s_X^2}}$$` --- ### What is `\(\hat{\sigma}\)`? - Recall, the residual is the difference between the observed response the predicted response (the estimated mean) - The residual for the ith observation, `\((x_i, y_i)\)`, is `$$e_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)$$` - The <font class = "vocab">Residual Standard Error</font> is the estimate of variation about the regression line - Also known as the **Root Mean Square Error (RMSE)** .alert[ `$$\hat{\sigma} = \sqrt{\frac{1}{n-2}\sum\limits_{i=1}^{n} e_i^2}$$` ] --- ### Why *t*? .alert[ `$$\hat{\beta}_1 \sim N \Bigg(\beta_1, \sigma\sqrt{\frac{1}{(n-1)s_X^2}} \Bigg)$$` ] - We don't know `\(\sigma\)`, so we use its estimate `\(\hat{\sigma}\)` in our calculations. Therefore, we use the *t* distribution when we calculate the confidence interval (and conduct hypothesis tests) to account for the extra variability that's been introduced (same reason we use *t* when doing inference for a mean) - The critical value `\(t^*\)` is calculated from the *t(n-2)* distribution - the *t* distribution with *n-2* degrees of freedom. --- ### Movies data: Critical value ```r qt(0.975, 144) ``` ``` ## [1] 1.976575 ``` <img src="04-slr-inf_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ### Calculating the 95% CI for `\(\beta_1\)` | n| var.x| sigma| beta1| crit.val| |---:|-------:|------:|-----:|--------:| | 146| 910.156| 12.538| 0.519| 1.977| .instructions[ Write the equation for the 95% confidence interval for `\(\beta_1\)`, the coefficient (slope) of `critics`. ] --- ### Interpretation ```r model %>% tidy(conf.int=TRUE) %>% kable(format = "markdown", digits = 3) ``` |term | estimate| std.error| statistic| p.value| conf.low| conf.high| |:-----------|--------:|---------:|---------:|-------:|--------:|---------:| |(Intercept) | 32.316| 2.343| 13.795| 0| 27.685| 36.946| |critics | 0.519| 0.035| 15.028| 0| 0.450| 0.587| .instructions[ Interpret the 95% confidence interval for `\(\beta_1\)`, the coefficient (slope) of critics. ] --- class: middle, center ### Is there actually a linear relationship between `critics` and `audience`, or is the relationshp we observed due to random chance? --- ### Recall: Outline of Hypothesis Test 1. State the hypotheses 2. Calculate the test statistic 3. Calculate the p-value 4. State the conclusion in the context of the problem --- ### 1. State the hypotheses - We are often interested in testing whether there is a statistically significant linear relationship between the predictor and response variables - If there is actually no linear relationship between the two variables, the population regression slope, `\(\beta_1\)`, would equal 0 -- - Therefore, let's test the hypotheses: .alert[ `$$\begin{aligned}& H_0: \beta_1 = 0\\& H_a: \beta_1 \neq 0\end{aligned}$$` ] - These are the hypotheses corresponding to the output from the `lm` function --- ### 2. Calculate the test statistic `$$\begin{aligned}& H_0: \beta_1 = 0\\& H_a: \beta_1 \neq 0\end{aligned}$$` .alert[ **Test Statistic:** `$$\begin{aligned}\text{test statistic} &= \frac{\text{Estimate} - \text{Hypothesized}}{SE} \\[10pt] &= \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)}\end{aligned}$$` ] --- ### 3. Calculate the p-value <font class="vocab">p-value</font> is calculated from a `\(t\)` distribution with `\(n-2\)` degrees of freedom .alert[ `$$\text{p-value} = P(t \geq |\text{test statistic}|)$$` ] <img src="04-slr-inf_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- class: middle, center .instructions[ **Write the general definition of the p-value for tests of `\(\beta_1\)`.** ] --- ### 4. State the conclusion | Magnitude of p-value | Interpretation | |:---------------------:|:-------------------------------------:| | p-value < 0.01 | strong evidence against `\(H_0\)` | | 0.01 < p-value < 0.05 | moderate evidence against `\(H_0\)` | | 0.05 < p-value < 0.1 | weak evidence against `\(H_0\)` | | p-value > 0.1 | effectively no evidence against `\(H_0\)` | <br> <br> **Notes:** - These are general guidelines. The strength of evidence depends on the context of the problem. - Don't just rely on the reject/fail to reject conclusion from the hypothesis test to draw conclusions about the relationship between `\(x\)` and `\(y\)`. Use the EDA and confidence intervals to provide more context about the implications of your results. --- ### Movie data: Hypothesis test for `\(\beta_1\)` ```r model %>% tidy() %>% kable(format = "markdown", digits = 3) ``` |term | estimate| std.error| statistic| p.value| |:-----------|--------:|---------:|---------:|-------:| |(Intercept) | 32.316| 2.343| 13.795| 0| |critics | 0.519| 0.035| 15.028| 0| .instructions[ a. State the hypotheses in (1) words and (2) statistical notation. b. What is the meaning of the test statistic in the context of the problem? c. What is the meaning of the p-value in the context of the problem? d. State the conclusion in context of the problem. ] --- class: middle, center ### Predictions --- class: regular ### Predictions for New Observations - We can use the regression model to predict for a response at `\(x_0\)` `$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_0$$` <br> - Because the regression models produces the mean response for a given value of `\(x_0\)`, it will produce the same estimate whether we want to predict the mean response at `\(x_0\)` or an individual response at `\(x_0\)` --- class: regular ### Movies Data .instructions[ What is the predicted audience score **for a movie** that has a critic score of 60%? ] <br><br> .instructions[ What is the predicted average audience score **for the subset of movies** that have a critic score of 60%? ] --- class: regular ### Predictions for New Observations - There is uncertainty in our predictions, so we need to calculate an a standard error (SE) to capture the uncertainty - The SE is different depending on whether you are predicting an average value or an individual value - SE is larger when predicting for an individual value than for an average value --- ### Standard errors for predictions .alert[ **Predicting the mean response** `$$SE(\hat{\mu}) = \hat{\sigma}\sqrt{\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}$$` ] <br><br> -- .alert[ **Predicting an individual response** `$$SE(\hat{y}) = \hat{\sigma}\sqrt{1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}$$` ] --- ### Movies data: Predicting an individual response We wish to predict the audience score for a single movie with critics score of 60%. ```r x0 <- data.frame(critics = c(60)) *predict.lm(model, x0, interval = "prediction", * conf.level = 0.95) ``` ``` ## fit lwr upr ## 1 63.43619 38.56876 88.30362 ``` .instructions[ Interpret the interval in the context of the data. ] --- ### Movie data: Predicting the mean response We wish to predict the <font class="vocab">mean</font> audience score for the subset of movies with a critics score of 60%. ```r x0 <- data.frame(critics = c(60)) *predict.lm(model, x0, interval = "confidence", * conf.level = 0.95) ``` ``` ## fit lwr upr ## 1 63.43619 61.38435 65.48804 ``` .instructions[ Interpret the interval in the context of the data. How does the predicted value compare to the prediction for an individual movie? How does the interval compare? ]