class: center, middle, inverse, title-slide # Simple Linear Regression ### Dr. Maria Tackett ### 09.04.19 --- class: middle, center ### [Click for PDF of slides](03-slr.pdf) --- ## Announcements - Lab 01 due **TODAY at 11:59p** - Will work in your lab groups starting tomorrow --- ## Check in - Any questions from last class? - Any questions about the lab? - Any questions about course logistics? --- ## Today's Agenda - Motivating Regression Analysis - Simple Linear Regression - Estimating & interpreting coefficients - Assessing model fit: `\(R^2\)` - Residuals and model assumptions --- ## Packages and Data ```r library(tidyverse) library(broom) library(modelr) library(knitr) library(fivethirtyeight) #fandango dataset library(cowplot) #plot_grid() function ``` ```r movie_scores <- fandango %>% rename(critics = rottentomatoes, audience = rottentomatoes_user) ``` --- class: middle, center ### Motivating Regression Analysis --- ## rottentomatoes.com Can the ratings from movie critics be used to predict what movies the audience will like? -- <img src="img/03/movie-rating-1.png" width="70%" style="display: block; margin: auto;" /> -- <img src="img/03/movie-rating-2.png" width="70%" style="display: block; margin: auto;" /> --- ## Critic vs. Audience Ratings - To answer this question, we will analyze the critic and audience scores from rottentomatoes.com. - The data was first used in the article [Be Suspicious of Online Movie Ratings, Especially Fandango's](https://fivethirtyeight.com/features/fandango-movies-ratings/). - Variables: - **`critics`**: Tomatometer score for the film (0 - 100) - **`audience`**: Audience score for the film (0 - 100) --- ## `glimpse` of the data .small[ ```r glimpse(movie_scores) ``` ``` ## Observations: 146 ## Variables: 23 ## $ film <chr> "Avengers: Age of Ultron", "Cinderell… ## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2… ## $ critics <int> 74, 85, 80, 18, 14, 63, 42, 86, 99, 8… ## $ audience <int> 86, 80, 90, 84, 28, 62, 53, 64, 82, 8… ## $ metacritic <int> 66, 67, 64, 22, 29, 50, 53, 81, 81, 8… ## $ metacritic_user <dbl> 7.1, 7.5, 8.1, 4.7, 3.4, 6.8, 7.6, 6.… ## $ imdb <dbl> 7.8, 7.1, 7.8, 5.4, 5.1, 7.2, 6.9, 6.… ## $ fandango_stars <dbl> 5.0, 5.0, 5.0, 5.0, 3.5, 4.5, 4.0, 4.… ## $ fandango_ratingvalue <dbl> 4.5, 4.5, 4.5, 4.5, 3.0, 4.0, 3.5, 3.… ## $ rt_norm <dbl> 3.70, 4.25, 4.00, 0.90, 0.70, 3.15, 2… ## $ rt_user_norm <dbl> 4.30, 4.00, 4.50, 4.20, 1.40, 3.10, 2… ## $ metacritic_norm <dbl> 3.30, 3.35, 3.20, 1.10, 1.45, 2.50, 2… ## $ metacritic_user_nom <dbl> 3.55, 3.75, 4.05, 2.35, 1.70, 3.40, 3… ## $ imdb_norm <dbl> 3.90, 3.55, 3.90, 2.70, 2.55, 3.60, 3… ## $ rt_norm_round <dbl> 3.5, 4.5, 4.0, 1.0, 0.5, 3.0, 2.0, 4.… ## $ rt_user_norm_round <dbl> 4.5, 4.0, 4.5, 4.0, 1.5, 3.0, 2.5, 3.… ## $ metacritic_norm_round <dbl> 3.5, 3.5, 3.0, 1.0, 1.5, 2.5, 2.5, 4.… ## $ metacritic_user_norm_round <dbl> 3.5, 4.0, 4.0, 2.5, 1.5, 3.5, 4.0, 3.… ## $ imdb_norm_round <dbl> 4.0, 3.5, 4.0, 2.5, 2.5, 3.5, 3.5, 3.… ## $ metacritic_user_vote_count <int> 1330, 249, 627, 31, 88, 34, 17, 124, … ## $ imdb_user_vote_count <int> 271107, 65709, 103660, 3136, 19560, 3… ## $ fandango_votes <int> 14846, 12640, 12055, 1793, 1021, 397,… ## $ fandango_difference <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.… ``` ] --- .small[ ```r p1 <- ggplot(data = movie_scores,mapping = aes(x = critics)) + geom_histogram() p2 <- ggplot(data = movie_scores,mapping = aes(x = audience)) + geom_histogram() plot_grid(p1, p2, ncol = 2) ``` <img src="03-slr_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- .small[ ```r ggplot(data = movie_scores, mapping = aes(x = critics, y = audience)) + geom_point() + labs(title = "Audience Score vs. Critics Score") ``` <img src="03-slr_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- .small[ ```r ggplot(data = movie_scores, mapping = aes(x = critics, y = audience)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Audience Score vs. Critics Score") ``` <img src="03-slr_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] --- ## Terminology - **`audience`** is the <font class="vocab">response variable `\((Y)\)`</font> - variable whose variation we want to understand and/or variable we wish to predict - also known as *dependent*, *outcome*, *target* variable <br> - **`critics`** is the <font class="vocab">predictor variable `\((X)\)`</font> - variable used to account for variation in the response - also known as *independent* --- ## Model .alert[ `$$\text{audience} = f(critics) + \epsilon$$` ] - We want to estimate `\(f\)`. How do we do it? -- - <font class="vocab">Non-parametric methods:</font> estimate `\(f\)` by getting as close to the data points as possible without making explicit assumptions about the functional form of `\(f\)` -- - <font class="vocab">Parametric methods:</font> estimate `\(f\)` by making an assumption about the functional form of `\(f\)` then using the data to fit a model based on that form --- class: middle, center .alert[ In this class, we will focus on <font class="vocab">parametric methods</font> ] <small> *Non-parametric methods covered in STA 325: Machine Learning & Data Mining* </small> --- ## General form of model .alert[ `$$Y = f(\mathbf{X}) + \epsilon$$` ] - `\(Y\)`: quantitative response variable - `\(\mathbf{X} = (X_1, X_2, \ldots, X_p)\)`: predictor variables - `\(f\)`: fixed but unknown function - systematic information `\(\mathbf{X}\)` provides about `\(Y\)` - `\(\epsilon\)`: random error term with mean 0 that is independent of `\(\mathbf{X}\)` --- ## How to estimate `\(f\)`? In general, we will use the following steps to estimate `\(f\)` - Choose the functional form of `\(f\)`, i.e. <font class="vocab">choose the appropriate model given the data</font> - Ex: `\(f\)` is a linear model `$$f(\mathbf{X}) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \epsilon$$` <br> -- - Use the data to fit the model, i.e. <font class="vocab">estimate the model parameters</font> - Ex: Use a method to estimate the model parameters `\(\beta_0, \beta_1, \ldots, \beta_p\)` --- ## Why estimate `\(f\)`? Suppose we have the model `$$\text{audience} = \beta_0 + \beta_1 \times \text{critics} + \epsilon$$` <br><br><br> -- .question[ What are 1-2 questions you can answer using this model? ] --- ## Why estimate `\(f\)`? There are two types of questions we may wish to answer using our model: - <font class="vocab">Prediction: </font> What is the expected `\(Y\)` given particular values of `\(X_1, X_2, \ldots, X_p\)`? -- - Ex: What is the expected audience score for a movie that receives a critic score of 70%? -- - <font class="vocab">Inference: </font> What is the relationship between `\(\mathbf{X}\)` and `\(Y\)`. How does `\(Y\)` change as a function of `\(\mathbf{X}\)`? -- - Ex: How much can we expect the audience score to change for each additional point in the critic score? --- ## Prediction `$$\hat{\text{audience}} = \hat{f}(\text{critics})$$` - **Question**: How accurate is `\(\hat{\text{audience}}\)` as a prediction of `\(\text{audience}\)`? - This depends on two quantities: reducible and irreducible error -- - <font class="vocab">Reducible error:</font> error that can potentially be reduced by using a model , `\(\hat{f}\)`, that is a more appropriate fit for the data -- - <font class="vocab">Irreducible error: </font> Error that cannot be reduced even if the model, `\(\hat{f}\)`, is a perfect fit. In other words, this is variability associated with `\(\epsilon\)` --- class: middle, center .question[ In the case of the `movie_ratings` data, what types of factors are included in the irreducible error? ] --- ## Inference `$$\hat{\text{audience}} = \hat{f}(\text{critics})$$` - **Question**: How is `audience` affected as `critics` changes? -- - We can break this into several questions: -- - Is the critic score an important predictor of the audience score? -- - What is the relationship between `audience` and `critics`? -- - Does a linear model adequately describe the relationship between `audience` and `critics`, or is a more complex model required? .alert[ This semester, we will learn how to fit and interpret models to answer these questions using different types of data. ] --- ## Course Outline - **Unit 1: Quantitative Response Variables** - Simple Linear Regression - Multiple Linear Regression <br> - **Unit 2: Categorical Response Variable** - Logistic Regression - Multinomial Logistic Regression <br> - **Unit 3: Modeling in Practice** - Dealing with messy data - Special topics --- class: middle, center ## Simple Linear Regression --- ## Least-Squares Regression - There is some true relationship between `\(X\)` and `\(Y\)` that exists in the population `$$Y = f(X) + \epsilon$$` -- - If `\(f\)` is approximated by a linear function, then we can write the relationship as `$$Y = \beta_0 + \beta_1 X + \epsilon$$` -- - We'll estimate the slope and intercept of this linear function using <font class = "vocab">least-squares regression</font> -- - We'll use statistical inference to determine if the relationship we observe in the data is statistically significant or if it's due to random chance (we'll talk about this more next class) --- ## Regression Model `$$Y|X \sim N(\beta_0 + \beta_1 X,\sigma^2)$$` <img src="img/02/regression.png" width="80%" style="display: block; margin: auto;" /> - `\(\sigma\)`: the standard deviation of `\(Y\)` as a function of `\(X\)` - **<i>Assumption</i>:** `\(\sigma\)` is equal for all values of `\(X\)` --- ## Regression Model .alert[ `$$Y|X \sim N(\beta_0 + \beta_1 X, \sigma^2)$$` ] -- - For a single observation `\((x_i, y_i)\)` `$$y_i = \beta_0 + \beta_1 x_i + \epsilon_i \hspace{10mm} \epsilon_i \sim N(0,\sigma^2)$$` <br><br> -- - We want to use the `\(n\)` observations `\((x_1,y_1), \ldots, (x_n, y_n)\)` to estimate `\(\beta_0\)` and `\(\beta_1\)`. We will use *least-squares regression* estimates. --- ## Residuals <img src="03-slr_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> The .vocab[residual] is the difference between the observed and predicted response. --- ## Residual Sum of Squares - The residual for the `\(i^{th}\)` observation is `$$e_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)$$` -- - The *residual sum of squares* is `$$RSS = e^2_1 + e^2_2 + \dots + e^2_n$$` -- - The .vocab[least-squares regression] approach chooses coefficients `\(\hat{\beta}_0\)` and `\(\hat{\beta}_1\)` to minimize RSS. --- ## Estimating Coefficients - .vocab[Slope:] `$$\hat\beta_1 = \frac{\sum\limits_{i=1}^n(x_i-\bar{x})(y_i - \bar{y})}{\sum\limits_{i=1}^n(x_i-\bar{x})^2} = r\frac{s_y}{s_x}$$` such that `\(r\)` is the correlation between `\(x\)` And `\(y\)`. <br><br> - .vocab[Intercept:] `\(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}\)` --- ## Least-Squares Model ```r model <- lm(audience ~ critics, data = movie_scores) tidy(model) %>% kable(format = "markdown", digits = 3) ``` |term | estimate| std.error| statistic| p.value| |:-----------|--------:|---------:|---------:|-------:| |(Intercept) | 32.316| 2.343| 13.795| 0| |critics | 0.519| 0.035| 15.028| 0| <br><br> `$$\hat{\text{audience}} = 32.316 + 0.519 \times \text{critics}$$` --- ## Interpreting Slope & Intercept - <font class="vocab">Slope: </font> Increase in the mean response for every one unit increase in the predictor variable - <font class="vocab">Intercept: </font> Mean response when the explanatory variable equals 0 <br><br> -- - The regression equation for the Rotten Tomatoes data is `$$\hat{\text{audience}} = 32.316 + 0.519 \times \text{critics}$$` .instructions[ Write the interpretation of the slope and intercept ] --- ## Nonsensical Intercept - Sometimes it doesn't make sense to interpret the intercept + When predictor variable doesn't take values close to 0 + When the intercept is negative even though the response variable should always be positive - The intercept helps the line fit the data as closely as possible - It is fine to have a nonsensical intercept if it helps the model give better overall predictions --- ### Does it make sense to interpret the intercept? - Example 1: - **Explanatory**: number of home runs in a baseball game - **Response**: attendance at the next baseball game <br><br> - Example 2: - **Explanatory**: height of a person - **Response**: weight of a person --- class: middle, center ## Assessing Model Fit --- ## `\(R^2\)` We can use the coefficient of determination, `\(R^2\)`, to measure how well the model fits the data - `\(R^2\)` is the proportion of variation in `\(Y\)` that is explained by the regression line (reported as percentage) - It is difficult to determine what's a "good" value of `\(R^2\)`. It depends on the context of the data. --- ## Calculating `\(R^2\)` .instructions[ `$$R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}$$` ] - <font class="vocab">Total Sum of Squares: </font>Total variation in the `\(Y\)`'s before fitting the regression `$$\text{TSS}= \sum\limits_{i=1}^{n}(y_i - \bar{y})^2 = (n-1)s^2_y$$` - <font class="vocab">Residual Sum of Squares (RSS): </font>Total variation in the `\(Y\)`'s around the regression line (sum of squared residuals) `$$\text{RSS} = \sum\limits_{i=1}^{n}[y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i)]^2$$` --- ## Rotten Tomatoes Data ```r rsquare(model,movie_scores) ``` ``` ## [1] 0.6106479 ``` .alert[ The critics score explains about 61.06% of the variation in audience scores on rottentomatoes.com. ] --- class: middle, center ## Checking Model Assumptions --- ## Assumptions for Regression 1. <font class="vocab">Linearity: </font>The plot of the mean value for `\(y\)` against `\(x\)` falls on a straight line 2. <font class="vocab">Constant Variance: </font>The regression variance is the same for all values of `\(x\)` 3. <font class="vocab">Normality: </font> For a given `\(x\)`, the distribution of `\(y\)` around its mean is Normal 4. <font class="vocab">Independence: </font>All observations are independent --- ## Checking Assumptions We can use plots of the residuals to check the assumptions for regression. 1. Scatterplot of `\(Y\)` vs. `\(X\)` (linearity). - Check this before fitting the regression model. 2. Plot of residuals vs. predictor variable (constant variance, linearity) 3. Histogram and Normal QQ-Plot of residuals (Normality) --- ## Residuals vs. Predictor - When all the assumptions are true, the values of the residuals reflect random (chance) error - We can look at a plot of the residuals vs. the predictor variable - There should be no distinguishable pattern in the residuals plot, i.e. the residuals should be randomly scattered - A non-random pattern suggests assumptions might be violated --- ## Plots of Residuals <img src="img/02/resid_plots.png" width="80%" style="display: block; margin: auto;" /> --- ```r movie_scores <- movie_scores %>% mutate(residuals=resid(model)) ``` ```r ggplot(data=movie_scores,mapping=aes(x=critics, y=residuals)) + geom_point() + geom_hline(yintercept=0,color="red")+ labs(title="Residuals vs. Critics Score") ``` <img src="03-slr_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## Checking Normality - Examine the distribution of the residuals to determine if the Normality assumption is satisfied - Plot the residuals in a histogram and a Normal QQ plot to visualize their distribution and assess Normality - Most inference methods for regression are robust to some departures from Normality --- ## Normal QQ-Plot <img src="img/02/normal_qqplot.png" width="100%" style="display: block; margin: auto;" /> --- ```r ggplot(data=movie_scores,mapping=aes(x=residuals)) + geom_histogram() + labs(title="Distribution of Residuals") ``` <img src="03-slr_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ```r ggplot(data=movie_scores,mapping=aes(sample=residuals)) + stat_qq() + stat_qq_line() + labs(title="Normal QQ Plot of Residuals") ``` <img src="03-slr_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## Checking Independence - Often, we can conclude that the independence assumption is sufficiently met based on a description of the data and how it was collected. - Two common violations of the independence assumption: - <font class="vocab">Serial Effect</font>: If the data were collected over time, the residuals should be plotted in time order to determine if there is serial correlation - <font class="vocab">Cluster Effect</font>: You can plot the residuals vs. a group identifier or use different markers (colors/shapes) in the residual plot to determine if there is a cluster effect. --- ### Example: Birth rate vs. Per Capita Income - Pew Research did a <a href="http://www.pewsocialtrends.org/2011/10/12/in-a-down-economy-fewer-births/" target="_blank">study in 2011</a> looking at how the economy's effect on birthrate in the United States. - We will look at data for Virginia and Washington D.C. years 2000 - 2009 - Birth rate: Births per 100,000 women ages 15-44 - Per Capita Income: average income per person --- <small> ```r ggplot(data=pew_data, mapping = aes(x=percapitaincome,y=birthrate)) + geom_point() + geom_smooth(method="lm",se=FALSE) + labs(title="Birth rate vs. Per Capita Income", x="Per Capita Income ($)", y="Birth Rate") ``` <img src="03-slr_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> </small> --- ## Birthrate vs. Per Capita Income ```r pew_model <- lm(birthrate ~ percapitaincome,data=pew_data) tidy(pew_model) %>% kable(format = "markdown", digits = 3) ``` |term | estimate| std.error| statistic| p.value| |:---------------|--------:|---------:|---------:|-------:| |(Intercept) | -18.218| 15.33| -1.188| 0.25| |percapitaincome | 0.002| 0.00| 5.125| 0.00| <br><br> `$$\hat{\text{Birth Rate}} = -18.2 + 0.0019 \times \text{ Per Capita Income}$$` --- ### Residuals vs. Explanatory Variable <small> ```r pew_data <- pew_data %>% mutate(residuals = resid(pew_model)) ``` <img src="03-slr_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> </small> --- ### Residuals: Cluster Effect <img src="03-slr_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- ### Residuals: Serial Effect <img src="03-slr_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- ## Recap - Motivating Regression Analysis - Simple Linear Regression - Estimating & interpreting coefficients - Assessing model fit: `\(R^2\)` - Residuals and model assumptions