class: center, middle, inverse, title-slide # Inference for Regression ## Intro to Data Science ### Shawn Santo ### 04-09-20 --- ## Video table of contents 1. Introduction and a simulation-based confidence interval for `\(\beta_1\)`: [slides 3 - 16](https://warpwire.duke.edu/w/g5oDAA/) <br/><br/> 2. A simulation-based confidence interval for `\(\beta_1\)`: [slides 17 - 24](https://warpwire.duke.edu/w/hZoDAA/) <br/><br/> 3. CLT-based inference for regression: [slides 25 - 35](https://warpwire.duke.edu/w/jZoDAA/) --- ## Announcements - Homework 4 due 04-09-20 at 11:59pm EST, remember you get to drop your lowest homework score - Lab 08 due 04-10-20 at 11:59pm EST - Last graded lab assignment, Lab 09, is out on 04-10-20, remember you get to drop your lowest lab score - Prepare for Exam 2, focus will be on inference and model building --- class: middle, center, inverse # Inference for Regression --- ## Riders in Florence, MA Packages: ```r library(tidyverse) library(infer) library(broom) # tidy model results library(mosaicData) # package where our data lives ``` Data: The Pioneer Valley Planning Commission collected data in Florence, MA for 90 days from April 5 to November 15, 2005 using a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station. ```r data(RailTrail) ``` --- ## Data preview ```r glimpse(RailTrail) ``` ``` #> Observations: 90 #> Variables: 11 #> $ hightemp <int> 83, 73, 74, 95, 44, 69, 66, 66, 80, 79, 78, 65, 41, 5… #> $ lowtemp <int> 50, 49, 52, 61, 52, 54, 39, 38, 55, 45, 55, 48, 49, 3… #> $ avgtemp <dbl> 66.5, 61.0, 63.0, 78.0, 48.0, 61.5, 52.5, 52.0, 67.5,… #> $ spring <int> 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1,… #> $ summer <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,… #> $ fall <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,… #> $ cloudcover <dbl> 7.6, 6.3, 7.5, 2.6, 10.0, 6.6, 2.4, 0.0, 3.8, 4.1, 8.… #> $ precip <dbl> 0.00, 0.29, 0.32, 0.00, 0.14, 0.02, 0.00, 0.00, 0.00,… #> $ volume <int> 501, 419, 397, 385, 200, 375, 417, 629, 533, 547, 432… #> $ weekday <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FAL… #> $ dayType <chr> "weekday", "weekday", "weekday", "weekend", "weekday"… ``` We'll focus on a couple of variables: - `hightemp`: daily high temperature (in degrees Fahrenheit) - `volume`: estimated number of trail users that day (number of breaks recorded) --- ## Riders in Florence, MA <img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ??? ```r gg_scatter <- ggplot(RailTrail, mapping = aes(x = hightemp, y = volume)) + geom_point() + labs(y = "Volume", x = "High Temperature (F)") + theme_minimal(base_size = 16) gg_scatter + geom_smooth(method = "lm", se = FALSE, color = "red") ``` --- ## Coefficient interpretation Interpret the coefficients of the regression model for predicting volume (estimated number of trail users that day) from hightemp (daily high temperature, in Fahrenheit). ```r m_riders <- lm(volume ~ hightemp, data = RailTrail) tidy(m_riders) %>% select(term, estimate) ``` ``` #> # A tibble: 2 x 2 #> term estimate #> <chr> <dbl> #> 1 (Intercept) -17.1 #> 2 hightemp 5.70 ``` --- ## Variability in our estimates Similar to the sample mean and sample proportion, there exists a sampling distribution for our intercept and slope estimates. The intercept and slope values we obtain from our fitted regression line are only point estimates of the true (unknown) population parameters. <br/><br/> A natural question we may ask is, is this convincing evidence that the true linear model has a non-zero slope? <br/><br/> Using the techniques we have learned before, we can create confidence intervals and perform hypothesis testing for the population regression coefficients. --- class: center, middle, inverse # A confidence interval for `\(\beta_1\)` --- ## Bootstrapping the data To create a 95% confidence interval: 1. Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample. 2. Calculate the bootstrap statistic - fit a linear model on the bootstrapped data and extract the estimate for the slope. 3. Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics. 4. Calculate the bounds of the 95% confidence interval as the middle 95% of the bootstrap distribution. --- ## Bootstrap animation <img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-7-.gif" style="display: block; margin: auto;" /> ??? ```r boot_samples <- RailTrail %>% specify(volume ~ hightemp) %>% generate(reps = 100, type = "bootstrap") for (i in 1:100) { plot <- boot_samples %>% filter(replicate == i) %>% ggplot(aes(x = hightemp, y = volume)) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") + ggpubr::stat_regline_equation(label.x = 40, label.y = 650, color = "red", size = 5) + labs(y = "Volume", x = "High Temperature (F)", title = paste0("Bootstrap sample ", i)) + theme_minimal(base_size = 16) + xlim(40, 100) + ylim(120, 750) print(plot) } ``` --- ## All those bootstrapped lines <img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ??? ```r ggplot(boot_samples, aes(x = hightemp, y = volume, color = factor(replicate))) + geom_smooth(method = "lm", se = FALSE, lwd = 0.2) + labs(y = "Volume", x = "High Temperature (F)", title = "Bootstrapped regression lines") + theme_minimal(base_size = 16) + xlim(40, 100) + ylim(120, 750) + theme(legend.position = "none") + scale_color_manual(values = rep("gray", 100)) ``` --- ## Visualize .tiny[ ```r boot_dist <- RailTrail %>% specify(volume ~ hightemp) %>% generate(reps = 5000, type = "bootstrap") %>% * calculate(stat = "slope") ``` ```r boot_dist %>% visualize() + labs(x = expression(b[1]), y = "Count") + theme_minimal(base_size = 16) ``` <img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- ## Confidence interval for `\(\beta_1\)` <img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ??? ```r ci <- boot_dist %>% get_ci() boot_dist %>% visualize() + shade_ci(ci, level= 0.95) + labs(x = expression(b[1]), y = "Count") + theme_minimal(base_size = 16) ``` --- ## Confidence interval for `\(\beta_1\)` Interpret the bootstrap interval in the context of the data. Below are two ways to obtain the interval endpoints. Using `infer`: .tiny[ ```r boot_dist %>% get_ci() ``` ``` #> # A tibble: 1 x 2 #> `2.5%` `97.5%` #> <dbl> <dbl> #> 1 4.21 7.39 ``` ] Using `dplyr`: .tiny[ ```r boot_dist %>% summarise(lb = quantile(stat, 0.025), ub = quantile(stat, 0.975)) ``` ``` #> # A tibble: 1 x 2 #> lb ub #> <dbl> <dbl> #> 1 4.21 7.39 ``` ] --- class: center, middle, inverse # A hypothesis test for `\(\beta_1\)` --- ## Hypothesis testing for `\(\beta_1\)` We might wonder, based on our sample data, is this convincing evidence that temperature is associated with the volume of riders. We can set this up as a hypothesis test, where <br/> -- `\(H_0: \beta_1 = 0\)`. There is no relationship, the slope is 0. `\(H_A: \beta_1 \ne 0\)`. There is a relationship between volume and temperature. -- <br/> We would only reject `\(H_0\)` in favor of `\(H_A\)` if the data provide strong evidence that the true slope parameter is different from zero. We can assess this in a simulation-based hypothesis test. --- ## Compute observed slope .tiny[ ```r obs_b1 <- RailTrail %>% specify(volume ~ hightemp) %>% calculate(stat = "slope") obs_b1 ``` ``` #> # A tibble: 1 x 1 #> stat #> <dbl> #> 1 5.70 ``` ] This is the same as doing .tiny[ ```r lm(volume ~ hightemp, data = RailTrail) %>% tidy() %>% select(term, estimate) ``` ``` #> # A tibble: 2 x 2 #> term estimate #> <chr> <dbl> #> 1 (Intercept) -17.1 #> 2 hightemp 5.70 ``` ] and extracting the slope estimate. Use whichever method you like, although I think the first method is easiest as it generalizes to any statistic `infer` supports. --- ## Simulate the null distribution ```r null_dist <- RailTrail %>% specify(response = volume, explanatory = hightemp) %>% hypothesize(null = "independence") %>% generate(reps = 5000, type = "permute") %>% * calculate(stat = "slope") ``` -- ```r null_dist ``` ``` #> # A tibble: 5,000 x 2 #> replicate stat #> <int> <dbl> #> 1 1 0.112 #> 2 2 1.90 #> 3 3 0.223 #> 4 4 1.35 #> 5 5 -0.295 #> 6 6 -0.417 #> 7 7 -1.11 #> 8 8 0.879 #> 9 9 2.05 #> 10 10 0.516 #> # … with 4,990 more rows ``` --- ## Visualize the distribution under `\(H_0\)` <img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> ??? ```r null_dist %>% visualize() + xlim(-6, 6) + labs(x = expression(b[1]), y = "Count") + theme_minimal(base_size = 16) ``` --- ## Visualize the p-value <img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ??? ```r null_dist %>% visualize() + * shade_p_value(obs_stat = obs_b1, direction = "two_sided") + xlim(-6, 6) + labs(x = expression(b[1]), y = "Count") + theme_minimal(base_size = 16) ``` --- ## Computing the p-value ```r null_dist %>% get_p_value(obs_b1, direction = "two_sided") ``` ``` #> # A tibble: 1 x 1 #> p_value #> <dbl> #> 1 0 ``` -- <br/> What is our conclusion? --- ## Shortcomings of `infer` - It does not support a multiple regression setting. - It does not support a logistic regression setting. -- <br/> <b> How can we conduct inference with regards to the parameters in a multiple regression framework? </b> --- class: center, middle, inverse # CLT-based inference for regression --- ## Hypothesis testing for the slope Recall our fitted model `m_rider`. ```r tidy(m_riders) ``` ``` #> # A tibble: 2 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) -17.1 59.4 -0.288 0.774 #> 2 hightemp 5.70 0.848 6.72 0.00000000171 ``` -- <br/> How can we use this information to conduct a hypothesis test for the slope? Also, when can we use this information? --- ## Conditions for inference for regression Four conditions: -- 1. Observations should be independent -- 2. Residuals should be randomly distributed around 0 -- 3. Residuals should be nearly normally distributed, centered at 0 -- 4. Residuals should have constant variance --- ## Condition 1 **Verify** observations are independent. One consideration might be the time series structure of the data. We can check whether one residual depends on the previous by plotting the residuals in the order of data collection. .tiny[ ```r m_riders_aug <- augment(m_riders) ggplot(data = m_riders_aug, aes(x = 1:nrow(m_riders_aug), y = .resid)) + geom_point() + labs(x = "Index", y = "Residual") + theme_minimal(base_size = 16) ``` <img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> ] --- ## Conditions 2, 4 **Verify** the distribution of residuals is around 0 and has constant variance. .tiny[ ```r ggplot(data = m_riders_aug, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, lty = 3, color = "gray") + labs(y = "Residuals", x = expression(hat(y))) + theme_minimal(base_size = 16) ``` <img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> ] --- ## Condition 3 **Verify** normality of residuals. .tiny[ ```r ggplot(data = m_riders_aug, aes(x = .resid)) + geom_histogram(binwidth = 30) + labs(x = "Residuals", y = "Count") + theme_minimal(base_size = 16) ``` <img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> ] --- ## Tidy model output ```r tidy(m_riders) ``` ``` #> # A tibble: 2 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) -17.1 59.4 -0.288 0.774 #> 2 hightemp 5.70 0.848 6.72 0.00000000171 ``` -- A 95% confidence interval for `\(\beta_1\)` is given by ```r broom::confint_tidy(m_riders, conf.level = 0.95) %>% slice(2) ``` ``` #> # A tibble: 1 x 2 #> conf.low conf.high #> <dbl> <dbl> #> 1 4.02 7.39 ``` --- You could also use: ```r tidy(m_riders, conf.int = TRUE, conf.level = 0.95) ``` ``` #> # A tibble: 2 x 7 #> term estimate std.error statistic p.value conf.low conf.high #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) -17.1 59.4 -0.288 0.774 -135. 101. #> 2 hightemp 5.70 0.848 6.72 0.00000000171 4.02 7.39 ``` -- <br/> Using the model output directly and function `qt()` we can also obtain the 95% confidence interval. ```r c(5.7 - qt(.975, 89) * 0.848, 5.7 + qt(.975, 89) * 0.848) ``` ``` #> [1] 4.015042 7.384958 ``` **Do not calculate it this way. Use `tidy()` or `confint_tidy()` from `broom`.** --- ## What about those p-values? ```r tidy(m_riders) ``` ``` #> # A tibble: 2 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) -17.1 59.4 -0.288 0.774 #> 2 hightemp 5.70 0.848 6.72 0.00000000171 ``` <br/> The p-values in the output of `tidy()` represent the two-sided p-value associated with the observed statistic. The test under consideration is `$$H_0: \beta_k = 0$$` `$$H_A: \beta_k \neq 0$$`. Based on the result above, as was the case with the simulation-based approach, we reject `\(H_0\)` in favor of `\(H_A\)` with regards to the test involving the slope. Hence, we have enough evidence to suggest that there is an association between volume and the temperature high. --- ## Application exercise https://classroom.github.com/a/IOdhZU0x --- ## References 1. Tidy Statistical Inference. (2020). Infer.netlify.com. Retrieved 3 April 2020, from https://infer.netlify.com/index.html 2. Convert Statistical Objects into Tidy Tibbles. (2020). Broom.tidyverse.org. Retrieved 5 April 2020, from https://broom.tidyverse.org/