class: center, middle, inverse, title-slide # Inference for regression
📏 --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01 </a> </span> </div> --- ## Announcements - New teams assigned --- class: center, middle # Inference for regression --- ## Riders in Florence, MA .small[ The Pioneer Valley Planning Commission collected data in Florence, MA for 90 days from April 5 to November 15, 2005 using a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station. - `hightemp` daily high temperature (in degrees Fahrenheit) - `volume` estimated number of trail users that day (number of breaks recorded) ] ```r library(mosaicData) ``` <!-- --> --- ## Coefficient interpretation .question[ Interpret the coefficients of the regression model for predicting volume (estimated number of trail users that day) from hightemp (daily high temperature, in F). ] ```r m_riders <- lm(volume ~ hightemp, data = RailTrail) tidy(m_riders) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -17.1 59.4 -0.288 0.774 ## 2 hightemp 5.70 0.848 6.72 0.00000000171 ``` --- ## Uncertainty around the slope <!-- --> --- ## Bootstrapping the data, once <!-- --> ``` ## # A tibble: 2 x 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) -145. ## 2 hightemp 7.76 ``` --- ## Bootstrapping the data, once again <!-- --> ``` ## # A tibble: 2 x 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) -32.5 ## 2 hightemp 5.89 ``` --- ## Bootstrapping the data, again <!-- --> ``` ## # A tibble: 2 x 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) -114. ## 2 hightemp 7.30 ``` --- ## Bootstrapping the regression line <!-- --> --- ## Bootstrap interval for the slope ```r boot_dist <- RailTrail %>% specify(response = volume, explanatory = hightemp) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") ``` ``` ## # A tibble: 1 x 2 ## l u ## <dbl> <dbl> ## 1 4.19 7.48 ``` <!-- --> --- ## Bootstrap interval for the slope .question[ Interpret the bootstrap interval in context of the data. ] ```r boot_dist %>% summarise(l = quantile(stat, 0.025), u = quantile(stat, 0.975)) ``` ``` ## # A tibble: 1 x 2 ## l u ## <dbl> <dbl> ## 1 4.19 7.48 ``` --- ## Hypothesis testing for the slope `\(H_0\)`: No relationship, `\(\beta_1 = 0\)` `\(H_A\)`: There is a relationship, `\(\beta_1 \ne 0\)` -- ```r null_dist <- RailTrail %>% specify(volume ~ hightemp) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") ``` -- ``` ## Warning: Removed 2 rows containing missing values (geom_bar). ``` <!-- --> --- ## Finding the p-value ```r null_dist %>% filter(stat >= m_riders$coefficients[2]) %>% summarise(p_val = 2 * (n()/1000)) ``` ``` ## # A tibble: 1 x 1 ## p_val ## <dbl> ## 1 0 ``` --- ## Conditions for inference for regression Three conditions: -- 1. Observations should be independent -- 2. Residuals should be randomly distributed around 0 -- 3. Residuals should have constant variance --- ## Checking independence One consideration might be time series structure of the data. We can check whether one residual depends on the previous by plotting the residuals in the order of data collection. .midi[ ```r m_riders_aug <- augment(m_riders) ggplot(data = m_riders_aug, aes(x = 1:nrow(m_riders_aug), y = .resid)) + geom_point() + labs(x = "Index", y = "Residual") ``` <!-- --> ] --- ## Checking distribution of residuals around 0 and constant variance ```r ggplot(data = m_riders_aug, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, lty = 3, color = "gray") + labs(y = "Residuals", x = "Predicted values, y-hat") ``` <!-- --> --- ## What else can we do with these p-values? Model selection based on p-value method: - Backwards elimination: Remove the variable with the highest p-value, re-fit, repeat until all variables are significant at the chosen significance level. - Forward selection: Start with the variable with the lowest p-value, re-fit, repeat until all no more significant variables left at the chosen significance level. This is sometimes seen in the literature, but is not recommended! - Relies on arbitrary significance level cutoff - Multiple testing! Instead use adjusted `\(R^2\)`, or AIC, or other criterion based model selection. --- ## My thoughts Both sets of p-values are largely useless other than in a few very narrow circumstances * Coefficient p-value - If you truly want to know if a coefficient is significantly different from zero (taking the other predictors into account) then great - If you want to know which predictors are important - use model selection * Overall model p-value - Strawman comparison, I don't really care that my model is better than a mean only model, the latter is almost always going to be terrible --- class: center, middle # Testing errors and power --- ## Testing errors - Type 1: Reject `\(H_0\)` when you shouldn't have + P(Type 1 error) = `\(\alpha\)` - Type 2: Fail to reject `\(H_0\)` when you should have + P(Type 2 error) is harder to calculate, but increases as `\(\alpha\)` decreases <div class="question"> In a court of law <ul> <li> Null hypothesis: Defendant is innocent <li> Alternative hypothesis: Defendant is guilty </ul> Which is worse: Type 1 or Type 2 error? </div> --- ## Probabilities of testing errors - P(Type 1 error) = `\(\alpha\)` - P(Type 2 error) = 1 - Power - Power = P(correctly rejecting the null hypothesis) <div class="question"> Fill in the blanks in terms of correctly/incorrectly rejecting/failing to reject the null hypothesis: <ul> <li> `\(alpha\)` is the probability of ___. <li> 1 - Power is the probability of ___. </ul> </div> --- class: center, middle # Projects --- ## It's project time! - Get in your new teams - Find a team name and let me know - Start working on the project proposal and ask questions