Inference for Regression

# Inference for Regression
## Intro to Data Science
### Shawn Santo
### 04-09-20

---

## Video table of contents

1. Introduction and a simulation-based confidence interval for `$\beta_1$`: 
 [slides 3 - 16](https://warpwire.duke.edu/w/g5oDAA/)
 
2. A simulation-based confidence interval for `$\beta_1$`: 
 [slides 17 - 24](https://warpwire.duke.edu/w/hZoDAA/)
 
3. CLT-based inference for regression: 
 [slides 25 - 35](https://warpwire.duke.edu/w/jZoDAA/)

---

## Announcements

- Homework 4 due 04-09-20 at 11:59pm EST, remember you get to drop your 
  lowest homework score

- Lab 08 due 04-10-20 at 11:59pm EST

- Last graded lab assignment, Lab 09, is out on 04-10-20,
  remember you get to drop your lowest lab score
  
- Prepare for Exam 2, focus will be on inference and
  model building

---

# Inference for Regression

---

## Riders in Florence, MA

Packages:

```r
library(tidyverse)
library(infer)
library(broom)      # tidy model results
library(mosaicData) # package where our data lives
```

Data:

The Pioneer Valley Planning Commission collected data in Florence, MA for 90 
days from April 5 to November 15, 2005 using a laser sensor, with breaks in the 
laser beam recording when a rail-trail user passed the data collection station.

```r
data(RailTrail)
```

---

## Data preview

```r
glimpse(RailTrail)
```

```
#> Observations: 90
#> Variables: 11
#> $ hightemp <int> 83, 73, 74, 95, 44, 69, 66, 66, 80, 79, 78, 65, 41, 5…
#> $ lowtemp <int> 50, 49, 52, 61, 52, 54, 39, 38, 55, 45, 55, 48, 49, 3…
#> $ avgtemp <dbl> 66.5, 61.0, 63.0, 78.0, 48.0, 61.5, 52.5, 52.0, 67.5,…
#> $ spring <int> 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1,…
#> $ summer <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,…
#> $ fall <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ cloudcover <dbl> 7.6, 6.3, 7.5, 2.6, 10.0, 6.6, 2.4, 0.0, 3.8, 4.1, 8.…
#> $ precip <dbl> 0.00, 0.29, 0.32, 0.00, 0.14, 0.02, 0.00, 0.00, 0.00,…
#> $ volume <int> 501, 419, 397, 385, 200, 375, 417, 629, 533, 547, 432…
#> $ weekday <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FAL…
#> $ dayType <chr> "weekday", "weekday", "weekday", "weekend", "weekday"…
```

We'll focus on a couple of variables:

- `hightemp`: daily high temperature (in degrees Fahrenheit)

- `volume`: estimated number of trail users that day (number of breaks recorded)

---

## Riders in Florence, MA

???

```r
gg_scatter <- ggplot(RailTrail, mapping = aes(x = hightemp, y = volume)) +
 geom_point() +
 labs(y = "Volume", x = "High Temperature (F)") +
 theme_minimal(base_size = 16)

gg_scatter +
    geom_smooth(method = "lm", se = FALSE, color = "red")
```

---

## Coefficient interpretation

Interpret the coefficients of the regression model for predicting volume 
(estimated number of trail users that day) from hightemp 
(daily high temperature, in Fahrenheit).

```r
m_riders <- lm(volume ~ hightemp, data = RailTrail)

tidy(m_riders) %>%
  select(term, estimate)
```

```
#> # A tibble: 2 x 2
#> term estimate
#> <chr> <dbl>
#> 1 (Intercept) -17.1 
#> 2 hightemp 5.70
```

---

## Variability in our estimates

Similar to the sample mean and sample proportion, there exists a sampling
distribution for our intercept and slope estimates. The intercept and slope
values we obtain from our fitted regression line are only point estimates of the
true (unknown) population parameters.

A natural question we may ask is, is this convincing evidence that the true
linear model has a non-zero slope?

Using the techniques we have learned before, we can create confidence intervals
and perform hypothesis testing for the population regression coefficients.

---

# A confidence interval for `$\beta_1$`

---

## Bootstrapping the data

To create a 95% confidence interval:

1. Take a bootstrap sample - a random sample taken with replacement from the 
   original sample, of the same size as the original sample.

2. Calculate the bootstrap statistic - fit a linear model on the bootstrapped
   data and extract the estimate for the slope.

3. Repeat steps (1) and (2) many times to create a bootstrap distribution - 
   a distribution of bootstrap statistics.

4. Calculate the bounds of the 95% confidence interval as the middle 95% of 
   the bootstrap distribution.
   
---

## Bootstrap animation

???

```r
boot_samples <- RailTrail %>%
 specify(volume ~ hightemp) %>% 
 generate(reps = 100, type = "bootstrap")

for (i in 1:100) {
plot <- boot_samples %>%
 filter(replicate == i) %>% 
 ggplot(aes(x = hightemp, y = volume)) +
 geom_point() +
 geom_smooth(method = "lm", se = FALSE, color = "red") +
 ggpubr::stat_regline_equation(label.x = 40, label.y = 650, 
 color = "red", size = 5) +
 labs(y = "Volume", x = "High Temperature (F)",
 title = paste0("Bootstrap sample ", i)) +
 theme_minimal(base_size = 16) +
 xlim(40, 100) +
 ylim(120, 750)
print(plot)
}
```

---

## All those bootstrapped lines

???

```r
ggplot(boot_samples, aes(x = hightemp, y = volume, color = factor(replicate))) +
  geom_smooth(method = "lm", se = FALSE, lwd = 0.2) +
  labs(y = "Volume", x = "High Temperature (F)",
       title = "Bootstrapped regression lines") +
  theme_minimal(base_size = 16) +
  xlim(40, 100) +
  ylim(120, 750) +
  theme(legend.position = "none") +
  scale_color_manual(values = rep("gray", 100))
```

---

## Visualize

```r
boot_dist <- RailTrail %>%
 specify(volume ~ hightemp) %>%
 generate(reps = 5000, type = "bootstrap") %>%
* calculate(stat = "slope")
```

```r
boot_dist %>% 
  visualize() +
  labs(x = expression(b[1]), y = "Count") +
  theme_minimal(base_size = 16)
```

<img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
]

---

## Confidence interval for `$\beta_1$`

???

```r
ci <- boot_dist %>% 
 get_ci()

boot_dist %>% 
  visualize() +
  shade_ci(ci, level= 0.95) +
  labs(x = expression(b[1]), y = "Count") +
  theme_minimal(base_size = 16)
```

---

## Confidence interval for `$\beta_1$`

Interpret the bootstrap interval in the context of the data. Below are two 
ways to obtain the interval endpoints.

Using `infer`:

```r
boot_dist %>% 
  get_ci()
```

```
#> # A tibble: 1 x 2
#> `2.5%` `97.5%`
#> <dbl> <dbl>
#> 1 4.21 7.39
```
]

Using `dplyr`:

```r
boot_dist %>%
  summarise(lb = quantile(stat, 0.025), 
            ub = quantile(stat, 0.975))
```

```
#> # A tibble: 1 x 2
#> lb ub
#> <dbl> <dbl>
#> 1 4.21 7.39
```
]

---

# A hypothesis test for `$\beta_1$`

---

## Hypothesis testing for `$\beta_1$`

We might wonder, based on our sample data, is this convincing evidence that
temperature is associated with the volume of riders. We can set this up as a 
hypothesis test, where

`$H_0: \beta_1 = 0$`. There is no relationship, the slope is 0.

`$H_A: \beta_1 \ne 0$`. There is a relationship between volume and temperature.

We would only reject `$H_0$` in favor of `$H_A$` if the data provide strong evidence
that the true slope parameter is different from zero. We can assess this in a
simulation-based hypothesis test.

---

## Compute observed slope

```r
obs_b1 <- RailTrail %>% 
 specify(volume ~ hightemp) %>% 
 calculate(stat = "slope")

obs_b1
```

```
#> # A tibble: 1 x 1
#> stat
#> <dbl>
#> 1 5.70
```
]

This is the same as doing

```r
lm(volume ~ hightemp, data = RailTrail) %>% 
  tidy() %>%
  select(term, estimate)
```

```
#> # A tibble: 2 x 2
#> term estimate
#> <chr> <dbl>
#> 1 (Intercept) -17.1 
#> 2 hightemp 5.70
```
]

and extracting the slope estimate. Use whichever method you like, although
I think the first method is easiest as it generalizes to any statistic
`infer` supports.

---

## Simulate the null distribution

```r
null_dist <- RailTrail %>%
 specify(response = volume, explanatory = hightemp) %>% 
 hypothesize(null = "independence") %>%
 generate(reps = 5000, type = "permute") %>%
* calculate(stat = "slope")
```

```r
null_dist
```

```
#> # A tibble: 5,000 x 2
#> replicate stat
#> <int> <dbl>
#> 1 1 0.112
#> 2 2 1.90 
#> 3 3 0.223
#> 4 4 1.35 
#> 5 5 -0.295
#> 6 6 -0.417
#> 7 7 -1.11 
#> 8 8 0.879
#> 9 9 2.05 
#> 10 10 0.516
#> # … with 4,990 more rows
```

---

## Visualize the distribution under `$H_0$`

???

```r
null_dist %>% 
  visualize() + 
  xlim(-6, 6) +
  labs(x = expression(b[1]), y = "Count") +
  theme_minimal(base_size = 16)
```

---

## Visualize the p-value

???

```r
null_dist %>% 
  visualize() + 
* shade_p_value(obs_stat = obs_b1, direction = "two_sided") +
  xlim(-6, 6) +
  labs(x = expression(b[1]), y = "Count") +
  theme_minimal(base_size = 16)
```

---

## Computing the p-value

```r
null_dist %>% 
  get_p_value(obs_b1, direction = "two_sided")
```

```
#> # A tibble: 1 x 1
#> p_value
#> <dbl>
#> 1 0
```

What is our conclusion?

---

## Shortcomings of `infer`

- It does not support a multiple regression setting.

- It does not support a logistic regression setting.

How can we conduct inference with regards to the parameters in a multiple
regression framework?

---

# CLT-based inference for regression

---

## Hypothesis testing for the slope

Recall our fitted model `m_rider`.

```r
tidy(m_riders)
```

```
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -17.1 59.4 -0.288 0.774 
#> 2 hightemp 5.70 0.848 6.72 0.00000000171
```

How can we use this information to conduct a hypothesis test for the slope?
Also, when can we use this information?

---

## Conditions for inference for regression

Four conditions:

1. Observations should be independent

2. Residuals should be randomly distributed around 0

3. Residuals should be nearly normally distributed, centered at 0

4. Residuals should have constant variance

---

## Condition 1

**Verify** observations are independent.

One consideration might be the time series structure of the data. We can check 
whether one residual depends on the previous by plotting the residuals in the 
order of data collection.

```r
m_riders_aug <- augment(m_riders)

ggplot(data = m_riders_aug, aes(x = 1:nrow(m_riders_aug), y = .resid)) +
  geom_point() +
  labs(x = "Index", y = "Residual") +
  theme_minimal(base_size = 16)
```

<img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" />
]

---

## Conditions 2, 4

**Verify** the distribution of residuals is around 0 and has constant variance.

```r
ggplot(data = m_riders_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, lty = 3, color = "gray") +
  labs(y = "Residuals", x = expression(hat(y))) +
  theme_minimal(base_size = 16)
```

<img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" />
]

---

## Condition 3

**Verify** normality of residuals.

```r
ggplot(data = m_riders_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 30) +
  labs(x = "Residuals", y = "Count") +
  theme_minimal(base_size = 16)
```

<img src="lec-14b-inf-reg_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" />
]

---

## Tidy model output

```r
tidy(m_riders)
```

A 95% confidence interval for `$\beta_1$` is given by

```r
broom::confint_tidy(m_riders, conf.level = 0.95) %>% 
  slice(2)
```

```
#> # A tibble: 1 x 2
#> conf.low conf.high
#> <dbl> <dbl>
#> 1 4.02 7.39
```

---

You could also use:

```r
tidy(m_riders, conf.int = TRUE, conf.level = 0.95)
```

```
#> # A tibble: 2 x 7
#> term estimate std.error statistic p.value conf.low conf.high
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -17.1 59.4 -0.288 0.774 -135. 101. 
#> 2 hightemp 5.70 0.848 6.72 0.00000000171 4.02 7.39
```

Using the model output directly and function `qt()` we can also obtain
the 95% confidence interval.

```r
c(5.7 - qt(.975, 89) * 0.848, 5.7 + qt(.975, 89) * 0.848)
```

```
#> [1] 4.015042 7.384958
```

**Do not calculate it this way. Use `tidy()` or `confint_tidy()` from `broom`.**

---

## What about those p-values?

```r
tidy(m_riders)
```

The p-values in the output of `tidy()` represent the two-sided p-value 
associated with the observed statistic. The test under consideration is

`$$H_0: \beta_k = 0$$`
`$$H_A: \beta_k \neq 0$$`.

Based on the result above, as was the case with the simulation-based approach,
we reject `$H_0$` in favor of `$H_A$` with regards to the test involving the slope. 
Hence, we have enough evidence to suggest that there is an association between 
volume and the temperature high.

---

## Application exercise

https://classroom.github.com/a/IOdhZU0x

---

## References

1. Tidy Statistical Inference. (2020). Infer.netlify.com. Retrieved 3 April 
   2020, from https://infer.netlify.com/index.html
   
2. Convert Statistical Objects into Tidy Tibbles. (2020). Broom.tidyverse.org.
   Retrieved 5 April 2020, from https://broom.tidyverse.org/