Inference for SLR

# Inference for SLR
## Part 2
### Dr. Maria Tackett
### 01.28.19

---

## Announcements

- HW 01 due Wed, Jan 30 at 11:59p

- Lab 02 due Wed, Jan 30 at 11:59p

---

## Agenda

- Inference for the slope, `$\beta_1$`

- Prediction

- Cautions

---

## Packages and Data

```r
library(tidyverse)
library(broom)
library(readr)
library(modelr)
```

```r
beer <- read_csv("data/beer.csv")
```

---

## Carbs vs. alcohol in beer

- **What is the relationship between the percentage of alcohol and the amount of carbohydrates in beer?**

- The dataset contains the following characteristics for 173 domestic beers: 
    - `Brand`
    - `Brewery`
    - `PercentAlcohol`
    - `CaloriesPer12Oz`
    - `Carbohydrates` (in grams)

---

## Inference for `$\beta_1$`

---

## Statistical Inference

- Confidence Interval: Estimate a range of plausible values for a parameter

- Hypothesis Test: Test a specified claim or hypothesis about the parameter

- In this class, our focus will be on inference for regression coefficients, i.e. the `$\beta_j$`'s

---

## Confidence Intervals

- **What**: Estimates a population parameter using a sample statistic
    - Assuming sample data is a simple random sample
  
- **Why**: Because the statistic is a random variable, its value is subject to chance error
    - We want a range of plausible values for the population parameter that takes the chance error into account

- We assume the data is from a random sample that is representative of the population. A confidence interval will not make up for systematic bias in the data

---

## Confidence interval for `$\beta_1$`

- The confidence interval for the regression slope is

- `$t^*$` is the critical value associated with the confidence level.
  + It is calculated from a `$t$` distribution with `$n-2$` degrees of freedom
  
- `$SE(\hat{\beta}_1)$` is the standard error for the slope

`$$SE(\hat{\beta}_1) = \sqrt{\frac{\hat{\sigma}^2}{\sum\limits_{i=1}^n (x_i - \bar{x})^2}} \hspace{2.5mm} = \hspace{2.5mm} \hat{\sigma}\sqrt{\frac{1}{(n-1)s_X^2}}$$`

---

### Beer data: 95% confidence interval for `$\beta_1$`

```r
model <- lm(Carbohydrates ~ PercentAlcohol, data=beer)
```

```r
model %>%
  tidy(conf.int=TRUE)
```

```
## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.64 1.31 2.01 4.56e- 2 0.0520 5.23
## 2 PercentAlcohol 1.80 0.242 7.44 6.09e-12 1.32 2.28
```
]

---

## Calculating the 95% CI

```r
sigma <- sigma(model)
beta1 <- nth(model$coefficients,2)
```

```r
n <- nrow(beer)
crit.val <- qt(0.975,n-2)
```

```r
beer %>%
  summarise(var=var(PercentAlcohol),
            sigma = sigma,
            beta1 = beta1,
            crit.val = crit.val,
            se = sqrt(sigma^2 /((n-1)*var)), 
            lb = beta1 - crit.val * se,
            ub = beta1 + crit.val * se)
```

```
## # A tibble: 1 x 7
## var sigma beta1 crit.val se lb ub
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.89 4.27 1.80 1.97 0.237 1.33 2.27
```

---

**Is there truly a linear relationship between the percent alcohol and amount of carbohydrates in a domestic beer?**

**In other words, is there enough evidence to conclude that the true population slope for percent alcohol is different from 0?**

---

## Hypthesis Tests

Question being answered:

- Given the data in our sample, is there evidence **against** a specified hypothesis about the parameter of interest?

- In other words:
    - Are the data consistent with the specified hypothesis?
  
---

## Outline of a Hypothesis Test

- Assume some hypothesis about the parameter is true
 - This hypothesis is the null hypothesis

- Identify the appropriate statistic based on the distribution of the parameter of interest.
 - This statistic is the test statistic.
 - The statistic should take an extreme value when the hypothesis is false.

- Use the data in your sample to calculate the value of the test statistic.

---

## Outline of a Hypothesis Test

- Calculate the probability of observing a value of the statistic that is as extreme or more extreme than the observed value, under the assumed hypothesis.
 - This calculated probability is the p-value.

- Assess the p-value to make your conclusion.

---

## Conducting a Hypothesis Test

1. State the hypotheses

2. Calculate the test statistic

3. Calculate the p-value

4. State the conclusion in the context of the problem

---

### 1. State the hypotheses

- We are often interested in testing whether there is a significant linear relationship between the explanatory and response variable

- If there is no linear relationship between the two variables, the population regression slope should equal 0 
--

- We can test the hypotheses:

`$$\begin{aligned}&\mathbf{H_0: \boldsymbol{\beta_1} = 0}\\&\mathbf{H_a: \boldsymbol{\beta_1} \neq 0}\end{aligned}$$`

- This is the test conducted by the `lm()` function in R

---

### 2. Calculate the test statistic

- Test Statistic: 
`$$\begin{aligned}\text{test statistic} &= \frac{\text{Estimate} - \text{Hypothesized}}{SE} \\
&= \frac{\hat{\boldsymbol{\beta}}_1 - 0}{SE(\hat{\beta}_1)}\end{aligned}$$`

- **Interpretation**: The number of standard errors the estimated slope is from the mean of the distribution, the hypothesized slope, given `$H_0$` is true

---

### 3. Calculate the p-value

- p-value: Calculated from a `$t$` distribution with `$n-2$` degrees of freedom

`$$\text{p-value} = P(t \geq |\text{test statistic}|)$$`

A small p-value means either....

1. The assumed hypothesis is incorrect or 
2. The assumed hypothesis is correct and a rare event has occurred

---

### 4. State the conclusion

- State a conclusion about the hypothesis based on the assessment of the p-value.
 - Since event (2) is by definition rare, we will conclude a small p-value indicates that there is sufficient evidence to claim that the assumed hypothesis is false.
 
 - When the p-value is not small, we will conclude that there is not sufficient evidence to claim the assumed hypothesis is false.

---

### Hypothesis test for coefficient of `PercentAlcohol`

```r
model %>%
  tidy(conf.int=TRUE)
```

---

### Understanding Hypothesis Tests

---

### What the *p-value* is and is not

**What the p-value is NOT**:
- It is not the probability the null hypothesis is true
 + The null hypothesis is either true or not true
- (1 - *p-value*) is not the probability that the alternative hypothesis is true
 + The alternative hypothesis is either true or not true
 
.alert[
**What the p-value IS**

The probability of getting a test statistic as extreme or more extreme than the calculated test statistic, *assuming the null hypothesis is true*
]

---

## Interpreting the p-value

**Note:** These are general guidelines. The strength of evidence depends on the context of the problem.

---

## Statistical Significance

- A threshold can be used to decide whether or not to reject `$H_0$`.

- This threshold is called the significance level and is usually denoted by `$\alpha$`

- When `$H_0$` is rejected, we use the term statistically significant to describe the outcome of the test.

- *Example*: When `$\alpha = 0.05$`, results are statistically significance when the p-value is `$< 0.05$`

---

## Statistical Significance

- Do not rely strictly on the significance level to make a conclusion!

- Suppose the significance level is 0.05

+ If the p-value is 0.05001, we do not reject `$H_0$`

+ If the p-value is 0.04999, we do reject `$H_0$`

- 0.05001 and 0.04999 are practically the same, yet they led to different conclusions.

- Always state the p-value when reporting results and assess their magnitude in the context of your problem.

---

### Not Statistically Significant

- An outcome of failing to reject `$H_0$` is **not** a failed study/experiment

- Obtaining an outcome of "no significant effect" or "no significant difference" is still valid

- It is often just as important to learn that the `$H_0$` can't be refuted

---

## Type I &  Type II Errors

<img src="img/03/errors.png" width="80%" style="display: block; margin: auto;" />
.small[
Image: *The Basic Practice of Statistics (7th Ed.)*
]

- Type I Error: Reject `$H_0$` when `$H_0$` is true
- Type II Error: Fail to reject `$H_0$` when `$H_1$` is true
- Replicate study when possible to reduce these errors

---

## Sample Size

- Probability of Type I error is not affected by the sample size
    - What affects the probability of Type I error?
--

- Probability of Type II error decreases as the sample size increases
--

- If the hypothesized value is not very different from actual parameter value, you need a large sample size
--

- When designing a study, it is good practice to conduct a power analyses to determine the sample size required to minimize the chance of Type II error

---

## Sample Size

- It is always preferable to collect as much data as possible (as long as it is accurate and relevant)
--

- A test can reject almost any false `$H_0$`, i.e. detect even very small effects, when the sample size is large enough
--

- *Note*: Statistical significance is not the same as practical significance 
 + Even if `$H_0$` is rejected, the detected effect may be too small to be of any practical use

---

## Predictions

---

### Predictions for New Observations

- We can use the regression model to predict for a response at `$x_0$`

`$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_0$$`

- In other words, we have the same estimate whether we want to predict the mean response at `$x_0$` or an individual response at `$x_0$`.
---

## Beer data

.instructions[
What is the predicted amount of carbohydrates (in grams) for **a beer** with 5% alcohol?
]

.instructions[
What is the average predicted amount of carbohydrates (in grams) for **the subset of beers** with 5% alcohol?
]

---

### Predictions for New Observations

- There is uncertainty in our predictions, so we need to calculate an SE to capture the uncertainty

- The SE is different depending on whether you are predicting an average value or an individual value

- SE is larger when predicting for an individual value than for an average value

---

### SE

When predicting for a new observation `$x$`,

`$$SE(\hat{\mu}) = \hat{\sigma}\sqrt{\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}$$`

`$$SE(\hat{y}) = \hat{\sigma}\sqrt{1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}$$`
---

### Beer data: predicting mean

- We wish to predict the mean carbohydrates for the subset of beers with 5% alcohol.

```r
x0 <- data.frame(PercentAlcohol=c(5))
*predict.lm(model,x0,interval="confidence",conf.level=0.95)
```

```
##        fit      lwr    upr
## 1 11.63853 10.96005 12.317
```

- Interpret the interval in the context of the data.

---

### Beer data: predicting individual

- We wish to predict the amount of carbohydrates for a certain beer with 5% alcohol.

```r
x0 <- data.frame(PercentAlcohol=5)
*predict.lm(model,x0,interval="prediction",conf.level=0.95)
```

```
##        fit      lwr      upr
## 1 11.63853 3.178022 20.09903
```

- Interpret the interval in the context of the data.

---

## Cautions

---

## Caution: Extrapolation

- The regression is only useful for predictions for the response variable `$Y$` in the range of the explanatory variable `$X$` that was used to fit the regression

- It is risky to predict far beyond that range of `$X$`, since you don't have data to tell whether or not the relationship continues to follow a straight line

---

## Caution: Extrapolation

We should not use the regression model to predict the amount of carbohydrates for beers with approximately more than 12% alcohol or beers with approximately less than 1%

---

### Caution: Correlation `$\neq$` Causation

- The regression model is **not** a statement of causality
 + The regression model provides a description of the averages of `$Y$` for different values of `$X$`

- The regression model alone **cannot** prove causality. You need either
 - Randomized experiment
 - Observational study in which all relevant confounding variables are controlled for adequately