CLT-based Inference

# CLT-based Inference
## Intro to Data Science
### Shawn Santo
### 04-07-20

---

## Video table of contents

1. Introduction and theory: [slides 1 - 12](https://warpwire.duke.edu/w/wZcDAA/)
<br/><br/>
2. Example: [slides 13 - 27](https://warpwire.duke.edu/w/xZcDAA/)

---

## Announcements

- Project proposal due 04-06-20 at 11:59pm EST

- Homework 4 due 04-09-20 at 11:59pm EST

- Lab 08 due 04-10-20 at 11:59pm EST

---

## The Central Limit Theorem

Remember:

For a population with a well-defined mean `$\mu$` and standard deviation `$\sigma$`,
these three properties hold for the distribution of sample average `$\bar{X}$`,
assuming certain conditions hold:

- The distribution of the sample statistic is nearly normal

- The distribution is centered at the (often unknown) population parameter `$\mu$`

- The variability of the distribution is inversely proportional to the square 
  root of the sample size

---

## Why do we care?

Knowing the distribution of the sample statistic `$\bar{X}$` can help us

- estimate a population parameter as point estimate `$\pm$` margin of error, where 
the margin of error is comprised of a measure of how confident we want to be and 
how variable the sample statistic is

- test for a population parameter by evaluating how likely it is to obtain the
  observed sample statistic when assuming that the null hypothesis is true as 
  this probability will depend on how variable the sampling distribution is

---

# Inference based on the CLT

---

## Inference based on the CLT

If necessary conditions are met, we can also use inference methods based on the 
CLT. Suppose we know the true population standard deviation. Then the CLT tells 
us that `$\bar{X}$` approximately has the distribution 
`$N\left(\mu, \sigma/\sqrt{n}\right)$`. That is,

`$$Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1)$$`

---

## Standard normal distribution: N(0, 1)

```r
pnorm(-1.5)
```

```
#> [1] 0.0668072
```
]

---

## Standard normal distribution: N(0, 1)

```r
qnorm(0.2)
```

```
#> [1] -0.8416212
```
]

---

## T distribution

- In practice, we never know the true value of `$\sigma$`, and so we estimate it
from our data with `$s$`. We can make the following test statistic for testing
a single sample's population mean, which has a **t-distribution** with 
`$n-1$` *degrees of freedom*:

$$ T = \frac{\bar{X} - \mu}{s/\sqrt{n}} \sim t_{n-1}$$

- The t-distribution is also unimodal and symmetric, and is centered at 0

- Thicker tails than the normal distribution (to make up for additional 
variability introduced by using `$s$` instead of `$\sigma$` in calculation of the 
SE)

- If we want to test two sample means against each other, the degrees of
freedom are a bit more complicated and the test statistic is slightly different. 
Best to let technology handle it.

---

## T vs Z distributions

---

## T distribution

```r
pt(-1.96, df = 9)
```

```
#> [1] 0.0408222
```

```r
pt(1.96, df = 9, lower.tail = FALSE)
```

```
#> [1] 0.0408222
```
]

<br/>
.small[
Finding cutoff values under the t curve:

```r
qt(0.025, df = 9)
```

```
#> [1] -2.262157
```

```r
qt(0.975, df = 9)
```

```
#> [1] 2.262157
```
]

---

# Example

---

## Relaxing after work

.question[
The GSS asks "After an average work day, about how many 
hours do you have to relax or pursue activities that you enjoy?". Do these data 
provide convincing evidence that Americans, on average, spend more than 3 hours 
per day relaxing? Note that the variable of interest in the dataset is `hrsrelax`.
]

```r
library(tidyverse)
gss <-read_csv("data/gss2010.csv")

gss %>% 
  filter(!is.na(hrsrelax)) %>%
  summarise(x_bar = mean(hrsrelax), 
            med   = median(hrsrelax), 
            sd    = sd(hrsrelax), 
            n     = n())
```

```
#> # A tibble: 1 x 4
#>   x_bar   med    sd     n
#>   <dbl> <dbl> <dbl> <int>
#> 1  3.68     3  2.63  1154
```

---

## Exploratory data analysis

```r
ggplot(data = gss, aes(x = hrsrelax)) + 
  geom_histogram(binwidth = 1) +
  theme_minimal(base_size = 16) +
  labs(x = "Relaxation hours", y = "Count")
```

---

## Hypotheses

.question[
What are the hypotheses for evaluating if Americans, on average, spend more than 
3 hours
per day relaxing?
]

`$$H_0: \mu = 3$$` 
`$$H_A: \mu > 3$$`

---

## Conditions

.question[
What conditions must be satisfied to conduct this hypothesis test using methods 
based on the CLT? Are these conditions satisfied?
]

---

## Calculating the test statistic

Summary statistics from the sample:

```
#> # A tibble: 1 x 3
#>    xbar     s     n
#>   <dbl> <dbl> <int>
#> 1  3.68  2.63  1154
```

And the CLT says:

`$$\bar{x} \sim N\left(mean = \mu, SE = \frac{\sigma}{\sqrt{n}}\right)$$`

<br/>

.question[
How likely are we to observe a sample mean that is at least as extreme as the 
observed sample mean, if in fact the null hypothesis is true?
]

---

## Calculations

```r
n <- hrsrelax_summ %>% pull(n)
s <- hrsrelax_summ %>% pull(s)
xbar <- hrsrelax_summ %>% pull(xbar)
```

```r
(se <- s / sqrt(n))
```

```
#> [1] 0.07740938
```

```r
(t <- (xbar - 3) / se)
```

```
#> [1] 8.7876
```

```r
(df <- n - 1)
```

```
#> [1] 1153
```

```r
pt(t, df, lower.tail = FALSE)
```

```
#> [1] 2.720895e-18
```

---

## Conclusion

- Since the p-value is small, we reject `$H_0$`.

- The data provide sufficient evidence at the `$\alpha = 0.05$` level that 
Americans, on average, spend more than 3 hours per day relaxing after work.

.question[
Would you expect a 90% confidence interval for the average number of hours 
Americans spend relaxing after work to include 3 hours?
]

---

## Confidence interval for a mean

`$$\mbox{point estimate} \pm \mbox{critical value} \times \mbox{SE}$$`

```r
t_star <- qt(0.95, df)
pt_est <- xbar
round(pt_est + c(-1,1) * t_star * se, 2)
```

```
#> [1] 3.55 3.81
```

---

## Built-in functionality in R

- There are built in functions for doing some of these tests in R.

- A learning goal is this course is not to go through an exhaustive 
  list of all CLT based tests and how to implement them

- Instead you should try to understand how these methods are / are not like the
  simulation based methods we learned about earlier

.question[
What is similar, and what is different, between CLT based test of means vs. 
simulation based test?
]

---

## Built-in functionality in R

**Hypothesis testing**

```r
t.test(gss$hrsrelax, mu = 3, alternative = "greater")
```

```
#> 
#> 	One Sample t-test
#> 
#> data:  gss$hrsrelax
#> t = 8.7876, df = 1153, p-value < 2.2e-16
#> alternative hypothesis: true mean is greater than 3
#> 95 percent confidence interval:
#>  3.552813      Inf
#> sample estimates:
#> mean of x 
#>  3.680243
```

```r
infer::t_test(x = gss, response = hrsrelax, 
              mu = 3, alternative = "greater")
```

```
#> # A tibble: 1 x 6
#>   statistic  t_df  p_value alternative lower_ci upper_ci
#>       <dbl> <dbl>    <dbl> <chr>          <dbl>    <dbl>
#> 1      8.79  1153 2.72e-18 greater         3.55      Inf
```

---

## Built-in functionality in R

**Confidence interval**

```r
t.test(gss$hrsrelax, conf.level = 0.90)
```

```
#> 
#> 	One Sample t-test
#> 
#> data:  gss$hrsrelax
#> t = 47.543, df = 1153, p-value < 2.2e-16
#> alternative hypothesis: true mean is not equal to 0
#> 90 percent confidence interval:
#>  3.552813 3.807672
#> sample estimates:
#> mean of x 
#>  3.680243
```

```r
infer::t_test(x = gss, response = hrsrelax, 
              conf_int = TRUE, conf_level = 0.95)
```

```
#> # A tibble: 1 x 6
#>   statistic  t_df   p_value alternative lower_ci upper_ci
#>       <dbl> <dbl>     <dbl> <chr>          <dbl>    <dbl>
#> 1      47.5  1153 5.37e-274 two.sided       3.53     3.83
```

---

## Additional resources

1. See [Section 5.1](https://drive.google.com/file/d/0B-DHaDEbiOGkc1RycUtIcUtIelE/view)
   for more information about the t-distribution and inference involving a
   population mean.
   
2. See [Section 6.1](https://drive.google.com/file/d/0B-DHaDEbiOGkc1RycUtIcUtIelE/view)
   for information about inference for a single proportion. 
   
---

## Application exercise

https://classroom.github.com/a/i1KFO5VO

---

## References

1. Tidy Statistical Inference. (2020). Infer.netlify.com. Retrieved 3 April 
   2020, from https://infer.netlify.com/