The Central Limit Theorem

# The Central Limit Theorem
### Yue Jiang
### Duke University

---

## Sample Statistics and Sampling Distributions

---

## For the fifth time...

- Statistical inference is the act of generalizing from a sample in order to 
make conclusions regarding a population.

- We are interested in population parameters, which we do not observe. Instead, 
we must calculate statistics from our sample in order to learn about them.

- As part of this process, we must quantify the degree of uncertainty in our 
sample statistic.

---

## Variability of sample statistics

- We've seen that each sample from the population yields a slightly different 
sample statistic (sample mean, sample proportion, etc.)

- Previously we've quantified this value via simulation

- Today we talk about some of the theory underlying .vocab[sampling distributions],
particularly as they relate to sample means.

---

## Sampling distribution of the mean

Suppose we’re interested in the mean resting heart rate of students at Duke, and 
are able to do the following:

1. Take a random sample of size `$n$` from this population, and calculate the 
mean resting heart rate in this sample, `$\bar{X}_1$`

2. Put the sample back, take a second random sample of size `$n$`, and calculate 
the mean resting heart rate from this new sample, `$\bar{X}_2$`

3. Put the sample back, take a third random sample of size `$n$`, and calculate
the mean resting heart rate from this sample, too...

...and so on.

After repeating this many times, we have a dataset that has the
sample averages from the population: `$\bar{X}_1$`, `$\bar{X}_2$`, `$\cdots$`,
`$\bar{X}_K$` (assuming we took `$K$` total samples).

---

## Sampling distribution of the mean

.question[
Can we say anything about the distribution of these sample means (that is, the
.vocab[sampling distribution] of the mean?) 
]

*(Keep in mind, we don't know what the underlying distribution of mean resting 
heart rate looks like in Duke students!)*

---

## The Central Limit Theorem

---

## A quick caveat for now...

For now, let's assume that the underlying `$\sigma$` is known from our 
distribution.

---

## The Central Limit Theorem

For a population with a well-defined mean `$\mu$` and standard deviation `$\sigma$`,
these three properties hold for the distribution of sample average `$\bar{X}$`,
assuming certain conditions hold:

1. The mean of the sampling distribution of the mean is identical to the 
population mean `$\mu$`,
2. The standard deviation of the distribution of the sample averages is
`$\sigma/\sqrt{n}$`, or the .vocab[standard error] (SE) of the mean, and
3. For `$n$` large enough (in the limit, as `$n \to \infty$`), the shape of the
sampling distribution of means is approximately .vocab[normally distributed].

---

## The normal (Gaussian) distribution?

The normal distribution is unimodal and symmetric and is described by its
.vocab[density function]:

If a random variable `$X$` follows the normal distribution, then
`$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{1}{2}\frac{(x - \mu)^2}{\sigma^2} \right\}$$`
where `$\mu$` is the mean and `$\sigma^2$` is the variance.

We often write `$N(\mu, \sigma^2)$` to describe this distribution.

---

## The normal distribution (graphically)

---

## Wait, *any* distribution?

The central limit theorem tells us that *sample averages* are normally 
distributed, if we have enough data and certain assumptions hold.

This is true *even if our original variables are not normally distributed*.

[**Check out this interactive demonstration**](http://onlinestatbook.com/stat_sim/sampling_dist/index.html)

---

## Conditions

What are the conditions we need for the CLT to hold?

- **Independence:** The sampled observations must be independent. This is 
difficult to check, but the following are useful guidelines:
    - the sample must be randomly taken
    - if sampling without replacement, sample size must be less than 10% of the 
    population size
    
If samples are independent, then *a priori*, one sample's value does not
"influence" another sample's value.

- **Sample size / distribution:** 
    - if data are numerical, usually n > 30 is considered a large enough sample
    for the CLT to kick in
    - if we know for sure that the underlying data are normally distributed, 
    then the distribution of sample averages will also be exactly normal, 
    regardless of the sample size
    - if data are categorical, at least 10 successes and 10 failures.

---

## Let's run our own simulation

**The underlying population** (we never observe this!)

```r
rs_pop <- tibble(x = rbeta(100000, 1, 5) * 100)
```

**The true population parameters**

```r
rs_pop %>%
  summarise(mu = mean(x), sigma = sd(x))
```

```
#> # A tibble: 1 x 2
#>      mu sigma
#>   <dbl> <dbl>
#> 1  16.7  14.1
```

---

## Sampling from the population - 1

```r
set.seed(1)
samp_1 <- rs_pop %>%
  sample_n(size = 50) %>%
  summarise(x_bar = mean(x))
```

---

## Sampling from the population - 2

```r
set.seed(2)
samp_2 <- rs_pop %>%
  sample_n(size = 50) %>% 
  summarise(x_bar = mean(x))
```

---

## Sampling from the population - 3

```r
samp_3 <- rs_pop %>%
  sample_n(size = 50, replace = TRUE) %>% 
  summarise(x_bar = mean(x))
```

keep repeating...

---

## Sampling distribution

```r
sampling <- rs_pop %>%
  rep_sample_n(size = 50, replace = TRUE, reps = 5000) %>%
  group_by(replicate) %>%
  summarise(xbar = mean(x))
```

```r
sampling %>%
  summarise(mean = mean(xbar), se = sd(xbar))
```

```
#> # A tibble: 1 x 2
#>    mean    se
#>   <dbl> <dbl>
#> 1  16.7  2.01
```

---

## Comparing two distributions

---

## Recap

- If certain assumptions are satisfied, regardless of the shape of the 
population distribution, the sampling distribution of the mean follows an 
approximately normal distribution.

- The center of the sampling distribution is at the center of the population 
distribution.

- The sampling distribution is less variable than the population distribution 
(and we can quantify by how much).

.question[
What is the standard error, and how are the standard error and sample size 
related? What does that say about how the spread of the sampling distribution
changes as `$n$` increases?
]

---

## Finding probabilities in R

---