Central limit theorem 🐪

# Central limit theorem <br> 🐪

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01
</a>
</span>
</div>

---

## Announcements

- Office hours 2:30-4:30pm today
- Project proposals due next Tuesday

---

# Sample Statistics and Sampling Distributions

---

## Notation

- Means:
    - Population: mean = `$\mu$`, standard deviation = `$\sigma$`
    - Sample: mean = `$\bar{x}$`, standard deviation = `$s$`
- Proportions:
    - Population: `$p$`
    - Sample: `$\hat{p}$`
<br/>    
- Standard error: `$SE$`

---

## Variability of sample statistics

- Each sample from the population yields a slightly different sample statistic 
(sample mean, sample proportion, etc.)
- The variability of these sample statistics is measured by the **standard error**
- Previously we quantified this value via simulation
- Today we talk about the theory underlying **sampling distributions**

---

## Sampling distribution

- **Sampling distribution** is the distribution of sample statistics of random
samples of size `$n$` taken with replacement from a population
- In practice it is impossible to construct sampling distributions since it would 
require having access to the entire population
- Today for demonstration purposes we will assume we have access to the population
data, and construct sampling distributions, and examine their shapes, centers, and
spreads

---

## The sampling distribution

.question[
We have a population that is normally distributed with mean 20 and standard deviation 3. Suppose we take samples of size 50 from this distribution, and plot their sample means. What shape, center, and spread will this distribution have?
]

---

## The population

```r
set.seed(20180328)
norm_pop <- tibble(x = rnorm(n = 100000, mean = 20, sd = 3))
ggplot(data = norm_pop, aes(x = x)) +
  geom_histogram(binwidth = 1) +
  labs(title = "Population distribution")
```

![](u2_d11-clt_files/figure-html/unnamed-chunk-1-1.png)

---

## Sampling from the population - 1

```r
samp_1 <- norm_pop %>%
  sample_n(size = 50, replace = TRUE)
```

```r
samp_1 %>%
  summarise(x_bar = mean(x))
```

```
## # A tibble: 1 x 1
##   x_bar
##   <dbl>
## 1  20.2
```

---

## Sampling from the population - 2

```r
samp_2 <- norm_pop %>%
  sample_n(size = 50, replace = TRUE)
```

```r
samp_2 %>%
  summarise(x_bar = mean(x))
```

```
## # A tibble: 1 x 1
##   x_bar
##   <dbl>
## 1  19.5
```

---

## Sampling from the population - 3

```r
samp_3 <- norm_pop %>%
  sample_n(size = 50, replace = TRUE)
```

```r
samp_3 %>%
  summarise(x_bar = mean(x))
```

```
## # A tibble: 1 x 1
##   x_bar
##   <dbl>
## 1  20.4
```

...

---

## Sampling distribution

```r
sampling <- norm_pop %>%
  rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>%
  group_by(replicate) %>%
  summarise(xbar = mean(x))
sampling
```

```
## # A tibble: 1,000 x 2
##    replicate  xbar
##        <int> <dbl>
##  1         1  19.9
##  2         2  19.6
##  3         3  19.6
##  4         4  20.8
##  5         5  20.1
##  6         6  20.2
##  7         7  20.2
##  8         8  20.5
##  9         9  19.7
## 10        10  19.8
## # ... with 990 more rows
```

---

## Population vs. sampling

![](u2_d11-clt_files/figure-html/unnamed-chunk-9-1.png)

---

![](u2_d11-clt_files/figure-html/unnamed-chunk-10-1.png)

---

```r
norm_pop %>%
  summarise(mu = mean(x), sigma = sd(x))
```

```
## # A tibble: 1 x 2
##      mu sigma
##   <dbl> <dbl>
## 1  20.0  3.00
```

```r
sampling %>%
  summarise(mean = mean(xbar), se = sd(xbar))
```

```
## # A tibble: 1 x 2
##    mean    se
##   <dbl> <dbl>
## 1  20.0 0.431
```

---

## Simulating another sampling distribution

```r
rs_pop <- tibble(x = rbeta(100000, 1, 5) * 100)
```

![](u2_d11-clt_files/figure-html/unnamed-chunk-14-1.png)

```
## # A tibble: 1 x 2
##      mu sigma
##   <dbl> <dbl>
## 1  16.7  14.1
```

---

## Sampling distribution

```r
sampling <- rs_pop %>%
  rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>%
  group_by(replicate) %>%
  summarise(xbar = mean(x))
```

![](u2_d11-clt_files/figure-html/unnamed-chunk-17-1.png)

```
## # A tibble: 1 x 2
##    mean    se
##   <dbl> <dbl>
## 1  16.6  1.94
```

---

![](u2_d11-clt_files/figure-html/unnamed-chunk-19-1.png)

---

## Recap

- Regardless of the shape of the population distribution, the sampling distribution follows a normal distribution.
- The center of the sampling distribution is at the center of the population distribution.
- The sampling distribution is less variable than the population distribution.

.question[
What was the one (very unrealistic) assumption we had in simulating these sampling distributions?
]

---

# Central Limit Theorem

---

## In practice...

We can't directly generate sampling distributions, because we only draw a single sample from the population.

- Statistical inference deals with this very issue: observe only one sample, try to make inference about the entire population.
- We have already seen that there are simulation based methods that help us estimate the sampling distribution . 
- Additionally, there are theoretical results (**Central Limit Theorem**) that tell us about the sampling distributions (for certain sample statistics).

---

## Central Limit Theorem

If certain conditions are met (more on this in a bit), sampling distribution of the sample statistic
will be 
- nearly normal
- mean equal to the unknown population parameter
- standard error proportional to the inverse of the square root of the sample size

**Single mean:** `$\bar{x} \sim N\left(mean = \mu, sd = \frac{\sigma}{\sqrt{n}}\right)$`  
**Difference between two means:** `$(\bar{x}_1 - \bar{x}_2) \sim N\left(mean = (\mu_1 - \mu_2), sd = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} \right)$`  
**Single proportion:** `$\hat{p} \sim N\left(mean = p, sd = \sqrt{\frac{p (1-p)}{n}} \right)$`  
**Difference between two proportions:** `$(\hat{p}_1 - \hat{p}_2) \sim N\left(mean = (p_1 - p_2), sd = \sqrt{\frac{p_1 (1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \right)$`

---

## Conditions:

- **Independence:** The sampled observations must be independent. This is difficult to check, but the following are useful guidelines:
    - the sample must be random
    - if sampling without replacement, sample size must be less than 10% of the population size
- **Sample size / distribution:** 
    - numerical data: The more skewed the sample (and hence the population)
    distribution, the larger samples we need. Usually n > 30 is considered a 
    large enough sample for population distributions that are not extremely skewed.
    - categorical data: At least 10 successes and 10 failures.
- If comparing two populations, the groups must be independent of each other,
and all conditions should be checked for both groups.

---

## Standard Error

The **standard error** is the *standard deviation* of the *sampling distribution*, calculated using sample statistics (since we don't know the population parameters like `$\sigma$` or `$p$`).

- **Single mean:** `$SE = \frac{s}{\sqrt{n}}$`
- **Difference between two means:** `$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$`
- **Single proportion:** `$SE = \sqrt{\frac{\hat{p} (1-\hat{p})}{n}}$`
- **Difference between two proportions:** `$SE = \sqrt{\frac{\hat{p}_1 (1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$`

.question[
How are standard error and sample size associated? What does that say about how the spread of the sampling distribution changes as `$n$` increases?
]

---

## What is the normal distribution?

Normal distribution

- is unimodal and symmetric
- follows the 68-95-99.7 rule

![](img/10b/6895997.png)

---

.question[
Speeds of cars on a highway are normally distributed with mean 65 miles / hour. The minimum speed recorded is 48 miles / hour and the maximum speed recorded is 83 miles / hour. Which of the following is most likely to be the standard deviation of the distribution?
]

(a) -5  
(b) 5  
(c) 10  
(d) 15  
(e) 30

---

## What is "unusual" under the normal distribution?

Observations that are more than 2 standard deviations away from the mean are often considered unusual under the normal distribution.

---

.question[
Suppose my iPod has 3,000 songs. The histogram below shows the distribution of the lengths of these songs. We also know that, for songs on this iPod, the mean length is 3.45 minutes and the standard deviation is 1.63 minutes. Calculate the probability that a randomly selected song lasts more than 5 minutes.
]

![](u2_d11-clt_files/figure-html/unnamed-chunk-20-1.png)

---

.question[
I’m about to take a trip to visit friends and the drive is 6 hours. I make a random playlist of 100 songs. What is the probability that my playlist lasts the entire drive? Reminder: For songs on this iPod, the mean length is 3.45 minutes and the standard deviation is 1.63 minutes. 
]

Hints:

- You know how to find the distribution of `$\bar{x}$` (average song length)
- To find probabilities under the normal curve, use the `pnorm()` function.

---

## Recap -- why do we care?

Knowing the distribution of the sample statistic can help us

- estimate a population parameter as point estimate `$\pm$` margin of error, where the margin of error is comprised of a measure of how confident we want to be and the variablity of the sample statistic.
- test for a population parameter by evaluating how likely it is to obtain to observed sample statistic when assuming that the null hypothesis is true as this probability will depend on the variablity of the sampling distribution.

---

# Inference methods based on CLT

---

## What is the CLT?

The Central Limit Theorem tells us the distribution of certain sample statistics if necessary conditions are met.

- The distribution of the sample statistic is nearly normal
- The distribution is centered at the (often unknown) population parameter
- The variability of the distribution is inversely proportional to the square root of the sample size

---

## Inference methods based on CLT

If necessary conditions are met, we can also use inference methods based on the CLT:

- use the CLT to calculate the SE of the sample statistic of interest (sample mean, 
sample proportion, difference between sample means, etc.)

--
- calculate the **test statistic**, number of standard errors away from the null 
value the observed sample statistic is
    - Z for proportions
    - T for means, along with appropriate degrees of freedom

- use the test statistic to calculate the **p-value**, the probability of an observed 
or more extreme outcome given that the null hypothesis is true

---

## Z distribution

Also called the **standard normal distribution**: `$Z \sim N(mean = 0, \sigma = 1)$`

Finding probabilities under the normal curve:
.midi[

```r
pnorm(-1.96)
```

```
## [1] 0.0249979
```

```r
pnorm(1.96, lower.tail = FALSE)
```

```
## [1] 0.0249979
```
]

Finding cutoff values under the normal curve:
.midi[

```r
qnorm(0.025)
```

```
## [1] -1.959964
```

```r
qnorm(0.975)
```

```
## [1] 1.959964
```
]

---

## T distribution

- Also unimodal and symmetric, and centered at 0
- Thicker tails than the normal distribution (to make up for additional variability
introduced by using `$s$` instead of `$\sigma$` in calculation of the SE)
- Parameter: **degrees of freedom**
    - df for single mean: `$df = n - 1$`
    - df for comparing two means:

`$$df \approx \frac{(s_1^2/n_1+s_2^2/n_2)^2}{(s_1^2/n_1)^2/(n_1-1)+(s_2^2/n_2)^2/(n_2-1)} \approx min(n_1 - 1, n_2 - 1)$$`

---

## T vs Z distributions

![](u2_d11-clt_files/figure-html/unnamed-chunk-23-1.png)

---

## T distribution

Finding probabilities under the t curve:
.midi[

```r
pt(-1.96, df = 9)
```

```
## [1] 0.0408222
```

```r
pt(1.96, df = 9, lower.tail = FALSE)
```

```
## [1] 0.0408222
```
]

<br/>

Finding cutoff values under the t curve:
.midi[

```r
qt(0.025, df = 9)
```

```
## [1] -2.262157
```

```r
qt(0.975, df = 9)
```

```
## [1] 2.262157
```
]

---

# Example

---

## Relaxing after work

.question[
The GSS asks "After an average work day, about how many 
hours do you have to relax or pursue activities that you enjoy?". Do these data
provide convincing evidence that Americans, on average, spend more than 3 hours
per day relaxing? Note that the variable of interest in the dataset is `hrsrelax`.
]

```r
gss = read_csv("data/gss2010.csv")

gss %>% 
  filter(!is.na(hrsrelax)) %>%
  summarise(x_bar = mean(hrsrelax), med = median(hrsrelax), 
            sd = sd(hrsrelax), n = n())
```

```
## # A tibble: 1 x 4
##   x_bar   med    sd     n
##   <dbl> <dbl> <dbl> <int>
## 1  3.68     3  2.63  1154
```

---

## Exploratory Data Analysis

```r
ggplot(data = gss, aes(x = hrsrelax)) + 
  geom_histogram(binwidth = 1)
```

```
## Warning: Removed 890 rows containing non-finite values (stat_bin).
```

![](u2_d11-clt_files/figure-html/unnamed-chunk-27-1.png)

---

## Hypotheses

.question[
What are the hypotheses for evaluation Americans, on average, spend more than 3 hours
per day relaxing?
]

`$$H_0: \mu = 3$$` 
`$$H_A: \mu > 3$$`

---

## Conditions

.question[
What conditions must be satisfied to conduct this hypothesis test using methods 
based on the CLT? Are these conditions satisfied?
]

---

## Calculating the test statistic

Summary stats from the sample:

```
## # A tibble: 1 x 3
##    xbar     s     n
##   <dbl> <dbl> <int>
## 1  3.68  2.63  1154
```

And the CLT says:

`$$\bar{x} \sim N\left(mean = \mu, SE = \frac{\sigma}{\sqrt{n}}\right)$$`

.question[
How likely are we to observe a sample mean that is at least as extreme as the observed sample mean, if in fact the null hypothesis is true?
]

---

## Calculations

```r
(se <- hrsrelax_summ$s / sqrt(hrsrelax_summ$n))
```

```
## [1] 0.07740938
```

```r
(t <- (hrsrelax_summ$xbar - 3) / se)
```

```
## [1] 8.7876
```

```r
(df <- hrsrelax_summ$n - 1)
```

```
## [1] 1153
```

```r
pt(t, df, lower.tail = FALSE)
```

```
## [1] 2.720895e-18
```

---

## Conclusion

- Since the p-value is small, we reject `$H_0$`.
- The data provide convincing evidence that Americans, on average, spend more than
3 hours per day relaxing after work.

.question[
Would you expect a 90% confidence interval for the average number of hours Americans 
spend relaxing after work to include 3 hours?
]

---

## Confidence interval for a mean

`$$point~estimate \pm critical~value \times SE$$`

```r
t_star <- qt(0.95, df)
pt_est <- hrsrelax_summ$xbar
round(pt_est + c(-1,1) * t_star * se, 2)
```

```
## [1] 3.55 3.81
```

---

## Built-in functionality in R

- There are built in functions for doing some of these tests in R:
- However a learning goal is this course is not to go through an exhaustive list of all CLT based tests and how to implement them
- Instead you should try to understand how these methods are / are not like the simulation based methods we learned about earlier

.question[
What is similar, and what is different, between CLT based test of means vs. simulation based test?
]

---

```r
# HT
t.test(gss$hrsrelax, mu = 3, alternative = "greater")
```

```
## 
## 	One Sample t-test
## 
## data:  gss$hrsrelax
## t = 8.7876, df = 1153, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 3
## 95 percent confidence interval:
##  3.552813      Inf
## sample estimates:
## mean of x 
##  3.680243
```

```r
# CI
t.test(gss$hrsrelax, conf.level = 0.90)$conf.int
```

```
## [1] 3.552813 3.807672
## attr(,"conf.level")
## [1] 0.9
```
]