class: center, middle, inverse, title-slide # The Central Limit Theorem ### Yue Jiang ### Duke University --- class: center, middle ## Sample Statistics and Sampling Distributions --- ## For the fifth time... - Statistical inference is the act of generalizing from a sample in order to make conclusions regarding a population. - We are interested in population parameters, which we do not observe. Instead, we must calculate statistics from our sample in order to learn about them. - As part of this process, we must quantify the degree of uncertainty in our sample statistic. --- ## Variability of sample statistics - We've seen that each sample from the population yields a slightly different sample statistic (sample mean, sample proportion, etc.) - Previously we've quantified this value via simulation - Today we talk about some of the theory underlying .vocab[sampling distributions], particularly as they relate to sample means. --- ## Sampling distribution of the mean Suppose we’re interested in the mean resting heart rate of students at Duke, and are able to do the following: -- 1. Take a random sample of size `\(n\)` from this population, and calculate the mean resting heart rate in this sample, `\(\bar{X}_1\)` -- 2. Put the sample back, take a second random sample of size `\(n\)`, and calculate the mean resting heart rate from this new sample, `\(\bar{X}_2\)` -- 3. Put the sample back, take a third random sample of size `\(n\)`, and calculate the mean resting heart rate from this sample, too... -- ...and so on. -- After repeating this many times, we have a dataset that has the sample averages from the population: `\(\bar{X}_1\)`, `\(\bar{X}_2\)`, `\(\cdots\)`, `\(\bar{X}_K\)` (assuming we took `\(K\)` total samples). --- ## Sampling distribution of the mean .question[ Can we say anything about the distribution of these sample means (that is, the .vocab[sampling distribution] of the mean?) ] *(Keep in mind, we don't know what the underlying distribution of mean resting heart rate looks like in Duke students!)* --- class: center, middle ## The Central Limit Theorem --- ## A quick caveat for now... For now, let's assume that the underlying `\(\sigma\)` is known from our distribution. --- ## The Central Limit Theorem For a population with a well-defined mean `\(\mu\)` and standard deviation `\(\sigma\)`, these three properties hold for the distribution of sample average `\(\bar{X}\)`, assuming certain conditions hold: 1. The mean of the sampling distribution of the mean is identical to the population mean `\(\mu\)`, 2. The standard deviation of the distribution of the sample averages is `\(\sigma/\sqrt{n}\)`, or the .vocab[standard error] (SE) of the mean, and 3. For `\(n\)` large enough (in the limit, as `\(n \to \infty\)`), the shape of the sampling distribution of means is approximately .vocab[normally distributed]. --- ## The normal (Gaussian) distribution? The normal distribution is unimodal and symmetric and is described by its .vocab[density function]: If a random variable `\(X\)` follows the normal distribution, then `$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{1}{2}\frac{(x - \mu)^2}{\sigma^2} \right\}$$` where `\(\mu\)` is the mean and `\(\sigma^2\)` is the variance. We often write `\(N(\mu, \sigma^2)\)` to describe this distribution. --- ## The normal distribution (graphically) <img src="13-clt_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- ## Wait, *any* distribution? The central limit theorem tells us that *sample averages* are normally distributed, if we have enough data and certain assumptions hold. This is true *even if our original variables are not normally distributed*. [**Check out this interactive demonstration**](http://onlinestatbook.com/stat_sim/sampling_dist/index.html) --- ## Conditions What are the conditions we need for the CLT to hold? - **Independence:** The sampled observations must be independent. This is difficult to check, but the following are useful guidelines: - the sample must be randomly taken - if sampling without replacement, sample size must be less than 10% of the population size If samples are independent, then *a priori*, one sample's value does not "influence" another sample's value. - **Sample size / distribution:** - if data are numerical, usually n > 30 is considered a large enough sample for the CLT to kick in - if we know for sure that the underlying data are normally distributed, then the distribution of sample averages will also be exactly normal, regardless of the sample size - if data are categorical, at least 10 successes and 10 failures. --- ## Let's run our own simulation **The underlying population** (we never observe this!) ```r rs_pop <- tibble(x = rbeta(100000, 1, 5) * 100) ``` <img src="13-clt_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> **The true population parameters** ```r rs_pop %>% summarise(mu = mean(x), sigma = sd(x)) ``` ``` #> # A tibble: 1 x 2 #> mu sigma #> <dbl> <dbl> #> 1 16.7 14.1 ``` --- ## Sampling from the population - 1 ```r set.seed(1) samp_1 <- rs_pop %>% sample_n(size = 50) %>% summarise(x_bar = mean(x)) ``` --- ## Sampling from the population - 2 ```r set.seed(2) samp_2 <- rs_pop %>% sample_n(size = 50) %>% summarise(x_bar = mean(x)) ``` --- ## Sampling from the population - 3 ```r samp_3 <- rs_pop %>% sample_n(size = 50, replace = TRUE) %>% summarise(x_bar = mean(x)) ``` -- keep repeating... --- ## Sampling distribution ```r sampling <- rs_pop %>% rep_sample_n(size = 50, replace = TRUE, reps = 5000) %>% group_by(replicate) %>% summarise(xbar = mean(x)) ``` <img src="13-clt_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ```r sampling %>% summarise(mean = mean(xbar), se = sd(xbar)) ``` ``` #> # A tibble: 1 x 2 #> mean se #> <dbl> <dbl> #> 1 16.7 2.01 ``` --- ## Comparing two distributions .question[ How do the shapes, centers, and spreads of these distributions compare? ] <img src="13-clt_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## Recap - If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution. - The center of the sampling distribution is at the center of the population distribution. - The sampling distribution is less variable than the population distribution (and we can quantify by how much). .question[ What is the standard error, and how are the standard error and sample size related? What does that say about how the spread of the sampling distribution changes as `\(n\)` increases? ] --- class: center, middle ## Finding probabilities in R --- ## Standard normal distribution: N(0, 1) .small[ Finding probabilities under the normal curve: ```r pnorm(-1.5) ``` ``` #> [1] 0.0668072 ``` ] <img src="13-clt_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## Standard normal distribution: N(0, 1) How might we find the probability of being *between* two values? .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] -- <img src="13-clt_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Standard normal distribution: N(0, 1) How might we find the probability of being *between* two values? .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] <img src="13-clt_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ## Standard normal distribution: N(0, 1) How might we find the probability of being *between* two values? .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] <img src="13-clt_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ```r pnorm(2) - pnorm(-1) ``` ``` #> [1] 0.8185946 ``` --- ## Standard normal distribution: N(0, 1) How might we find the probability of being *between* two values? .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] ```r pnorm(2) - pnorm(-1) ``` ``` #> [1] 0.8185946 ``` --- ## Standard normal distribution: N(0, 1) .small[ Finding cutoff values under the normal curve: ```r qnorm(0.25) ``` ``` #> [1] -0.6744898 ``` ] <img src="13-clt_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ## Your turn! [https://classroom.github.com/a/bDVjo9kD](https://classroom.github.com/a/bDVjo9kD)