class: center, middle, inverse, title-slide # The CLT and Confidence Intervals ### Yue Jiang ### Duke University --- ### A roadmap for today Today's topic may be a bit theoretical, but it is fundamental to making sound statistical inferences and conclusions. The basic idea is called the .copper[central limit theorem], which states that for **any** distribution with a well-defined mean and variance, the distribution of the means computed from samples of size `\(n\)` will be approximately normal (Gaussian). We will extend this idea to talk about interval estimation, a way to define a "plausible range of values" for a population parameter. --- ### Review: What is statistical inference? .copper[Statistical inference] is the act of generalizing from a .copper[sample] in order to make conclusions regarding a .copper[population] while quantifying the degree of certainty we have. We are interested in population .copper[parameters], which we do not observe. Instead, we must calculate .copper[statistics] from our sample in order to learn about the parameters. --- ### The sampling distribution of the mean Suppose we're interested in the resting heart rate of students at Duke, and are able to do the following: 1. Take a random sample of size `\(n\)` from this population - Calculate the mean resting heart rate \textit{in this sample}, `\(\bar{x}_1\)` 2. Put the sample back, take a second random sample of size `\(n\)` - Calculate the mean resting heart rate from this new sample, `\(\bar{x}_2\)` 3. Put the sample back, take a third random sample of size `\(n\)` - Calculate the mean resting heart rate from this sample, too... 4. ...and so on. After repeating this `\(M\)` many times, we have a dataset that has the sample averages from the population: `\(\{ \bar{x}_1, \bar{x}_2, \cdots \bar{x}_M\}\)`. .question[ Can we say anything about the distribution of these sample means? ] --- ### The central limit theorem The .copper[central limit theorem] states that for a population with mean `\(\mu\)` and standard deviation `\(\sigma\)`, these three properties hold for the distribution of sample averages: 1. The mean of the sampling distribution is identical to the population mean `\(\mu\)` 2. The standard deviation of the distribution of the sample averages is `\(\sigma/\sqrt{n}\)`, or the .copper[standard error] (SE) of the mean 3. For `\(n\)` large enough (in the limit, as `\(n \to \infty\)`), the shape of the sampling distribution of means is approximately normal --- ### What if the population dist. is not normal? The central limit theorem tells us that **sample averages** are normally distributed, if we have enough data. This is true **even if** our original variables are not normally distributed. [http://onlinestatbook.com/stat_sim/sampling_dist/](Interactive central limit theorem demonstration) --- ### Another experiment Define a variable `\(X\)` to be 1 if a student at Duke has brown eyes, and 0 if a student does not have brown eyes. - What is the distribution of `\(X\)`? - If we take a random sample, the average is an estimate of the true proportion of brown-eyed students in our population of interest - If we take repeated random samples from our population and calculate the proportion in each sample with brown eyes, what values might we expect? Will we get the same values every time? --- ### Another experiment The central limit theorem tells us that the distribution of sample averages should have mean `\(E(X)\)` and standard deviation `\(SD(X)/\sqrt{n}\)` .question[ In our case, what are these values? ] --- ### IQ tests .pull-left[ IQ tests are designed to have a probability distribution with `\(\mu = 100\)` and `\(\sigma = 15\)`. Suppose we draw samples of size `\(n = 20\)` from this population. From the central limit theorem, the distribution of the sample averages will be approximately normal with mean 100 and standard deviation `\(15/\sqrt{20}\)` ] .pull-right[ <img src="iqtest2.jpg" width="100%" style="display: block; margin: auto;" /> ] --- ### IQ tests If the population distribution is normal to begin with, then the distribution of the sample averages will also be exactly normal If the population distribution is not normal, then the rule of thumb is that we need at least `\(n = 30\)` for the central limit theorem to kick in for approximate normality --- ### Example Suppose I give a random sample of `\(n = 30\)` statistics students an IQ test*, and the sample average score is 120. Does this mean that these students are smarter than average? .small[*I know there are lots of problems with IQ and IQ testing...bear with me here!] --- ### Example The central limit theorem tells us that the distribution of means of samples of size 30 from this population is also normal, with mean `\(\mu = 100\)` and `\(SE = \sigma/\sqrt{n} = 15/\sqrt{30} \approx 2.7\)`. `\(Z = \frac{\bar{X} - \mu}{SE}\)` is a standard normal random variable, and here `\(Z \approx 7.3\)`. The probability of a `\(z\)`-score greater than this is extremely small. .question[ What does this mean? Could we be "wrong"? ] <!-- --- --> <!-- ### Some additional questions --> <!-- What are the upper and lower limits that enclose 95% of the means for samples of size `\(n\)` drawn from the population? --> <!-- How large would our random samples need to be for 95% of their averages to lie within `\(\pm 10\)` of the population mean `\(\mu\)`? --> --- ### The opioid crisis <img src="westvirginia.PNG" width="40%" style="display: block; margin: auto;" /> <img src="fig1.jpg" width="80%" style="display: block; margin: auto;" /> West Virginia has the highest age-adjusted rate of drug overdose deaths involving opioids --- ### Statistical inference -.copper[Point estimation]: estimating an unknown parameter using a single number calculated from the sample -.copper[Interval estimation]: estimating an unknown parameter using a range of values that is likely to contain the true parameter - .copper[Hypothesis testing]: evaluating whether our observed sample data provides evidence against some populuation claim --- ### Why should we care about interval estimation? <img src="errorbars.png" width="60%" style="display: block; margin: auto;" /> --- ### What is a confidence interval, anyway? <img src="confidence.png" width="80%" style="display: block; margin: auto;" /> A .copper[confidence interval] gives a range of values that is intended to cover the parameter of interest to a certain degree of "confidence" Confidence interval = .copper[point estimate] `\(\pm\)` .copper[margin of error] --- ### How do you interpret a confidence interval? <img src="ci.png" width="60%" style="display: block; margin: auto;" /> Researchers conducted a clinical trial of a drug intended for severe asthma patients. Their primary endpoint was evaluating whether the mean rate of asthma exacerbation over 48 weeks was different between placebo and treatment arms. Above is the 95% confidence interval for the mean rate among the placebo patients. .question[ How do you interpret this interval? (more on this very soon) ] --- ### Brief caveat For now, let's assume that we know `\(\sigma\)` (this very rarely ever happens, since it is a population parameter) --- ### Two-sided confidence intervals Given a random variable `\(X\)` with mean `\(\mu\)` and standard deviation `\(\sigma\)`, the CLT tells us that `\begin{align*} Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}}, \end{align*}` where `\(Z\)` has a standard normal distribution if `\(X\)` is normal, and `\(Z\)` is approximately normal if `\(X\)` is not normal, but `\(n\)` is large enough --- ### Deriving the two-sided interval For a standard normal random variable, 95% of the observations lie between -1.96 and 1.96 for `\(Z \sim N(0, 1)\)`, so `\begin{align*} 0.95 = P(-1.96 \le Z \le 1.96) \end{align*}` So, a 95\% CI is given by `\begin{align*} \left( \bar{X} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} \right) \end{align*}` --- ### Confidence intervals for the mean `\begin{align*} \left( \bar{X} - z^\star_{1 - \alpha/2}\frac{\sigma}{\sqrt{n}}, \bar{X} + z^\star_{1 - \alpha/2}\frac{\sigma}{\sqrt{n}} \right) \end{align*}` Point estimate `\(\pm\)` margin of error Point estimate `\(\pm\)` confidence multiplier `\(\times\)` SE --- ### Other coverage probabilities The .copper[confidence multiplier], `\(z^\star_{1-\alpha/2}\)`, is the z-score that cuts off the upper 100% `\(\times \alpha/2\)` of the distribution (the `\(1 - \alpha/2\)` percentile) For `\(\alpha = 0.05\)`, we have `\(1 - \alpha/2 = 0.975\)`, and so `\(z^\star\)` is the 97.5th quantile of the standard normal distribution (calculated using software packages) Compromising on confidence level to obtain narrower CIs is highly frowned upon! --- ### CI interpretation `\begin{align*} \left( \bar{X} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} \right) \end{align*}` Suppose we select `\(M\)` different random samples from the population of size `\(n\)`, and use them to calculate `\(M\)` different 95% CIs in the same way as above. We expect 95% of these intervals would cover the true `\(\mu\)` and 5% do not [https://digitalfirst.bfwpub.com/stats_applet/stats_applet_4_ci.html](Interactive activity) --- ### CI interpretation `\begin{align*} \left( \bar{X} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} \right) \end{align*}` .pull-center[ .question[ <b><strike>There is a 95% chance that `\(\mu\)` lies in the interval</strike></b>. ] ] --- ### CI interpretation `\begin{align*} \left( \bar{X} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} \right) \end{align*}` **Important**: we do not know whether any particular interval is in the 95% of them that cover the mean or the 5% that don't Since `\(\mu\)` is a parameter, it's either in our confidence interval or not --- ### When can we use this CI? `\begin{align*} \left( \bar{X} - z^\star\frac{\sigma}{\sqrt{n}}, \bar{X} + z^\star\frac{\sigma}{\sqrt{n}} \right) \end{align*}` Remember, this is only ok to use when `\(\sigma\)` is known, and `\(X\)` is normal or `\(X\)` is *not* normal, but `\(n\)` is sufficiently large --- ### What can we do if SD isn't known? <img src="guinness.jpg" width="80%" style="display: block; margin: auto;" /> --- ### What can we do if SD isn't known? .pull-left[ <img src="gosset.jpg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ As a Guinness Brewery employee, William Sealy Gossett published a paper on the `\(t\)` distribution, which became known as Student's `\(t\)` (the brewery didn't allow him to use his own name) The t-distribution is used to construct a confidence interval for the mean when `\(\sigma\)` is unknown and must be estimated from the data ] --- ### Student's t distribution .pull-left[ - The `\(t\)` distribution looks like the normal distribution except it has fatter tails, leading to wider CIs - This is due to the uncertainty involved in estimating `\(\sigma\)` by using `\(s\)` - As the sample size increases, `\(s\)` is a better and better estimate of `\(\sigma\)`, and so the `\(t\)` distribution looks more and more like the normal distribution ] .pull-right[ <img src="normalvt.png" width="100%" style="display: block; margin: auto;" /> ] --- ### Degrees of freedom The .copper[degrees of freedom] of a `\(t\)` distribution tells us how much information is "available" for estimating `\(\sigma\)` using `\(s\)`. The random variable `\begin{align*} t = \frac{\bar{X} - \mu}{s/\sqrt{n}} \end{align*}` has a `\(t\)` distribution with `\(n-1\)` degrees of freedom ($df$), which we denote by `\(t_{n-1}\)` (we lose one `\(df\)` by estimating the sample mean using `\(\bar{X}\)`). The `\(t\)` distribution only has one parameter ($df$) --- ### Two-sided interval with unknown SD `\begin{align*} \left( \bar{X} - t^\star_{n-1; 1 - \alpha/2}\frac{s}{\sqrt{n}}, \bar{X} + t^\star_{n-1; 1 - \alpha/2}\frac{s}{\sqrt{n}} \right) \end{align*}` .question[ What about one-sided intervals? When might we want to use such a thing? ]