Central Limit Theorem
Inference based on the Central Limit Theorem
Due Thursday: Read Sections 2.5 - 2.8 on OpenIntro: Intro Stat with Randomization and Simulation (http://www.openintro.org/isrs)
Due Next Tuesday: HW4
Each sample from the population yields a slightly different sample statistic (sample mean, sample proportion, etc.)
The variability of these sample statistics is measured by the standard error
Previously we quantified this value via simulation
Today we talk about the theory underlying sampling distributions
Sampling distribution is the distribution of sample statistics of random samples of size \(n\) taken from a population
In practice it is impossible to construct sampling distributions since it would require having access to the entire population
Today for demonstration purposes we will assume we have access to the population data, and construct sampling distributions, and examine their shapes, centers, and spreads
d = data.frame(norm_samp = rnorm(100, mean = 50, sd = 5)) ggplot(data = d, aes(sample = norm_samp)) + geom_point(alpha = 0.7, stat = "qq")
Data are plotted on the y-axis of a normal probability plot and theoretical quantiles (following a normal distribution) on the x-axis.
If there is a one-to-one relationship between the data and the theoretical quantiles, then the data follow a nearly normal distribution.
Since a one-to-one relationship would appear as a straight line on a scatter plot, the closer the points are to a perfect straight line, the more confident we can be that the data follow the normal model.
Data (y-coordinates) | Percentile | Theoretical Quantiles (x-coordinates) |
---|---|---|
37.5 | 0.5 / 100 = 0.005 | qnorm(0.005) = -2.58 |
38.0 | 1.5 / 100 = 0.015 | qnorm(0.015) = -2.17 |
38.3 | 2.5 / 100 = 0.025 | qnorm(0.025) = -1.95 |
39.5 | 3.5 / 100 = 0.035 | qnorm(0.035) = -1.81 |
… | … | … |
61.9 | 99.5 / 100 = 0.995 | qnorm(0.995) = 2.58 |
Best to think about what is happening with the most extreme values - here the biggest values are bigger than we would expect and the smallest values are smaller than we would expect (for a normal).
Here the biggest values are smaller than we would expect and the smallest values are bigger than we would expect.
Here the biggest values are bigger than we would expect and the smallest values are also bigger than we would expect.
Here the biggest values are smaller than we would expect and the smallest values are also smaller than we would expect.
We can't directly know what the sampling distributions looks like, because we only draw a single sample.
The whole point of statistical inference is to deal with this issue: observe only one sample, try to make inference about the entire population
We have already seen that there are simulation based methods that help us derive the sampling distributiom
Additionally, there are theoretical results (Central Limit Theorem) that tell us what the sampling distribution should look like (for certain sample statistics)
If certain conditions are met (more on this in a bit), the sampling distribution of the sample statistic will be nearly normally distributed with mean equal to the population parameter and standard error proportional to the inverse of the square root of the sample size.
The standard error is the standard deviation of the sampling distribution.
Single mean: \(SE = \frac{\sigma}{\sqrt{n}}\)
Difference between two means: \(SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\)
Single proportion: \(SE = \sqrt{\frac{p (1-p)}{n}}\)
Difference between two proportions: \(SE = \sqrt{\frac{p_1 (1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\)
If necessary conditions are met, we can also use inference methods based on the CLT:
use the CLT to calculate the SE of the sample statistic of interest (sample mean, sample proportion, difference between sample means, etc.)
use the test statistic to calculate the p-value, the probability of an observed or more extreme outcome given that the null hypothesis is true
Also called the standard normal distribution: \(Z \sim N(mean = 0, \sigma = 1)\)
Finding probabilities under the normal curve:
pnorm(-1.96)
## [1] 0.0249979
pnorm(1.96, lower.tail = FALSE)
## [1] 0.0249979
Finding cutoff values under the normal curve:
qnorm(0.025)
## [1] -1.959964
qnorm(0.975)
## [1] 1.959964
Also unimodal and symmetric, and centered at 0
Thicker tails than the normal distribution (to make up for additional variability introduced by using \(s\) instead of \(\sigma\) in calculation of the SE)
Parameter: degrees of freedom
df for single mean: \(df = n - 1\)
df for comparing two means:
\[df \approx \frac{(s_1^2/n_1+s_2^2/n_2)^2}{(s_1^2/n_1)^2/(n_1-1)+(s_2^2/n_2)^2/(n_2-1)} \approx min(n_1 - 1, n_2 - 1)\]
Finding probabilities under the t curve:
pt(-1.96, df = 9)
## [1] 0.0408222
pt(1.96, df = 9, lower.tail = FALSE)
## [1] 0.0408222
Finding cutoff values under the normal curve:
qt(0.025, df = 9)
## [1] -2.262157
qt(0.975, df = 9)
## [1] 2.262157
Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society.
GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.
2010 GSS:
gss = read.csv("https://stat.duke.edu/~mc301/data/gss2010.csv")
One of the questions on the survey is "After an average work day, about how many hours do you have to relax or pursue activities that you enjoy?". Do these data provide convincing evidence that Americans, on average, spend more than 3 hours per day relaxing? Note that the variable of interest in the dataset is hrsrelax
.
gss %>% filter(!is.na(hrsrelax)) %>% summarise(mean(hrsrelax), median(hrsrelax), sd(hrsrelax), length(hrsrelax))
## mean(hrsrelax) median(hrsrelax) sd(hrsrelax) length(hrsrelax) ## 1 3.680243 3 2.629641 1154
ggplot(data = gss, aes(x = hrsrelax)) + geom_histogram(binwidth = 1)
What are the hypotheses for evaluation Americans, on average, spend more than 3 hours per day relaxing?
\[H_0: \mu = 3\] \[H_A: \mu > 3\]
Independence: The GSS uses a reasonably random sample, and the sample size of 1,154 is less than 10% of the US population, so we can assume that the respondents in this sample are independent of each other.
Sample size / skew: The distribution of hours relaxed is right skewed, however the sample size is large enough for the sampling distribution to be nearly normal.
\[\bar{x} \sim N\left(mean = \mu, SE = \frac{\sigma}{\sqrt{n}}\right)\] \[ \frac{\bar{x}-\mu_0}{s/\sqrt{n}} \sim T_{df=n-1} \]
\[T_{df} = \frac{obs - null}{SE} = \frac{\bar{x}-\mu_0}{s}\] \[df = n - 1\]
# summary stats hrsrelax_summ = gss %>% filter(!is.na(hrsrelax)) %>% summarise(xbar = mean(hrsrelax), s = sd(hrsrelax), n = n())
# calculations (se = hrsrelax_summ$s / sqrt(hrsrelax_summ$n))
## [1] 0.07740938
(t = (hrsrelax_summ$xbar - 3) / se)
## [1] 8.7876
(df = hrsrelax_summ$n - 1)
## [1] 1153
p-value = P(observed or more extreme outcome | \(H_0\) true)
pt(t, df, lower.tail = FALSE)
## [1] 2.720895e-18
Since the p-value is small, we reject \(H_0\).
The data provide convincing evidence that Americans, on average, spend more than 3 hours per day relaxing after work.
Would you expect a 90% confidence interval for the average number of hours Americans spend relaxing after work to include 3 hours?
\[point~estimate \pm critical~value \times SE\]
t_star = qt(0.95, df) pt_est = hrsrelax_summ$xbar round(pt_est + c(-1,1) * t_star * se, 2)
## [1] 3.55 3.81
Interpret this interval in context of the data.
# HT t.test(gss$hrsrelax, mu = 3, alternative = "greater")
## ## One Sample t-test ## ## data: gss$hrsrelax ## t = 8.7876, df = 1153, p-value < 2.2e-16 ## alternative hypothesis: true mean is greater than 3 ## 95 percent confidence interval: ## 3.552813 Inf ## sample estimates: ## mean of x ## 3.680243
# CI t.test(gss$hrsrelax, conf.level = 0.90)$conf.int
## [1] 3.552813 3.807672 ## attr(,"conf.level") ## [1] 0.9