November 5, 2015
CLT
Inference for a mean
Inference for difference in two means
Due Thursday: HW 3 (will be emailed after class)
If certain conditions are met, the sampling distribution of the sample statistic will be nearly normally distributed with mean equal to the population parameter and standard error equal inversely proportional to the sample size.
If necessary conditions are met, we can also use inference methods based on the CLT:
use the CLT to calculate the SE of the sample statistic of interest (sample mean, sample proportion, difference between sample means, etc.)
use the test statistic to calculte the p-value, the probability of an observed or more extreme outcome given that the null hypothesis is true
Also called the standard normal distribution: \(Z \sim N(mean = 0, \sigma = 1)\)
Finding probabilities under the normal curve:
pnorm(-1.96)
## [1] 0.0249979
pnorm(1.96, lower.tail = FALSE)
## [1] 0.0249979
qnorm(0.025)
## [1] -1.959964
qnorm(0.975)
## [1] 1.959964
Also unimodal and symmetric, and centered at 0
Thicker tails than the normal distribution (to make up for additional variability introduced by using \(s\) instead of \(\sigma\) in calculation of the SE)
pt(-1.96, df = 9)
## [1] 0.0408222
pt(1.96, df = 9, lower.tail = FALSE)
## [1] 0.0408222
qt(0.025, df = 9)
## [1] -2.262157
qt(0.975, df = 9)
## [1] 2.262157
Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society.
GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.
2010 GSS:
gss <- read.csv("https://stat.duke.edu/~mc301/data/gss2010.csv")
Data dictionary at https://gssdataexplorer.norc.org/variables/vfilter
Note that not all questions are asked every year
One of the questions on the survey is "After an average work day, about how many hours do you have to relax or pursue activities that you enjoy?". Do these data provide convincing evidence that Americans, on average, spend more than 3 hours per day relaxing? Note that the variable of interest in the dataset is hrsrelax.
gss %>% filter(!is.na(hrsrelax)) %>% summarise(mean(hrsrelax), median(hrsrelax), sd(hrsrelax), length(hrsrelax))
## mean(hrsrelax) median(hrsrelax) sd(hrsrelax) length(hrsrelax) ## 1 3.680243 3 2.629641 1154
ggplot(data = gss, aes(x = hrsrelax)) + geom_histogram(binwidth = 1)
What are the hypotheses for evaluation Americans, on average, spend more than 3 hours per day relaxing?
\[H_0: \mu = 3\] \[H_A: \mu > 3\]
Independence: The GSS uses a reasonably random sample, and the sample size of 1,154 is less than 10% of the US population, so we can assume that the respondents in this sample are independent of each other.
Sample size / skew: The distribution of hours relaxed is right skewed, however the sample size is large enough for the sampling distribution to be nearly normal.
\[\bar{x} \sim N\left(mean = \mu, SE = \frac{s}{\sqrt{n}}\right)\] \[T_{df} = \frac{obs - null}{SE}\] \[df = n - 1\]
# summary stats hrsrelax_summ <- gss %>% filter(!is.na(hrsrelax)) %>% summarise(xbar = mean(hrsrelax), s = sd(hrsrelax), n = length(hrsrelax))
# calculations se <- hrsrelax_summ$s / sqrt(hrsrelax_summ$n) t <- (hrsrelax_summ$xbar - 3) / se df <- hrsrelax_summ$n - 1
p-value = P(observed or more extreme outcome | \(H_0\) true)
pt(t, df, lower.tail = FALSE)
## [1] 2.720895e-18
Since the p-value is small, we reject \(H_0\).
The data provide convincing evidence that Americans, on average, spend more than 3 hours per day relaxing after work.
Would you expect a 90% confidence interval for the average number of hours Americans spend relaxing after work to include 3 hours?
\[point~estimate \pm critical~value \times SE\]
t_star <- qt(0.95, df) pt_est <- hrsrelax_summ$xbar round(pt_est + c(-1,1) * t_star * se, 2)
## [1] 3.55 3.81
Interpret this interval in context of the data.
# HT t.test(gss$hrsrelax, mu = 3, alternative = "greater")
## ## One Sample t-test ## ## data: gss$hrsrelax ## t = 8.7876, df = 1153, p-value < 2.2e-16 ## alternative hypothesis: true mean is greater than 3 ## 95 percent confidence interval: ## 3.552813 Inf ## sample estimates: ## mean of x ## 3.680243
# CI t.test(gss$hrsrelax, conf.level = 0.90)$conf.int
## [1] 3.552813 3.807672 ## attr(,"conf.level") ## [1] 0.9
Is there a difference between the average number of hours relaxing after work between males and females. What are the hypotheses?
\[H_0: \mu_{M} = \mu_{F}\] \[H_A: \mu_{M} \ne \mu_{F}\]
Note that the variable identifying males and females in the dataset is sex.
What type of visualization would be appropriate for evaluating this research question?
hrsrelax_sex_summ <- gss %>% filter(!is.na(hrsrelax)) %>% group_by(sex) %>% summarise(xbar = mean(hrsrelax), s = sd(hrsrelax), n = length(hrsrelax)) hrsrelax_sex_summ
## Source: local data frame [2 x 4] ## ## sex xbar s n ## (fctr) (dbl) (dbl) (int) ## 1 FEMALE 3.449180 2.396948 610 ## 2 MALE 3.939338 2.848216 544
\[(\bar{x}_1 - \bar{x}_2) \sim N\left(mean = (\mu_1 - \mu_2), SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \right)\] \[T_{df} = \frac{obs - null}{SE}\] \[df = min(n_1 - 1, n_2 - 1)\]
se <- sqrt((hrsrelax_sex_summ$s[1]^2 / hrsrelax_sex_summ$n[1])
+ (hrsrelax_sex_summ$s[2]^2 / hrsrelax_sex_summ$n[2]))
t <- ((hrsrelax_sex_summ$xbar[1] - hrsrelax_sex_summ$xbar[2]) - 0) / se
df <- min(hrsrelax_sex_summ$n[1], hrsrelax_sex_summ$n[2]) - 1
p-value = P(observed or more extreme outcome | \(H_0\) true)
pt(t, df) * 2
## [1] 0.001767347
Assuming \(\alpha = 0.05\), what is the conclusion of the hypothesis test?
What is the equivalent confidence level to this hypothesis test? At this level would you expect a confidence interval to include the difference in average number of hours relaxed by all American males and females?
\[point~estimate \pm critical~value \times SE\]
t_star <- qt(0.975, df) pt_est <- hrsrelax_sex_summ$xbar[1] - hrsrelax_sex_summ$xbar[2] round(pt_est + c(-1,1) * t_star * se, 2)
## [1] -0.80 -0.18
Interpret this interval in context of the data. Make sure to indicate which group has a higher/lower mean in your interpretation.
Note that t.test function uses an exact degrees of freedom formula.
# HT t.test(gss$hrsrelax ~ gss$sex, mu = 0, alternative = "two.sided")
## ## Welch Two Sample t-test ## ## data: gss$hrsrelax by gss$sex ## t = -3.1424, df = 1066.3, p-value = 0.001722 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.7962283 -0.1840875 ## sample estimates: ## mean in group FEMALE mean in group MALE ## 3.449180 3.939338
# CI t.test(gss$hrsrelax ~ gss$sex)$conf.int
## [1] -0.7962283 -0.1840875 ## attr(,"conf.level") ## [1] 0.95