Sta112FS 17. CLT based inference, Pt. 2

November 5, 2015

Today's agenda

CLT
Inference for a mean
Inference for difference in two means
Due Thursday: HW 3 (will be emailed after class)

Central Limit Theorem

In practice…

We can't construct sampling distributions directly, because we don't have access to the entire population data
- this is the whole point of statistical inference: observe only one sample, try to make inference about the entire population
Hence we rely on the Central Limit Theorem to tell us what the sampling distribution would look like, if we could construct it

Central Limit Theorem

If certain conditions are met, the sampling distribution of the sample statistic will be nearly normally distributed with mean equal to the population parameter and standard error equal inversely proportional to the sample size.

Single mean: \(\bar{x} \sim N\left(mean = \mu, SE = \frac{\sigma}{\sqrt{n}}\right)\)
Difference between two means: \((\bar{x}_1 - \bar{x}_2) \sim N\left(mean = (\mu_1 - \mu_2), SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \right)\)
Single proportion: \(\hat{p} \sim N\left(mean = p, SE = \sqrt{\frac{p (1-p)}{n}} \right)\)
Difference between two proportions: \((\hat{p}_1 - \hat{p}_2) \sim N\left(mean = (p_1 - p_2), SE = \sqrt{\frac{p_1 (1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \right)\)

Conditions:

Independence: The sampled observations must be independent. This is difficult to check, but the following are useful guidelines:
- the sample must be random
- if sampling without replacement, sample size must be less than 10% of population size
Sample size / distribution:
- numerical data: The more skewed the sample (and hence the population) distribution, the larger samples we need. Usually n > 30 is considered a large enough sample for population distributions that are not extremely skewed.
- categorical data: At least 10 successes and 10 failures.
If comparing two populations, the samples must be independent of each other, and all conditions should be checked for both groups.

Inference methods based on CLT

If necessary conditions are met, we can also use inference methods based on the CLT:

use the CLT to calculate the SE of the sample statistic of interest (sample mean, sample proportion, difference between sample means, etc.)
calculate the test statistic, number of standard errors away from the null value the observed sample statistic is
- T for means, along with appropriate degrees of freedom
- Z for proportions
use the test statistic to calculte the p-value, the probability of an observed or more extreme outcome given that the null hypothesis is true

Z and T distributions

Z distribution

Also called the standard normal distribution: \(Z \sim N(mean = 0, \sigma = 1)\)
Finding probabilities under the normal curve:

pnorm(-1.96)

## [1] 0.0249979

pnorm(1.96, lower.tail = FALSE)

## [1] 0.0249979

Finding cutoff values under the normal curve:

qnorm(0.025)

## [1] -1.959964

qnorm(0.975)

## [1] 1.959964

T distribution

Also unimodal and symmetric, and centered at 0
Thicker tails than the normal distribution (to make up for additional variability introduced by using \(s\) instead of \(\sigma\) in calculation of the SE)
Parameter: degrees of freedom
- df for single mean: \(df = n - 1\)
- df for comparing two means: \(df = min(n_1 - 1, n_2 - 1)\)

T distribution (cont.)

Finding probabilities under the t curve:

pt(-1.96, df = 9)

## [1] 0.0408222

pt(1.96, df = 9, lower.tail = FALSE)

## [1] 0.0408222

Finding cutoff values under the normal curve:

qt(0.025, df = 9)

## [1] -2.262157

qt(0.975, df = 9)

## [1] 2.262157

Examples

General Social Survey

Data

2010 GSS:

gss <- read.csv("https://stat.duke.edu/~mc301/data/gss2010.csv")

Data dictionary at https://gssdataexplorer.norc.org/variables/vfilter
Note that not all questions are asked every year

Inference for a single mean

Hypothesis testing for a mean

One of the questions on the survey is "After an average work day, about how many hours do you have to relax or pursue activities that you enjoy?". Do these data provide convincing evidence that Americans, on average, spend more than 3 hours per day relaxing? Note that the variable of interest in the dataset is hrsrelax.

gss %>% filter(!is.na(hrsrelax)) %>%
  summarise(mean(hrsrelax), median(hrsrelax), sd(hrsrelax), length(hrsrelax))

##   mean(hrsrelax) median(hrsrelax) sd(hrsrelax) length(hrsrelax)
## 1       3.680243                3     2.629641             1154

ggplot(data = gss, aes(x = hrsrelax)) + geom_histogram(binwidth = 1)

Hypotheses

What are the hypotheses for evaluation Americans, on average, spend more than 3 hours per day relaxing?

\[H_0: \mu = 3\] \[H_A: \mu > 3\]

Conditions

Independence: The GSS uses a reasonably random sample, and the sample size of 1,154 is less than 10% of the US population, so we can assume that the respondents in this sample are independent of each other.
Sample size / skew: The distribution of hours relaxed is right skewed, however the sample size is large enough for the sampling distribution to be nearly normal.

Calculating the test statistic

\[\bar{x} \sim N\left(mean = \mu, SE = \frac{s}{\sqrt{n}}\right)\] \[T_{df} = \frac{obs - null}{SE}\] \[df = n - 1\]

# summary stats
hrsrelax_summ <- gss %>% 
  filter(!is.na(hrsrelax)) %>%
  summarise(xbar = mean(hrsrelax), s = sd(hrsrelax), n = length(hrsrelax))

# calculations
se <- hrsrelax_summ$s / sqrt(hrsrelax_summ$n)
t <- (hrsrelax_summ$xbar - 3) / se
df <- hrsrelax_summ$n - 1

p-value

p-value = P(observed or more extreme outcome | \(H_0\) true)

pt(t, df, lower.tail = FALSE)

## [1] 2.720895e-18

Conclusion

Since the p-value is small, we reject \(H_0\).
The data provide convincing evidence that Americans, on average, spend more than 3 hours per day relaxing after work.

Would you expect a 90% confidence interval for the average number of hours Americans spend relaxing after work to include 3 hours?

Confidence interval for a mean

\[point~estimate \pm critical~value \times SE\]

t_star <- qt(0.95, df)
pt_est <- hrsrelax_summ$xbar
round(pt_est + c(-1,1) * t_star * se, 2)

## [1] 3.55 3.81

Interpret this interval in context of the data.

In R

# HT
t.test(gss$hrsrelax, mu = 3, alternative = "greater")

## 
##  One Sample t-test
## 
## data:  gss$hrsrelax
## t = 8.7876, df = 1153, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 3
## 95 percent confidence interval:
##  3.552813      Inf
## sample estimates:
## mean of x 
##  3.680243

# CI
t.test(gss$hrsrelax, conf.level = 0.90)$conf.int

## [1] 3.552813 3.807672
## attr(,"conf.level")
## [1] 0.9

Aside: confidence vs. significance levels

Equivalency of confidence and significance levels

Two sided alternative HT with \(\alpha\) \(\rightarrow\) \(CL = 1 - \alpha\)
One sided alternative HT with \(\alpha\) \(\rightarrow\) \(CL = 1 - (2 times \alpha)\)

Inference for difference of two means

Hypothesis testing for a difference of two means

Is there a difference between the average number of hours relaxing after work between males and females. What are the hypotheses?

\[H_0: \mu_{M} = \mu_{F}\] \[H_A: \mu_{M} \ne \mu_{F}\]

Note that the variable identifying males and females in the dataset is sex.

Exploratory analysis

What type of visualization would be appropriate for evaluating this research question?

Summary statistics

hrsrelax_sex_summ <- gss %>% 
  filter(!is.na(hrsrelax)) %>%
  group_by(sex) %>%
  summarise(xbar = mean(hrsrelax), s = sd(hrsrelax), n = length(hrsrelax))
hrsrelax_sex_summ

## Source: local data frame [2 x 4]
## 
##      sex     xbar        s     n
##   (fctr)    (dbl)    (dbl) (int)
## 1 FEMALE 3.449180 2.396948   610
## 2   MALE 3.939338 2.848216   544

Calculating the test statistic

\[(\bar{x}_1 - \bar{x}_2) \sim N\left(mean = (\mu_1 - \mu_2), SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \right)\] \[T_{df} = \frac{obs - null}{SE}\] \[df = min(n_1 - 1, n_2 - 1)\]

se <- sqrt((hrsrelax_sex_summ$s[1]^2 / hrsrelax_sex_summ$n[1]) 
           + (hrsrelax_sex_summ$s[2]^2 / hrsrelax_sex_summ$n[2]))
t <- ((hrsrelax_sex_summ$xbar[1] - hrsrelax_sex_summ$xbar[2]) - 0) / se
df <- min(hrsrelax_sex_summ$n[1], hrsrelax_sex_summ$n[2]) - 1

p-value

p-value = P(observed or more extreme outcome | \(H_0\) true)

pt(t, df) * 2

## [1] 0.001767347

Assuming \(\alpha = 0.05\), what is the conclusion of the hypothesis test?

Equivalency to a confidence interval

What is the equivalent confidence level to this hypothesis test? At this level would you expect a confidence interval to include the difference in average number of hours relaxed by all American males and females?

Confidence interval for a difference in means

\[point~estimate \pm critical~value \times SE\]

t_star <- qt(0.975, df)
pt_est <- hrsrelax_sex_summ$xbar[1] - hrsrelax_sex_summ$xbar[2]
round(pt_est + c(-1,1) * t_star * se, 2)

## [1] -0.80 -0.18

Interpret this interval in context of the data. Make sure to indicate which group has a higher/lower mean in your interpretation.

In R

Note that t.test function uses an exact degrees of freedom formula.

# HT
t.test(gss$hrsrelax ~ gss$sex, mu = 0, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  gss$hrsrelax by gss$sex
## t = -3.1424, df = 1066.3, p-value = 0.001722
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7962283 -0.1840875
## sample estimates:
## mean in group FEMALE   mean in group MALE 
##             3.449180             3.939338

# CI
t.test(gss$hrsrelax ~ gss$sex)$conf.int

## [1] -0.7962283 -0.1840875
## attr(,"conf.level")
## [1] 0.95