CLT based inference, Pt 2

Today's agenda

Inference for difference in two means
Inference for a proportion
Inference for difference in two proportions
Reminder - HW4 due on Tuesday 11/15 by 11 pm (repos on Github)

Inference for difference of two means

Data

2010 GSS:

gss = read.csv("https://stat.duke.edu/~mc301/data/gss2010.csv")

Data dictionary at https://gssdataexplorer.norc.org/variables/vfilter

Hypothesis testing for a difference of two means

Is there a difference between the average number of hours relaxing after work between males and females. What are the hypotheses?

\[H_0: \mu_{M} = \mu_{F}\] \[H_A: \mu_{M} \ne \mu_{F}\]

Note that the variable identifying males and females in the dataset is sex.

Exploratory analysis

What type of visualization would be appropriate for evaluating this research question?

Summary statistics

(hrsrelax_sex_summ = gss %>% 
  filter(!is.na(hrsrelax)) %>%
  group_by(sex) %>%
  summarise(xbar = mean(hrsrelax), s = sd(hrsrelax), n = length(hrsrelax)))

## # A tibble: 2 × 4
##      sex     xbar        s     n
##   <fctr>    <dbl>    <dbl> <int>
## 1 FEMALE 3.449180 2.396948   610
## 2   MALE 3.939338 2.848216   544

Calculating the test statistic

\[ \begin{aligned} t &= \frac{obs - null}{SE} = \frac{(\bar{x}_1-\bar{x}_2) - 0}{\sqrt{s^2_1/n_1+s^2_2/n_2}} \sim T_{df} \\ df &\approx \frac{(s_1^2/n_1+s_2^2/n_2)^2}{(s_1^2/n_1)^2/(n_1-1)+(s_2^2/n_2)^2/(n_2-1)} \approx min(n_1 - 1, n_2 - 1) \end{aligned} \]

(se = sqrt((hrsrelax_sex_summ$s[1]^2 / hrsrelax_sex_summ$n[1]) 
           + (hrsrelax_sex_summ$s[2]^2 / hrsrelax_sex_summ$n[2])))

## [1] 0.155984

(t = ((hrsrelax_sex_summ$xbar[1] - hrsrelax_sex_summ$xbar[2]) - 0) / se)

## [1] -3.14236

(df = min(hrsrelax_sex_summ$n[1], hrsrelax_sex_summ$n[2]) - 1)

## [1] 543

p-value

p-value = P(observed or more extreme outcome | \(H_0\) true)

pt(t, df) * 2

## [1] 0.001767347

pt(t, df) + pt(-t, df, lower.tail=FALSE)

## [1] 0.001767347

Equivalence to a confidence interval

What is the equivalent confidence level to this hypothesis test? At this level would you expect a confidence interval to include the difference in average number of hours relaxed by all American males and females?

Confidence interval for a difference in means

\[point~estimate \pm critical~value \times SE\] \[(\bar{x}_1-\bar{x}_2) \pm t^* \times \sqrt{s^2_1/n_1+s^2_2/n_2} \]

(t_star = qt(0.975, df))

## [1] 1.964342

(pt_est = hrsrelax_sex_summ$xbar[1] - hrsrelax_sex_summ$xbar[2])

## [1] -0.4901579

round(pt_est + c(-1,1) * t_star * se, 3)

## [1] -0.797 -0.184

HT in R

Note that t.test function uses an exact degrees of freedom formula not \(\min(n_1-1,n_2-1)\).

# HT
t.test(gss$hrsrelax ~ gss$sex, mu = 0, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  gss$hrsrelax by gss$sex
## t = -3.1424, df = 1066.3, p-value = 0.001722
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7962283 -0.1840875
## sample estimates:
## mean in group FEMALE   mean in group MALE 
##             3.449180             3.939338

CI in R

t.test(gss$hrsrelax ~ gss$sex)$conf.int

## [1] -0.7962283 -0.1840875
## attr(,"conf.level")
## [1] 0.95

HT - Simulation

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_MALE = 544, y_bar_MALE = 3.9393, s_MALE = 3
## n_FEMALE = 610, y_bar_FEMALE = 3.4492, s_FEMALE = 3
## H0: mu_MALE =  mu_FEMALE
## HA: mu_MALE != mu_FEMALE
## p_value = 9e-04

CI - Simulation

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_MALE = 544, y_bar_MALE = 3.9393, s_MALE = 2.8482
## n_FEMALE = 610, y_bar_FEMALE = 3.4492, s_FEMALE = 2.3969
## 95% CI (MALE - FEMALE): (0.1838 , 0.7939)

Inference for a proportion

Hypothesis testing for a proportion

Another question on the survey is "Do you think the use of marijuana should be made legal or not?". Do these data convincing evidence that majority of Americans think that the use of marijuana should not be legal? Note that the variable of interest in the dataset is grass.

(grass_summ = gss %>%
  filter(!is.na(grass)) %>%
  summarise(x = sum(grass == "NOT LEGAL"), n = length(grass), p_hat = x / n))

##     x    n     p_hat
## 1 656 1259 0.5210485

What are the hypotheses?

Let \(p\) be the proportion of all americans who do not think marijuana should be legalized.

\[H_0: p = 0.5\] \[H_A: p > 0.5\]

Calculating the test statistic

\[Z = \frac{obs - null}{SE} = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0 (1-p_0)}{n}}} \sim N(0,1)\]

p_0 = 0.5
(se = sqrt(p_0 * (1-p_0) / grass_summ$n))

## [1] 0.0140915

(Z = (grass_summ$p_hat - p_0) / se)

## [1] 1.493699

p-value

p-value = \(P(\text{observed or more extreme outcome}~|~H_0~\text{true})\)

pnorm(Z, lower.tail = FALSE)

## [1] 0.06762719

Confidence interval for a proportion

\[point~estimate \pm critical~value \times SE\] \[ \hat{p} \pm z^* \times \sqrt{\frac{\hat{p} (1-\hat{p})}{n}} \]

(z_star = qnorm(0.95))

## [1] 1.644854

(se = sqrt(grass_summ$p_hat * (1 - grass_summ$p_hat) / grass_summ$n))

## [1] 0.01407901

round(grass_summ$p_hat + c(-1,1) * z_star * se, 3)

## [1] 0.498 0.544

HT in R

Note that prop.test function uses a different (equivalent) distribution.

prop.test(grass_summ$x, grass_summ$n, p = 0.5, alternative = "greater", correct = FALSE)

## 
##  1-sample proportions test without continuity correction
## 
## data:  grass_summ$x out of grass_summ$n, null probability 0.5
## X-squared = 2.2311, df = 1, p-value = 0.06763
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
##  0.4978702 1.0000000
## sample estimates:
##         p 
## 0.5210485

CI in R

prop.test(grass_summ$x, grass_summ$n, correct = FALSE, conf.level = 0.90)$conf.int

## [1] 0.4978702 0.5441364
## attr(,"conf.level")
## [1] 0.9

HT - Simulation

## Single categorical variable, success: NOT LEGAL
## n = 1259, p-hat = 0.521
## H0: p = 0.5
## HA: p > 0.5
## p_value = 0.0717

CI - Simulation

## Single categorical variable, success: NOT LEGAL
## n = 1259, p-hat = 0.521
## 95% CI: (0.4932 , 0.5488)

Inference for a difference in two proportions

Hypothesis test for a difference in two proportions

Is there a difference between the proportions of people who think marijuana should not be legalized based on whether they favor or oppose a law which would require a person to obtain a police permit before he or she could buy a gun?

Let \(p\) represent people who do not think marijuana should be legalized.

\[H_0: p_{not~legal|favor~gunlaw} = p_{not~legal|oppose~gunlaw}\] \[H_A: p_{not~legal|favor~gunlaw} \ne p_{not~legal|oppose~gunlaw}\]

Note that the variable identifying people who are pro and anti gun laws in the dataset is gunlaw.

Exploratory analysis

What type of visualization would be appropriate for evaluating this research question?

Summary statistics

table(gss$gunlaw, gss$grass) %>% addmargins()

##         
##          LEGAL NOT LEGAL Sum
##   FAVOR    209       211 420
##   OPPOSE    70        65 135
##   Sum      279       276 555

(gss_gun_grss_summ = gss_gun_grss %>% 
  group_by(gunlaw) %>%
  summarise(x = sum(grass == "NOT LEGAL"), n = length(grass), p_hat = x / n))

## # A tibble: 2 × 4
##   gunlaw     x     n     p_hat
##   <fctr> <int> <int>     <dbl>
## 1  FAVOR   211   420 0.5023810
## 2 OPPOSE    65   135 0.4814815

Calculating the test statistic

\[(\hat{p}_1 - \hat{p}_2) \sim N\left(mean = (p_1 - p_2),~SE = \sqrt{ \frac{p_1 (1 - p_1)}{n_1} + \frac{p_2 (1 - p_2)}{n_2} } \right)\] \[Z = \frac{obs - null}{SE} = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{\sqrt{ \frac{p_1 (1 - p_1)}{n_1} + \frac{p_2 (1 - p_2)}{n_2} }}\]

We need to find a reasonable value for \(p_1\) and \(p_2\) that are equal to each other, and that make sense in the context of these data.

Remember the null hypothesis is equivalent to claiming the two variables are independent.

Pooled proportion

	LEGAL	NOT LEGAL	Sum
FAVOR	209	211	420
OPPOSE	70	65	135
Sum	279	276	555

(p_pool = 276 / 555)

## [1] 0.4972973

Calculating the test statistic

(se = sqrt( (p_pool * (1-p_pool))/gss_gun_grss_summ$n[1] + 
              (p_pool * (1-p_pool))/gss_gun_grss_summ$n[2] ) )

## [1] 0.04946735

(Z = ((gss_gun_grss_summ$p_hat[1] - gss_gun_grss_summ$p_hat[2]) - 0) / se)

## [1] 0.4224902

p-value

p-value = P(observed or more extreme outcome | \(H_0\) true)

pnorm(Z, lower.tail= FALSE) * 2

## [1] 0.6726672

Confidence interval

What is the equivalent confidence level to this hypothesis test? At this level would you expect a confidence interval to include 0?

Confidence interval for a difference in proportions

\[point~estimate \pm critical~value \times SE\]

\[(\hat{p}_1-\hat{p}_2) \pm Z^* \times \sqrt{ \frac{\hat{p}_1 (1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2 (1 - \hat{p}_2)}{n_2} } \]

The only difference is that SE is calculated using the sample proportions, and not the pooled proportion.

(z_star = qnorm(0.95))

## [1] 1.644854

p1 = gss_gun_grss_summ$p_hat[1]; n1 = gss_gun_grss_summ$n[1]
p2 = gss_gun_grss_summ$p_hat[2]; n2 = gss_gun_grss_summ$n[2]
pt_est = p1 - p2
(se = sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2))

## [1] 0.04944225

round(pt_est + c(-1,1) * z_star * se, 3)

## [1] -0.060  0.102

In R

prop.test(x = c(gss_gun_grss_summ$x[1], gss_gun_grss_summ$x[2]),
          n = c(gss_gun_grss_summ$n[1], gss_gun_grss_summ$n[2]), correct = FALSE)

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  c(gss_gun_grss_summ$x[1], gss_gun_grss_summ$x[2]) out of c(gss_gun_grss_summ$n[1], gss_gun_grss_summ$n[2])
## X-squared = 0.1785, df = 1, p-value = 0.6727
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.07600556  0.11780450
## sample estimates:
##    prop 1    prop 2 
## 0.5023810 0.4814815

HT - Simulation

## Response variable: categorical (2 levels, success: NOT LEGAL)
## Explanatory variable: categorical (2 levels) 
## n_FAVOR = 420, p_hat_FAVOR = 0.5024
## n_OPPOSE = 135, p_hat_OPPOSE = 0.4815
## H0: p_FAVOR =  p_OPPOSE
## HA: p_FAVOR != p_OPPOSE
## p_value = 0.7339

CI - Simulation

## Response variable: categorical (2 levels, success: NOT LEGAL)
## Explanatory variable: categorical (2 levels) 
## n_FAVOR = 420, p_hat_FAVOR = 0.5024
## n_OPPOSE = 135, p_hat_OPPOSE = 0.4815
## 95% CI (FAVOR - OPPOSE): (-0.0751 , 0.118)

Recap

We now have been introduced to both simulation based and CLT based methods for statistical inference.
For most simulation based methods you wrote your own code, for CLT based methods we introduced some built in functions.
Take away message: If certain conditions are met CLT based methods may be used for statistical inference. To do so, we would need to know how the standard error is calculated for the given sample statistic of interest.
What you should know:
- What does standard error mean?
- What does the p-value mean?
- How do we make decisions based on the p-value?
- How do we make decisions based on a CI?