CLT based inference

# CLT based inference
### Dr. Çetinkaya-Rundel
### November 14, 2017

---

# Inference methods based on CLT

---

## Inference methods based on CLT

If necessary conditions are met, we can also use inference methods based on the CLT:

- use the CLT to calculate the SE of the sample statistic of interest (sample mean, 
sample proportion, difference between sample means, etc.)

- calculate the **test statistic**, number of standard errors away from the null 
value the observed sample statistic is
    - T for means, along with appropriate degrees of freedom
    - Z for proportions

- use the test statistic to calculate the **p-value**, the probability of an observed 
or more extreme outcome given that the null hypothesis is true

---

## Z distribution

<br/>

```r
pnorm(-1.96)
```

```
## [1] 0.0249979
```

```r
pnorm(1.96, lower.tail = FALSE)
```

```
## [1] 0.0249979
```
]
.pull-right[
Finding cutoff values under the normal curve:

```r
qnorm(0.025)
```

```
## [1] -1.959964
```

```r
qnorm(0.975)
```

```
## [1] 1.959964
```
]
]

---

## T distribution

- Also unimodal and symmetric, and centered at 0

- Thicker tails than the normal distribution (to make up for additional variability
introduced by using `$s$` instead of `$\sigma$` in calculation of the SE)

- Parameter: **degrees of freedom**

- df for single mean: `$df = n - 1$`

- df for comparing two means:

`$$df \approx \frac{(s_1^2/n_1+s_2^2/n_2)^2}{(s_1^2/n_1)^2/(n_1-1)+(s_2^2/n_2)^2/(n_2-1)} \approx min(n_1 - 1, n_2 - 1)$$`

---

## T vs Z distributions

![](20-deck_files/figure-html/unnamed-chunk-4-1.png)

---

## T distribution (cont.)

```r
pt(-1.96, df = 9)
```

```
## [1] 0.0408222
```

```r
pt(1.96, df = 9, lower.tail = FALSE)
```

```
## [1] 0.0408222
```

<br/>

Finding cutoff values under the normal curve:

```r
qt(0.025, df = 9)
```

```
## [1] -2.262157
```

```r
qt(0.975, df = 9)
```

```
## [1] 2.262157
```
]

---

# Examples

---

## General Social Survey

.small[
- Since 1972, the General Social Survey (GSS) has been monitoring societal change and 
studying the growing complexity of American society.

- The GSS aims to gather data on contemporary American society in order to 
    - monitor and explain trends and constants in attitudes, behaviors, attributes;
    - examine the structure and functioning of society in general as well as the role
    played by relevant subgroups;
    - compare the US to other societies to place American society  in comparative 
    perspective and develop cross-national models of human society;
    - make high-quality data easily accessible to scholars, students, policy makers, 
    and others, with minimal cost and waiting.

- GSS questions cover a diverse range of issues including national spending priorities, 
marijuana use, crime and punishment, race relations, quality of life, confidence in 
institutions, and sexual behavior.
]

---

## Data

2010 GSS:

```r
gss = read_csv("data/gss2010.csv")
```

<br/>

- Data dictionary at https://gssdataexplorer.norc.org/variables/vfilter

---

## Relaxing after work

<div class="question">
One of the questions on the survey is "After an average work day, about how many 
hours do you have to relax or pursue activities that you enjoy?". Do these data
provide convincing evidence that Americans, on average, spend more than 3 hours
per day relaxing? Note that the variable of interest in the dataset is `hrsrelax`.
</div>

```r
gss %>% 
  filter(!is.na(hrsrelax)) %>%
  summarise(x_bar = mean(hrsrelax), med = median(hrsrelax), 
            sd = sd(hrsrelax), n = n())
```

```
## # A tibble: 1 x 4
##      x_bar   med       sd     n
##      <dbl> <dbl>    <dbl> <int>
## 1 3.680243     3 2.629641  1154
```

---

## Exploratory Data Analysis

```r
ggplot(data = gss, aes(x = hrsrelax)) + 
  geom_histogram(binwidth = 1)
```

```
## Warning: Removed 890 rows containing non-finite values (stat_bin).
```

![](20-deck_files/figure-html/unnamed-chunk-9-1.png)

---

## Hypotheses

<div class="question">
What are the hypotheses for evaluation Americans, on average, spend more than 3 hours
per day relaxing?
</div>

`$$H_0: \mu = 3$$` 
`$$H_A: \mu > 3$$`

---

## Conditions

1. Independence: The GSS uses a reasonably random sample, and the sample
size of 1,154 is less than 10% of the US population, so we can assume that the
respondents in this sample are independent of each other.

2. Sample size / skew: The distribution of hours relaxed is right skewed, however
the sample size is large enough for the sampling distribution to be nearly
normal.

---

## Calculating the test statistic

`$$\bar{x} \sim N\left(mean = \mu, SE = \frac{\sigma}{\sqrt{n}}\right)$$`

`$$\frac{\bar{x}-\mu_0}{s/\sqrt{n}} \sim t_{df=n-1}$$`

`$$T_{df} = \frac{obs - null}{SE} = \frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}$$`
`$$df = n - 1$$`

```r
# summary stats
hrsrelax_summ <- gss %>% 
  filter(!is.na(hrsrelax)) %>%
  summarise(xbar = mean(hrsrelax), s = sd(hrsrelax), n = n())
```

---

```r
# calculations
(se <- hrsrelax_summ$s / sqrt(hrsrelax_summ$n))
```

```
## [1] 0.07740938
```

```r
(t <- (hrsrelax_summ$xbar - 3) / se)
```

```
## [1] 8.7876
```

```r
(df <- hrsrelax_summ$n - 1)
```

```
## [1] 1153
```

---

## p-value

p-value = P(observed or more extreme outcome | `$H_0$` true)

```r
pt(t, df, lower.tail = FALSE)
```

```
## [1] 2.720895e-18
```

---

## Conclusion

- Since the p-value is small, we reject `$H_0$`.

- The data provide convincing evidence that Americans, on average, spend more than
3 hours per day relaxing after work.

<div class="question">
Would you expect a 90% confidence interval for the average number of hours Americans 
spend relaxing after work to include 3 hours?
</div>

---

## Confidence interval for a mean

`$$point~estimate \pm critical~value \times SE$$`

```r
t_star <- qt(0.95, df)
pt_est <- hrsrelax_summ$xbar
round(pt_est + c(-1,1) * t_star * se, 2)
```

```
## [1] 3.55 3.81
```

<div class="question">
Interpret this interval in context of the data.
</div>

---

## In R

```r
# HT
t.test(gss$hrsrelax, mu = 3, alternative = "greater")
```

```
## 
## 	One Sample t-test
## 
## data:  gss$hrsrelax
## t = 8.7876, df = 1153, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 3
## 95 percent confidence interval:
##  3.552813      Inf
## sample estimates:
## mean of x 
##  3.680243
```

```r
# CI
t.test(gss$hrsrelax, conf.level = 0.90)$conf.int
```

```
## [1] 3.552813 3.807672
## attr(,"conf.level")
## [1] 0.9
```
]

---

# Case study

---

See the course website for **Case Study 04: Inferring from the GSS**