Sta112FS 15. Central Limit Theorem + CLT based inference, Pt. 1

November 3, 2015

Today's agenda

Central Limit Theorem
Aside: Evaluating normality graphically
Application exercise: proving the CLT via simulation
Inference based on the Central Limit Theorem
Due Thursday: Read Sections 2.5 - 2.8 on OpenIntro: Intro Stat with Randomization and Simulation: http://www.openintro.org/isrs

Notation

Means:
- Population: mean = \(\mu\), standard deviation = \(\sigma\)
- Sample: mean = \(\bar{x}\), standard deviation = \(s\)
Proportions:
- Population: \(p\)
- Sample: \(\hat{p}\)
Standard error: \(SE\)

Central Limit Theorem

Variability of sample statistics

Each sample from the population yields a slightly different sample statistic (sample mean, sample proportion, etc.)
The variability of these sample statistics is measured by the standard error
Previously we quantified this value via simulation
Today we talk about the theory underlying sampling distributions

Sampling distribution

Sampling distribution is the distribution of sample statistics of random samples of size \(n\) taken from a population
In practice it is impossible to construct sampling distributions since it would require having access to the entire population
Today for demonstration purposes we will assume we have access to the population data, and construct sampling distributions, and examine their shapes, centers, and spreads

Evaluating normality: Normal probability plots

Normal probability plot

d <- data.frame(norm_samp = rnorm(100, mean = 50, sd = 5))

ggplot(data = d, aes(sample = norm_samp)) +
  geom_point(alpha = 0.7, stat = "qq")

Anatomy of a normal probability plot

Data are plotted on the y-axis of a normal probability plot and theoretical quantiles (following a normal distribution) on the x-axis.
If there is a one-to-one relationship between the data and the theoretical quantiles, then the data follow a nearly normal distribution.
Since a one-to-one relationship would appear as a straight line on a scatter plot, the closer the points are to a perfect straight line, the more confident we can be that the data follow the normal model.

Constructing a normal probability plot

Data (y-coordinates)	Percentile	Theoretical Quantiles (x-coordinates)
37.5	0.5 / 100 = 0.005	`qnorm(0.005) = -2.58`
38.0	1.5 / 100 = 0.015	`qnorm(0.015) = -2.17`
38.3	2.5 / 100 = 0.025	`qnorm(0.025) = -1.95`
39.5	3.5 / 100 = 0.035	`qnorm(0.035) = -1.81`
…	…	…
61.9	99.5 / 100 = 0.995	`qnorm(0.995) = 2.58`

Constructing a normal probability plot

Fat tails

Best to think about what is happening with the most extreme values - here the biggest values are bigger than we would expect and the smallest values are smaller than we would expect (for a normal).

Skinny tails

Here the biggest values are smaller than we would expect and the smallest values are bigger than we would expect.

Right skew

Here the biggest values are bigger than we would expect and the smallest values are also bigger than we would expect.

Left skew

Here the biggest values are smaller than we would expect and the smallest values are also smaller than we would expect.

Back to sampling distributions

Application exercise

See course website for details

Central Limit Theorem

In practice…

We can't construct sampling distributions directly, because we don't have access to the entire population data
- this is the whole point of statistical inference: observe only one sample, try to make inference about the entire population
Hence we rely on the Central Limit Theorem to tell us what the sampling distribution would look like, if we could construct it

Central Limit Theorem

If certain conditions are met, the sampling distribution of the sample statistic will be nearly normally distributed with mean equal to the population parameter and standard error equal inversely proportional to the sample size.

Single mean: \(\bar{x} \sim N\left(mean = \mu, SE = \frac{\sigma}{\sqrt{n}}\right)\)
Difference between two means: \((\bar{x}_1 - \bar{x}_2) \sim N\left(mean = (\mu_1 - \mu_2), SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \right)\)
Single proportion: \(\hat{p} \sim N\left(mean = p, SE = \sqrt{\frac{p (1-p)}{n}} \right)\)
Difference between two proportions: \((\hat{p}_1 - \hat{p}_2) \sim N\left(mean = (p_1 - p_2), SE = \sqrt{\frac{p_1 (1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \right)\)

Conditions:

Independence: The sampled observations must be independent. This is difficult to check, but the following are useful guidelines:
- the sample must be random
- if sampling without replacement, sample size must be less than 10% of population size
Sample size / distribution:
- numerical data: The more skewed the sample (and hence the population) distribution, the larger samples we need. Usually n > 30 is considered a large enough sample for population distributions that are not extremely skewed.
- categorical data: At least 10 successes and 10 failures.
If comparing two populations, the samples must be independent of each other, and all conditions should be checked for both groups.

Confirm your findings…

Confirm that your findings from the application exercise match up with what the CLT outlines.

Inference methods based on CLT

If necessary conditions are met, we can also use inference methods based on the CLT:

use the CLT to calculate the SE of the sample statistic of interest (sample mean, sample proportion, difference between sample means, etc.)
calculate the test statistic, number of standard errors away from the null value the observed sample statistic is
- T for means, along with appropriate degrees of freedom
- Z for proportions
use the test statistic to calculte the p-value, the probability of an observed or more extreme outcome given that the null hypothesis is true

Z and T distributions

Z distribution

Also called the standard normal distribution: \(Z \sim N(mean = 0, \sigma = 1)\)
Finding probabilities under the normal curve:

pnorm(-1.96)

## [1] 0.0249979

pnorm(1.96, lower.tail = FALSE)

## [1] 0.0249979

Finding cutoff values under the normal curve:

qnorm(0.025)

## [1] -1.959964

qnorm(0.975)

## [1] 1.959964

T distribution

Also unimodal and symmetric, and centered at 0
Thicker tails than the normal distribution (to make up for additional variability introduced by using \(s\) instead of \(\sigma\) in calculation of the SE)
Parameter: degrees of freedom
- df for single mean: \(df = n - 1\)
- df for comparing two means: \(df = min(n_1 - 1, n_2 - 1)\)

T distribution (cont.)

Finding probabilities under the t curve:

pt(-1.96, df = 9)

## [1] 0.0408222

pt(1.96, df = 9, lower.tail = FALSE)

## [1] 0.0408222

Finding cutoff values under the normal curve:

qt(0.025, df = 9)

## [1] -2.262157

qt(0.975, df = 9)

## [1] 2.262157

Example

In 2001 the average GPA of students at Duke University was 3.37. This semester we surveyed 63 students in a statistics course about their GPAs. The mean was 3.58, and the standard deviation 0.53. A histogram of the data is shown below. Assuming that this sample is random and representative of all Duke students (bit of a leap of faith?), do these data provide convincing evidence that the average GPA of Duke students has changed over the last decade?

\(H_0: \mu = 3.37; H_A: \mu \ne 3.37\)

\(\bar{x} \sim N\left(mean = \mu = 3.37, SE = \frac{\sigma}{\sqrt{n}} = \frac{0.53}{\sqrt{63}} = 0.0668 \right)\)

\(T = \frac{3.58 - 3.37}{0.0668} \approx 3.14\), \(df = n - 1 = 63 - 2 = 62\)

mu <- 3.37
x_bar <- 3.58
s <- 0.53
n <- 63
t_obs <- (x_bar - mu) / (s / sqrt(n))
(1 - pt(t_obs, df = n - 1)) * 2

## [1] 0.002550524

Recap

We now have been introduced to both simulation based and CLT based methods for statistical inference.
For most simulation based methods you wrote your own code, for CLT based methods we introduced some built in functions.
Take away message: If certain conditions are met CLT based methods may be used for statistical inference. To do so, we would need to know how the standard error is calculated for the given sample statistic of interest.
What you should know:
- What does standard error mean?
- What does the p-value mean?
- How do we make decisions based on the p-value?

in R

numerical data - t.test
- testing for one mean: \(H_0: \mu_x = \mu_0\)
- comparing two means (groups 1 and 2): \(H_0: \mu_1 = \mu_2\)
categorical data - prop.test
- testing for one proportion: \(H_0: p_x = p_0\)
- comparing two proportions (groups 1 and 2): \(H_0: p_1 = p_2\)