House keeping

Central Limit Theorem

Variability of sample statistics

  • Each sample from the population yields a slightly different sample statistic (sample mean, sample proportion, etc.)

  • The variability of these sample statistics is measured by the standard error

  • Previously we quantified this value via simulation

  • Today we talk about the theory underlying sampling distributions

Aside: Normal probability plots

Normal probability plot

temp = rnorm(100, mean = 50, sd = 5)
# normal probability plot
g = qplot(sample = temp, stat = "qq")
g + geom_abline(intercept = mean(temp), slope = sd(temp), linetype = "dashed")

plot of chunk unnamed-chunk-2

Alternative code for normal probability plot

qqnorm(temp)
qqline(temp)

plot of chunk unnamed-chunk-3

Anatomy of a normal probability plot

  • Data are plotted on the y-axis of a normal probability plot and theoretical quantiles (following a normal distribution) on the x-axis.

  • If there is a one-to-one relationship between the data and the theoretical quantiles, then the data follow a nearly normal distribution.

  • Since a one-to-one relationship would appear as a straight line on a scatter plot, the closer the points are to a perfect straight line, the more confident we can be that the data follow the normal model.

Constructing a normal probability plot

Data (y-coordinates) Percentile Theoretical Quantiles (x-coordinates)
37.5 0.5 / 100 = 0.005 qnorm(0.005) = -2.58
38.0 1.5 / 100 = 0.015 qnorm(0.015) = -2.17
38.3 2.5 / 100 = 0.025 qnorm(0.025) = -1.95
39.5 3.5 / 100 = 0.035 qnorm(0.035) = -1.81
61.9 99.5 / 100 = 0.995 qnorm(0.995) = 2.58

Constructing a normal probability plot

qqnorm(temp)
qqline(temp)
t = sort(temp)
abline(v = c(-2.58, -2.17, -1.95, -1.81, 2.58), lty = 2, col  = 1:5)
abline(h = c(t[1:4], t[100]), lty = 2, col = 1:5)

plot of chunk unnamed-chunk-4

Fat tails

Best to think about what is happening with the most extreme values - here the biggest values are bigger than we would expect and the smallest values are smaller than we would expect (for a normal).

plot of chunk unnamed-chunk-6

Skinny tails

Here the biggest values are smaller than we would expect and the smallest values are bigger than we would expect.

plot of chunk unnamed-chunk-7

Right skew

Here the biggest values are bigger than we would expect and the smallest values are also bigger than we would expect.

plot of chunk unnamed-chunk-8

Left skew

Here the biggest values are smaller than we would expect and the smallest values are also smaller than we would expect.

plot of chunk unnamed-chunk-9

Back to sampling distributions

Application exercise 11:

  1. Create the following distributions of size 100.
    • We R Prepared: normal
    • It’s All Ogre Now: wonkiest distribution you can imagine
    • #Shreklμvers: somewhat right skewed
    • Fantastic four (minus 1): extremely left skewed
    • Statisfaction: bimodal
  2. Treat the above distribution as your population. Create sampling distributions of samples of sizes n = 10, 50, 100. Make histograms and normal probability plots of these distributions.
  3. Describe the shapes of these distributions, and calculate the centers and the spreads. Compare these to shapes, centers, of spreads of parent population distributions from (1).

Central Limit Theorem

Central Limit Theorem - for means

\[\bar{x} \sim N\left(mean = \mu, SE = \frac{\sigma}{\sqrt{n}}\right)\]

Conditions:

  • Independence: The sampled observations must be independent. This is difficult to check, but the following are useful guidelines:
    • the sample must be random
    • if sampling without replacement, sample size must be less than 10% of population size
  • Sample size: If the population distribution is normal, the sample size doesn’t matter. But if the population distribution is not normal, we need larger samples for the sampling distribution to be normal (the more skewed the population, the higher the sample size needed). Usually n > 30 is considered a large enough sample for population distributions that are not extremely skewed.

Confirm your findings…

Confirm that your findings from the application exercise match up with what the CLT outlines.

Central Limit Theorem - for proportions

\[\hat{p} \sim N\left(mean = p, SE = \sqrt{\frac{p(1-p)}{n}}\right)\]

Conditions:

  • Independence: The sampled observations must be independent. This is difficult to check, but the following are useful guidelines:
    • the sample must be random
    • if sampling without replacement, sample size must be less than 10% of population size
  • Sample size: There should be at least 10 expected successes and 10 expected failures:
    • np \(\ge\) 10
    • n(1-p) \(\ge\) 10

Inference methods based on CLT

Inference methods based on CLT

If necessary conditions are met, we can also use inference methods based on the CLT:

  • use the CLT to calculate the SE of the sample statistic of interest (sample mean, sample proportion, difference between sample means, etc.)

  • calculate the test statistic, number of standard errors away from the null value the observed sample statistic is (different test statistics for diffent data types, e.g. T, Z, \(\chi^2\), F, etc.)

  • use the test statistic to calculte the p-value, the probability of an observed or more extreme outcome given that the null hypothesis is true

Example

In 2001 the average GPA of students at Duke University was 3.37. This semester we surveyed 63 students in a statistics course about their GPAs. The mean was 3.58, and the standard deviation 0.53. A histogram of the data is shown below. Assuming that this sample is random and representative of all Duke students (bit of a leap of faith?), do these data provide convincing evidence that the average GPA of Duke students has changed over the last decade?

\(H_0: \mu = 3.37; H_A: \mu \ne 3.37\)

\(\bar{x} \sim N\left(mean = \mu = 3.37, SE = \frac{\sigma}{\sqrt{n}} = \frac{0.53}{\sqrt{63}} = 0.0668 \right)\)

\(T = \frac{3.58 - 3.37}{0.0668} \approx 3.14\), \(df = n - 1 = 63 - 2 = 62\)

(1 - pt(3.14, df = 62)) * 2
## [1] 0.002588

in R - numerical data

  • numerical data - t.test
    • testing for one mean: \(H_0: \mu_x = 0\)
    • comparing two independent means (groups x and y): \(H_0: \mu_x = \mu_y\)
    • comparing two dependent means (groups x and y): \(H_0: \mu_x = \mu_y\)
  • categorical data - prop.test
    • testing for one proportion: \(H_0: p_x = 0.3\)
    • comparing two independent proportions (groups x and y): \(H_0: p_x = p_y\)
    • comparing many proportions (categorical data with many levels): \(H_0:\) x and y are independent (can also use chisq.test)

Recap

Recap

  • We now have been introduced to both simulation based and CLT based methods for statistical inference.

  • For most simulation based methods you wrote your own code, for CLT based methods we introduced some built in functions.

  • Take away message: If certain conditions are met CLT based methods may be used for statistical inference. To do so, we would need to know how the standard error is calculated for the given sample statistic of interest.

  • What you should know:
    • What does standard error mean?
    • What does the p-value mean?
    • How do we make decisions based on the p-value?
  • What you don’t need to know: how to calculate standard errors and p-values by hand

HW

  1. Read Sections 2.5 - 2.8 on OpenIntro: Intro Stat with Randomization and Simulation:

http://www.openintro.org/stat/textbook.php?stat_book=isrs

  1. HW3: Complete exercises 2.4, 2.10, 2.12, 3.16, 3.18, 4.13, 4.20
    • For 4.13, the dataset is called gifted, in the openintro package
    • For 4.13, the dataset is called ncbirths, in the openintro package

(Note that these are the end of chapter exercises)