Due 10/21: https://stat.duke.edu/courses/Fall14/sta112.01/project/mt_project.html
Each sample from the population yields a slightly different sample statistic (sample mean, sample proportion, etc.)
The variability of these sample statistics is measured by the standard error
Previously we quantified this value via simulation
Today we talk about the theory underlying sampling distributions
temp = rnorm(100, mean = 50, sd = 5)
# normal probability plot
g = qplot(sample = temp, stat = "qq")
g + geom_abline(intercept = mean(temp), slope = sd(temp), linetype = "dashed")
qqnorm(temp)
qqline(temp)
Data are plotted on the y-axis of a normal probability plot and theoretical quantiles (following a normal distribution) on the x-axis.
If there is a one-to-one relationship between the data and the theoretical quantiles, then the data follow a nearly normal distribution.
Since a one-to-one relationship would appear as a straight line on a scatter plot, the closer the points are to a perfect straight line, the more confident we can be that the data follow the normal model.
Data (y-coordinates) | Percentile | Theoretical Quantiles (x-coordinates) |
---|---|---|
37.5 | 0.5 / 100 = 0.005 | qnorm(0.005) = -2.58 |
38.0 | 1.5 / 100 = 0.015 | qnorm(0.015) = -2.17 |
38.3 | 2.5 / 100 = 0.025 | qnorm(0.025) = -1.95 |
39.5 | 3.5 / 100 = 0.035 | qnorm(0.035) = -1.81 |
… | … | … |
61.9 | 99.5 / 100 = 0.995 | qnorm(0.995) = 2.58 |
qqnorm(temp)
qqline(temp)
t = sort(temp)
abline(v = c(-2.58, -2.17, -1.95, -1.81, 2.58), lty = 2, col = 1:5)
abline(h = c(t[1:4], t[100]), lty = 2, col = 1:5)
Best to think about what is happening with the most extreme values - here the biggest values are bigger than we would expect and the smallest values are smaller than we would expect (for a normal).
Here the biggest values are smaller than we would expect and the smallest values are bigger than we would expect.
Here the biggest values are bigger than we would expect and the smallest values are also bigger than we would expect.
Here the biggest values are smaller than we would expect and the smallest values are also smaller than we would expect.
Application exercise 11:
\[\bar{x} \sim N\left(mean = \mu, SE = \frac{\sigma}{\sqrt{n}}\right)\]
Conditions:
\[\hat{p} \sim N\left(mean = p, SE = \sqrt{\frac{p(1-p)}{n}}\right)\]
Conditions:
If necessary conditions are met, we can also use inference methods based on the CLT:
use the CLT to calculate the SE of the sample statistic of interest (sample mean, sample proportion, difference between sample means, etc.)
calculate the test statistic, number of standard errors away from the null value the observed sample statistic is (different test statistics for diffent data types, e.g. T, Z, \(\chi^2\), F, etc.)
use the test statistic to calculte the p-value, the probability of an observed or more extreme outcome given that the null hypothesis is true
\(H_0: \mu = 3.37; H_A: \mu \ne 3.37\)
\(\bar{x} \sim N\left(mean = \mu = 3.37, SE = \frac{\sigma}{\sqrt{n}} = \frac{0.53}{\sqrt{63}} = 0.0668 \right)\)
\(T = \frac{3.58 - 3.37}{0.0668} \approx 3.14\), \(df = n - 1 = 63 - 2 = 62\)
(1 - pt(3.14, df = 62)) * 2
## [1] 0.002588
t.test
x
and y
): \(H_0: \mu_x = \mu_y\)x
and y
): \(H_0: \mu_x = \mu_y\)prop.test
x
and y
): \(H_0: p_x = p_y\)chisq.test
)We now have been introduced to both simulation based and CLT based methods for statistical inference.
For most simulation based methods you wrote your own code, for CLT based methods we introduced some built in functions.
Take away message: If certain conditions are met CLT based methods may be used for statistical inference. To do so, we would need to know how the standard error is calculated for the given sample statistic of interest.
What you don’t need to know: how to calculate standard errors and p-values by hand
http://www.openintro.org/stat/textbook.php?stat_book=isrs
gifted
, in the openintro
packagencbirths
, in the openintro
package(Note that these are the end of chapter exercises)