cat_pops = read.csv("https://stat.duke.edu/~cr173/Sta112_Fa16/data/cat_pops.csv")
Assume each column of this dataset represents a categorical population, each population has only two levels: success
and failure
.
Create a sampling distributions of sample proportions using samples of the following sizes:
Generate each sampling distribution using at least 15,000 samples. Note that sampling distributions are created by taking random samples, with replacement, from the original population.
num_pops = read.csv("https://stat.duke.edu/~cr173/Sta112_Fa16/data/num_pops.csv")
Reuse the data from Part 2.
Repeat steps 2. - 4. from Part 1 for these numerical data and use the sample median as the sample statistic.
Load the data from the 2010 General Social Survey:
gss = read.csv("https://stat.duke.edu/~mc301/data/gss2010.csv", stringsAsFactors = FALSE)
Remember that the data dictionary can be found at https://gssdataexplorer.norc.org/variables/vfilter.
For each of the questions below, make sure to check whether conditions for inference have been met and show all your work.
You will answer each question below using both a theoretical (using the CLT) and simulation based approach. Compare your answers to see if they agree or disagree with each other.
Working extra: Do these data provide convincing evidence that Americans work extra hours beyond their usual schedule more than 5 days per month on average? Evaluate at the 5% significance level. Also construct a confidence interval, at the equivalent level, estimating the average number of days per month Americans work extra hours beyond their usual schedule. Interpret both results in context of the data and the research question, and comment on whether your findings from the hypothesis test and confidence interval agree. The variable of interest is moredays
.
Working extra and education: Do the average number of days Americans work extra hours beyond their usual schedule vary between those with and without a college degree? The variable of interest here is degree
which has 5 levels: graduate, bachelor, junior college, high school, and less than high school.
Life after death: Estimate the proportion of Americans who believe in a life after death using a 95% confidence interval. The variable of interest is called postlife
in the dataset. Calculate the interval two ways: using a CLT theorem based approach as well as using bootstrapping. Interpret the interval in context of the data, and comment on whether or not the two intervals you calculate match. (They may not be exactly the same, but they should be close. You can re-use old bootstrapping code, but this is an opportunity to review and revise your code if needed.)
Pick your own: Pick two categorical variables from the dataset, identify one as the explanatory and the other as the response variable. Make sure that these variables each only have two levels, or combine their levels into two levels. Then, compare population proportions across the two groups of your explanatory variable using both a confidence interval and a hypothesis test. As always, interpret all results in context of the data and the research question, and comment on whether your findings from the hypothesis test and confidence interval agree.