Tasks

Part 1: Sampling distribution of sample proportions

  1. Load the following dataset:
cat_pops = read.csv("https://stat.duke.edu/~cr173/Sta112_Fa16/data/cat_pops.csv")

Assume each column of this dataset represents a categorical population, each population has only two levels: success and failure.

  1. Create a sampling distributions of sample proportions using samples of the following sizes:

    • \(n = 10\)
    • \(n = 50\)
    • \(n = 200\)

Generate each sampling distribution using at least 15,000 samples. Note that sampling distributions are created by taking random samples, with replacement, from the original population.

  1. For each sample size create four graphs
    • A visualization of the population distribution.
    • A visualization of a single sample
    • A visualization of the shape sampling distibution
    • A normal probability plot of the sampling distribution
  2. Describe the shapes of your sampling distributions; calculate the centers (mean) and the spreads (standard deviation). Compare these to shapes, centers, and spreads of the parent population distributions from (1).


Part 2: Sampling distribution of sample means

  1. Load the following dataset:
num_pops = read.csv("https://stat.duke.edu/~cr173/Sta112_Fa16/data/num_pops.csv")
  1. Repeat steps 2. - 4. from Part 1 for these numerical data and use the sample mean as the sample statistic.


Part 3: Sampling distribution of sample median

  1. Reuse the data from Part 2.

  2. Repeat steps 2. - 4. from Part 1 for these numerical data and use the sample median as the sample statistic.


Part 4: Inference

Load the data from the 2010 General Social Survey:

gss = read.csv("https://stat.duke.edu/~mc301/data/gss2010.csv", stringsAsFactors = FALSE)

Remember that the data dictionary can be found at https://gssdataexplorer.norc.org/variables/vfilter.

For each of the questions below, make sure to check whether conditions for inference have been met and show all your work.

You will answer each question below using both a theoretical (using the CLT) and simulation based approach. Compare your answers to see if they agree or disagree with each other.

  1. Working extra: Do these data provide convincing evidence that Americans work extra hours beyond their usual schedule more than 5 days per month on average? Evaluate at the 5% significance level. Also construct a confidence interval, at the equivalent level, estimating the average number of days per month Americans work extra hours beyond their usual schedule. Interpret both results in context of the data and the research question, and comment on whether your findings from the hypothesis test and confidence interval agree. The variable of interest is moredays.

  2. Working extra and education: Do the average number of days Americans work extra hours beyond their usual schedule vary between those with and without a college degree? The variable of interest here is degree which has 5 levels: graduate, bachelor, junior college, high school, and less than high school.

    • First, combine levels to make this a binary variable with levels college and no college. College contains junior college, bachelor, and graduate, and no college contains the remaining levels.
    • Then, run the relevant hypothesis test (at 5% significance level), and also estimate the difference between the two population means at the equivalent confidence level. Interpret all results in context of the data and the research question, and comment on whether your findings from the hypothesis test and confidenceinterval agree.
  3. Life after death: Estimate the proportion of Americans who believe in a life after death using a 95% confidence interval. The variable of interest is called postlife in the dataset. Calculate the interval two ways: using a CLT theorem based approach as well as using bootstrapping. Interpret the interval in context of the data, and comment on whether or not the two intervals you calculate match. (They may not be exactly the same, but they should be close. You can re-use old bootstrapping code, but this is an opportunity to review and revise your code if needed.)

  4. Pick your own: Pick two categorical variables from the dataset, identify one as the explanatory and the other as the response variable. Make sure that these variables each only have two levels, or combine their levels into two levels. Then, compare population proportions across the two groups of your explanatory variable using both a confidence interval and a hypothesis test. As always, interpret all results in context of the data and the research question, and comment on whether your findings from the hypothesis test and confidence interval agree.