num_pops <- read_csv("data/num_pops.csv")
Each column represents a different population. Determine out how many columns are in this dataset. Create a histogam of each of the populations, and describe their distributions. Specifically make sure to note the shape, center, and spread of the distribution.
Create a sampling distributions of sample means using samples of the following sizes for each of the populations:
Generate each sampling distribution using at least 15,000 samples. Note that sampling distributions are created by taking random samples, with replacement, from the original population.
To help you out, here is the code for creating a sampling distribution for samples of size 5 with 20 samples for the population in column M of the data. Also make sure to read the help for the rep_sample_n
function. Note that this function is in the oilabs
package.
sampling_mean_n5_rep20_M <- num_pops %>%
select(M) %>% # select column M
rep_sample_n(size = 5, reps = 20, replace = TRUE) %>% # 20 random samples w/ replacement,
# each of size 5
group_by(replicate) %>% # group by each replicate
summarise(xbar = mean(M)) # calculate the mean of each sample
Make sure to use clear naming convention for your sampling distributions. You can follow the naming convention I used above for sampling_mean_n5_rep20_M
, or choose a different, but reasonable, style.
Describe the shapes of your sampling distributions; calculate the centers (mean) and the spreads (standard deviation). Compare these to shapes, centers, and spreads of the parent population distributions from (1).
Note the Central Limit Theorem for a single mean For each one of your samples, evaluate whether the Central Limit Theorem holds. If it does not, explain why.
cat_pops <- read_csv("data/cat_pops.csv")
Each column represents a different population. Determine out how many columns are in this dataset. Create a bar plot of each of the populations, and describe their distributions. Specifically make sure to note the proportion of successes.
Create a sampling distributions of sample proportions using samples of the following sizes for each of the populations:
Generate each sampling distribution using at least 15,000 samples. Note that sampling distributions are created by taking random samples, with replacement, from the original population.
To help you out, here is the code for creating a sampling distribution for samples of size 5 with 20 samples for the population in column A of the data.
sampling_prop_n5_rep20_A <- cat_pops %>%
select(A) %>% # select column A
rep_sample_n(size = 5, reps = 20, replace = TRUE) %>% # 20 random samples w/ replacement,
# each of size 5
group_by(replicate) %>% # group by each replicate
summarise(phat = sum(A == "success") / n()) # calculate the mean of each sample
Make sure to use clear naming convention for your sampling distributions. You can follow the naming convention I used above for sampling_prop_n5_rep20_A
, or choose a different, but reasonable, style.
Describe the shapes of your sampling distributions; calculate the centers (mean) and the spreads (standard deviation). Compare the centers of your sampling distributions to the true population proportions from Question (1)
Note the Central Limit Theorem for a single proportion. For each one of your samples, evaluate whether the Central Limit Theorem holds. If it does not, explain why.
Describe precisely how you would set up the simulation for the following hypothesis tests. Imagine using index cards or chips to represent the data. Also specify whether the null hypothesis would be independence or point and whether the simulation type would be bootstrap, simulate, or permute. In each of the scenarios you can assume sample size is 100 and number of simulations is 15,000.
You’re working in the same repo as your teammates now, so merge conflics will happen, issues will arise, and that’s fine! Commit and push often, and ask questions when stuck.
Review the grading guidelines below and ask questions if any of the expectations are unclear.
Make sure each team member is contributing, both in terms of quality and quantity of contribution (we will be reviewing commits from different team members).
Set aside time to work together and apart (physically).
When you’re done, review the .md document on GitHub to make sure you’re happy with the final state of your work. Then go get some rest!
Total | 100 pts |
---|---|
Part 1: Sampling distributions of sample means | 30 pts |
Part 2: Sampling distributions of sample proportions | 30 pts |
Part 3: Simulations for hypothesis testing | 20 pts |
Code quality | 10 pts |
Document organization (team name, code chunk names, commtis, overall organization, etc.) | 10 pt |