October 15, 2015
New teams! (I'll need team names ASAP)
Introduction to inference via simulations
People providing an organ for donation sometimes seek the help of a special "medical consultant". These consultants assist the patient in all aspects of the surgery, with the goal of reducing the possibility of complications during the medical procedure and recovery. Patients might choose a consultant based in part on the historical complication rate of the consultant's clients.
One consultant tried to attract patients by noting that the average complication rate for liver donor surgeries in the US is about 10%, but her clients have only had 3 complications in the 62 liver donor surgeries she has facilitated. She claims this is strong evidence that her work meaningfully contributes to reducing complications (and therefore she should be hired!).
A parameter for a hypothesis test is the "true" value of interest. We typically estimate the parameter using a sample statistic as a point estimate.
\(p\): true rate of complication
\(\hat{p}\): rate of complication in the sample = \(\frac{3}{62}\) = 0.048
Is it possible to assess the consultant’s claim using the data?
No. The claim is that there is a causal connection, but the data are observational. For example, maybe patients who can afford a medical consultant can afford better medical care, which can also lead to a lower complication rate.
While it is not possible to assess the causal claim, it is still possible to test for an association using these data. For this question we ask, could the low complication rate of \(\hat{p}\) = 0.048 be due to chance?
Complication rate for this consultant is no different than the US average of 10%
Complication rate for this consultant is lower than the US average of 10%
Null hypothesis, \(H_0\): Defendant is innocent
Alternative hypothesis, \(H_A\): Defendant is guilty
Present the evidence: Collect data
Start with a null hypothesis (\(H_0\)) that represents the status quo
Set an alternative hypothesis (\(H_A\)) that represents the research question, i.e. what we’re testing for
Which of the following is the correct set of hypotheses?
Describe how you would simulate the null distribution for this study using a bag of chips. How many chips? What colors? What do the colors indicate? How many draws? With replacement or without replacement?
When sampling from the null distribution, what is the expected proportion of success (complications)?
set.seed(9) library(ggplot2)
# create sample space outcomes <- c("complication", "no complication") # draw the first sample of size 62 from the null distribution sim1 <- sample(outcomes, size = 62, prob = c(0.1, 0.9), replace = TRUE) # view the sample table(sim1)
## sim1 ## complication no complication ## 11 51
# calculate the simulated sample proportion of complications (red chips) (p_hat_sim1 <- sum(sim1 == "complication") / length(sim1))
## [1] 0.1774194
# create an empty data frame sim_dist <- data.frame(p_hat_sim = rep(NA, 100)) # record the simulated p-hat as the first observation sim_dist$p_hat_sim[1] <- p_hat_sim1 # plot ggplot(sim_dist, aes(x = p_hat_sim)) + geom_dotplot() + xlim(0, 0.26) + ylim(0, 10)
sim2 <- sample(outcomes, size = 62, prob = c(0.1, 0.9), replace = TRUE) (p_hat_sim2 <- sum(sim2 == "complication") / length(sim2))
## [1] 0.08064516
sim_dist$p_hat_sim[2] <- p_hat_sim2 ggplot(sim_dist, aes(x = p_hat_sim)) + geom_dotplot() + xlim(0,0.26) + ylim(0,10)
sim3 <- sample(outcomes, size = 62, prob = c(0.1, 0.9), replace = TRUE) (p_hat_sim3 <- sum(sim3 == "complication") / length(sim3))
## [1] 0.2096774
sim_dist$p_hat_sim[3] <- p_hat_sim3 ggplot(sim_dist, aes(x = p_hat_sim)) + geom_dotplot() + xlim(0,0.26) + ylim(0,10)
We need a way to automate this process!
for
loopsSimplest, and most common type of loop in R - iterate through the elements of a vector and evaluate the code block for each.
for(x in 1:10) { cat(x^2," ", sep="") # cat: concatenate and print }
## 1 4 9 16 25 36 49 64 81 100
for(y in list(1:3, LETTERS[1:7], c(TRUE,FALSE))) { cat(length(y)," ",sep="") }
## 3 7 2
Almost always it is better to create an object to store your results first, rather than growing the object as you go.
# Good res <- rep(NA,10) for(x in 1:10) { res[x] <- x^2 } res
## [1] 1 4 9 16 25 36 49 64 81 100
# Bad res <- c() for (x in 1:10) { res <- c(res,x^2) } res
## [1] 1 4 9 16 25 36 49 64 81 100
Earlier we simulated each iteration one-by-one and showed how we would fill in each element of sim_dist$p_hat_sim
.
Using for loops we can automate this process
for
loopsim_dist <- data.frame(p_hat_sim = rep(NA, 100)) for (i in 1:100){ sim <- sample(outcomes, size = 62, prob = c(0.1, 0.9), replace = TRUE) p_hat_sim <- sum(sim == "complication") / length(sim) sim_dist$p_hat_sim[i] <- p_hat_sim } ggplot(sim_dist, aes(x = p_hat_sim)) + geom_dotplot()
Remember p-value is probability of observed or more extreme outcome given that the null hypothesis is true.
What is the p-value, i.e. in what % of the simulations was the simulated \(\hat{p}\) was at least as extreme as the observed \(\hat{p}\) of 0.048?
We often use 5% as the cutoff for whether the p-value is low enough that the data are unlikely to have come from the null model. This cutoff value is called the significance level (\(\alpha\)).
If p-value < \(\alpha\), reject \(H_0\) in favor of \(H_A\): The data provide convincing evidence for the alternative hypothesis.
If p-value > \(\alpha\), fail to reject \(H_0\) in favor of \(H_A\): The data do not provide convincing evidence for the alternative hypothesis.
What is the conclusion of the hypothesis test?
Since the p-value is greater than the significance level, we fail to reject the null hypothesis. These data do not provide convincing evidence that this consultant incurs a lower complication rate than 10% (overall US complication rate).
100 simulations is not sufficient
We usually simulate around 15,000 times to get an accurate distribution
in your new teams!
See course website for details.