From last time

Application exercise 8:

  1. Write a function with the inputs data (a vector of the observed data), null (null value), alt (the alternative hypothesis, options: less than, greater than, not equal to), success (level of the categorical variable to be considered a success), seed (the seed to be set). As an output, the function should report the null hypothesis, the alternative hypothesis, a summary of the dataset, the observed sample statistic, the p-value, and the simulated null distribution.

  2. Confirm that your function is working properly using the earlier dataset/hypothesis test on livery donor surgery complications.

  3. One of the earliest examples of behavioral asymmetry is a preference in humans for turning the head to the right, rather than to the left, during the final weeks of gestation and for the first 6 months after birth.This is thought to influence subsequent development of perceptual and motor preferences. A study of 124 couples found that 64.5% turned their heads to the right when kissing. Do these data provide convincing evidence that majority of couples turn their heads to the right when kissing? Use the function you created earlier to conduct the appropriate hypothesis test. You can download the dataset at https://stat.duke.edu/courses/Fall14/sta112.01/data/kissing.html.

Hint: If you need a reminder on how to load a csv file into R, see HW2: https://stat.duke.edu/courses/Fall14/sta112.01/hw/hw2.html.

Pseudo code

  1. Set hypotheses, using inputs null and alt

  2. Calculate sample statistic, using data and the level of the data defined as success

  3. Create the simulation distribution, using sample size based on data, null value as the assumed true probability of success, and nsim

  4. Calculate the p-value, as the proportion of simulations where the simulated probability of success is at least as extreme as the observed sample statistic calculated in item 2. Note that the definition of “extreme” will depend on the alt

  5. Report all desired output: hypotheses, summary statistics, p-value, plot of simulation distribution

Skeleton

one_cat_test <- function(data, null, alt, success, nsim, seed){
  # code corresponding to steps outlined in pseudo code
}

Set defaults for input arguments

one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, 
                         nsim = 10000, seed = NULL){
  # code corresponding to steps outlined in pseudo code
}

Step 1 - Set hypotheses

Set hypotheses, using inputs null and alt

… and also think about possible ways the user might make an error

one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
  # set null hypothesis
  print(paste("H0: p =", null))
  
  # errors associated with null hypothesis
  if(null < 0 | null > 1) { stop("Null value should be between 0 and 1.") }
  if(is.numeric(null) == FALSE) { stop("Null value should be a numeric value between 0 and 1.") }
  if(is.null(null)) { stop("Missing null value.") }
        
  # set alternative hypothesis
  if(alt == "not equal") { alt_sign = "!=" }
  if(alt == "greater") { alt_sign = ">" }
  if(alt == "less") { alt_sign = "<" }
  print(paste("HA: p", alt_sign, null))
  
  # errors associated with alternative hypothesis
  if(!(alt %in% c("greater","less","not equal"))) {
    stop("Alternative hypothesis not specified properly. Set alt = 'less', 'greater', or 'not equal'.")
    }
}

Step 1 - Set hypotheses (another approach)

… and also think about possible ways the user might make an error

one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
  # set null hypothesis
  print(paste("H0: The true population proportion is", null))
  
  # set alternative hypothesis
  if(alt %in% c("greater", "less")){
    print(paste("HA: The true population proportion is", alt, "than", null))
  }
  if(alt == "not equal"){
    print(paste("HA: The true population proportion is", alt, "to", null))
  }  
  
  # and you can add similar errors as before...
}

Step 2 - Calculate sample statistic

Calculate sample statistic, using data and the level of the data defined as success

one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
  # omitting earlier code so that everything fits on one slide
  
  # sample size
  n = length(data)
  
  # observed number of successes
  n_suc = sum(data == success)
  
  # observed proportion of successes
  p_hat = n_suc / n
  
  # errors associated with data format
  if(is.null(data)) { stop("Missing data.") }
  if(is.data.frame(data)) { stop("Data should be a vector, not a data frame.") }
}

Step 3 - Create the simulation distribution

Create the simulation distribution, using sample size based on data, null value as the assumed true probability of success, and nsim

one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
  # omitting earlier code so that everything fits on one slide
    
  # outcomes to sample from
  outcomes = levels(as.factor(data))
  # error if data has more than 2 levels
  if(length(outcomes) > 2) { stop("Input data has more than two levels.") }
  
  # probability with which to sample
  if(which(outcomes == success) == 1) { p = c(null, 1-null) }
  if(which(outcomes == success) == 2) { p = c(1-null, null) }
  
  # seed, if set
  if(!is.null(seed)) { set.seed(seed) }
  
  # loop for simulation
  sim = data.frame(phat = rep(NA, nsim))
  for(i in 1:nsim){
    sim_sample = sample(outcomes, size = n, prob = p, replace = TRUE)
    sim$phat[i] = sum(sim_sample == success) / n
  }
}

Step 4 - Create the simulation distribution

Calculate the p-value, as the proportion of simulations where the simulated probability of success is at least as extreme as the observed sample statistic calculated in item 2. Note that the definition of “extreme” will depend on the alt

one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
  # omitting earlier code so that everything fits on one slide
  
  # calculate number of simulated p-hats at least as extreme as observed p-hat
  if(alt  == "greater"){ nsim_extreme = sum(sim$phat >= p_hat) }
  if(alt  == "less"){ nsim_extreme = sum(sim$phat <= p_hat) }  
  if(alt  == "not equal"){
    if(p_hat > null) { nsim_extreme = 2 * sum(sim$phat >= p_hat) }
    if(p_hat < null) { nsim_extreme = 2 * sum(sim$phat <= p_hat) }
  }
  
  # calculate p-value
  p_value = nsim_extreme / nsim
}

Step 5 - Report output

Report all desired output: hypotheses, summary statistics, p-value, plot of simulation distribution

one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
  # omitting earlier code so that everything fits on one slide
  
  # hypotheses were printed earlier
  
  # summary statistics
  cat(paste("Summary stats: n =", n, ", number of successes =", n_suc, ", p-hat =", round(p_hat, 4), "\n"))
  if(round(p_value, 4) == 0) { cat(paste("p-value < 0.0001\n")) }
  if(round(p_value, 4) > 0) { cat(paste("p-value = ", round(p_value, 4), "\n")) }
  
  # plot
  if(nsim <= 100){
    simdist_plot = ggplot(sim, aes(x = phat)) + geom_dotplot() 
    suppressWarnings( suppressMessages( print( simdist_plot ) ) ) 
  }
  if(nsim > 100){
    simdist_plot = qplot(phat, data = sim, geom = "histogram")
    suppressWarnings( suppressMessages( print( simdist_plot ) ) ) 
  }
}

Return

one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
  # omitting earlier code so that everything fits on one slide
  
  return(list(p_value = p_value))
}

Complete function

First load the devtools package, so that we can source from an https address:

library(devtools)

Then source the function:

source_url("https://stat.duke.edu/courses/Fall14/sta112.01/code/one_cat_test.R")

Load data

download.file("https://stat.duke.edu/courses/Fall14/sta112.01/data/kissing.csv", 
    destfile = "kissing.csv", method = "curl")
kissing = read.csv("kissing.csv")

Run my function, compare to yours

one_cat_test(data = kissing$side, null = 0.5, alt = "greater", success = "right", 
             nsim = 100, seed = 1234)
## H0: p = 0.5 
## HA: p > 0.5 
## Summary stats: n = 124 , number of successes = 80 , p-hat = 0.6452 
## p-value < 0.0001

plot of chunk unnamed-chunk-13

## $p_value
## [1] 0

Return

mytest = one_cat_test(data = kissing$side, null = 0.5, alt = "greater", success = "right", 
             nsim = 100, seed = 1234)
## H0: p = 0.5 
## HA: p > 0.5 
## Summary stats: n = 124 , number of successes = 80 , p-hat = 0.6452 
## p-value < 0.0001

plot of chunk unnamed-chunk-14

mytest$p_value
## [1] 0

Confidence intervals via bootstrapping

Confidence intervals

A plausible range of values for the population parameter is called a confidence interval.

empirical empirical
  • If we report a point estimate, we probably won’t hit the exact population parameter.

  • If we report a range of plausible values we have a good shot at capturing the parameter.

Bootstrapping

empirical

  • One method of constructing a confidence interval, via simulation, is bootstrapping.

  • This term comes from the phrase “pulling oneself up by one’s bootstraps”, which is a metaphor for accomplishing an impossible task without any outside help.

  • In this case the impossible task is estimating a population parameter, and we’ll accomplish it using data from only the given sample.

Bootstrapping scheme

  1. Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample

  2. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, etc. computed on the bootstrap samples

  3. Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics

Bootstrapping in action

Let’s create the bootstrap distribution for the kissing dataset.

set.seed(2345)
nsim = 100
boot = data.frame(phat = rep(NA, nsim))
for(i in 1:nsim){
  boot_sample = sample(kissing$side, size = 124, replace = TRUE)
  boot$phat[i] = sum(boot_sample == "right") / 124
}

Plotting the bootstrap distribution

What does each dot in the following plot represent?
ggplot(boot, aes(x = phat)) + geom_dotplot()