September 22, 2014

## From last time

Application exercise 8:

1. Write a function with the inputs `data` (a vector of the observed data), `null` (null value), `alt` (the alternative hypothesis, options: less than, greater than, not equal to), `success` (level of the categorical variable to be considered a success), `seed` (the seed to be set). As an output, the function should report the null hypothesis, the alternative hypothesis, a summary of the dataset, the observed sample statistic, the p-value, and the simulated null distribution.

2. Confirm that your function is working properly using the earlier dataset/hypothesis test on livery donor surgery complications.

3. One of the earliest examples of behavioral asymmetry is a preference in humans for turning the head to the right, rather than to the left, during the final weeks of gestation and for the first 6 months after birth.This is thought to influence subsequent development of perceptual and motor preferences. A study of 124 couples found that 64.5% turned their heads to the right when kissing. Do these data provide convincing evidence that majority of couples turn their heads to the right when kissing? Use the function you created earlier to conduct the appropriate hypothesis test. You can download the dataset at https://stat.duke.edu/courses/Fall14/sta112.01/data/kissing.html.

Hint: If you need a reminder on how to load a csv file into R, see HW2: https://stat.duke.edu/courses/Fall14/sta112.01/hw/hw2.html.

## Pseudo code

1. Set hypotheses, using inputs `null` and `alt`

2. Calculate sample statistic, using `data` and the level of the data defined as `success`

3. Create the simulation distribution, using sample size based on `data`, `null` value as the assumed true probability of `success`, and `nsim`

4. Calculate the p-value, as the proportion of simulations where the simulated probability of success is at least as extreme as the observed sample statistic calculated in item 2. Note that the definition of "extreme" will depend on the `alt`

5. Report all desired output: hypotheses, summary statistics, p-value, plot of simulation distribution

## Skeleton

```one_cat_test <- function(data, null, alt, success, nsim, seed = NULL){
# code corresponding to steps outlined in pseudo code
}```

## Set defaults for input arguments

```one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL,
nsim = 10000, seed = NULL){
# code corresponding to steps outlined in pseudo code
}```

## Step 1 - Set hypotheses

Set hypotheses, using inputs `null` and `alt`

… and also think about possible ways the user might make an error

```one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
# set null hypothesis
print(paste("H0: p =", null))

# errors associated with null hypothesis
if(null < 0 | null > 1) { stop("Null value should be between 0 and 1.") }
if(is.numeric(null) == FALSE) { stop("Null value should be a numeric value between 0 and 1.") }
if(is.null(null)) { stop("Missing null value.") }

# set alternative hypothesis
if(alt == "not equal") { alt_sign = "!=" }
if(alt == "greater") { alt_sign = ">" }
if(alt == "less") { alt_sign = "<" }
print(paste("HA: p", alt_sign, null))

# errors associated with alternative hypothesis
if(!(alt %in% c("greater","less","not equal"))) {
stop("Alternative hypothesis not specified properly. Set alt = 'less', 'greater', or 'not equal'.")
}
}```

## Step 1 - Set hypotheses (another approach)

… and also think about possible ways the user might make an error

```one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
# set null hypothesis
print(paste("H0: The true population proportion is", null))

# set alternative hypothesis
if(alt %in% c("greater", "less")){
print(paste("HA: The true population proportion is", alt, "than", null))
}
if(alt == "not equal"){
print(paste("HA: The true population proportion is", alt, "to", null))
}

# and you can add similar errors as before...
}```

## Step 2 - Calculate sample statistic

Calculate sample statistic, using `data` and the level of the data defined as `success`

```one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
# omitting earlier code so that everything fits on one slide

# sample size
n = length(data)

# observed number of successes
n_suc = sum(data == success)

# observed proportion of successes
p_hat = n_suc / n

# errors associated with data format
if(is.null(data)) { stop("Missing data.") }
if(is.data.frame(data)) { stop("Data should be a vector, not a data frame.") }
}```

## Step 3 - Create the simulation distribution

Create the simulation distribution, using sample size based on `data`, `null` value as the assumed true probability of `success`, and `nsim`

```one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
# omitting earlier code so that everything fits on one slide

# outcomes to sample from
outcomes = levels(as.factor(data))
# error if data has more than 2 levels
if(length(outcomes) > 2) { stop("Input data has more than two levels.") }

# probability with which to sample
if(which(outcomes == success) == 1) { p = c(null, 1-null) }
if(which(outcomes == success) == 2) { p = c(1-null, null) }

# seed, if set
if(!is.null(seed)) { set.seed(seed) }

# loop for simulation
sim = data.frame(phat = rep(NA, nsim))
for(i in 1:nsim){
sim_sample = sample(outcomes, size = n, prob = p, replace = TRUE)
sim\$phat[i] = sum(sim_sample == success) / n
}
}```

## Step 4 - Create the simulation distribution

Calculate the p-value, as the proportion of simulations where the simulated probability of success is at least as extreme as the observed sample statistic calculated in item 2. Note that the definition of "extreme" will depend on the `alt`

```one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
# omitting earlier code so that everything fits on one slide

# calculate number of simulated p-hats at least as extreme as observed p-hat
if(alt  == "greater"){ nsim_extreme = sum(sim\$phat >= p_hat) }
if(alt  == "less"){ nsim_extreme = sum(sim\$phat <= p_hat) }
if(alt  == "not equal"){
if(p_hat > null) { nsim_extreme = 2 * sum(sim\$phat >= p_hat) }
if(p_hat < null) { nsim_extreme = 2 * sum(sim\$phat <= p_hat) }
}

# calculate p-value
p_value = nsim_extreme / nsim
}```

## Step 5 - Report output

Report all desired output: hypotheses, summary statistics, p-value, plot of simulation distribution

```one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
# omitting earlier code so that everything fits on one slide

# hypotheses were printed earlier

# summary statistics
cat(paste("Summary stats: n =", n, ", number of successes =", n_suc, ", p-hat =", round(p_hat, 4), "\n"))
if(round(p_value, 4) == 0) { cat(paste("p-value < 0.0001\n")) }
if(round(p_value, 4) > 0) { cat(paste("p-value = ", round(p_value, 4), "\n")) }

# plot
if(nsim <= 100){
simdist_plot = ggplot(sim, aes(x = phat)) + geom_dotplot()
suppressWarnings( suppressMessages( print( simdist_plot ) ) )
}
if(nsim > 100){
simdist_plot = qplot(phat, data = sim, geom = "histogram")
suppressWarnings( suppressMessages( print( simdist_plot ) ) )
}
}```

## Return

```one_cat_test <- function(data, null = NULL, alt = "not equal", success = NULL, nsim = 10000, seed = NULL){
# omitting earlier code so that everything fits on one slide

return(list(p_value = p_value))
}```

## Complete function

First load the `devtools` package, so that we can source from an https address:

`library(devtools)`

Then `source` the function:

`source_url("https://stat.duke.edu/courses/Fall14/sta112.01/code/one_cat_test.R")`

```download.file("https://stat.duke.edu/courses/Fall14/sta112.01/data/kissing.csv",
destfile = "kissing.csv", method = "curl")

## Run my function, compare to yours

```one_cat_test(data = kissing\$side, null = 0.5, alt = "greater", success = "right",
nsim = 100, seed = 1234)```
```## H0: p = 0.5
## HA: p > 0.5
## Summary stats: n = 124 , number of successes = 80 , p-hat = 0.6452
## p-value < 0.0001```

```## \$p_value
## [1] 0```

## Return

```mytest = one_cat_test(data = kissing\$side, null = 0.5, alt = "greater", success = "right",
nsim = 100, seed = 1234)```
```## H0: p = 0.5
## HA: p > 0.5
## Summary stats: n = 124 , number of successes = 80 , p-hat = 0.6452
## p-value < 0.0001```

`mytest\$p_value`
`## [1] 0`

## Confidence intervals

A plausible range of values for the population parameter is called a confidence interval.

• If we report a point estimate, we probably wonâ€™t hit the exact population parameter.

• If we report a range of plausible values we have a good shot at capturing the parameter.