Today’s lab will serve as a refresher for many of the functions you’ve used in R. It will not cover dplyr
or ggplot
functions (though they may be useful still, for instance in manipulating data or plotting data in order to check asusmptions), but will focus on the statistical tests themselves.
We’ll begin with a short refresher on the assignment operator <-
and $
, and then review the functions used for statistical tests introduced on homeworks 4, 5, and 6.
This lab is a compilation of the pre-homework activities. The lab TAs will be available to help in case you have any questions. There are no exercises to turn in.
We’ll be revisiting the licorice example presented in lecture and HW 05. As a review, postoperative sore throat is an annoying, but painful complication of intubation after surgery, particularly with wider gauge double-lumen tubes. Reutzler et al. (2013) performed an experimental study among patients having elective surgery who required intubation with a double-lumen tube. Prior to anesthesia, patients were randomly assigned to gargle either a licorice-based solution or sugar water (as placebo). Sore throat was evaluated 30 minutes, 90 minutes, and 4 hours after conclusion of the surgery (from 0 to 10).
The data are available with the following code:
library(tidyverse)
licorice <- read.csv("https://www2.stat.duke.edu/courses/Spring22/sta102.001/hw/data/licorice.csv")
If we have a dataset of interest, we can use the syntax t.test(x, mu = _, alternative = _, conf.level = _)
to perform t-tests in R
and construct confidence intervals for the mean (or difference in means). For further details, you may type ?t.test
or help(t.test)
. The arguments for this function are as follows:
x
, a numeric vector of data valuesmu
, a number indicating the true value of the meanalternative
, specifying the alternative hypothesis. It must be either "two.sided"
, "greater"
, or "less"
.conf.level
, specifying the confidence level of the confidence interval.Example: Suppose we want to test the null hypothesis that the mean BMi among all patients was 25 vs. the alternative hypothesis that the mean BMI among all patients was not equal to 25:
t.test(licorice$preOp_calcBMI, mu = 25,
alternative = "two.sided",
conf.level = 0.95)
##
## One Sample t-test
##
## data: licorice$preOp_calcBMI
## t = 2.1224, df = 234, p-value = 0.03486
## alternative hypothesis: true mean is not equal to 25
## 95 percent confidence interval:
## 25.04243 26.14047
## sample estimates:
## mean of x
## 25.59145
The dollar sign $
tells R which variable to use. For instance, licorice$preOp_calcBMI
tells R that you are interested in the preOp_calcBMI
variable from the licorice
dataset.
Note that the output displays the t-statistic, the degrees of freedom, the p-value, the alternative hypothesis tested, the 95% confidence interval, and the sample mean (wow, that’s a lot!).
If you are performing a two-sample t-test, there are some other additional arguments:
y
, a numeric vector of data values (placed after x
)paired
, a logical being either TRUE
or FALSE
indicating whether you want a paired t-test, andvar.equal
, a logical being either TRUE
or FALSE
indicating whether you assume the variance is the same in both groups (this affects the degrees of freedom used in the test)Example: Suppose we want to test the null hypothesis that the mean BMI among men was less than or equal to the mean BMI among women vs. the alternative hypothesis that the mean BMI among men was more than BMI among women:
t.test(licorice$preOp_calcBMI ~ licorice$preOp_gender, mu = 0,
alternative = "greater",
paired = FALSE,
var.equal = FALSE,
conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: licorice$preOp_calcBMI by licorice$preOp_gender
## t = 2.9044, df = 182.62, p-value = 0.002067
## alternative hypothesis: true difference in means between group 0 and group 1 is greater than 0
## 95 percent confidence interval:
## 0.716381 Inf
## sample estimates:
## mean in group 0 mean in group 1
## 26.24958 24.58656
Here, since there were two variables, we used formula syntax, given by a tilde ~
. In this syntax, we have variable of interest ~ grouping variable
. And so when we type licorice$preOp_calcBMI ~ licorice$preOp_gender
, we’re saying to perform a two-sample independent samples t-test for the preOp_calcBMI
variable from licorice
, and the groups are defined in the preOp_gender
variable.
To conduct an ANOVA in R, we will take a summary
of the aov
function to create the ANOVA table. The general syntax is summary(aov(outcome_var ~ grouping_var, data = dataset))
, where outcome_var
is the numerical variable you’re interested in calculating means of (e.g., the pulse rate in the experiment talked about in clasS) and grouping_var
is the grouping variable (e.g., pet vs. friend vs. neither).
Example: Suppose you’re interested in testing whether there is a difference in mean BMI between patients of the three ASA categories:
summary(aov(preOp_calcBMI ~ preOp_asa, data = licorice))
## Df Sum Sq Mean Sq F value Pr(>F)
## preOp_asa 2 262 131.21 7.595 0.000638 ***
## Residuals 232 4008 17.28
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that the ANOVA table provides all output talked about in the lecture slides. The between-group variance component is given in the top line and the within-group variance component is given in the bottom line.
In R, we can use the wilcox.test()
function to perform both the signed-rank test (for paired observations) or the rank sum test (for unpaired observations). In general, the format is as follows:
wilcox.test(outcome ~ groups, data = dataset, paired = ___)
where outcome
is the outcome variable, groups
is the group variable, dataset
is the dataset, and paired
is either TRUE
or FALSE
depending on whether you are looking for a signed rank test or rank sum test, respectively. You may also specified the alternative
to be "two.sided"
, "less"
, or "greater"
.
Example: Suppose you wanted to perform a rank sum test, examining differences in median throat pain score between patients receiving placebo and licorice:
wilcox.test(pacu90min_throatPain ~ treat, data = licorice,
paired = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: pacu90min_throatPain by treat
## W = 8620, p-value = 1.17e-06
## alternative hypothesis: true location shift is not equal to 0
For the Kruskal-Wallis test, use the kruskal.test()
function. The general format is:
kruskal.test(outcome ~ groups, data = dataset)
where again, outcome
is the outcome variable, groups
is the group variable, and dataset
is the dataset.
Example: Suppose you wanted to perform a rank sum test, examining differences in median throat pain score between patients depending on their ASA score:
kruskal.test(pacu90min_throatPain ~ preOp_gender, data = licorice)
##
## Kruskal-Wallis rank sum test
##
## data: pacu90min_throatPain by preOp_gender
## Kruskal-Wallis chi-squared = 4.2705, df = 1, p-value = 0.03878
Tables in R can be created straightforwardly with the table()
function with existing code. For instance, suppose we wanted to create a table looking at the ASA score prior to surgery:
table(licorice$asa)
## < table of extent 0 >
We can also make a 2-way table by specifying another variable. For instance, we could look at the relationship between ASA score and treatment (perhaps to see whether the treatment groups were “balanced” in terms of ASA):
table(licorice$preOp_asa, licorice$treat)
##
## 0 1
## healthy 19 22
## mild_disease 67 67
## systemic_disease 31 29
Notice that the first variable specified is displayed along the rows, and the second variable specified is displayed along the columns. Although R will let you create tables with any variables, they only really make sense for categorical data.
Suppose you are interested in testing a single proportion. You may use the prop.test()
function to do so, and it contains the following arguments:
x
, the number of “successes”n
, the total sample sizep
, the hypothesized proportionalternative
, specifying the alternative hypothesis. It must be either "two.sided"
, "greater"
, or "less"
.conf.level
, specifying the confidence level of the confidence interval.Example: We saw earlier in the table that there were 117 patients receiving placebo and 118 receiving licorice. If we wanted to test whether this proportion is different from the null hypothesized proportion \(p_0 = 0.5\) at the \(\alpha = 0.05\) significance level, we could use the code (looking at placebo):
prop.test(x = 117, n = 235, p = 0.5,
alternative = "two.sided", conf.level = 0.95)
##
## 1-sample proportions test with continuity correction
##
## data: 117 out of 235, null probability 0.5
## X-squared = 0, df = 1, p-value = 1
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4324046 0.5634108
## sample estimates:
## p
## 0.4978723
Note that the output displays a \(\chi^2\) statistic and degrees of freedom. This is similar to the test using the normal distribution covered in the lecture slides. You may safely use R to evaluate any hypothesis tests for proportions, and thus may use this \(\chi^2\) statistic for a test of proportions. The output also includes the p-value, the alternative hypothesis tested, a confidence interval, and the data itself.
Chi-square tests are performed in R by inputing a table into the chisq.test
function. To conduct a chi-square test, simply wrap the table in the chisq.test
function.
Example: Is there an association between ASA and gender among patients receiving surgeries requiring intubation?
chisq.test(table(licorice$preOp_asa, licorice$preOp_gender))
##
## Pearson's Chi-squared test
##
## data: table(licorice$preOp_asa, licorice$preOp_gender)
## X-squared = 0.18711, df = 2, p-value = 0.9107
Note that the output provides the \(\chi^2\) statistic, the degrees of freedom, and the p-value. We created a table looking at ASA and gender, and then directly used the chisq.test()
function on this table.
We do something similar to the chi-square test: simply use the fisher.test
function applied to a table.
Example: Same as above, but using Fisher’s exact test
fisher.test(table(licorice$preOp_asa, licorice$preOp_gender))
##
## Fisher's Exact Test for Count Data
##
## data: table(licorice$preOp_asa, licorice$preOp_gender)
## p-value = 0.9159
## alternative hypothesis: two.sided
Today’s dataset was made available by the Lerner Research Institute and Dr. Amy S. Nowacki of the Cleveland Clinic. These data are representative of a study by Ruetzler et al. (2013).