library(tidyverse)
library(infer)
The data is from https://archive.ics.uci.edu/ml/datasets/Liver+Disorders.
Each observation in the dataset constitutes the record of a male individual. The first 5 variables are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. The last variable is meaningless, and we will drop it once we assign variable names.
Read in the data directly from the site with read_delim()
. Function str_c()
(from stringr
in tidyverse
) is for string concatenation and allows us to break up long strings. This is a nice way to keep your code within 80 characters.
liver <- read_delim(str_c("https://archive.ics.uci.edu/ml/",
"machine-learning-databases/liver-disorders/",
"bupa.data"),
delim = ",", col_names = FALSE)
Feature Information (as given in the data):
mcv
mean corpuscular volumealkphos
alkaline phosphotasesgpt
alanine aminotransferasesgot
aspartate aminotransferasegammagt
gamma-glutamyl transpeptidasedrink_qty
number of half-pint equivalents of alcoholic beverages drunk per dayLet’s set the variable names in the tibble and drop the last variable.
liver <- liver %>%
select(mcv = X1, alkphos = X2, sgpt = X3, sgot = X4, gammagt = X5,
drink_qty = X6) %>%
mutate(drink = if_else(drink_qty > 2, "yes", "no"),
drink = factor(drink))
We created a new variable, drink
, using mutate()
. What does it mean?
set.seed(032420)
liver
Use the sample data to create a 96% confidence interval for the mean corpuscular volume for all males. What assumptions must you make?
liver %>%
specify(response = mcv) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lb = quantile(stat, .02),
ub = quantile(stat, .98)
)
Write out an interpretation of your interval from Exercise 1.
We are 96% confident that the mean corpuscular volume is captured by the interval (89.72, 90.64).
Your goal will now be to create a 96% confidence interval for the mean difference in corpuscular volume for those males that have more than 2 drinks per day and those that don’t. Define the difference as more than 2 drinks minus those with less than or equal to 2 drinks.
Before you get started, consult https://infer.netlify.com/articles/flights_examples.html#one-numerical-variable-one-categorical-2-levels-diff-in-means-1
What do you need to change in specify()
and calculate()
?
mcv
and drink
order
in function calculate()
Plot the bootstrapped sampling distribution for the difference in means. Comment on what you observe.
boot_dist <- liver %>%
specify(mcv ~ drink) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
visualise(boot_dist)
ggplot(data = boot_dist, mapping = aes(x = stat)) +
geom_histogram(binwidth = .25, color = "darkgreen", alpha = .5) +
theme_bw()
boot_dist2 <- liver %>%
specify(response = mcv, explanatory = drink) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
visualise(boot_dist2)
Create a 96% confidence interval for the mean difference in corpuscular volume for those males that have more than 2 drinks per day and those that don’t.
boot_dist %>%
summarise(
lb = quantile(stat, .02),
ub = quantile(stat, .98)
)
Write out an interpretation of your interval from Exercise 5.
We are 96% confident that the mean mcv value for the difference in those that have more than 2 drinks per day and those that don’t is captured by the interval (1.07, 2.89).