Packages and Data

library(tidyverse)
library(infer)

The data is from https://archive.ics.uci.edu/ml/datasets/Liver+Disorders.

Each observation in the dataset constitutes the record of a male individual. The first 5 variables are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. The last variable is meaningless, and we will drop it once we assign variable names.

Read in the data directly from the site with read_delim(). Function str_c() (from stringr in tidyverse) is for string concatenation and allows us to break up long strings. This is a nice way to keep your code within 80 characters.

liver <- read_delim(str_c("https://archive.ics.uci.edu/ml/",
                          "machine-learning-databases/liver-disorders/",
                          "bupa.data"), 
                    delim = ",", col_names = FALSE)

Feature Information (as given in the data):

mcv mean corpuscular volume
alkphos alkaline phosphotase
sgpt alanine aminotransferase
sgot aspartate aminotransferase
gammagt gamma-glutamyl transpeptidase
drink_qty number of half-pint equivalents of alcoholic beverages drunk per day
selector field created by the BUPA researchers to split the data into train/test sets

Let’s set the variable names in the tibble and drop the last variable.

liver <- liver %>% 
  select(mcv = X1, alkphos = X2, sgpt = X3, sgot = X4, gammagt = X5, 
         drink_qty = X6) %>% 
  mutate(drink = if_else(drink_qty > 2, "yes", "no"),
         drink = factor(drink))

We created a new variable, drink, using mutate(). What does it mean?

set.seed(032420)
liver

Exercises

Part 1

Exercise 1

Use the sample data to create a 96% confidence interval for the mean corpuscular volume for all males. What assumptions must you make?

liver %>% 
  specify(response = mcv) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean") %>% 
  summarise(
    lb = quantile(stat, .02),
    ub = quantile(stat, .98)
  )

Exercise 2

Write out an interpretation of your interval from Exercise 1.

We are 96% confident that the mean corpuscular volume is captured by the interval (89.72, 90.64).

Part 2

Your goal will now be to create a 96% confidence interval for the mean difference in corpuscular volume for those males that have more than 2 drinks per day and those that don’t. Define the difference as more than 2 drinks minus those with less than or equal to 2 drinks.

Exercise 3

Before you get started, consult https://infer.netlify.com/articles/flights_examples.html#one-numerical-variable-one-categorical-2-levels-diff-in-means-1

What do you need to change in specify() and calculate()?

account for two variables - mcv and drink
use “diff in means” and argument order in function calculate()

Exercise 4

Plot the bootstrapped sampling distribution for the difference in means. Comment on what you observe.

boot_dist <- liver %>% 
  specify(mcv ~ drink) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "diff in means", order = c("yes", "no"))

visualise(boot_dist)

ggplot(data = boot_dist, mapping = aes(x = stat)) +
  geom_histogram(binwidth = .25, color = "darkgreen", alpha = .5) +
  theme_bw()

boot_dist2 <- liver %>% 
  specify(response = mcv, explanatory = drink) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "diff in means", order = c("yes", "no"))

visualise(boot_dist2)

Exercise 5

Create a 96% confidence interval for the mean difference in corpuscular volume for those males that have more than 2 drinks per day and those that don’t.

boot_dist %>% 
  summarise(
    lb = quantile(stat, .02),
    ub = quantile(stat, .98)
  )

Exercise 6

Write out an interpretation of your interval from Exercise 5.

We are 96% confident that the mean mcv value for the difference in those that have more than 2 drinks per day and those that don’t is captured by the interval (1.07, 2.89).

Stage, commit and push

Stage your modified files.
Commit your changes with an informative message.
Push your changes to your GitHub repo.
Verify your files were updated on GitHub.

References

UCI Machine Learning Repository: Liver Disorders Data Set. (2020). Archive.ics.uci.edu. Retrieved 9 March 2020, from https://archive.ics.uci.edu/ml/datasets/Liver+Disorders

Bootstrap Confidence Intervals

Part II