```
library(tidyverse)
library(infer)
```

The data is from https://archive.ics.uci.edu/ml/datasets/Liver+Disorders.

Each observation in the dataset constitutes the record of a male individual. The first 5 variables are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. The last variable is meaningless, and we will drop it once we assign variable names.

Read in the data directly from the site with `read_delim()`

. Function `str_c()`

(from `stringr`

in `tidyverse`

) is for string concatenation and allows us to break up long strings. This is a nice way to keep your code within 80 characters.

```
liver <- read_delim(str_c("https://archive.ics.uci.edu/ml/",
"machine-learning-databases/liver-disorders/",
"bupa.data"),
delim = ",", col_names = FALSE)
```

**Feature Information (as given in the data):**

`mcv`

mean corpuscular volume`alkphos`

alkaline phosphotase`sgpt`

alanine aminotransferase`sgot`

aspartate aminotransferase`gammagt`

gamma-glutamyl transpeptidase`drink_qty`

number of half-pint equivalents of alcoholic beverages drunk per day- selector field created by the BUPA researchers to split the data into train/test sets

Let’s set the variable names in the tibble and drop the last variable.

```
liver <- liver %>%
select(mcv = X1, alkphos = X2, sgpt = X3, sgot = X4, gammagt = X5,
drink_qty = X6) %>%
mutate(drink = if_else(drink_qty > 2, "yes", "no"),
drink = factor(drink))
```

We created a new variable, `drink`

, using `mutate()`

. What does it mean?

```
set.seed(032420)
liver
```

Use the sample data to create a 96% confidence interval for the mean corpuscular volume for all males. What assumptions must you make?

```
liver %>%
specify(response = mcv) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lb = quantile(stat, .02),
ub = quantile(stat, .98)
)
```

Write out an interpretation of your interval from Exercise 1.

We are 96% confident that the mean corpuscular volume is captured by the interval (89.72, 90.64).

Your goal will now be to create a 96% confidence interval for the mean difference in corpuscular volume for those males that have more than 2 drinks per day and those that don’t. Define the difference as more than 2 drinks minus those with less than or equal to 2 drinks.

Before you get started, consult https://infer.netlify.com/articles/flights_examples.html#one-numerical-variable-one-categorical-2-levels-diff-in-means-1

What do you need to change in `specify()`

and `calculate()`

?

- account for two variables -
`mcv`

and`drink`

- use “diff in means” and argument
`order`

in function`calculate()`

Plot the bootstrapped sampling distribution for the difference in means. Comment on what you observe.

```
boot_dist <- liver %>%
specify(mcv ~ drink) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
visualise(boot_dist)
```

```
ggplot(data = boot_dist, mapping = aes(x = stat)) +
geom_histogram(binwidth = .25, color = "darkgreen", alpha = .5) +
theme_bw()
```