Packages and Data

library(tidyverse)
library(infer)

The data in songs contain the length in minutes of the 3,000 songs on a person’s phone (consider this as the entire population).

songs <- read_csv("data/songs.csv")

Exercises

Exercise 1

What is the population mean and standard deviation of song length?

songs <- songs %>% 
  rename(id = X1,
         length_minutes = length) %>% 
  mutate(length_seconds = length_minutes * 60)
songs_plot <- songs %>% 
  ggplot(aes(x = length_minutes)) +
  geom_histogram(binwidth = .25, alpha = .5, color = "violet") +
  labs(x = "Song length (minutes)", y = "Count") +
  theme_minimal(base_size = 16)

songs_plot

songs %>% 
  summarise(mean_length = mean(length_minutes),
            sd_length   = sd(length_minutes))

Exercise 2

What is the probability that a randomly selected song is longer than 5 minutes long?

songs_plot +
  geom_vline(xintercept = 5, lty = 2, size = 1, color = "darkblue")

songs %>% 
  mutate(length_test = length_minutes > 5) %>% 
  summarise(prop = mean(length_test))

Exercise 3

Set the seed to 53251 and take a random sample of 100 songs.

Construct a 95% confidence interval for the mean song length using your random sample. Did your interval contain the true population mean?

set.seed(53251)

songs_sample <- songs %>% 
  sample_n(size = 100)

songs_sample
t_test(x = songs_sample, response = length_minutes, 
       conf_int = TRUE, conf_level = 0.95) %>% 
  select(lower_ci:upper_ci)
songs_minutes <- songs_sample %>% 
  pull(length_minutes)

xbar <- mean(songs_minutes)
s <- sd(songs_minutes)
n <- length(songs_minutes)
critical_value <- qt(p = 0.975, df = n - 1)

c(xbar - critical_value * (s / sqrt(n)), xbar + critical_value * (s / sqrt(n)))
#> [1] 3.345638 4.119063

Yes, it did. The true population mean is known to be 3.8 and our 95% confidence interval was found to be (3.35, 4.12).

Exercise 4

On a 6 hour drive, what is the probability that a randomly selected playlist of 100 songs lasts the length of the trip?

This is the same as asking, what is \(P(\bar{X} >= 3.6)\) for a random sample of 100 songs.

From CLT we know that \(\bar{X} \sim N(\mu, \sigma / \sqrt{n})\). Here, we know \(\sigma = 2.39\). To compute this probability we can use function pnorm().

ggplot(NULL, aes(c(2, 5))) +
  geom_area(stat = "function", fun = dnorm, args = list(mean = 3.8, sd = 2.39 / sqrt(n)),
            fill = "#00998a", xlim = c(3.6, 5)) +
  geom_area(stat = "function", fun = dnorm, args = list(mean = 3.8, sd = 2.39 / sqrt(n)), 
            fill = "grey80", xlim = c(2, 3.6)) +
  labs(x = "x", y = "f(x)") +
  theme_minimal(base_size = 16)

pnorm(q = 3.6, mean = 3.80, sd = 2.39 / sqrt(n), lower.tail = FALSE)
#> [1] 0.7986531

Stage, commit and push

  1. Stage your modified files.
  2. Commit your changes with an informative message.
  3. Push your changes to your GitHub repo.
  4. Verify your files were updated on GitHub.