Packages and Data

library(tidyverse)
library(infer)

The data in songs contain the length in minutes of the 3,000 songs on a person’s phone (consider this as the entire population).

songs <- read_csv("data/songs.csv")

Exercises

Exercise 1

What is the population mean and standard deviation of song length?

songs <- songs %>% 
  rename(id = X1,
         length_minutes = length) %>% 
  mutate(length_seconds = length_minutes * 60)

songs_plot <- songs %>% 
  ggplot(aes(x = length_minutes)) +
  geom_histogram(binwidth = .25, alpha = .5, color = "violet") +
  labs(x = "Song length (minutes)", y = "Count") +
  theme_minimal(base_size = 16)

songs_plot

songs %>% 
  summarise(mean_length = mean(length_minutes),
            sd_length   = sd(length_minutes))

Exercise 2

What is the probability that a randomly selected song is longer than 5 minutes long?

songs_plot +
  geom_vline(xintercept = 5, lty = 2, size = 1, color = "darkblue")

songs %>% 
  mutate(length_test = length_minutes > 5) %>% 
  summarise(prop = mean(length_test))

Exercise 3

Set the seed to 53251 and take a random sample of 100 songs.

Construct a 95% confidence interval for the mean song length using your random sample. Did your interval contain the true population mean?

set.seed(53251)

songs_sample <- songs %>% 
  sample_n(size = 100)

songs_sample

t_test(x = songs_sample, response = length_minutes, 
       conf_int = TRUE, conf_level = 0.95) %>% 
  select(lower_ci:upper_ci)

songs_minutes <- songs_sample %>% 
  pull(length_minutes)

xbar <- mean(songs_minutes)
s <- sd(songs_minutes)
n <- length(songs_minutes)
critical_value <- qt(p = 0.975, df = n - 1)

c(xbar - critical_value * (s / sqrt(n)), xbar + critical_value * (s / sqrt(n)))

#> [1] 3.345638 4.119063

Yes, it did. The true population mean is known to be 3.8 and our 95% confidence interval was found to be (3.35, 4.12).

Exercise 4

On a 6 hour drive, what is the probability that a randomly selected playlist of 100 songs lasts the length of the trip?

This is the same as asking, what is \(P(\bar{X} >= 3.6)\) for a random sample of 100 songs.

From CLT we know that \(\bar{X} \sim N(\mu, \sigma / \sqrt{n})\). Here, we know \(\sigma = 2.39\). To compute this probability we can use function pnorm().

ggplot(NULL, aes(c(2, 5))) +
  geom_area(stat = "function", fun = dnorm, args = list(mean = 3.8, sd = 2.39 / sqrt(n)),
            fill = "#00998a", xlim = c(3.6, 5)) +
  geom_area(stat = "function", fun = dnorm, args = list(mean = 3.8, sd = 2.39 / sqrt(n)), 
            fill = "grey80", xlim = c(2, 3.6)) +
  labs(x = "x", y = "f(x)") +
  theme_minimal(base_size = 16)

pnorm(q = 3.6, mean = 3.80, sd = 2.39 / sqrt(n), lower.tail = FALSE)

#> [1] 0.7986531

Stage, commit and push

Stage your modified files.
Commit your changes with an informative message.
Push your changes to your GitHub repo.
Verify your files were updated on GitHub.