library(tidyverse)
library(infer)
The data in songs
contain the length in minutes of the 3,000 songs on a person’s phone (consider this as the entire population).
songs <- read_csv("data/songs.csv")
What is the population mean and standard deviation of song length?
songs <- songs %>%
rename(id = X1,
length_minutes = length) %>%
mutate(length_seconds = length_minutes * 60)
songs_plot <- songs %>%
ggplot(aes(x = length_minutes)) +
geom_histogram(binwidth = .25, alpha = .5, color = "violet") +
labs(x = "Song length (minutes)", y = "Count") +
theme_minimal(base_size = 16)
songs_plot
songs %>%
summarise(mean_length = mean(length_minutes),
sd_length = sd(length_minutes))
What is the probability that a randomly selected song is longer than 5 minutes long?
songs_plot +
geom_vline(xintercept = 5, lty = 2, size = 1, color = "darkblue")
songs %>%
mutate(length_test = length_minutes > 5) %>%
summarise(prop = mean(length_test))
Set the seed to 53251
and take a random sample of 100 songs.
Construct a 95% confidence interval for the mean song length using your random sample. Did your interval contain the true population mean?
set.seed(53251)
songs_sample <- songs %>%
sample_n(size = 100)
songs_sample
t_test(x = songs_sample, response = length_minutes,
conf_int = TRUE, conf_level = 0.95) %>%
select(lower_ci:upper_ci)
songs_minutes <- songs_sample %>%
pull(length_minutes)
xbar <- mean(songs_minutes)
s <- sd(songs_minutes)
n <- length(songs_minutes)
critical_value <- qt(p = 0.975, df = n - 1)
c(xbar - critical_value * (s / sqrt(n)), xbar + critical_value * (s / sqrt(n)))
#> [1] 3.345638 4.119063
Yes, it did. The true population mean is known to be 3.8 and our 95% confidence interval was found to be (3.35, 4.12).
On a 6 hour drive, what is the probability that a randomly selected playlist of 100 songs lasts the length of the trip?
This is the same as asking, what is \(P(\bar{X} >= 3.6)\) for a random sample of 100 songs.
From CLT we know that \(\bar{X} \sim N(\mu, \sigma / \sqrt{n})\). Here, we know \(\sigma = 2.39\). To compute this probability we can use function pnorm()
.
ggplot(NULL, aes(c(2, 5))) +
geom_area(stat = "function", fun = dnorm, args = list(mean = 3.8, sd = 2.39 / sqrt(n)),
fill = "#00998a", xlim = c(3.6, 5)) +
geom_area(stat = "function", fun = dnorm, args = list(mean = 3.8, sd = 2.39 / sqrt(n)),
fill = "grey80", xlim = c(2, 3.6)) +
labs(x = "x", y = "f(x)") +
theme_minimal(base_size = 16)
pnorm(q = 3.6, mean = 3.80, sd = 2.39 / sqrt(n), lower.tail = FALSE)
#> [1] 0.7986531