In this mini analysis, we will work with the Advertising data used in Chapters 2 and 3 of Introduction to Statistical Learning.

Data and packages

We start with loading the packages we’ll use.

library(readr)
library(tidyverse)
library(skimr)
library(broom)
advertising <- read_csv("data/advertising.csv")

We will analyze the advertising and sales data for 200 markets. The variables we’ll use are

Analysis

We’ll begin the analysis by getting quick view of the data:

glimpse(advertising)
## Observations: 200
## Variables: 4
## $ tv        <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6…
## $ radio     <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2…
## $ newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 2…
## $ sales     <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.…

Next, we can calculate summary statistics for each of the variables in the data set.

advertising %>% skim()
## Skim summary statistics
##  n obs: 200 
##  n variables: 4 
## 
## ── Variable type:numeric ───────────────────────────────────────────────────────────────────────────────
##   variable missing complete   n   mean    sd  p0   p25    p50    p75  p100
##  newspaper       0      200 200  30.55 21.78 0.3 12.75  25.75  45.1  114  
##      radio       0      200 200  23.26 14.85 0    9.97  22.9   36.52  49.6
##      sales       0      200 200  14.02  5.22 1.6 10.38  12.9   17.4   27  
##         tv       0      200 200 147.04 85.85 0.7 74.38 149.75 218.82 296.4
##      hist
##  ▇▇▅▅▂▁▁▁
##  ▇▆▅▅▆▅▆▅
##  ▁▃▇▇▆▃▃▂
##  ▆▅▆▅▅▇▆▅
  1. What type of advertising has the smallest median spending?
  2. What type of advertising has the largest variation in spending?
  3. Describe the shape of the distribution of sales.

We are most interested in understanding how advertising spending affect sales. One way to quantify the relationship between the variables is by calculating the correlation matrix.

advertising %>% 
  cor()
##                   tv      radio  newspaper     sales
## tv        1.00000000 0.05480866 0.05664787 0.7822244
## radio     0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales     0.78222442 0.57622257 0.22829903 1.0000000
  1. What is the correlation between radio and sales? Interpret this value.
  2. What type of advertising has the strongest linear relationship with sales?

Below are visualizations of sales versus each explanatory variable.

advertising %>%
  ggplot(mapping = aes(x=tv,y=sales)) + 
  geom_point(alpha=0.7) + 
  geom_smooth(method="lm",se=FALSE,color="blue") + 
  labs(title = "Sales vs. TV Advertising", 
       x= "TV Advertising (in $thousands)", 
       y="_____") #fill in the Y axis label

advertising %>%
  ggplot(mapping = aes(x=radio,y=sales)) + 
  geom_point(alpha=0.7) + 
  geom_smooth(method="lm",se=FALSE,color="red") + 
  labs(title = "Sales vs. TV Advertising", 
       x= "Radio Advertising (in $thousands)", 
       y="Sales (in $millions)")

advertising %>%
  ggplot(mapping = aes(x=newspaper,y=sales)) + 
  geom_point(alpha=0.7) + 
  geom_smooth(method="lm",se=FALSE,color="purple") + 
  labs(title = "Sales vs. Newspaper Advertising", 
       x= "Newspaper Advertising (in $thousands)", 
       y="Sales (in $millions)")

Since tv appears to have the strongest linear relationship with sales, let’s calculate a simple linear regression model using these two variables.

ad_model <- lm(sales ~ tv, data=advertising)
ad_model
## 
## Call:
## lm(formula = sales ~ tv, data = advertising)
## 
## Coefficients:
## (Intercept)           tv  
##     7.03259      0.04754
  1. Write the model equation.
  2. Interpret the intercept in the context of the problem.
  3. Interpret the slope in the context of the problem.