In this mini analysis, we will work with the Advertising
data used in Chapters 2 and 3 of Introduction to Statistical Learning.
We start with loading the packages we’ll use.
library(readr)
library(tidyverse)
library(skimr)
library(broom)
advertising <- read_csv("data/advertising.csv")
We will analyze the advertising and sales data for 200 markets. The variables we’ll use are
tv
: total spending on TV advertising (in $thousands)radio
: total spending on radio advertising (in $thousands)newspaper
: total spending on newspaper advertising (in $thousands)sales
: total sales (in $millions)We’ll begin the analysis by getting quick view of the data:
glimpse(advertising)
## Observations: 200
## Variables: 4
## $ tv <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6…
## $ radio <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2…
## $ newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 2…
## $ sales <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.…
Next, we can calculate summary statistics for each of the variables in the data set.
advertising %>% skim()
## Skim summary statistics
## n obs: 200
## n variables: 4
##
## ── Variable type:numeric ───────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100
## newspaper 0 200 200 30.55 21.78 0.3 12.75 25.75 45.1 114
## radio 0 200 200 23.26 14.85 0 9.97 22.9 36.52 49.6
## sales 0 200 200 14.02 5.22 1.6 10.38 12.9 17.4 27
## tv 0 200 200 147.04 85.85 0.7 74.38 149.75 218.82 296.4
## hist
## ▇▇▅▅▂▁▁▁
## ▇▆▅▅▆▅▆▅
## ▁▃▇▇▆▃▃▂
## ▆▅▆▅▅▇▆▅
sales
.We are most interested in understanding how advertising spending affect sales. One way to quantify the relationship between the variables is by calculating the correlation matrix.
advertising %>%
cor()
## tv radio newspaper sales
## tv 1.00000000 0.05480866 0.05664787 0.7822244
## radio 0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales 0.78222442 0.57622257 0.22829903 1.0000000
radio
and sales
? Interpret this value.sales
?Below are visualizations of sales
versus each explanatory variable.
advertising %>%
ggplot(mapping = aes(x=tv,y=sales)) +
geom_point(alpha=0.7) +
geom_smooth(method="lm",se=FALSE,color="blue") +
labs(title = "Sales vs. TV Advertising",
x= "TV Advertising (in $thousands)",
y="_____") #fill in the Y axis label
advertising %>%
ggplot(mapping = aes(x=radio,y=sales)) +
geom_point(alpha=0.7) +
geom_smooth(method="lm",se=FALSE,color="red") +
labs(title = "Sales vs. TV Advertising",
x= "Radio Advertising (in $thousands)",
y="Sales (in $millions)")
advertising %>%
ggplot(mapping = aes(x=newspaper,y=sales)) +
geom_point(alpha=0.7) +
geom_smooth(method="lm",se=FALSE,color="purple") +
labs(title = "Sales vs. Newspaper Advertising",
x= "Newspaper Advertising (in $thousands)",
y="Sales (in $millions)")
Since tv
appears to have the strongest linear relationship with sales
, let’s calculate a simple linear regression model using these two variables.
ad_model <- lm(sales ~ tv, data=advertising)
ad_model
##
## Call:
## lm(formula = sales ~ tv, data = advertising)
##
## Coefficients:
## (Intercept) tv
## 7.03259 0.04754