Review App Ex from last time
Application Exercise: model prices of Paris Paintings
Due Tuesday: Finish App Ex + Reading (you’ll receive an email with a link after class)
library(ggplot2)
library(dplyr)
library(stringr)
pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
tbl_df()
class(pp$price)
## [1] "character"
table(pp$price)[1:30]
##
## 1,000.0 1,000.4 1,001.0 1,002.0 1,004.0 1,005.0 1,006.0 1,008.0 1,011.0 1,035.5 1,050.0 1,051.0
## 20 1 4 4 1 1 2 1 1 2 4 1
## 1,055.0 1,060.0 1,077.0 1,079.0 1,080.0 1,086.0 1,099.0 1,100.0 1,100.5 1,105.0 1,110.0 1,140.0
## 2 1 1 1 1 1 1 6 2 1 2 1
## 1,150.0 1,155.0 1,161.0 1,180.0 1,200.0 1,201.0
## 4 2 1 3 14 5
Replace ,
with `` (blank), and save the variable as numeric:
pp <- pp %>%
mutate(price = as.numeric(str_replace(price, ",", "")))
Much better…
class(pp$price)
## [1] "numeric"
ggplot(data = pp, aes(x = price)) +
geom_histogram(binwidth = 1000)
We can represent relationships between variables using function
ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
geom_point() +
stat_smooth(method = "lm") # lm for linear model
Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis (dependent variable)
Explanatory variables: Other variables that you want to use to explain the variation in the response, on the x-axis (independent variables)
Here is the code for the two plots in the previous slide
# points colored by landsALL type
ggplot(data = pp, aes(x = Width_in, y = Height_in, color = factor(landsALL))) +
geom_point(alpha = 0.4) +
stat_smooth(method = "lm")
# points not colored by landsALL type
ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
geom_point(alpha = 0.4) +
stat_smooth(method = "lm")
Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modeling over simple visual inspection of data.
There is a real risk, however, that a model is imposing structure that is not really there on the scatter of data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted.
is just as important as the model, if not more!
Statistics is the explanation of variation in the context of what remains unexplained.
The scatter suggests that there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role.
Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We’ll talk more about this later.)
Explanation: Characterize the relationship between \(y\) and \(x\) via slopes for numerical explanatory variables or differences for categorical explanatory variables
Prediction: Plug in \(x\), get the predicted \(y\)
lm(Height_in ~ Width_in, data = pp)
##
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
##
## Coefficients:
## (Intercept) Width_in
## 3.6214 0.7808
\[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]
Slope: For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.78 inches.
lm(Height_in ~ factor(landsALL), data = pp)
##
## Call:
## lm(formula = Height_in ~ factor(landsALL), data = pp)
##
## Coefficients:
## (Intercept) factor(landsALL)1
## 22.680 -5.645
\[ \widehat{Height_{in}} = 22.68 - 5.65~landsALL \]
landsALL = 0
) to the other level (landsALL = 1
).lm(Height_in ~ school_pntg, data = pp)
##
## Call:
## lm(formula = Height_in ~ school_pntg, data = pp)
##
## Coefficients:
## (Intercept) school_pntgD/FL school_pntgF school_pntgG school_pntgI
## 14.000 2.329 10.197 1.650 10.287
## school_pntgS school_pntgX
## 30.429 2.869
When the categorical explanatory variable has many levels, they’re encoded to dummy variables.
Each coefficient describes the expected difference between heights in that particular school compared to the baseline level.
Remember this when interpreting model coefficients
3.62 + 0.78 * 60
## [1] 50.42
“On average, we expect paintings that are 60 inches wide to be 50.42 inches high.”
Warning: We “expect” this to happen, but there will be some variability. (We’ll learn about measuring the variability around the prediction later.)
“When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on.” - Stephen Colbert April 6th, 2010
Source: OpenIntro Statistics. “Extrapolation is treacherous.” OpenIntro Statistics.
The strength of the fit of a linear model is most commonly evaluated using \(R^2\)
It tells us what percent of variability in the response variable is explained by the model.
The remainder of the variability is explained by variables not included in the model.
\(R^2\) is sometimes called the coefficient of determination.
m_ht_wt <- lm(Height_in ~ Width_in, data = pp) # fit and save
summary(m_ht_wt)$r.squared # extract R-squared
## [1] 0.6829468
Roughly 68% of the variability in heights of paintings can be explained by their widths.
m_ht_land <- lm(Height_in ~ landsALL, data = pp)
summary(m_ht_land)$r.squared
## [1] 0.03456724
Some of the material on these slides is based off of
Daniel Kaplan. Statistical Modeling: A Fresh Approach. (2009).
See course website