Application Exercise: model prices of Paris Paintings
Due Thursday: Finish HW2 & App Ex
library(ggplot2)
library(dplyr)
library(stringr)
pp = read.csv("~/paris_paintings.csv", stringsAsFactors = FALSE) %>%
tbl_df()
class(pp$price)
## [1] "character"
head(pp$price, 150)
## [1] "360.0" "6.0" "12.0" "6.0" "6.0" "9.0" "12.0" "12.0" "24.0" "6.0"
## [11] "6.0" "1.3" "1.3" "1.3" "12.0" "100.0" "100.0" "12.0" "12.0" "12.0"
## [21] "6.0" "12.0" "12.0" "12.0" "315.0" "315.0" "1.2" "1.2" "1.2" "1.2"
## [31] "1.2" "59.0" "36.0" "41.0" "6.0" "3.0" "3.0" "3.0" "3.0" "16.3"
## [41] "16.3" "16.3" "16.3" "298.0" "240.0" "427.0" "120.0" "63.0" "160.0" "500.0"
## [51] "141.0" "340.0" "151.0" "150.0" "300.0" "27.0" "9.0" "48.0" "68.0" "68.0"
## [61] "24.0" "24.0" "8.0" "38.0" "24.0" "24.0" "9.0" "9.0" "24.0" "36.0"
## [71] "9.0" "37.0" "36.0" "24.0" "24.0" "145.0" "51.0" "24.0" "24.0" "36.0"
## [81] "36.0" "11.7" "11.7" "11.7" "11.7" "11.7" "11.7" "11.7" "11.7" "11.7"
## [91] "11.7" "11.7" "11.7" "11.7" "11.7" "11.7" "11.7" "11.7" "11.7" "11.7"
## [101] "11.7" "11.7" "11.7" "11.7" "11.7" "120.0" "72.0" "140.0" "75.0" "75.0"
## [111] "30.0" "30.0" "60.0" "12.0" "24.0" "3.0" "312.0" "1,000.0" "312.0" "360.0"
## [121] "40.1" "150.0" "502.0" "166.0" "595.0" "370.0" "240.5" "25.0" "25.0" "152.0"
## [131] "230.0" "350.0" "83.0" "72.0" "60.0" "45.0" "45.0" "440.0" "200.5" "200.5"
## [141] "532.0" "1,842.5" "1,842.5" "3,035.0" "3,035.0" "1,000.4" "800.1" "371.0" "371.0" "651.1"
pp$price %>% as.numeric() %>% head(150)
## Warning in function_list[[i]](value): NAs introduced by coercion
## [1] 360.0 6.0 12.0 6.0 6.0 9.0 12.0 12.0 24.0 6.0 6.0 1.3 1.3 1.3 12.0 100.0 100.0
## [18] 12.0 12.0 12.0 6.0 12.0 12.0 12.0 315.0 315.0 1.2 1.2 1.2 1.2 1.2 59.0 36.0 41.0
## [35] 6.0 3.0 3.0 3.0 3.0 16.3 16.3 16.3 16.3 298.0 240.0 427.0 120.0 63.0 160.0 500.0 141.0
## [52] 340.0 151.0 150.0 300.0 27.0 9.0 48.0 68.0 68.0 24.0 24.0 8.0 38.0 24.0 24.0 9.0 9.0
## [69] 24.0 36.0 9.0 37.0 36.0 24.0 24.0 145.0 51.0 24.0 24.0 36.0 36.0 11.7 11.7 11.7 11.7
## [86] 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7
## [103] 11.7 11.7 11.7 120.0 72.0 140.0 75.0 75.0 30.0 30.0 60.0 12.0 24.0 3.0 312.0 NA 312.0
## [120] 360.0 40.1 150.0 502.0 166.0 595.0 370.0 240.5 25.0 25.0 152.0 230.0 350.0 83.0 72.0 60.0 45.0
## [137] 45.0 440.0 200.5 200.5 532.0 NA NA NA NA NA 800.1 371.0 371.0 651.1
The commas are causing us problems, lets replace ,
with `` (blank) before coercing:
pp$price %>% str_replace(",", "") %>% as.numeric() %>% head(150)
## [1] 360.0 6.0 12.0 6.0 6.0 9.0 12.0 12.0 24.0 6.0 6.0 1.3 1.3 1.3 12.0
## [16] 100.0 100.0 12.0 12.0 12.0 6.0 12.0 12.0 12.0 315.0 315.0 1.2 1.2 1.2 1.2
## [31] 1.2 59.0 36.0 41.0 6.0 3.0 3.0 3.0 3.0 16.3 16.3 16.3 16.3 298.0 240.0
## [46] 427.0 120.0 63.0 160.0 500.0 141.0 340.0 151.0 150.0 300.0 27.0 9.0 48.0 68.0 68.0
## [61] 24.0 24.0 8.0 38.0 24.0 24.0 9.0 9.0 24.0 36.0 9.0 37.0 36.0 24.0 24.0
## [76] 145.0 51.0 24.0 24.0 36.0 36.0 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7
## [91] 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7 11.7
## [106] 120.0 72.0 140.0 75.0 75.0 30.0 30.0 60.0 12.0 24.0 3.0 312.0 1000.0 312.0 360.0
## [121] 40.1 150.0 502.0 166.0 595.0 370.0 240.5 25.0 25.0 152.0 230.0 350.0 83.0 72.0 60.0
## [136] 45.0 45.0 440.0 200.5 200.5 532.0 1842.5 1842.5 3035.0 3035.0 1000.4 800.1 371.0 371.0 651.1
pp = pp %>% mutate(price = str_replace(price, ",", "") %>% as.numeric())
str(pp$price)
## num [1:3393] 360 6 12 6 6 9 12 12 24 6 ...
We can represent relationships between variables using function
ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
geom_point() +
stat_smooth(method = "lm") # lm for linear model
Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis (dependent variable)
Explanatory variables: Other variables that you want to use to explain the variation in the response variable, usually on the x-axis (and or color, shape, etc.) (independent variables)
What does a negative residual mean? Which paintings on the plot have have negative residuals?
How do we decide which line to use for our linear model?
We can get a rough sense of model fit via visual inspection. How well do you think a painting’s width explains its height?
Here is the code for the plot from the previous slide
# points not colored by landsALL type
ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
geom_point(alpha = 0.2) +
stat_smooth(method = "lm", se=FALSE)
Models can sometimes reveal patterns that are not evident in a graph of the data, particularly in high dimensions where simple visual inspection of the data is not possible.
There is a real risk that a model is imposing structure that is not really there on the data. A skeptical approach is always warranted.
is just as important as the model, if not more so!
Statistics is the explanation of variation in the context of what remains unexplained.
The scatter suggests that there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role.
Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We’ll talk more about this later.)
Explanation: Characterize the relationship between \(y\) and \(x\) via slopes for numerical explanatory variables or differences for categorical explanatory variables
Prediction: Plug in \(x\), get the predicted \(y\)
lm(Height_in ~ Width_in, data = pp)
##
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
##
## Coefficients:
## (Intercept) Width_in
## 3.6214 0.7808
\[ \widehat{Height} = 3.62 + 0.78~Width \]
Slope: For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.78 inches.
lm(Height_in ~ factor(landsALL), data = pp)
##
## Call:
## lm(formula = Height_in ~ factor(landsALL), data = pp)
##
## Coefficients:
## (Intercept) factor(landsALL)1
## 22.680 -5.645
\[ \widehat{Height} = 22.68 - 5.65~landsALL \]
landsALL = 0
) to the other level (landsALL = 1
).lm(Height_in ~ school_pntg, data = pp)
##
## Call:
## lm(formula = Height_in ~ school_pntg, data = pp)
##
## Coefficients:
## (Intercept) school_pntgD/FL school_pntgF school_pntgG school_pntgI school_pntgS
## 14.000 2.329 10.197 1.650 10.287 30.429
## school_pntgX
## 2.869
When the categorical explanatory variable has many levels, they’re encoded using dummy variables.
Each coefficient describes the expected difference between heights in that particular school compared to the baseline level.
pp %>%
select(Height_in,school_pntg) %>%
na.omit() %>%
group_by(school_pntg) %>%
summarize(mean = mean(Height_in))
## # A tibble: 7 × 2
## school_pntg mean
## <chr> <dbl>
## 1 A 14.00000
## 2 D/FL 16.32873
## 3 F 24.19671
## 4 G 15.65000
## 5 I 24.28671
## 6 S 44.42857
## 7 X 16.86875
On average, how tall are paintings that are 60 inches wide? \[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]
3.62 + 0.78 * 60
## [1] 50.42
“On average, we expect paintings that are 60 inches wide to be 50.42 inches high.”
Warning: We “expect” this to happen, but there will be some variability. (We’ll learn about measuring the variability around the prediction later.)
On average, how tall are paintings that are 400 inches wide? \[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]
“When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on.” - Stephen Colbert April 6th, 2010
Source: OpenIntro Statistics. “Extrapolation is treacherous.” OpenIntro Statistics.
The strength of the fit of a linear model is most commonly evaluated using \(R^2\)
It tells us what percent of variability in the response variable is explained by the model.
The remainder of the variability is explained by variables not included in the model (or just measurement error).
\(R^2\) is sometimes called the coefficient of determination.
m_ht_wt <- lm(Height_in ~ Width_in, data = pp) # fit and save
summary(m_ht_wt)$r.squared # extract R-squared
## [1] 0.6829468
Roughly 68% of the variability in heights of paintings can be explained by their widths.
m_ht_land <- lm(Height_in ~ landsALL, data = pp)
summary(m_ht_land)$r.squared
## [1] 0.03456724
See course website