Load packages:
library(ggplot2) library(dplyr) library(stringr) library(magrittr)
Load (and fix) data:
pp = read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>% tbl_df() %>% mutate(price = as.numeric(str_replace(price, ",", "")))
pp %>% group_by(Shape) %>% summarise(n = n()) %>% arrange(n)
## # A tibble: 9 × 2 ## Shape n ## <chr> <int> ## 1 octagon 1 ## 2 octogon 1 ## 3 ronde 5 ## 4 miniature 10 ## 5 ovale 24 ## 6 oval 28 ## 7 36 ## 8 round 69 ## 9 squ_rect 3219
original | new | change |
---|---|---|
NA |
yes | |
miniature |
miniature |
no |
octagon |
octagon |
no |
octogon |
octagon |
yes |
oval |
oval |
no |
ovale |
oval |
yes |
squ_rect |
squ_rect |
no |
ronde |
round |
yes |
round |
round |
no |
library(forcats)
Another package by Hadley Wickham (of dplyr and ggplot2 fame) for handling categorical variables.
Everything in the package is possible with base R, but a lot of it is harder than it needs to be.
Package documentation is available: https://hadley.github.io/forcats/reference/index.html.
pp %>% mutate( Shape = fct_recode( Shape, octagon="octogon", oval="ovale", round="ronde", rect="squ_rect", NULL="") ) %>% group_by(Shape) %>% summarize(n=n()) %>% arrange(n)
## # A tibble: 6 × 2 ## Shape n ## <fctr> <int> ## 1 octagon 2 ## 2 miniature 10 ## 3 NA 36 ## 4 oval 52 ## 5 round 74 ## 6 rect 3219
pp %>% mutate(Shape = fct_lump(Shape)) %>% group_by(Shape) %>% summarize(n=n()) %>% arrange(n)
## # A tibble: 2 × 2 ## Shape n ## <fctr> <int> ## 1 Other 174 ## 2 squ_rect 3219
What happened to our missing values (NA
)?
pp %>% mutate(Shape = fct_recode(Shape, NULL="") %>% fct_lump()) %>% group_by(Shape) %>% summarize(n=n()) %>% arrange(n)
## # A tibble: 3 × 2 ## Shape n ## <fctr> <int> ## 1 NA 36 ## 2 Other 138 ## 3 squ_rect 3219
Much better (NA
s are preserved) and infrequent shapes are lumped together.
Now that we are happy with our changes to the Shape
variable lets make them permanent by replacing the existing faulty values.
library(magrittr) pp %<>% mutate(Shape = fct_recode(Shape, octagon="octogon", oval="ovale", round="ronde", rect="squ_rect", NULL=""))
For this we need to make sure that the magrittr
package is loaded (not just dplyr
).
Let's tackle the mat
variable:
pp %>% group_by(mat) %>% summarise(n=n())
## # A tibble: 21 × 2 ## mat n ## <chr> <int> ## 1 169 ## 2 a 2 ## 3 al 1 ## 4 ar 1 ## 5 b 886 ## 6 br 7 ## 7 c 312 ## 8 ca 3 ## 9 co 6 ## 10 e 1 ## # ... with 11 more rows
mat | explanation | new categories | mat | explanation | new categories |
---|---|---|---|---|---|
a |
silver | metal |
h |
oil technique | other |
al |
alabaster | stone |
m |
marble | stone |
ar |
slate | stone |
mi |
miniature technique | other |
b |
wood | wood |
o |
other | other |
bc |
wood and copper | metal |
p |
paper | paper |
br |
bronze frames | metal |
pa |
pastel | other |
bt |
canvas on wood | canvas |
t |
canvas | canvas |
c |
copper | metal |
ta |
canvas? | canvas |
ca |
cardboard | paper |
v |
glass | other |
co |
cloth | canvas |
n/a |
NA | NA |
e |
wax | other |
NA | NA | |
g |
grissaille technique | other |
pp %>% mutate( mat = fct_collapse( mat, metal = c("a", "bc", "br", "c"), stone = c("al", "ar", "m"), canvas = c("co", "bt", "t","ta"), paper = c("p", "ca"), wood = c("b"), other = c("o", "e", "v", "h","mi","pa","g"), NULL = c("n/a", "") ) ) %>% group_by(mat) %>% summarize(n=n()) %>% arrange(n)
## Warning: Unknown levels in `f`: bc, bt
## # A tibble: 7 × 2 ## mat n ## <fctr> <int> ## 1 stone 3 ## 2 paper 38 ## 3 other 56 ## 4 NA 306 ## 5 metal 321 ## 6 wood 886 ## 7 canvas 1783
pp %<>% mutate( mat = fct_collapse( mat, metal = c("a", "bc", "br", "c"), stone = c("al", "ar", "m"), canvas = c("co", "bt", "t","ta"), paper = c("p", "ca"), wood = c("b"), other = c("o", "e", "v", "h","mi","pa","g"), NULL = c("n/a", "") ) )
## Warning: Unknown levels in `f`: bc, bt
Any estimate comes with some uncertainty around it.
Later in the course we'll discuss how to estimate the uncertainty around an estimate, such as the slope, and the conditions required for quantifying uncertainty around estimates using various methods.
Describe the relationship between price and width of painting.
ggplot(data = pp, aes(x = Width_in, y = price)) + geom_point(alpha = 0.2)
## Warning: Removed 256 rows containing missing values (geom_point).
Let's focus on paintings with Width_in < 100
pp_width = pp %>% filter(Width_in < 100) ggplot(data = pp_width, aes(x = Width_in, y = price)) + geom_point(alpha = 0.2)
ggplot(data = pp_width, aes(x = price)) + geom_histogram()
ggplot(data = pp_width, aes(x = log(price))) + geom_histogram()
Which plot shows a more linear relationship?
ggplot(data = pp_width, aes(x = Width_in, y = price)) + geom_point(alpha = 0.2)
ggplot(data = pp_width, aes(x = Width_in, y = log(price))) + geom_point(alpha = 0.2)
Which plot shows a more linear relationship?
ggplot(data = pp_width, aes(x = Width_in, y = price)) + geom_point(alpha = 0.2) + geom_smooth(method = "lm", se=FALSE)
ggplot(data = pp_width, aes(x = Width_in, y = log(price))) + geom_point(alpha = 0.2) + geom_smooth(method = "lm", se=FALSE)
price
has a right-skewed distribution, and the relationship between price and width of painting is non-linear.How do we interpret the slope of this model?
ggplot(data = pp_width, aes(x = Width_in, y = log(price))) + geom_point(alpha = 0.2) + stat_smooth(method = "lm")
(m = lm(log(price) ~ Width_in, data = pp_width))
## ## Call: ## lm(formula = log(price) ~ Width_in, data = pp_width) ## ## Coefficients: ## (Intercept) Width_in ## 4.66852 0.01915
\[ \widehat{log(price)} = 4.67 + 0.02~Width\_in \]
For each additional inch the painting is wider, the log price of the painting is expected to be higher, on average, by 0.02 log livres.
which is not a very useful statement… (what is a log livre?)
\[log(a) − log(b) = log\left(\frac{a}{b}\right)\]
\[e^{log(x)} = x\]
Assume that a painting has a price \(y\) and is \(x\) inches wide, if another painting is one inch wider (\(x+1\)) what is its price (\(y'\)) in terms of \(y\)?
\[ \begin{aligned} \log(y) &= 4.67 + 0.02~x \\ \log(y') &= 4.67 + 0.02~(x+1) = 4.67 + 0.02~x + 0.02 \end{aligned} \]
\[log(y') - log(y) = 0.02 \]
\[log\left(\frac{y'}{y}\right) = 0.02 \]
\[e^{log\left(\frac{y'}{y}\right)} = e^{0.02} \]
\[\frac{y'}{y} = e^{0.02} = 1.02\]
\[y' \approx = y \]
For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, by a factor of 1.02.
m$coefficients
## (Intercept) Width_in ## 4.6685206 0.0191532
exp(m$coefficients)
## (Intercept) Width_in ## 106.540014 1.019338
When using a log transformation on the response variable the interpretation of the slope changes: "For each unit increase in x, y is expected on average to change by a factor of \(e^{b_1}\)."
Another useful transformation is the square root: \(\sqrt{y}\), it is also used when the data is right skewed (but not as severely right skewed as when you use a \(log\))
In the case of left skewed data you can try using power transformations like \(y^2\) or \(y^3\).
Most transformations don't have natural interpretations for the slope parameter in terms of untransformed units.
In some cases the value of the response variable might be 0, and
log(0)
## [1] -Inf
One trick is to add a very small number to the value of the response variable for these cases so that the \(log\) function can still be applied:
log(0 + 0.001)
## [1] -6.907755
However, this correction is sensitive to the units of \(y\)
See course website for details on the application exercise.