September 22, 2015
Load packages:
library(ggplot2) library(dplyr) library(stringr)
Load data:
pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>% tbl_df()
We did this already!
Overwrite existing variable price
:
pp <- pp %>% mutate(price = as.numeric(str_replace(price, ",", "")))
pp %>% group_by(Shape) %>% summarise(count = n()) %>% arrange()
## Source: local data frame [9 x 2] ## ## Shape count ## 1 36 ## 2 miniature 10 ## 3 octagon 1 ## 4 octogon 1 ## 5 oval 28 ## 6 ovale 24 ## 7 ronde 5 ## 8 round 69 ## 9 squ_rect 3219
original | new | change |
---|---|---|
NA |
yes | |
miniature |
miniature |
no |
octagon |
octagon |
no |
octogon |
octogon |
yes |
oval |
oval |
no |
ovale |
oval |
yes |
squ_rect |
squ_rect |
no |
ronde |
round |
yes |
round |
round |
no |
ifelse()
functionUsage
ifelse(test, yes, no)
Arguments
test
- an object which can be coerced to logical mode.
yes
- return values for true elements of test.
no
- return values for false elements of test.
You can nest many ifelse
statetements.
See ?ifelse
for more details (just like any other function).
Create new variable shape_recode
:
pp <- pp %>% mutate(shape_recode = ifelse(Shape == "", NA, ifelse(Shape == "ovale", "oval", ifelse(Shape == "ronde", "round", ifelse(Shape == "octogon", "octagon", Shape)))))
pp %>% group_by(shape_recode, Shape) %>% summarise(count = n()) %>% arrange(shape_recode)
## Source: local data frame [9 x 3] ## Groups: shape_recode ## ## shape_recode Shape count ## 1 miniature miniature 10 ## 2 octagon octagon 1 ## 3 octagon octogon 1 ## 4 oval oval 28 ## 5 oval ovale 24 ## 6 round ronde 5 ## 7 round round 69 ## 8 squ_rect squ_rect 3219 ## 9 NA 36
nfigures
== 0 \(\rightarrow\) fig_mention
= no figures
nfigures
>= 1 \(\rightarrow\) fig_mention
= some figures
# recode pp <- pp %>% mutate(fig_mention = ifelse(nfigures == 0, "no figures", "some figures")) # check pp %>% group_by(fig_mention, nfigures) %>% summarise(n())
## Source: local data frame [46 x 3] ## Groups: fig_mention ## ## fig_mention nfigures n() ## 1 no figures 0 3181 ## 2 some figures 2 6 ## 3 some figures 3 10 ## 4 some figures 4 18 ## 5 some figures 5 16 ## 6 some figures 6 4 ## 7 some figures 7 11 ## 8 some figures 8 11 ## 9 some figures 9 17 ## 10 some figures 10 8 ## .. ... ... ...
Let's tackle the mat
variable:
pp %>% group_by(mat) %>% summarise(n())
## Source: local data frame [21 x 2] ## ## mat n() ## 1 169 ## 2 a 2 ## 3 al 1 ## 4 ar 1 ## 5 b 886 ## 6 br 7 ## 7 c 312 ## 8 ca 3 ## 9 co 6 ## 10 e 1 ## .. ... ...
mat | explanation | new categories | mat | explanation | new categories |
---|---|---|---|---|---|
a |
silver | metal |
h |
oil technique | uncertain |
al |
alabaster | stone |
m |
marble | stone |
ar |
slate | stone |
mi |
miniature technique | uncertain |
b |
wood | wood |
o |
other | other |
bc |
wood and copper | metal |
p |
paper | paper |
br |
bronze frames | uncertain |
pa |
pastel | uncertain |
bt |
canvas on wood | canvas |
t |
canvas | canvas |
c |
copper | metal |
ta |
canvas? | uncertain |
ca |
cardboard | paper |
v |
glass | other |
co |
cloth | canvas |
n/a |
NA | NA |
e |
wax | other |
NA | NA | |
g |
grissaille technique | uncertain |
%in%
operatorCreate new variable mat_recode
:
pp <- pp %>% mutate(mat_recode = ifelse(mat %in% c("a", "bc", "c"), "metal", ifelse(mat %in% c("al", "ar", "m"), "stone", ifelse(mat %in% c("co", "bt", "t"), "canvas", ifelse(mat %in% c("p", "ca"), "paper", ifelse(mat %in% c("b"), "wood", ifelse(mat %in% c("o", "e", "v"), "other", ifelse(mat %in% c("n/a", ""), NA, "uncertain"))))))))
pp %>% group_by(mat_recode) %>% summarise(count = n()) %>% arrange(desc(count))
## Source: local data frame [8 x 2] ## ## mat_recode count ## 1 canvas 1695 ## 2 wood 886 ## 3 metal 314 ## 4 NA 306 ## 5 uncertain 135 ## 6 paper 38 ## 7 other 16 ## 8 stone 3
Any estimate comes with some uncertainty around it.
Later in the course we'll discuss how to estimate the uncertainty around an estimate, such as the slope, and the conditions required for quantifying uncertainty around estimates using various methods.
ggplot(data = pp, aes(x = Width_in, y = price)) + geom_point(alpha = 0.5)
## Warning: Removed 256 rows containing missing values (geom_point).
Width_in < 100
pp_wt_lt_100 <- pp %>% filter(Width_in < 100)
ggplot(data = pp_wt_lt_100, aes(x = price)) + geom_histogram()
ggplot(data = pp_wt_lt_100, aes(x = log(price))) + geom_histogram()
ggplot(data = pp_wt_lt_100, aes(x = Width_in, y = price)) + geom_point(alpha = 0.5)
ggplot(data = pp_wt_lt_100, aes(x = Width_in, y = log(price))) + geom_point(alpha = 0.5)
price
has a right-skewed distribution, and the relationship between price and width of painting is non-linear.ggplot(data = pp_wt_lt_100, aes(x = Width_in, y = log(price))) + geom_point(alpha = 0.5) + stat_smooth(method = "lm")
m_lprice_wt <- lm(log(price) ~ Width_in, data = pp_wt_lt_100) m_lprice_wt
## ## Call: ## lm(formula = log(price) ~ Width_in, data = pp_wt_lt_100) ## ## Coefficients: ## (Intercept) Width_in ## 4.66852 0.01915
\[ \widehat{log(price)} = 4.67 + 0.02 Width\_in \]
For each additional inch the painting is wider, the log price of the painting is expected to be higher, on average, by 0.02 livres.
which is not a very useful statement…
Subtraction and logs: \(log(a) − log(b) = log(a / b)\)
Natural logarithm: \(e^{log(x)} = x\)
We can these identities to "undo" the log transformation
The slope coefficient for the log transformed model is 0.02, meaning the log price difference between paintings whose widths are one inch apart is predicted to be 0.02 log livres.
\[log(\text{price for width x+1}) - log(\text{price for width x}) = 0.02 \]
\[log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right) = 0.02 \]
\[e^{log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right)} = e^{0.02} \]
\[\frac{\text{price for width x+1}}{\text{price for width x}} \approx 1.02\]
For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, by a factor of 1.02.
m_lprice_wt$coefficients
## (Intercept) Width_in ## 4.6685206 0.0191532
exp(m_lprice_wt$coefficients)
## (Intercept) Width_in ## 106.540014 1.019338
Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable.
The most common transformation when the response variable is right skewed is the log transform: \(log(y)\), especially useful when the response variable is (extremely) right skewed.
This transformation is also useful for variance stabilization.
When using a log transformation on the response variable the interpretation of the slope changes: "For each unit increase in x, y is expected on average to be higher/lower by a factor of \(e^{b_1}\)."
Another useful transformation is the square root: \(\sqrt{y}\), especially useful when the response variable is counts.
These transformations may also be useful when the relationship is non-linear, but in those cases a polynomial regression may also be needed (this is beyond the scope of this course, but you’re welcomed to try it for your final project, and I’d be happy to provide further guidance).
In some cases the value of the response variable might be 0, and
log(0)
## [1] -Inf
The trick is to add a very small number to the value of the response variable for these cases so that the \(log\) function can still be applied:
log(0 + 0.00001)
## [1] -11.51293
See course website for details on the application exercise.