September 22, 2015

Today's agenda

Today's agenda

  • Review App Ex from last time
    • Recap modeling terminology and interpretations
  • Recoding variables
    • So you can make them more meaningful and use them in your analysis
  • Transformations
    • So you actually fit linear models to linear relationships
  • Due Thursday:
    • App Ex 4
    • Reading (you'll receive an email with the link after class)

Prepping the data

Load packages + Paris Paintings data

Load packages:

library(ggplot2)
library(dplyr)
library(stringr)

Load data:

pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
  tbl_df()

Recoding variables

Fixing price

  • We did this already!

  • Overwrite existing variable price:

pp <- pp %>%
  mutate(price = as.numeric(str_replace(price, ",", "")))

Shapes of paintings

pp %>%
  group_by(Shape) %>%
  summarise(count = n()) %>%
  arrange()
## Source: local data frame [9 x 2]
## 
##       Shape count
## 1              36
## 2 miniature    10
## 3   octagon     1
## 4   octogon     1
## 5      oval    28
## 6     ovale    24
## 7     ronde     5
## 8     round    69
## 9  squ_rect  3219

Recode scheme

original new change
NA yes
miniature miniature no
octagon octagon no
octogon octogon yes
oval oval no
ovale oval yes
squ_rect squ_rect no
ronde round yes
round round no

ifelse() function

Usage

ifelse(test, yes, no)

Arguments

test - an object which can be coerced to logical mode.
yes - return values for true elements of test.
no - return values for false elements of test.

  • You can nest many ifelse statetements.

  • See ?ifelse for more details (just like any other function).

Recoding shape of painting

Create new variable shape_recode:

pp <- pp %>%
  mutate(shape_recode = ifelse(Shape == "", NA,
                               ifelse(Shape == "ovale", "oval",
                                      ifelse(Shape == "ronde", "round",
                                             ifelse(Shape == "octogon", "octagon", Shape)))))

Check before you move on!

pp %>%
  group_by(shape_recode, Shape) %>%
  summarise(count = n()) %>%
  arrange(shape_recode)
## Source: local data frame [9 x 3]
## Groups: shape_recode
## 
##   shape_recode     Shape count
## 1    miniature miniature    10
## 2      octagon   octagon     1
## 3      octagon   octogon     1
## 4         oval      oval    28
## 5         oval     ovale    24
## 6        round     ronde     5
## 7        round     round    69
## 8     squ_rect  squ_rect  3219
## 9           NA              36

Converting numerical variables to categorical

  • If nfigures == 0 \(\rightarrow\) fig_mention = no figures
  • If nfigures >= 1 \(\rightarrow\) fig_mention = some figures
# recode
pp <- pp %>%
  mutate(fig_mention = ifelse(nfigures == 0, "no figures", "some figures"))
  
# check
pp %>%
  group_by(fig_mention, nfigures) %>%
  summarise(n())
## Source: local data frame [46 x 3]
## Groups: fig_mention
## 
##     fig_mention nfigures  n()
## 1    no figures        0 3181
## 2  some figures        2    6
## 3  some figures        3   10
## 4  some figures        4   18
## 5  some figures        5   16
## 6  some figures        6    4
## 7  some figures        7   11
## 8  some figures        8   11
## 9  some figures        9   17
## 10 some figures       10    8
## ..          ...      ...  ...

Recoding, kicked up a notch…

Let's tackle the mat variable:

pp %>% 
  group_by(mat) %>% 
  summarise(n())
## Source: local data frame [21 x 2]
## 
##    mat n()
## 1      169
## 2    a   2
## 3   al   1
## 4   ar   1
## 5    b 886
## 6   br   7
## 7    c 312
## 8   ca   3
## 9   co   6
## 10   e   1
## .. ... ...

mat explanation new categories mat explanation new categories
a silver metal h oil technique uncertain
al alabaster stone m marble stone
ar slate stone mi miniature technique uncertain
b wood wood o other other
bc wood and copper metal p paper paper
br bronze frames uncertain pa pastel uncertain
bt canvas on wood canvas t canvas canvas
c copper metal ta canvas? uncertain
ca cardboard paper v glass other
co cloth canvas n/a NA NA
e wax other NA NA
g grissaille technique uncertain

Making use of the %in% operator

Create new variable mat_recode:

pp <- pp %>%
  mutate(mat_recode = ifelse(mat %in% c("a", "bc", "c"), "metal",
                             ifelse(mat %in% c("al", "ar", "m"), "stone",
                                    ifelse(mat %in% c("co", "bt", "t"), "canvas",
                                           ifelse(mat %in% c("p", "ca"), "paper",
                                                  ifelse(mat %in% c("b"), "wood",
                                                         ifelse(mat %in% c("o", "e", "v"), "other",
                                                                ifelse(mat %in% c("n/a", ""), NA,
                                                                       "uncertain"))))))))

Check before you move on!

pp %>%
  group_by(mat_recode) %>%
  summarise(count = n()) %>%
  arrange(desc(count))
## Source: local data frame [8 x 2]
## 
##   mat_recode count
## 1     canvas  1695
## 2       wood   886
## 3      metal   314
## 4         NA   306
## 5  uncertain   135
## 6      paper    38
## 7      other    16
## 8      stone     3

Formalizing the linear model

The linear model with a single predictor

  • We're interested in the \(\beta_0\) (population parameter for the intercept) and the \(\beta_1\) (population parameter for the slope) in the following model: \[ \hat{y} = \beta_0 + \beta_1~x \]
  • Tough luck, you can't have them…
  • So we use the sample statistics to estimate them: \[ \hat{y} = b_0 + b_1~x \]

The linear model with multiple predictors

  • Population model: \[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k \]
  • Sample model that we use to estimate the population model: \[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]

Uncertainty around estimates

  • Any estimate comes with some uncertainty around it.

  • Later in the course we'll discuss how to estimate the uncertainty around an estimate, such as the slope, and the conditions required for quantifying uncertainty around estimates using various methods.

Transformations

Price vs. surface

Describe the relationship between price and width of painting.
ggplot(data = pp, aes(x = Width_in, y = price)) +
  geom_point(alpha = 0.5)
## Warning: Removed 256 rows containing missing values (geom_point).

Let's focus on paintings with Width_in < 100

pp_wt_lt_100 <- pp %>% 
  filter(Width_in < 100)

Distribution of price

ggplot(data = pp_wt_lt_100, aes(x = price)) +
  geom_histogram()

ggplot(data = pp_wt_lt_100, aes(x = log(price))) +
  geom_histogram()

Price vs. surface

Which plot shows a more linear relationship?
ggplot(data = pp_wt_lt_100, 
       aes(x = Width_in, y = price)) +
  geom_point(alpha = 0.5)

ggplot(data = pp_wt_lt_100, 
       aes(x = Width_in, y = log(price))) +
  geom_point(alpha = 0.5)

Transforming the data

  • We saw that price has a right-skewed distribution, and the relationship between price and width of painting is non-linear.
  • In these situations a transformation applied to the response variable may be useful.
  • In order to decide which transformation to use, we should examine the distribution of the response variable.
  • The extremely right skewed distribution suggests that a log transformation may be useful.
    • log = natural log (\(ln\))

Logged price vs. surface

How do we interpret the slope of this model?
ggplot(data = pp_wt_lt_100, aes(x = Width_in, y = log(price))) +
  geom_point(alpha = 0.5) +
  stat_smooth(method = "lm")

Interpreting models with log transformation

m_lprice_wt <- lm(log(price) ~ Width_in, data = pp_wt_lt_100)
m_lprice_wt
## 
## Call:
## lm(formula = log(price) ~ Width_in, data = pp_wt_lt_100)
## 
## Coefficients:
## (Intercept)     Width_in  
##     4.66852      0.01915

Linear model with log transformation

\[ \widehat{log(price)} = 4.67 + 0.02 Width\_in \]

  • For each additional inch the painting is wider, the log price of the painting is expected to be higher, on average, by 0.02 livres.

  • which is not a very useful statement…

Working with logs

  • Subtraction and logs: \(log(a) − log(b) = log(a / b)\)

  • Natural logarithm: \(e^{log(x)} = x\)

  • We can these identities to "undo" the log transformation

Interpreting models with log transformation

The slope coefficient for the log transformed model is 0.02, meaning the log price difference between paintings whose widths are one inch apart is predicted to be 0.02 log livres.

\[log(\text{price for width x+1}) - log(\text{price for width x}) = 0.02 \]

\[log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right) = 0.02 \]

\[e^{log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right)} = e^{0.02} \]

\[\frac{\text{price for width x+1}}{\text{price for width x}} \approx 1.02\]

For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, by a factor of 1.02.

Shortcuts in R

m_lprice_wt$coefficients
## (Intercept)    Width_in 
##   4.6685206   0.0191532
exp(m_lprice_wt$coefficients)
## (Intercept)    Width_in 
##  106.540014    1.019338

Recap

  • Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable.

  • The most common transformation when the response variable is right skewed is the log transform: \(log(y)\), especially useful when the response variable is (extremely) right skewed.

  • This transformation is also useful for variance stabilization.

Recap (cont.)

  • When using a log transformation on the response variable the interpretation of the slope changes: "For each unit increase in x, y is expected on average to be higher/lower by a factor of \(e^{b_1}\)."

  • Another useful transformation is the square root: \(\sqrt{y}\), especially useful when the response variable is counts.

  • These transformations may also be useful when the relationship is non-linear, but in those cases a polynomial regression may also be needed (this is beyond the scope of this course, but you’re welcomed to try it for your final project, and I’d be happy to provide further guidance).

Aside: when \(y = 0\)

In some cases the value of the response variable might be 0, and

log(0)
## [1] -Inf

The trick is to add a very small number to the value of the response variable for these cases so that the \(log\) function can still be applied:

log(0 + 0.00001)
## [1] -11.51293

Application Exercise

Modeling log transformed prices of Paris Paintings

See course website for details on the application exercise.