Sta112FS 8. Recoding variables and transformations

September 22, 2015

Today's agenda

Review App Ex from last time
- Recap modeling terminology and interpretations
Recoding variables
- So you can make them more meaningful and use them in your analysis
Transformations
- So you actually fit linear models to linear relationships
Due Thursday:
- App Ex 4
- Reading (you'll receive an email with the link after class)

Prepping the data

Load packages + Paris Paintings data

Load packages:

library(ggplot2)
library(dplyr)
library(stringr)

Load data:

pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
  tbl_df()

Recoding variables

Fixing price

We did this already!
Overwrite existing variable price:

pp <- pp %>%
  mutate(price = as.numeric(str_replace(price, ",", "")))

Shapes of paintings

pp %>%
  group_by(Shape) %>%
  summarise(count = n()) %>%
  arrange()

## Source: local data frame [9 x 2]
## 
##       Shape count
## 1              36
## 2 miniature    10
## 3   octagon     1
## 4   octogon     1
## 5      oval    28
## 6     ovale    24
## 7     ronde     5
## 8     round    69
## 9  squ_rect  3219

Recode scheme

original	new	change
	`NA`	yes
`miniature`	`miniature`	no
`octagon`	`octagon`	no
`octogon`	`octogon`	yes
`oval`	`oval`	no
`ovale`	`oval`	yes
`squ_rect`	`squ_rect`	no
`ronde`	`round`	yes
`round`	`round`	no

`ifelse()` function

Usage

ifelse(test, yes, no)

Arguments

test - an object which can be coerced to logical mode.
yes - return values for true elements of test.
no - return values for false elements of test.

You can nest many ifelse statetements.
See ?ifelse for more details (just like any other function).

Recoding shape of painting

Create new variable shape_recode:

pp <- pp %>%
  mutate(shape_recode = ifelse(Shape == "", NA,
                               ifelse(Shape == "ovale", "oval",
                                      ifelse(Shape == "ronde", "round",
                                             ifelse(Shape == "octogon", "octagon", Shape)))))

Check before you move on!

pp %>%
  group_by(shape_recode, Shape) %>%
  summarise(count = n()) %>%
  arrange(shape_recode)

## Source: local data frame [9 x 3]
## Groups: shape_recode
## 
##   shape_recode     Shape count
## 1    miniature miniature    10
## 2      octagon   octagon     1
## 3      octagon   octogon     1
## 4         oval      oval    28
## 5         oval     ovale    24
## 6        round     ronde     5
## 7        round     round    69
## 8     squ_rect  squ_rect  3219
## 9           NA              36

Converting numerical variables to categorical

If nfigures == 0 \(\rightarrow\) fig_mention = no figures
If nfigures >= 1 \(\rightarrow\) fig_mention = some figures

# recode
pp <- pp %>%
  mutate(fig_mention = ifelse(nfigures == 0, "no figures", "some figures"))
  
# check
pp %>%
  group_by(fig_mention, nfigures) %>%
  summarise(n())

## Source: local data frame [46 x 3]
## Groups: fig_mention
## 
##     fig_mention nfigures  n()
## 1    no figures        0 3181
## 2  some figures        2    6
## 3  some figures        3   10
## 4  some figures        4   18
## 5  some figures        5   16
## 6  some figures        6    4
## 7  some figures        7   11
## 8  some figures        8   11
## 9  some figures        9   17
## 10 some figures       10    8
## ..          ...      ...  ...

Recoding, kicked up a notch…

Let's tackle the mat variable:

pp %>% 
  group_by(mat) %>% 
  summarise(n())

## Source: local data frame [21 x 2]
## 
##    mat n()
## 1      169
## 2    a   2
## 3   al   1
## 4   ar   1
## 5    b 886
## 6   br   7
## 7    c 312
## 8   ca   3
## 9   co   6
## 10   e   1
## .. ... ...

mat	explanation	new categories	mat	explanation	new categories
`a`	silver	`metal`	`h`	oil technique	uncertain
`al`	alabaster	`stone`	`m`	marble	stone
`ar`	slate	`stone`	`mi`	miniature technique	uncertain
`b`	wood	`wood`	`o`	other	other
`bc`	wood and copper	`metal`	`p`	paper	paper
`br`	bronze frames	`uncertain`	`pa`	pastel	uncertain
`bt`	canvas on wood	`canvas`	`t`	canvas	canvas
`c`	copper	`metal`	`ta`	canvas?	uncertain
`ca`	cardboard	`paper`	`v`	glass	other
`co`	cloth	`canvas`	`n/a`	NA	NA
`e`	wax	`other`		NA	NA
`g`	grissaille technique	`uncertain`

Making use of the `%in%` operator

Create new variable mat_recode:

pp <- pp %>%
  mutate(mat_recode = ifelse(mat %in% c("a", "bc", "c"), "metal",
                             ifelse(mat %in% c("al", "ar", "m"), "stone",
                                    ifelse(mat %in% c("co", "bt", "t"), "canvas",
                                           ifelse(mat %in% c("p", "ca"), "paper",
                                                  ifelse(mat %in% c("b"), "wood",
                                                         ifelse(mat %in% c("o", "e", "v"), "other",
                                                                ifelse(mat %in% c("n/a", ""), NA,
                                                                       "uncertain"))))))))

Check before you move on!

pp %>%
  group_by(mat_recode) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

## Source: local data frame [8 x 2]
## 
##   mat_recode count
## 1     canvas  1695
## 2       wood   886
## 3      metal   314
## 4         NA   306
## 5  uncertain   135
## 6      paper    38
## 7      other    16
## 8      stone     3

Formalizing the linear model

The linear model with a single predictor

We're interested in the \(\beta_0\) (population parameter for the intercept) and the \(\beta_1\) (population parameter for the slope) in the following model: \[ \hat{y} = \beta_0 + \beta_1~x \]

Tough luck, you can't have them…

So we use the sample statistics to estimate them: \[ \hat{y} = b_0 + b_1~x \]

The linear model with multiple predictors

Population model: \[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k \]

Sample model that we use to estimate the population model: \[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]

Uncertainty around estimates

Any estimate comes with some uncertainty around it.
Later in the course we'll discuss how to estimate the uncertainty around an estimate, such as the slope, and the conditions required for quantifying uncertainty around estimates using various methods.

Transformations

Price vs. surface

Describe the relationship between price and width of painting.

ggplot(data = pp, aes(x = Width_in, y = price)) +
  geom_point(alpha = 0.5)

## Warning: Removed 256 rows containing missing values (geom_point).

Let's focus on paintings with `Width_in < 100`

pp_wt_lt_100 <- pp %>% 
  filter(Width_in < 100)

Distribution of price

ggplot(data = pp_wt_lt_100, aes(x = price)) +
  geom_histogram()

ggplot(data = pp_wt_lt_100, aes(x = log(price))) +
  geom_histogram()

Price vs. surface

Which plot shows a more linear relationship?

ggplot(data = pp_wt_lt_100, 
       aes(x = Width_in, y = price)) +
  geom_point(alpha = 0.5)

ggplot(data = pp_wt_lt_100, 
       aes(x = Width_in, y = log(price))) +
  geom_point(alpha = 0.5)

Transforming the data

We saw that price has a right-skewed distribution, and the relationship between price and width of painting is non-linear.

In these situations a transformation applied to the response variable may be useful.

In order to decide which transformation to use, we should examine the distribution of the response variable.

The extremely right skewed distribution suggests that a log transformation may be useful.
- log = natural log (\(ln\))

Logged price vs. surface

How do we interpret the slope of this model?

ggplot(data = pp_wt_lt_100, aes(x = Width_in, y = log(price))) +
  geom_point(alpha = 0.5) +
  stat_smooth(method = "lm")

Interpreting models with log transformation

m_lprice_wt <- lm(log(price) ~ Width_in, data = pp_wt_lt_100)
m_lprice_wt

## 
## Call:
## lm(formula = log(price) ~ Width_in, data = pp_wt_lt_100)
## 
## Coefficients:
## (Intercept)     Width_in  
##     4.66852      0.01915

Linear model with log transformation

\[ \widehat{log(price)} = 4.67 + 0.02 Width\_in \]

For each additional inch the painting is wider, the log price of the painting is expected to be higher, on average, by 0.02 livres.
which is not a very useful statement…

Working with logs

Subtraction and logs: \(log(a) − log(b) = log(a / b)\)
Natural logarithm: \(e^{log(x)} = x\)
We can these identities to "undo" the log transformation

Interpreting models with log transformation

The slope coefficient for the log transformed model is 0.02, meaning the log price difference between paintings whose widths are one inch apart is predicted to be 0.02 log livres.

\[log(\text{price for width x+1}) - log(\text{price for width x}) = 0.02 \]

\[log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right) = 0.02 \]

\[e^{log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right)} = e^{0.02} \]

\[\frac{\text{price for width x+1}}{\text{price for width x}} \approx 1.02\]

For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, by a factor of 1.02.

Shortcuts in R

m_lprice_wt$coefficients

## (Intercept)    Width_in 
##   4.6685206   0.0191532

exp(m_lprice_wt$coefficients)

## (Intercept)    Width_in 
##  106.540014    1.019338

Recap

Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable.
The most common transformation when the response variable is right skewed is the log transform: \(log(y)\), especially useful when the response variable is (extremely) right skewed.
This transformation is also useful for variance stabilization.

Recap (cont.)

When using a log transformation on the response variable the interpretation of the slope changes: "For each unit increase in x, y is expected on average to be higher/lower by a factor of \(e^{b_1}\)."
Another useful transformation is the square root: \(\sqrt{y}\), especially useful when the response variable is counts.
These transformations may also be useful when the relationship is non-linear, but in those cases a polynomial regression may also be needed (this is beyond the scope of this course, but you’re welcomed to try it for your final project, and I’d be happy to provide further guidance).

Aside: when \(y = 0\)

In some cases the value of the response variable might be 0, and

log(0)

## [1] -Inf

The trick is to add a very small number to the value of the response variable for these cases so that the \(log\) function can still be applied:

log(0 + 0.00001)

## [1] -11.51293

Application Exercise

Modeling log transformed prices of Paris Paintings

See course website for details on the application exercise.

Today's agenda

Today's agenda

Prepping the data

Load packages + Paris Paintings data

Recoding variables

Fixing price

Shapes of paintings

Recode scheme

ifelse() function

Recoding shape of painting

Check before you move on!

Converting numerical variables to categorical

Recoding, kicked up a notch…

Making use of the %in% operator

Check before you move on!

Formalizing the linear model

The linear model with a single predictor

The linear model with multiple predictors

Uncertainty around estimates

Transformations

Price vs. surface

Let's focus on paintings with Width_in < 100

Distribution of price

Price vs. surface

Transforming the data

Logged price vs. surface

Interpreting models with log transformation

Linear model with log transformation

Working with logs

Interpreting models with log transformation

Shortcuts in R

Recap

Recap (cont.)

Aside: when \(y = 0\)

Application Exercise

Modeling log transformed prices of Paris Paintings

`ifelse()` function

Making use of the `%in%` operator

Let's focus on paintings with `Width_in < 100`