Recoding and transformations

Today's agenda

Recoding variables
- So you can make them more meaningful use of them in an analysis

Transformations
- So you actually fit linear models to linear relationships

Due Next Tuesday - before class
- App Ex 3

Prepping the data

Load packages + Paris Paintings data

Load packages:

library(ggplot2)
library(dplyr)
library(stringr)
library(magrittr)

Load (and fix) data:

pp = read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
  tbl_df() %>%
  mutate(price = as.numeric(str_replace(price, ",", "")))

Recoding variables

Shapes of paintings

pp %>%
  group_by(Shape) %>%
  summarise(n = n()) %>%
  arrange(n)

## # A tibble: 9 × 2
##       Shape     n
##       <chr> <int>
## 1   octagon     1
## 2   octogon     1
## 3     ronde     5
## 4 miniature    10
## 5     ovale    24
## 6      oval    28
## 7              36
## 8     round    69
## 9  squ_rect  3219

Recode scheme

original	new	change
	`NA`	yes
`miniature`	`miniature`	no
`octagon`	`octagon`	no
`octogon`	`octagon`	yes
`oval`	`oval`	no
`ovale`	`oval`	yes
`squ_rect`	`squ_rect`	no
`ronde`	`round`	yes
`round`	`round`	no

forcats package (🐈🐈🐈🐈)

library(forcats)

Another package by Hadley Wickham (of dplyr and ggplot2 fame) for handling categorical variables.

Everything in the package is possible with base R, but a lot of it is harder than it needs to be.

Package documentation is available: https://hadley.github.io/forcats/reference/index.html.

Recoding shape of paintings

pp %>%
  mutate(
    Shape = fct_recode(
      Shape, 
      octagon="octogon", 
      oval="ovale", 
      round="ronde", 
      rect="squ_rect", 
      NULL="")
  ) %>%
  group_by(Shape) %>%
  summarize(n=n()) %>%
  arrange(n)

## # A tibble: 6 × 2
##       Shape     n
##      <fctr> <int>
## 1   octagon     2
## 2 miniature    10
## 3        NA    36
## 4      oval    52
## 5     round    74
## 6      rect  3219

Lumping shape of paintings

pp %>%
  mutate(Shape = fct_lump(Shape)) %>%
  group_by(Shape) %>%
  summarize(n=n()) %>%
  arrange(n)

## # A tibble: 2 × 2
##      Shape     n
##     <fctr> <int>
## 1    Other   174
## 2 squ_rect  3219

What happened to our missing values (NA)?

Lumping shape of paintings

pp %>%
  mutate(Shape = fct_recode(Shape, NULL="") %>% fct_lump()) %>%
  group_by(Shape) %>%
  summarize(n=n()) %>%
  arrange(n)

## # A tibble: 3 × 2
##      Shape     n
##     <fctr> <int>
## 1       NA    36
## 2    Other   138
## 3 squ_rect  3219

Much better (NAs are preserved) and infrequent shapes are lumped together.

Saving our changes

Now that we are happy with our changes to the Shape variable lets make them permanent by replacing the existing faulty values.

library(magrittr)
pp %<>%
  mutate(Shape = fct_recode(Shape, 
                            octagon="octogon", 
                            oval="ovale", 
                            round="ronde", 
                            rect="squ_rect", 
                            NULL=""))

For this we need to make sure that the magrittr package is loaded (not just dplyr).

Recoding, kicked up a notch…

Let's tackle the mat variable:

pp %>% 
  group_by(mat) %>% 
  summarise(n=n())

## # A tibble: 21 × 2
##      mat     n
##    <chr> <int>
## 1          169
## 2      a     2
## 3     al     1
## 4     ar     1
## 5      b   886
## 6     br     7
## 7      c   312
## 8     ca     3
## 9     co     6
## 10     e     1
## # ... with 11 more rows

mat	explanation	new categories	mat	explanation	new categories
`a`	silver	`metal`	`h`	oil technique	other
`al`	alabaster	`stone`	`m`	marble	stone
`ar`	slate	`stone`	`mi`	miniature technique	other
`b`	wood	`wood`	`o`	other	other
`bc`	wood and copper	`metal`	`p`	paper	paper
`br`	bronze frames	`metal`	`pa`	pastel	other
`bt`	canvas on wood	`canvas`	`t`	canvas	canvas
`c`	copper	`metal`	`ta`	canvas?	canvas
`ca`	cardboard	`paper`	`v`	glass	other
`co`	cloth	`canvas`	`n/a`	NA	NA
`e`	wax	`other`		NA	NA
`g`	grissaille technique	`other`

Collapsing painting materials

pp %>%
  mutate(
    mat = fct_collapse(
      mat, 
      metal  = c("a", "bc", "br", "c"),
      stone  = c("al", "ar", "m"),
      canvas = c("co", "bt", "t","ta"),
      paper  = c("p", "ca"),
      wood   = c("b"),
      other  = c("o", "e", "v", "h","mi","pa","g"),
      NULL   = c("n/a", "")
    )
  ) %>%
  group_by(mat) %>%
  summarize(n=n()) %>%
  arrange(n)

## Warning: Unknown levels in `f`: bc, bt

## # A tibble: 7 × 2
##      mat     n
##   <fctr> <int>
## 1  stone     3
## 2  paper    38
## 3  other    56
## 4     NA   306
## 5  metal   321
## 6   wood   886
## 7 canvas  1783

Saving changes

pp %<>%
  mutate(
    mat = fct_collapse(
       mat, 
       metal  = c("a", "bc", "br", "c"),
       stone  = c("al", "ar", "m"),
       canvas = c("co", "bt", "t","ta"),
       paper  = c("p", "ca"),
       wood   = c("b"),
       other  = c("o", "e", "v", "h","mi","pa","g"),
       NULL   = c("n/a", "")
    )
  )

## Warning: Unknown levels in `f`: bc, bt

Formalizing linear models

The linear model with a single predictor

We're interested in the \(\beta_0\) (population parameter for the intercept) and the \(\beta_1\) (population parameter for the slope) in the following model: \[ \hat{y} = \beta_0 + \beta_1~x \]

Tough luck, you can't have them…

So we use the sample statistics to estimate them: \[ \hat{y} = b_0 + b_1~x \]

Uncertainty around estimates

Any estimate comes with some uncertainty around it.
Later in the course we'll discuss how to estimate the uncertainty around an estimate, such as the slope, and the conditions required for quantifying uncertainty around estimates using various methods.

Transformations

Price vs. surface

Describe the relationship between price and width of painting.

ggplot(data = pp, aes(x = Width_in, y = price)) +
  geom_point(alpha = 0.2)

## Warning: Removed 256 rows containing missing values (geom_point).

Limiting scope

Let's focus on paintings with Width_in < 100

pp_width = pp %>% 
  filter(Width_in < 100)
ggplot(data = pp_width, aes(x = Width_in, y = price)) +
  geom_point(alpha = 0.2)

Distribution of price

ggplot(data = pp_width, aes(x = price)) +
  geom_histogram()

ggplot(data = pp_width, aes(x = log(price))) +
  geom_histogram()

Price vs. surface

Which plot shows a more linear relationship?

ggplot(data = pp_width, 
       aes(x = Width_in, y = price)) +
  geom_point(alpha = 0.2)

ggplot(data = pp_width, 
       aes(x = Width_in, y = log(price))) +
  geom_point(alpha = 0.2)

Price vs. surface w/ linear model

Which plot shows a more linear relationship?

ggplot(data = pp_width, 
       aes(x = Width_in, y = price)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se=FALSE)

ggplot(data = pp_width, 
       aes(x = Width_in, y = log(price))) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se=FALSE)

Transforming the data

We saw that price has a right-skewed distribution, and the relationship between price and width of painting is non-linear.

In these situations a transformation applied to the response variable may be useful.

In order to decide which transformation to use, we should examine the distribution of the response variable.

The extremely right skewed distribution suggests that a log transformation may be useful.
log = natural log (\(ln\))

Logged price vs. surface

How do we interpret the slope of this model?

ggplot(data = pp_width, aes(x = Width_in, y = log(price))) +
  geom_point(alpha = 0.2) +
  stat_smooth(method = "lm")

Interpreting slope under log transformation

(m = lm(log(price) ~ Width_in, data = pp_width))

## 
## Call:
## lm(formula = log(price) ~ Width_in, data = pp_width)
## 
## Coefficients:
## (Intercept)     Width_in  
##     4.66852      0.01915

Linear model with log transformation

\[ \widehat{log(price)} = 4.67 + 0.02~Width\_in \]

For each additional inch the painting is wider, the log price of the painting is expected to be higher, on average, by 0.02 log livres.
which is not a very useful statement… (what is a log livre?)

Working with logs

Subtraction and logs:

\[log(a) − log(b) = log\left(\frac{a}{b}\right)\]

Natural logarithm:

\[e^{log(x)} = x\]

We can these identities to "undo" the log transformation

Interpreting models with log transformation

Assume that a painting has a price \(y\) and is \(x\) inches wide, if another painting is one inch wider (\(x+1\)) what is its price (\(y'\)) in terms of \(y\)?

\[ \begin{aligned} \log(y) &= 4.67 + 0.02~x \\ \log(y') &= 4.67 + 0.02~(x+1) = 4.67 + 0.02~x + 0.02 \end{aligned} \]

\[log(y') - log(y) = 0.02 \]

\[log\left(\frac{y'}{y}\right) = 0.02 \]

\[e^{log\left(\frac{y'}{y}\right)} = e^{0.02} \]

\[\frac{y'}{y} = e^{0.02} = 1.02\]

\[y' \approx = y \]

For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, by a factor of 1.02.

Shortcuts in R

m$coefficients

## (Intercept)    Width_in 
##   4.6685206   0.0191532

exp(m$coefficients)

## (Intercept)    Width_in 
##  106.540014    1.019338

Recap (cont.)

When using a log transformation on the response variable the interpretation of the slope changes: "For each unit increase in x, y is expected on average to change by a factor of \(e^{b_1}\)."
Another useful transformation is the square root: \(\sqrt{y}\), it is also used when the data is right skewed (but not as severely right skewed as when you use a \(log\))
In the case of left skewed data you can try using power transformations like \(y^2\) or \(y^3\).
Most transformations don't have natural interpretations for the slope parameter in terms of untransformed units.

Aside: when \(y = 0\)

In some cases the value of the response variable might be 0, and

log(0)

## [1] -Inf

One trick is to add a very small number to the value of the response variable for these cases so that the \(log\) function can still be applied:

log(0 + 0.001)

## [1] -6.907755

However, this correction is sensitive to the units of \(y\)

Application Exercise

Modeling log transformed prices of Paris Paintings

See course website for details on the application exercise.