September 22, 2015

Today's agenda

Today's agenda

  • Review App Ex from last time
    • Recap modeling terminology and interpretations
  • Recoding variables
    • So you can make them more meaningful and use them in your analysis
  • Transformations
    • So you actually fit linear models to linear relationships
  • Due Thursday:
    • App Ex 4
    • Reading (you'll receive an email with the link after class)

Prepping the data

Load packages + Paris Paintings data

Load packages:

library(ggplot2)
library(dplyr)
library(stringr)

Load data:

pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
  tbl_df()

Recoding variables

Fixing price

  • We did this already!

  • Overwrite existing variable price:

pp <- pp %>%
  mutate(price = as.numeric(str_replace(price, ",", "")))

Shapes of paintings

pp %>%
  group_by(Shape) %>%
  summarise(count = n()) %>%
  arrange()
## Source: local data frame [9 x 2]
## 
##       Shape count
## 1              36
## 2 miniature    10
## 3   octagon     1
## 4   octogon     1
## 5      oval    28
## 6     ovale    24
## 7     ronde     5
## 8     round    69
## 9  squ_rect  3219

Recode scheme

original new change
NA yes
miniature miniature no
octagon octagon no
octogon octogon yes
oval oval no
ovale oval yes
squ_rect squ_rect no
ronde round yes
round round no

ifelse() function

Usage

ifelse(test, yes, no)

Arguments

test - an object which can be coerced to logical mode.
yes - return values for true elements of test.
no - return values for false elements of test.

  • You can nest many ifelse statetements.

  • See ?ifelse for more details (just like any other function).

Recoding shape of painting

Create new variable shape_recode:

pp <- pp %>%
  mutate(shape_recode = ifelse(Shape == "", NA,
                               ifelse(Shape == "ovale", "oval",
                                      ifelse(Shape == "ronde", "round",
                                             ifelse(Shape == "octogon", "octagon", Shape)))))

Check before you move on!

pp %>%
  group_by(shape_recode, Shape) %>%
  summarise(count = n()) %>%
  arrange(shape_recode)
## Source: local data frame [9 x 3]
## Groups: shape_recode
## 
##   shape_recode     Shape count
## 1    miniature miniature    10
## 2      octagon   octagon     1
## 3      octagon   octogon     1
## 4         oval      oval    28
## 5         oval     ovale    24
## 6        round     ronde     5
## 7        round     round    69
## 8     squ_rect  squ_rect  3219
## 9           NA              36

Converting numerical variables to categorical

  • If nfigures == 0 \(\rightarrow\) fig_mention = no figures
  • If nfigures >= 1 \(\rightarrow\) fig_mention = some figures
# recode
pp <- pp %>%
  mutate(fig_mention = ifelse(nfigures == 0, "no figures", "some figures"))
  
# check
pp %>%
  group_by(fig_mention, nfigures) %>%
  summarise(n())
## Source: local data frame [46 x 3]
## Groups: fig_mention
## 
##     fig_mention nfigures  n()
## 1    no figures        0 3181
## 2  some figures        2    6
## 3  some figures        3   10
## 4  some figures        4   18
## 5  some figures        5   16
## 6  some figures        6    4
## 7  some figures        7   11
## 8  some figures        8   11
## 9  some figures        9   17
## 10 some figures       10    8
## ..          ...      ...  ...

Recoding, kicked up a notch…

Let's tackle the mat variable:

pp %>% 
  group_by(mat) %>% 
  summarise(n())
## Source: local data frame [21 x 2]
## 
##    mat n()
## 1      169
## 2    a   2
## 3   al   1
## 4   ar   1
## 5    b 886
## 6   br   7
## 7    c 312
## 8   ca   3
## 9   co   6
## 10   e   1
## .. ... ...

mat explanation new categories mat explanation new categories
a silver metal h oil technique uncertain
al alabaster stone m marble stone
ar slate stone mi miniature technique uncertain
b wood wood o other other
bc wood and copper metal p paper paper
br bronze frames uncertain pa pastel uncertain
bt canvas on wood canvas t canvas canvas
c copper metal ta canvas? uncertain
ca cardboard paper v glass other
co cloth canvas n/a NA NA
e wax other NA NA
g grissaille technique uncertain

Making use of the %in% operator

Create new variable mat_recode:

pp <- pp %>%
  mutate(mat_recode = ifelse(mat %in% c("a", "bc", "c"), "metal",
                             ifelse(mat %in% c("al", "ar", "m"), "stone",
                                    ifelse(mat %in% c("co", "bt", "t"), "canvas",
                                           ifelse(mat %in% c("p", "ca"), "paper",
                                                  ifelse(mat %in% c("b"), "wood",
                                                         ifelse(mat %in% c("o", "e", "v"), "other",
                                                                ifelse(mat %in% c("n/a", ""), NA,
                                                                       "uncertain"))))))))

Check before you move on!

pp %>%
  group_by(mat_recode) %>%
  summarise(count = n()) %>%
  arrange(desc(count))
## Source: local data frame [8 x 2]
## 
##   mat_recode count
## 1     canvas  1695
## 2       wood   886
## 3      metal   314
## 4         NA   306
## 5  uncertain   135
## 6      paper    38
## 7      other    16
## 8      stone     3

Formalizing the linear model

The linear model with a single predictor

  • We're interested in the \(\beta_0\) (population parameter for the intercept) and the \(\beta_1\) (population parameter for the slope) in the following model: \[ \hat{y} = \beta_0 + \beta_1~x \]
  • Tough luck, you can't have them…
  • So we use the sample statistics to estimate them: \[ \hat{y} = b_0 + b_1~x \]

The linear model with multiple predictors

  • Population model: \[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k \]
  • Sample model that we use to estimate the population model: \[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]

Uncertainty around estimates

  • Any estimate comes with some uncertainty around it.

  • Later in the course we'll discuss how to estimate the uncertainty around an estimate, such as the slope, and the conditions required for quantifying uncertainty around estimates using various methods.

Transformations

Price vs. surface

Describe the relationship between price and width of painting.
ggplot(data = pp, aes(x = Width_in, y = price)) +
  geom_point(alpha = 0.5)
## Warning: Removed 256 rows containing missing values (geom_point).

Let's focus on paintings with Width_in < 100

pp_wt_lt_100 <- pp %>% 
  filter(Width_in < 100)

Distribution of price

ggplot(data = pp_wt_lt_100, aes(x = price)) +
  geom_histogram()

ggplot(data = pp_wt_lt_100, aes(x = log(price))) +
  geom_histogram()

Price vs. surface

Which plot shows a more linear relationship?
ggplot(data = pp_wt_lt_100, 
       aes(x = Width_in, y = price)) +
  geom_point(alpha = 0.5)

ggplot(data = pp_wt_lt_100, 
       aes(x = Width_in, y = log(price))) +
  geom_point(alpha = 0.5)

Transforming the data

  • We saw that price has a right-skewed distribution, and the relationship between price and width of painting is non-linear.
  • In these situations a transformation applied to the response variable may be useful.
  • In order to decide which transformation to use, we should examine the distribution of the response variable.
  • The extremely right skewed distribution suggests that a log transformation may be useful.
    • log = natural log (\(ln\))

Logged price vs. surface

How do we interpret the slope of this model?
ggplot(data = pp_wt_lt_100, aes(x = Width_in, y = log(price))) +
  geom_point(alpha = 0.5) +
  stat_smooth(method = "lm")