September 22, 2015

## Today's agenda

• Review App Ex from last time
• Recap modeling terminology and interpretations
• Recoding variables
• So you can make them more meaningful and use them in your analysis
• Transformations
• So you actually fit linear models to linear relationships
• Due Thursday:
• App Ex 4

## Load packages + Paris Paintings data

library(ggplot2)
library(dplyr)
library(stringr)

pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
tbl_df()

## Fixing price

• Overwrite existing variable price:

pp <- pp %>%
mutate(price = as.numeric(str_replace(price, ",", "")))

## Shapes of paintings

pp %>%
group_by(Shape) %>%
summarise(count = n()) %>%
arrange()
## Source: local data frame [9 x 2]
##
##       Shape count
## 1              36
## 2 miniature    10
## 3   octagon     1
## 4   octogon     1
## 5      oval    28
## 6     ovale    24
## 7     ronde     5
## 8     round    69
## 9  squ_rect  3219

## Recode scheme

original new change
NA yes
miniature miniature no
octagon octagon no
octogon octogon yes
oval oval no
ovale oval yes
squ_rect squ_rect no
ronde round yes
round round no

## ifelse() function

Usage

ifelse(test, yes, no)

Arguments

test - an object which can be coerced to logical mode.
yes - return values for true elements of test.
no - return values for false elements of test.

• You can nest many ifelse statetements.

• See ?ifelse for more details (just like any other function).

## Recoding shape of painting

Create new variable shape_recode:

pp <- pp %>%
mutate(shape_recode = ifelse(Shape == "", NA,
ifelse(Shape == "ovale", "oval",
ifelse(Shape == "ronde", "round",
ifelse(Shape == "octogon", "octagon", Shape)))))

## Check before you move on!

pp %>%
group_by(shape_recode, Shape) %>%
summarise(count = n()) %>%
arrange(shape_recode)
## Source: local data frame [9 x 3]
## Groups: shape_recode
##
##   shape_recode     Shape count
## 1    miniature miniature    10
## 2      octagon   octagon     1
## 3      octagon   octogon     1
## 4         oval      oval    28
## 5         oval     ovale    24
## 6        round     ronde     5
## 7        round     round    69
## 8     squ_rect  squ_rect  3219
## 9           NA              36

## Converting numerical variables to categorical

• If nfigures == 0 $$\rightarrow$$ fig_mention = no figures
• If nfigures >= 1 $$\rightarrow$$ fig_mention = some figures
# recode
pp <- pp %>%
mutate(fig_mention = ifelse(nfigures == 0, "no figures", "some figures"))

# check
pp %>%
group_by(fig_mention, nfigures) %>%
summarise(n())
## Source: local data frame [46 x 3]
## Groups: fig_mention
##
##     fig_mention nfigures  n()
## 1    no figures        0 3181
## 2  some figures        2    6
## 3  some figures        3   10
## 4  some figures        4   18
## 5  some figures        5   16
## 6  some figures        6    4
## 7  some figures        7   11
## 8  some figures        8   11
## 9  some figures        9   17
## 10 some figures       10    8
## ..          ...      ...  ...

## Recoding, kicked up a notch…

Let's tackle the mat variable:

pp %>%
group_by(mat) %>%
summarise(n())
## Source: local data frame [21 x 2]
##
##    mat n()
## 1      169
## 2    a   2
## 3   al   1
## 4   ar   1
## 5    b 886
## 6   br   7
## 7    c 312
## 8   ca   3
## 9   co   6
## 10   e   1
## .. ... ...
mat explanation new categories mat explanation new categories
a silver metal h oil technique uncertain
al alabaster stone m marble stone
ar slate stone mi miniature technique uncertain
b wood wood o other other
bc wood and copper metal p paper paper
br bronze frames uncertain pa pastel uncertain
bt canvas on wood canvas t canvas canvas
c copper metal ta canvas? uncertain
ca cardboard paper v glass other
co cloth canvas n/a NA NA
e wax other NA NA
g grissaille technique uncertain

## Making use of the %in% operator

Create new variable mat_recode:

pp <- pp %>%
mutate(mat_recode = ifelse(mat %in% c("a", "bc", "c"), "metal",
ifelse(mat %in% c("al", "ar", "m"), "stone",
ifelse(mat %in% c("co", "bt", "t"), "canvas",
ifelse(mat %in% c("p", "ca"), "paper",
ifelse(mat %in% c("b"), "wood",
ifelse(mat %in% c("o", "e", "v"), "other",
ifelse(mat %in% c("n/a", ""), NA,
"uncertain"))))))))

## Check before you move on!

pp %>%
group_by(mat_recode) %>%
summarise(count = n()) %>%
arrange(desc(count))
## Source: local data frame [8 x 2]
##
##   mat_recode count
## 1     canvas  1695
## 2       wood   886
## 3      metal   314
## 4         NA   306
## 5  uncertain   135
## 6      paper    38
## 7      other    16
## 8      stone     3

## The linear model with a single predictor

• We're interested in the $$\beta_0$$ (population parameter for the intercept) and the $$\beta_1$$ (population parameter for the slope) in the following model: $\hat{y} = \beta_0 + \beta_1~x$
• Tough luck, you can't have them…
• So we use the sample statistics to estimate them: $\hat{y} = b_0 + b_1~x$

## The linear model with multiple predictors

• Population model: $\hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k$
• Sample model that we use to estimate the population model: $\hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k$

## Uncertainty around estimates

• Any estimate comes with some uncertainty around it.

• Later in the course we'll discuss how to estimate the uncertainty around an estimate, such as the slope, and the conditions required for quantifying uncertainty around estimates using various methods.

## Price vs. surface

Describe the relationship between price and width of painting.
ggplot(data = pp, aes(x = Width_in, y = price)) +
geom_point(alpha = 0.5)
## Warning: Removed 256 rows containing missing values (geom_point).

## Let's focus on paintings with Width_in < 100

pp_wt_lt_100 <- pp %>%
filter(Width_in < 100)

## Distribution of price

ggplot(data = pp_wt_lt_100, aes(x = price)) +
geom_histogram()

ggplot(data = pp_wt_lt_100, aes(x = log(price))) +
geom_histogram()

## Price vs. surface

Which plot shows a more linear relationship?
ggplot(data = pp_wt_lt_100,
aes(x = Width_in, y = price)) +
geom_point(alpha = 0.5)

ggplot(data = pp_wt_lt_100,
aes(x = Width_in, y = log(price))) +
geom_point(alpha = 0.5)

## Transforming the data

• We saw that price has a right-skewed distribution, and the relationship between price and width of painting is non-linear.
• In these situations a transformation applied to the response variable may be useful.
• In order to decide which transformation to use, we should examine the distribution of the response variable.
• The extremely right skewed distribution suggests that a log transformation may be useful.
• log = natural log ($$ln$$)

## Logged price vs. surface

How do we interpret the slope of this model?
ggplot(data = pp_wt_lt_100, aes(x = Width_in, y = log(price))) +
geom_point(alpha = 0.5) +
stat_smooth(method = "lm")