The langauge of models


Today’s agenda

Today’s agenda

  • Modeling the relationship between variables
    • Focus on linear models for numeric outcomes (but remember there are other types of models too!)
  • Application Exercise: model prices of Paris Paintings

  • Due Thursday: Finish HW2 & App Ex

Prepping the data

Load packages + Paris Paintings data

library(ggplot2)
library(dplyr)
library(stringr)
pp = read.csv("~/paris_paintings.csv", stringsAsFactors = FALSE) %>%
  tbl_df()

What’s going on with prices?

class(pp$price)
## [1] "character"
head(pp$price, 150)
##   [1] "360.0"   "6.0"     "12.0"    "6.0"     "6.0"     "9.0"     "12.0"    "12.0"    "24.0"    "6.0"    
##  [11] "6.0"     "1.3"     "1.3"     "1.3"     "12.0"    "100.0"   "100.0"   "12.0"    "12.0"    "12.0"   
##  [21] "6.0"     "12.0"    "12.0"    "12.0"    "315.0"   "315.0"   "1.2"     "1.2"     "1.2"     "1.2"    
##  [31] "1.2"     "59.0"    "36.0"    "41.0"    "6.0"     "3.0"     "3.0"     "3.0"     "3.0"     "16.3"   
##  [41] "16.3"    "16.3"    "16.3"    "298.0"   "240.0"   "427.0"   "120.0"   "63.0"    "160.0"   "500.0"  
##  [51] "141.0"   "340.0"   "151.0"   "150.0"   "300.0"   "27.0"    "9.0"     "48.0"    "68.0"    "68.0"   
##  [61] "24.0"    "24.0"    "8.0"     "38.0"    "24.0"    "24.0"    "9.0"     "9.0"     "24.0"    "36.0"   
##  [71] "9.0"     "37.0"    "36.0"    "24.0"    "24.0"    "145.0"   "51.0"    "24.0"    "24.0"    "36.0"   
##  [81] "36.0"    "11.7"    "11.7"    "11.7"    "11.7"    "11.7"    "11.7"    "11.7"    "11.7"    "11.7"   
##  [91] "11.7"    "11.7"    "11.7"    "11.7"    "11.7"    "11.7"    "11.7"    "11.7"    "11.7"    "11.7"   
## [101] "11.7"    "11.7"    "11.7"    "11.7"    "11.7"    "120.0"   "72.0"    "140.0"   "75.0"    "75.0"   
## [111] "30.0"    "30.0"    "60.0"    "12.0"    "24.0"    "3.0"     "312.0"   "1,000.0" "312.0"   "360.0"  
## [121] "40.1"    "150.0"   "502.0"   "166.0"   "595.0"   "370.0"   "240.5"   "25.0"    "25.0"    "152.0"  
## [131] "230.0"   "350.0"   "83.0"    "72.0"    "60.0"    "45.0"    "45.0"    "440.0"   "200.5"   "200.5"  
## [141] "532.0"   "1,842.5" "1,842.5" "3,035.0" "3,035.0" "1,000.4" "800.1"   "371.0"   "371.0"   "651.1"

Lets first fix those prices - Attempt 1

pp$price %>% as.numeric() %>% head(150)
## Warning in function_list[[i]](value): NAs introduced by coercion
##   [1] 360.0   6.0  12.0   6.0   6.0   9.0  12.0  12.0  24.0   6.0   6.0   1.3   1.3   1.3  12.0 100.0 100.0
##  [18]  12.0  12.0  12.0   6.0  12.0  12.0  12.0 315.0 315.0   1.2   1.2   1.2   1.2   1.2  59.0  36.0  41.0
##  [35]   6.0   3.0   3.0   3.0   3.0  16.3  16.3  16.3  16.3 298.0 240.0 427.0 120.0  63.0 160.0 500.0 141.0
##  [52] 340.0 151.0 150.0 300.0  27.0   9.0  48.0  68.0  68.0  24.0  24.0   8.0  38.0  24.0  24.0   9.0   9.0
##  [69]  24.0  36.0   9.0  37.0  36.0  24.0  24.0 145.0  51.0  24.0  24.0  36.0  36.0  11.7  11.7  11.7  11.7
##  [86]  11.7  11.7  11.7  11.7  11.7  11.7  11.7  11.7  11.7  11.7  11.7  11.7  11.7  11.7  11.7  11.7  11.7
## [103]  11.7  11.7  11.7 120.0  72.0 140.0  75.0  75.0  30.0  30.0  60.0  12.0  24.0   3.0 312.0    NA 312.0
## [120] 360.0  40.1 150.0 502.0 166.0 595.0 370.0 240.5  25.0  25.0 152.0 230.0 350.0  83.0  72.0  60.0  45.0
## [137]  45.0 440.0 200.5 200.5 532.0    NA    NA    NA    NA    NA 800.1 371.0 371.0 651.1

Lets first fix those prices - Attempt 2

The commas are causing us problems, lets replace , with `` (blank) before coercing:

pp$price %>% str_replace(",", "") %>% as.numeric() %>% head(150)
##   [1]  360.0    6.0   12.0    6.0    6.0    9.0   12.0   12.0   24.0    6.0    6.0    1.3    1.3    1.3   12.0
##  [16]  100.0  100.0   12.0   12.0   12.0    6.0   12.0   12.0   12.0  315.0  315.0    1.2    1.2    1.2    1.2
##  [31]    1.2   59.0   36.0   41.0    6.0    3.0    3.0    3.0    3.0   16.3   16.3   16.3   16.3  298.0  240.0
##  [46]  427.0  120.0   63.0  160.0  500.0  141.0  340.0  151.0  150.0  300.0   27.0    9.0   48.0   68.0   68.0
##  [61]   24.0   24.0    8.0   38.0   24.0   24.0    9.0    9.0   24.0   36.0    9.0   37.0   36.0   24.0   24.0
##  [76]  145.0   51.0   24.0   24.0   36.0   36.0   11.7   11.7   11.7   11.7   11.7   11.7   11.7   11.7   11.7
##  [91]   11.7   11.7   11.7   11.7   11.7   11.7   11.7   11.7   11.7   11.7   11.7   11.7   11.7   11.7   11.7
## [106]  120.0   72.0  140.0   75.0   75.0   30.0   30.0   60.0   12.0   24.0    3.0  312.0 1000.0  312.0  360.0
## [121]   40.1  150.0  502.0  166.0  595.0  370.0  240.5   25.0   25.0  152.0  230.0  350.0   83.0   72.0   60.0
## [136]   45.0   45.0  440.0  200.5  200.5  532.0 1842.5 1842.5 3035.0 3035.0 1000.4  800.1  371.0  371.0  651.1

Time to actually fix things

pp = pp %>% mutate(price = str_replace(price, ",", "") %>% as.numeric())
str(pp$price)
##  num [1:3393] 360 6 12 6 6 9 12 12 24 6 ...

Modeling the relationship between variables

Models as functions

  • We can represent relationships between variables using function

  • A function is a mathematical concept: the relationship between an output and one or more inputs.
    • Plug in the inputs and receive back the output
    • Example: the formula \(y = 3x + 7\) is a function with input \(x\) and output \(y\), when \(x\) is \(5\), the output \(y\) is \(22\)

Height as a function of width

ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
  geom_point() +
  stat_smooth(method = "lm") # lm for linear model

Vocabulary

  • Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis (dependent variable)

  • Explanatory variables: Other variables that you want to use to explain the variation in the response variable, usually on the x-axis (and or color, shape, etc.) (independent variables)

  • Model / Predicted value: Output of the model function
    • The model function usually gives the expected value of the response variable conditioning on the explanatory variables
  • Residuals: How far each observation is from its predicted value
    • \(residual = observed~value - predicted~value\)

Residuals

What does a negative residual mean? Which paintings on the plot have have negative residuals?

Best fit line?

How do we decide which line to use for our linear model?

Model fit?

We can get a rough sense of model fit via visual inspection. How well do you think a painting’s width explains its height?

Just for reference…

Here is the code for the plot from the previous slide

# points not colored by landsALL type
ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
  geom_point(alpha = 0.2) +
  stat_smooth(method = "lm", se=FALSE)

Models - upsides and downsides

  • Models can sometimes reveal patterns that are not evident in a graph of the data, particularly in high dimensions where simple visual inspection of the data is not possible.

  • There is a real risk that a model is imposing structure that is not really there on the data. A skeptical approach is always warranted.

xkcd #1725



Unexplained variation …

is just as important as the model, if not more so!

Statistics is the explanation of variation in the context of what remains unexplained.

  • The scatter suggests that there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role.

  • Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We’ll talk more about this later.)

How do we use models?

  1. Explanation: Characterize the relationship between \(y\) and \(x\) via slopes for numerical explanatory variables or differences for categorical explanatory variables

  2. Prediction: Plug in \(x\), get the predicted \(y\)

Characterizing relationships with models

Relationship between height and width

lm(Height_in ~ Width_in, data = pp)
## 
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
## 
## Coefficients:
## (Intercept)     Width_in  
##      3.6214       0.7808

\[ \widehat{Height} = 3.62 + 0.78~Width \]

  • Slope: For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.78 inches.

  • Intercept: Paintings that are 0 inches wide are expected to be 3.62 inches high, on average.
    • Does this make sense?

Relationship between height and landscape features

lm(Height_in ~ factor(landsALL), data = pp)
## 
## Call:
## lm(formula = Height_in ~ factor(landsALL), data = pp)
## 
## Coefficients:
##       (Intercept)  factor(landsALL)1  
##            22.680             -5.645

\[ \widehat{Height} = 22.68 - 5.65~landsALL \]

  • Slope: Paintings that have some landscape features are expected, on average, to be 5.65 inches shorter than paintings that don’t have landscape features.
    • Compares the baseline level (landsALL = 0) to the other level (landsALL = 1).
  • Intercept: Paintings that don’t have landscape features are expected, on average, to be 22.68 inches tall.

Relationship between height and school

lm(Height_in ~ school_pntg, data = pp)
## 
## Call:
## lm(formula = Height_in ~ school_pntg, data = pp)
## 
## Coefficients:
##     (Intercept)  school_pntgD/FL     school_pntgF     school_pntgG     school_pntgI     school_pntgS  
##          14.000            2.329           10.197            1.650           10.287           30.429  
##    school_pntgX  
##           2.869
  • When the categorical explanatory variable has many levels, they’re encoded using dummy variables.

  • Each coefficient describes the expected difference between heights in that particular school compared to the baseline level.

FYI - linear models and means

pp %>% 
    select(Height_in,school_pntg) %>%
    na.omit() %>%
    group_by(school_pntg) %>%
    summarize(mean = mean(Height_in))
## # A tibble: 7 × 2
##   school_pntg     mean
##         <chr>    <dbl>
## 1           A 14.00000
## 2        D/FL 16.32873
## 3           F 24.19671
## 4           G 15.65000
## 5           I 24.28671
## 6           S 44.42857
## 7           X 16.86875

Correlation does not imply causation!

http://tylervigen.com/spurious-correlations

Prediction with models

Predict height from width

On average, how tall are paintings that are 60 inches wide? \[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]

3.62 + 0.78 * 60
## [1] 50.42

“On average, we expect paintings that are 60 inches wide to be 50.42 inches high.”

Warning: We “expect” this to happen, but there will be some variability. (We’ll learn about measuring the variability around the prediction later.)

Prediction vs. extrapolation

On average, how tall are paintings that are 400 inches wide? \[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]

Watch out for extrapolation!

“When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on.” - Stephen Colbert April 6th, 2010

Source: OpenIntro Statistics. “Extrapolation is treacherous.” OpenIntro Statistics.

Measuring model fit

Measuring the strength of the fit

  • The strength of the fit of a linear model is most commonly evaluated using \(R^2\)

  • It tells us what percent of variability in the response variable is explained by the model.

  • The remainder of the variability is explained by variables not included in the model (or just measurement error).

  • \(R^2\) is sometimes called the coefficient of determination.

Obtaining \(R^2\) in R

  • Height vs. width
m_ht_wt <- lm(Height_in ~ Width_in, data = pp) # fit and save
summary(m_ht_wt)$r.squared # extract R-squared
## [1] 0.6829468

Roughly 68% of the variability in heights of paintings can be explained by their widths.

  • Height vs. lanscape features
m_ht_land <- lm(Height_in ~ landsALL, data = pp)
summary(m_ht_land)$r.squared
## [1] 0.03456724

Application exercise

App Ex 2

See course website