Today’s agenda

Today’s agenda

  • Review App Ex from last time

  • Modeling the relationship between variables
    • Focus on linear models (but remember there are other types of models too!)
  • Application Exercise: model prices of Paris Paintings

  • Due Tuesday: Finish App Ex + Reading (you’ll receive an email with a link after class)

Prepping the data

Load packages + Paris Paintings data

library(ggplot2)
library(dplyr)
library(stringr)
pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
  tbl_df()

What’s going on with prices?

class(pp$price)
## [1] "character"
table(pp$price)[1:30]
## 
## 1,000.0 1,000.4 1,001.0 1,002.0 1,004.0 1,005.0 1,006.0 1,008.0 1,011.0 1,035.5 1,050.0 1,051.0 
##      20       1       4       4       1       1       2       1       1       2       4       1 
## 1,055.0 1,060.0 1,077.0 1,079.0 1,080.0 1,086.0 1,099.0 1,100.0 1,100.5 1,105.0 1,110.0 1,140.0 
##       2       1       1       1       1       1       1       6       2       1       2       1 
## 1,150.0 1,155.0 1,161.0 1,180.0 1,200.0 1,201.0 
##       4       2       1       3      14       5

Let’s first fix those prices

Replace , with `` (blank), and save the variable as numeric:

pp <- pp %>%
  mutate(price = as.numeric(str_replace(price, ",", "")))

Much better…

class(pp$price)
## [1] "numeric"

Prices

Describe the distribution of prices of paintings.
ggplot(data = pp, aes(x = price)) +
  geom_histogram(binwidth = 1000)

Modeling the relationship between variables

Models as functions

  • We can represent relationships between variables using function

  • A function is a mathematical concept: the relationship between an output and one or more inputs.
    • Plug in the inputs and receive back the output
    • Example: the formula \(y = 3x + 7\) is a function with input \(x\) and output \(y\), when \(x\) is \(5\), the output \(y\) is \(22\)

Height as a function of width

ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
  geom_point() +
  stat_smooth(method = "lm") # lm for linear model

Vocabulary

  • Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis (dependent variable)

  • Explanatory variables: Other variables that you want to use to explain the variation in the response, on the x-axis (independent variables)

  • Model value: Output of the function model function
    • The model function gives the typical value of the response variable conditioning on the explanatory variables
    • Also called the predicted value
  • Residuals: Show how far each case is from its model value
    • \(residual = actual~value - model~value\)
    • Tells how far above the model function each case is

Residuals

What does a negative residual mean? Which paintings on the plot have have negative residuals?

Multiple explanatory variables

How, if at all, the relatonship between width and height of paintings vary by whether or not they have any landscape elements?

Just for reference…

Here is the code for the two plots in the previous slide

# points colored by landsALL type
ggplot(data = pp, aes(x = Width_in, y = Height_in, color = factor(landsALL))) +
  geom_point(alpha = 0.4) +
  stat_smooth(method = "lm")
# points not colored by landsALL type
ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
  geom_point(alpha = 0.4) +
  stat_smooth(method = "lm")

Models - upsides and downsides

  • Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modeling over simple visual inspection of data.

  • There is a real risk, however, that a model is imposing structure that is not really there on the scatter of data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted.

Variation around the model…

is just as important as the model, if not more!

Statistics is the explanation of variation in the context of what remains unexplained.

  • The scatter suggests that there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role.

  • Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We’ll talk more about this later.)

How do we use models?

  1. Explanation: Characterize the relationship between \(y\) and \(x\) via slopes for numerical explanatory variables or differences for categorical explanatory variables

  2. Prediction: Plug in \(x\), get the predicted \(y\)

Characterizing relationships with models

Relationship between height and width

lm(Height_in ~ Width_in, data = pp)
## 
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
## 
## Coefficients:
## (Intercept)     Width_in  
##      3.6214       0.7808

\[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]

  • Slope: For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.78 inches.

  • Intercept: Paintings that are 0 inches wide are expected to be 3.62 inches high, on average.
    • Does this make sense?

Relationship between height and landscape features

lm(Height_in ~ factor(landsALL), data = pp)
## 
## Call:
## lm(formula = Height_in ~ factor(landsALL), data = pp)
## 
## Coefficients:
##       (Intercept)  factor(landsALL)1  
##            22.680             -5.645

\[ \widehat{Height_{in}} = 22.68 - 5.65~landsALL \]

  • Slope: Paintings that have some landscape features are expected, on average, to be 5.65 inches shorter than paintings that don’t have landscape features.
    • Compares the baseline level (landsALL = 0) to the other level (landsALL = 1).
  • Intercept: Paintings that don’t have landscape features are expected, on average, to be 22.68 inches tall.

Relationship between height and school

lm(Height_in ~ school_pntg, data = pp)
## 
## Call:
## lm(formula = Height_in ~ school_pntg, data = pp)
## 
## Coefficients:
##     (Intercept)  school_pntgD/FL     school_pntgF     school_pntgG     school_pntgI  
##          14.000            2.329           10.197            1.650           10.287  
##    school_pntgS     school_pntgX  
##          30.429            2.869
  • When the categorical explanatory variable has many levels, they’re encoded to dummy variables.

  • Each coefficient describes the expected difference between heights in that particular school compared to the baseline level.

Correlation does not imply causation!

Remember this when interpreting model coefficients

Prediction with models

Predict height from width

On average, how tall are paintings that are 60 inches wide? \[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]
3.62 + 0.78 * 60
## [1] 50.42

“On average, we expect paintings that are 60 inches wide to be 50.42 inches high.”

Warning: We “expect” this to happen, but there will be some variability. (We’ll learn about measuring the variability around the prediction later.)

Prediction vs. extrapolation

On average, how tall are paintings that are 400 inches wide? \[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]

Watch out for extrapolation!

“When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on.” - Stephen Colbert April 6th, 2010

Source: OpenIntro Statistics. “Extrapolation is treacherous.” OpenIntro Statistics.

Measuring model fit

Measuring the strength of the fit

  • The strength of the fit of a linear model is most commonly evaluated using \(R^2\)

  • It tells us what percent of variability in the response variable is explained by the model.

  • The remainder of the variability is explained by variables not included in the model.

  • \(R^2\) is sometimes called the coefficient of determination.

Obtaining \(R^2\) in R

  • Height vs. width
m_ht_wt <- lm(Height_in ~ Width_in, data = pp) # fit and save
summary(m_ht_wt)$r.squared # extract R-squared
## [1] 0.6829468

Roughly 68% of the variability in heights of paintings can be explained by their widths.

  • Height vs. lanscape features
m_ht_land <- lm(Height_in ~ landsALL, data = pp)
summary(m_ht_land)$r.squared
## [1] 0.03456724

Attribution

Some of the material on these slides is based off of

Daniel Kaplan. Statistical Modeling: A Fresh Approach. (2009).

Application exercise

App Ex 3

See course website