Today’s agenda

Today’s agenda

  • Review App Ex from last time

  • Modeling the relationship between variables
    • Focus on linear models (but remember there are other types of models too!)
  • Application Exercise: model prices of Paris Paintings

  • Due Tuesday: Finish App Ex + Reading (you’ll receive an email with a link after class)

Prepping the data

Load packages + Paris Paintings data

library(ggplot2)
library(dplyr)
library(stringr)
pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
  tbl_df()

What’s going on with prices?

class(pp$price)
## [1] "character"
table(pp$price)[1:30]
## 
## 1,000.0 1,000.4 1,001.0 1,002.0 1,004.0 1,005.0 1,006.0 1,008.0 1,011.0 1,035.5 1,050.0 1,051.0 
##      20       1       4       4       1       1       2       1       1       2       4       1 
## 1,055.0 1,060.0 1,077.0 1,079.0 1,080.0 1,086.0 1,099.0 1,100.0 1,100.5 1,105.0 1,110.0 1,140.0 
##       2       1       1       1       1       1       1       6       2       1       2       1 
## 1,150.0 1,155.0 1,161.0 1,180.0 1,200.0 1,201.0 
##       4       2       1       3      14       5

Let’s first fix those prices

Replace , with `` (blank), and save the variable as numeric:

pp <- pp %>%
  mutate(price = as.numeric(str_replace(price, ",", "")))

Much better…

class(pp$price)
## [1] "numeric"

Prices

Describe the distribution of prices of paintings.
ggplot(data = pp, aes(x = price)) +
  geom_histogram(binwidth = 1000)

Modeling the relationship between variables

Models as functions

  • We can represent relationships between variables using function

  • A function is a mathematical concept: the relationship between an output and one or more inputs.
    • Plug in the inputs and receive back the output
    • Example: the formula \(y = 3x + 7\) is a function with input \(x\) and output \(y\), when \(x\) is \(5\), the output \(y\) is \(22\)

Height as a function of width

ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
  geom_point() +
  stat_smooth(method = "lm") # lm for linear model

Vocabulary

  • Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis (dependent variable)

  • Explanatory variables: Other variables that you want to use to explain the variation in the response, on the x-axis (independent variables)

  • Model value: Output of the function model function
    • The model function gives the typical value of the response variable conditioning on the explanatory variables
    • Also called the predicted value
  • Residuals: Show how far each case is from its model value
    • \(residual = actual~value - model~value\)
    • Tells how far above the model function each case is

Residuals

What does a negative residual mean? Which paintings on the plot have have negative residuals?

Multiple explanatory variables

How, if at all, the relatonship between width and height of paintings vary by whether or not they have any landscape elements?