# Today’s agenda

## Today’s agenda

• Review App Ex from last time

• Modeling the relationship between variables
• Focus on linear models (but remember there are other types of models too!)
• Application Exercise: model prices of Paris Paintings

• Due Tuesday: Finish App Ex + Reading (you’ll receive an email with a link after class)

# Prepping the data

## Load packages + Paris Paintings data

``````library(ggplot2)
library(dplyr)
library(stringr)``````
``````pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
tbl_df()``````

## What’s going on with prices?

``class(pp\$price)``
``## [1] "character"``
``table(pp\$price)[1:30]``
``````##
## 1,000.0 1,000.4 1,001.0 1,002.0 1,004.0 1,005.0 1,006.0 1,008.0 1,011.0 1,035.5 1,050.0 1,051.0
##      20       1       4       4       1       1       2       1       1       2       4       1
## 1,055.0 1,060.0 1,077.0 1,079.0 1,080.0 1,086.0 1,099.0 1,100.0 1,100.5 1,105.0 1,110.0 1,140.0
##       2       1       1       1       1       1       1       6       2       1       2       1
## 1,150.0 1,155.0 1,161.0 1,180.0 1,200.0 1,201.0
##       4       2       1       3      14       5``````

## Let’s first fix those prices

Replace `,` with `` (blank), and save the variable as numeric:

``````pp <- pp %>%
mutate(price = as.numeric(str_replace(price, ",", "")))``````

Much better…

``class(pp\$price)``
``## [1] "numeric"``

## Prices

Describe the distribution of prices of paintings.
``````ggplot(data = pp, aes(x = price)) +
geom_histogram(binwidth = 1000)``````

# Modeling the relationship between variables

## Models as functions

• We can represent relationships between variables using function

• A function is a mathematical concept: the relationship between an output and one or more inputs.
• Plug in the inputs and receive back the output
• Example: the formula \(y = 3x + 7\) is a function with input \(x\) and output \(y\), when \(x\) is \(5\), the output \(y\) is \(22\)

## Height as a function of width

``````ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
geom_point() +
stat_smooth(method = "lm") # lm for linear model``````

## Vocabulary

• Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis (dependent variable)

• Explanatory variables: Other variables that you want to use to explain the variation in the response, on the x-axis (independent variables)

• Model value: Output of the function model function
• The model function gives the typical value of the response variable conditioning on the explanatory variables
• Also called the predicted value
• Residuals: Show how far each case is from its model value
• \(residual = actual~value - model~value\)
• Tells how far above the model function each case is

## Residuals

What does a negative residual mean? Which paintings on the plot have have negative residuals?

## Multiple explanatory variables

How, if at all, the relatonship between width and height of paintings vary by whether or not they have any landscape elements?