September 17, 2015

## Today's agenda

• Review App Ex from last time

• Modeling the relationship between variables
• Focus on linear models (but remember there are other types of models too!)
• Application Exercise: model prices of Paris Paintings

• Due Tuesday: Finish App Ex + Reading (you'll receive an email with a link after class)

## Load packages + Paris Paintings data

```library(ggplot2)
library(dplyr)
library(stringr)```
```pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
tbl_df()```

## What's going on with prices?

`class(pp\$price)`
`## [1] "character"`
`table(pp\$price)[1:10]`
```##
## 1,000.0 1,000.4 1,001.0 1,002.0 1,004.0 1,005.0 1,006.0 1,008.0 1,011.0 1,035.5
##      20       1       4       4       1       1       2       1       1       2```

## Let's first fix those prices

Replace `,` with `` (blank), and save the variable as numeric:

```pp <- pp %>%
mutate(price = as.numeric(str_replace(price, ",", "")))```

Much better…

`class(pp\$price)`
`## [1] "numeric"`

## Prices

Describe the distribution of prices of paintings.
```ggplot(data = pp, aes(x = price)) +
geom_histogram(binwidth = 1000)```

## Models as functions

• We can represent relationships between variables using function

• A function is a mathematical concept: the relationship between an output and one or more inputs.
• Plug in the inputs and receive back the output
• Example: the formula \(y = 3x + 7\) is a function with input \(x\) and output \(y\), when \(x\) is \(5\), the output \(y\) is \(22\)

## Height as a function of width

```ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
geom_point() +
stat_smooth(method = "lm") # lm for linear model```

## Vocabulary

• Response variable: Variable whose behavior or variation you are trying to understand, on the y-axis (dependent variable)

• Explanatory variables: Other variables that you want to use to explain the variation in the response, on the x-axis (independent variables)

• Model value: Output of the function model function
• The model function gives the typical value of the response variable conditioning on the explanatory variables
• Also called the predicted value
• Residuals: Show how far each case is from its model value
• \(residual = actual~value - model~value\)
• Tells how far above the model function each case is

## Residuals

What does a negative residual mean? Which paintings on the plot have have negative residuals?

## Multiple explanatory variables

How, if at all, the relatonship between width and height of paintings vary by whether or not they have any landscape elements?

## Just for reference…

Here is the code for the two plots in the previous slide

```# points colored by landsALL type
ggplot(data = pp, aes(x = Width_in, y = Height_in, color = factor(landsALL))) +
geom_point(alpha = 0.4) +
stat_smooth(method = "lm")```
```# points not colored by landsALL type
ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
geom_point(alpha = 0.4) +
stat_smooth(method = "lm")```

## Models - upsides and downsides

• Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modeling over simple visual inspection of data.

• There is a real risk, however, that a model is imposing structure that is not really there on the scatter of data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted.

## Variation around the model…

is just as important as the model, if not more!

Statistics is the explanation of variation in the context of what remains unexplained.

• The scatter suggests that there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role.

• Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We'll talk more about this later.)

## How do we use models?

1. Explanation: Characterize the relationship between \(y\) and \(x\) via slopes for numerical explanatory variables or differences for categorical explanatory variables

2. Prediction: Plug in \(x\), get the predicted \(y\)

## Relationship between height and width

`lm(Height_in ~ Width_in, data = pp)`
```##
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
##
## Coefficients:
## (Intercept)     Width_in
##      3.6214       0.7808```

\[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]

• Slope: For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.78 inches.

• Intercept: Paintings that are 0 inches wide are expected to be 3.62 inches high, on average.
• Does this make sense?

## Relationship between height and landscape features

`lm(Height_in ~ factor(landsALL), data = pp)`
```##
## Call:
## lm(formula = Height_in ~ factor(landsALL), data = pp)
##
## Coefficients:
##       (Intercept)  factor(landsALL)1
##            22.680             -5.645```

\[ \widehat{Height_{in}} = 22.68 - 5.65~landsALL \]

• Slope: Paintings that have some landscape features are expected, on average, to be 5.65 inches shorter than paintings that don't have landscape features.
• Compares the baseline level (`landsALL = 0`) to the other level (`landsALL = 1`).
• Intercept: Paintings that don't have landscape features are expected, on average, to be 22.68 inches tall.

## Relationship between height and school

`lm(Height_in ~ school_pntg, data = pp)`
```##
## Call:
## lm(formula = Height_in ~ school_pntg, data = pp)
##
## Coefficients:
##     (Intercept)  school_pntgD/FL     school_pntgF     school_pntgG     school_pntgI
##          14.000            2.329           10.197            1.650           10.287
##    school_pntgS     school_pntgX
##          30.429            2.869```
• When the categorical explanatory variable has many levels, they're encoded to dummy variables.

• Each coefficient describes the expected difference between heights in that particular school compared to the baseline level.

## Correlation does not imply causation!

Remember this when interpreting model coefficients

## Predict height from width

On average, how tall are paintings that are 60 inches wide? \[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]
`3.62 + 0.78 * 60`
`## [1] 50.42`

"On average, we expect paintings that are 60 inches wide to be 50.42 inches high."

Warning: We "expect" this to happen, but there will be some variability. (We'll learn about measuring the variability around the prediction later.)

## Prediction vs. extrapolation

On average, how tall are paintings that are 400 inches wide? \[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]