September 17, 2015

Review App Ex from last time

- Modeling the relationship between variables
- Focus on
*linear*models (but remember there are other types of models too!)

- Focus on
Application Exercise: model prices of Paris Paintings

**Due Tuesday:**Finish App Ex + Reading (you'll receive an email with a link after class)

library(ggplot2) library(dplyr) library(stringr)

pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>% tbl_df()

class(pp$price)

## [1] "character"

table(pp$price)[1:10]

## ## 1,000.0 1,000.4 1,001.0 1,002.0 1,004.0 1,005.0 1,006.0 1,008.0 1,011.0 1,035.5 ## 20 1 4 4 1 1 2 1 1 2

Replace `,`

with `` (blank), and save the variable as numeric:

pp <- pp %>% mutate(price = as.numeric(str_replace(price, ",", "")))

Much better…

class(pp$price)

## [1] "numeric"

Describe the distribution of prices of paintings.

ggplot(data = pp, aes(x = price)) + geom_histogram(binwidth = 1000)

We can represent relationships between variables using

**function**- A function is a mathematical concept: the relationship between an output and one or more inputs.
- Plug in the inputs and receive back the output
- Example: the formula \(y = 3x + 7\) is a function with input \(x\) and output \(y\), when \(x\) is \(5\), the output \(y\) is \(22\)

ggplot(data = pp, aes(x = Width_in, y = Height_in)) + geom_point() + stat_smooth(method = "lm") # lm for linear model

**Response variable:**Variable whose behavior or variation you are trying to understand, on the y-axis (dependent variable)**Explanatory variables:**Other variables that you want to use to explain the variation in the response, on the x-axis (independent variables)**Model value:**Output of the function**model function**- The model function gives the typical value of the response variable
*conditioning*on the explanatory variables - Also called the
**predicted value**

- The model function gives the typical value of the response variable
**Residuals:**Show how far each case is from its model value- \(residual = actual~value - model~value\)
- Tells how far above the model function each case is

What does a negative residual mean? Which paintings on the plot have have negative residuals?

How, if at all, the relatonship between width and height of paintings vary by whether or not they have any landscape elements?

Here is the code for the two plots in the previous slide

# points colored by landsALL type ggplot(data = pp, aes(x = Width_in, y = Height_in, color = factor(landsALL))) + geom_point(alpha = 0.4) + stat_smooth(method = "lm")

# points not colored by landsALL type ggplot(data = pp, aes(x = Width_in, y = Height_in)) + geom_point(alpha = 0.4) + stat_smooth(method = "lm")

Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modeling over simple visual inspection of data.

There is a real risk, however, that a model is imposing structure that is not really there on the scatter of data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted.

is just as important as the model, if not more!

*Statistics is the explanation of variation in the context of what remains unexplained.*

The scatter suggests that there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role.

Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We'll talk more about this later.)

Explanation: Characterize the relationship between \(y\) and \(x\) via

*slopes*for numerical explanatory variables or*differences*for categorical explanatory variablesPrediction: Plug in \(x\), get the predicted \(y\)

lm(Height_in ~ Width_in, data = pp)

## ## Call: ## lm(formula = Height_in ~ Width_in, data = pp) ## ## Coefficients: ## (Intercept) Width_in ## 3.6214 0.7808

\[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]

**Slope:**For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.78 inches.**Intercept:**Paintings that are 0 inches wide are expected to be 3.62 inches high, on average.- Does this make sense?

lm(Height_in ~ factor(landsALL), data = pp)

## ## Call: ## lm(formula = Height_in ~ factor(landsALL), data = pp) ## ## Coefficients: ## (Intercept) factor(landsALL)1 ## 22.680 -5.645

\[ \widehat{Height_{in}} = 22.68 - 5.65~landsALL \]

**Slope:**Paintings that have some landscape features are expected, on average, to be 5.65 inches shorter than paintings that don't have landscape features.- Compares the baseline level (
`landsALL = 0`

) to the other level (`landsALL = 1`

).

- Compares the baseline level (
**Intercept:**Paintings that don't have landscape features are expected, on average, to be 22.68 inches tall.

lm(Height_in ~ school_pntg, data = pp)

## ## Call: ## lm(formula = Height_in ~ school_pntg, data = pp) ## ## Coefficients: ## (Intercept) school_pntgD/FL school_pntgF school_pntgG school_pntgI ## 14.000 2.329 10.197 1.650 10.287 ## school_pntgS school_pntgX ## 30.429 2.869

When the categorical explanatory variable has many levels, they're encoded to

**dummy variables**.Each coefficient describes the expected difference between heights in that particular school compared to the baseline level.

Remember this when interpreting model coefficients

On average, how tall are paintings that are 60 inches wide? \[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]

3.62 + 0.78 * 60

## [1] 50.42

"On average, we expect paintings that are 60 inches wide to be 50.42 inches high."

**Warning:** We "expect" this to happen, but there will be some variability. (We'll learn about measuring the variability around the prediction later.)

On average, how tall are paintings that are 400 inches wide? \[ \widehat{Height_{in}} = 3.62 + 0.78~Width_{in} \]