Motivating Regression Analysis

# Motivating Regression Analysis
### Dr. Maria Tackett
### 01.14.19

---

## Announcements

- Office hours start today. See course website for office hour times and locations.

- DVS Intro to R Workshop - Jan 22 at 2p
     - [https://duke.libcal.com/event/4799048](https://duke.libcal.com/event/4799048)

---

## Agenda

- Motivate regression analysis

- RStudio & R Markdown

---

## Packages and Data

```r
library(tidyverse)
library(readr) #read_csv() function
library(cowplot) #plot_grid() function
```

```r
advertising <- read_csv("data/advertising.csv")
```

---

## Example: Sales vs. Advertising

- Suppose you are a data scientist on the marketing team and the company wants to improve the sales of their premiere product

- You want to understand the association between advertising budget and total sales

- If there is an association between advertising and sales, then you can provide advice to the company about how to adjust the advertising budget to improve sales

---

## Advertising vs. Sales

```r
glimpse(advertising)
```

```
## Observations: 200
## Variables: 4
## $ tv <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6…
## $ radio <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2…
## $ newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 2…
## $ sales <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.…
```

- Observations: 200 markets

- Variables: 
 - `tv`: Spending on TV ads (in $thousands)
 - `radio`: Spending on radio ads (in $thousands)
 - `newspaper`: Spending on newspaper ads (in $thousands)
 - `sales`: total sales (in $millions)
 
---

## Terminology

- `sales` is the response variable
 - variable whose variation we want to understand / variable we wish to predict
 - also known as *outcome* or *dependent* variable

- `tv`, `radio`, `newspaper` are the predictor variables
 - variables used to account for variation in the outcome
 - also known as *explanatory*, *independent*, or *input* variables

---

## Advertising vs. Sales

```r
p1 <- advertising %>% 
 ggplot(mapping=aes(x=tv,y=sales)) + 
 geom_point(alpha=0.5) + 
 geom_smooth(method="lm", color="blue",se=FALSE)

p2 <- advertising %>% 
 ggplot(mapping=aes(x=radio,y=sales)) + 
 geom_point(alpha=0.5) + 
 geom_smooth(method="lm", color="red", se=FALSE)

p3 <- advertising %>% 
 ggplot(mapping=aes(x=newspaper,y=sales)) + 
 geom_point(alpha=0.5) + 
 geom_smooth(method="lm", color="purple", se=FALSE)

plot_grid(p1,p2,p3,ncol=3)
```

---

## Plot Data

Each line represents model we could use to predict `sales` using `tv`, `radio`, or `newspaper`

---

## Plot Data

---

## Model

- **Goal**: Estimate `$f$`

- How do we estimate `$f$`?

- Non-parametric methods: estimate `$f$` by getting as close to the data points as possible without making explicit assumptions about the functional form of `$f$`
 
--

- Parametric methods: estimate `$f$` by making an assumption about the functional form of `$f$` then using the data to fit a model based on that form 
 
---

*Non-parametric methods covered in STA 325: Machine Learning & Data Mining*

---

## General form of model

- `$Y$`: quantitative response variable

- `$\mathbf{X} = (X_1, X_2, \ldots, X_p)$`: predictor variables

- `$f$`: fixed but unknown function
    - systematic information `$\mathbf{X}$` provides about `$Y$`

- `$\epsilon$`: random error term with mean 0 that is independent of `$\mathbf{X}$` 
   
---

## How to estimate `$f$`?

In general, we will use the following steps to estimate `$f$`

1. Choose the functional form of `$f$`, i.e. choose the appropriate model given the data
 - Ex: `$f$` is a linear model
 `$$f(\mathbf{X}) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \epsilon$$`

2. Use the data to fit (or train) the model, i.e. estimate the model parameters
 - Ex: Use a method to estimate the model parameters `$\beta_0, \beta_1, \ldots, \beta_p$`

---

## Why estimate `$f$`?

Suppose we have the model

`$$\text{sales} = \beta_0 + \beta_1 \times \text{tv} + \beta_2 \times \text{radio} + \beta_3 \times \text{newspaper} + \epsilon$$`

---

## Why estimate `$f$`?

There are two types of questions we may be interesting in answering with our model:

- Prediction: What is the expected `$Y$` given particular values of `$X_1, X_2, \ldots, X_p$`? 
--

- Ex: What is the expected `sales` in a market if the company spends $100,000 TV ads, $30,000 on radio ads, and $10,000 on newspaper ads?

- Inference: What is the relationship between `$\mathbf{X}$` and `$Y$`. How does `$Y$` change as a function of `$\mathbf{X}$`?
--

- Ex: Which type of media advertising generates the largest boost in sales?

---

## Prediction

`$$\hat{\text{sales}} = \hat{f}(\text{tv},\text{radio}, \text{newspaper})$$`

- **Question**: How accurate is `$\hat{\text{sales}}$` as a prediction of `$\text{sales}$`?

- This depends on two quantities: reducible and irreducible error

--
 - Reducible error: error that can potentially be reduced by using a model , `$\hat{f}$`, that is a more appropriate fit for the data

--
 - Irreducible error: Error that cannot be reduced even if the model, `$\hat{f}$`, is a perfect fit. In other words, this is variability associated with `$\epsilon$`

---

.question[
In the case of the `advertising` data, what types of factors are included in the irreducible error?
]

---

## Inference

`$$\hat{\text{sales}} = \hat{f}(\text{tv},\text{radio}, \text{newspaper})$$`

- **Question**: How is `sales` affected as `tv`, `radio`, and `newspaper` change?
--

- We can break this into several questions:

--
    - What predictors are important for understanding sales? 
--
    - What is the relationship between sales and each predictor? 
--
    - Does the linear model adequately describe the relationship between `sales` and `tv`, `newspaper`, and `radio`, or is a more complex model required?

---

## Course Outline

- **Unit 1: Quantitative Response Variables**
 - Simple Linear Regression 
 - Multiple Linear Regression
 
--

- **Unit 2: Categorical Response Variable**
    - Logistic Regression 
    - Multinomial Logistic Regression

---

## Course Outline

- **Unit 3: Modeling in Practice**
 - Model Validation
 - Dealing with messy data
 
--

- **Unit 4: Looking Ahead**
    - Time Series 
    - Other models? 
 
---

## Recap

- We want to find the function that best fits the relationship between a response variable `$Y$` and some predictor variables `$X_1, \ldots, X_p$`

- We will use **parametric** methods to find the function 
    - Choose the appropriate model
    - Estimate values of the model parameters

- There are two main purposes for the model: 
    - Prediction 
    - Inference

---

## Reproducible data analysis

---

## Reproducibility checklist

**Near-term goals:**

- Are the tables and figures reproducible from the code and data?
- Does the code actually do what you think it does?
- In addition to what was done, is it clear **why** it was done? 
(e.g., how were parameter settings chosen?)

**Long-term goals:**

- Can the code be used for other data?
- Can you extend the code to do other things?
---

## Toolkit

- Scriptability `$\rightarrow$` R
- Literate programming (code, narrative, output in one place) `$\rightarrow$` R Markdown
- Version control `$\rightarrow$` Git / GitHub

---

# R and RStudio

---

## What is R/RStudio?

- R is a statistical programming language
- RStudio is a convenient interface for R (an integrated development environment, IDE)
- At its simplest:*
 - R is like a car’s engine
 - RStudio is like a car’s dashboard

---

## R/RStudio Tour

Concepts introduced:

- Console
- Using R as a calculator
- Environment
- Loading and viewing a data frame
- Accessing a variable in a data frame
- R functions

---

### R essentials (a short list)

- **Functions** are (most often) verbs, followed by what they will be applied to in parentheses:

```r
do_this(to_this)
do_that(to_this, to_that, with_those)
```

- **Columns** (variables) in data frames are accessed with `$`:

```r
dataframe$var_name
```

- **Packages** are installed with the `install.packages` function and loaded with the `library` function, once per session:

```r
install.packages("package_name")
library(package_name)
```

---

## tidyverse

- The tidyverse is an opinionated collection of R packages designed for data science. 
- All packages share an underlying philosophy and a common grammar. 
]

---

# R Markdown

---

## R Markdown

- Fully reproducible reports -- the analysis is run from the beginning each time you knit

- Simple Markdown syntax for text

- Code goes in chunks, defined by three backticks, narrative goes outside of chunks

---

## R Markdown Tour

Concepts introduced:

- Copying a project in RStudio Cloud
- Knitting documents
- R Markdown and (some) R syntax

---

## R Markdown tips

**Resources**
- [R Markdown cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf)
- Markdown Quick Reference: 
 - `Help -> Markdown Quick Reference`
 
--

**Remember**: The workspace of the R Markdown document is separate from the console

---

## Workspace vs. console

- Run the following in the console

```r
x <- 2
x * 3
```

- Then, add the following chunk in your R Markdown document and knit

```r
x * 3
```

---

## How will we use R Markdown?

- Every assignment / lab / project / etc. is an R Markdown document

- You'll always have a template R Markdown document to start with

- The amount of scaffolding in the template will decrease over the semester

---

## Recap

Can you answer these questions?

**Motivating Regression Analysis**
- Given `$Y = f(\mathbf{X}) + \epsilon$`
    - What are the two basic steps for estimating `$f$`? 
    - What are two possible reasons for estimating `$f$`?

**Computing**    
- What is a reproducible data analysis, and why is it important?
- What is R vs. RStudio?

---

## Before next class

- Reading 01 posted online - due Wed

- Accept the invite to join `STA210-Sp19` organization on GitHub

- If you have not already done so, 
    - complete "Getting to know you" survey on Sakai 
    - complete Pretest ([test info & access code](https://www2.stat.duke.edu/courses/Spring19/sta210.001/slides/lab-slides/00-lab-slides.html#8))