class: center, middle, inverse, title-slide # Motivating Regression Analysis ### Dr. Maria Tackett ### 01.14.19 --- ## Announcements - Office hours start today. See course website for office hour times and locations. - DVS Intro to R Workshop - Jan 22 at 2p - [https://duke.libcal.com/event/4799048](https://duke.libcal.com/event/4799048) --- ## Agenda - Motivate regression analysis - RStudio & R Markdown --- ## Packages and Data ```r library(tidyverse) library(readr) #read_csv() function library(cowplot) #plot_grid() function ``` ```r advertising <- read_csv("data/advertising.csv") ``` --- ## Example: Sales vs. Advertising - Suppose you are a data scientist on the marketing team and the company wants to improve the sales of their premiere product - You want to understand the association between advertising budget and total sales - If there is an association between advertising and sales, then you can provide advice to the company about how to adjust the advertising budget to improve sales --- ## Advertising vs. Sales ```r glimpse(advertising) ``` ``` ## Observations: 200 ## Variables: 4 ## $ tv <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6… ## $ radio <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2… ## $ newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 2… ## $ sales <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.… ``` -- - <font class="vocab">Observations:</font> 200 markets -- - <font class="vocab">Variables:</font> - `tv`: Spending on TV ads (in $thousands) - `radio`: Spending on radio ads (in $thousands) - `newspaper`: Spending on newspaper ads (in $thousands) - `sales`: total sales (in $millions) --- ## Terminology - `sales` is the <font class="vocab">response variable</font> - variable whose variation we want to understand / variable we wish to predict - also known as *outcome* or *dependent* variable -- - `tv`, `radio`, `newspaper` are the <font class="vocab">predictor variables</font> - variables used to account for variation in the outcome - also known as *explanatory*, *independent*, or *input* variables --- ## Advertising vs. Sales ```r p1 <- advertising %>% ggplot(mapping=aes(x=tv,y=sales)) + geom_point(alpha=0.5) + geom_smooth(method="lm", color="blue",se=FALSE) p2 <- advertising %>% ggplot(mapping=aes(x=radio,y=sales)) + geom_point(alpha=0.5) + geom_smooth(method="lm", color="red", se=FALSE) p3 <- advertising %>% ggplot(mapping=aes(x=newspaper,y=sales)) + geom_point(alpha=0.5) + geom_smooth(method="lm", color="purple", se=FALSE) plot_grid(p1,p2,p3,ncol=3) ``` --- ## Plot Data <img src="01-regression-intro_files/figure-html/show-plot-1-1.png" style="display: block; margin: auto;" /> -- Each line represents model we could use to predict `sales` using `tv`, `radio`, or `newspaper` --- ## Plot Data <img src="01-regression-intro_files/figure-html/show-plot-2-1.png" style="display: block; margin: auto;" /> .alert[ `$$\text{sales} = f(\text{tv}, \text{radio}, \text{newspaper}) + \epsilon$$` ] --- ## Model .alert[ `$$\text{sales} = f(\text{tv}, \text{radio}, \text{newspaper}) + \epsilon$$` ] - **Goal**: Estimate `\(f\)` -- - How do we estimate `\(f\)`? -- - <font class="vocab">Non-parametric methods:</font> estimate `\(f\)` by getting as close to the data points as possible without making explicit assumptions about the functional form of `\(f\)` -- - <font class="vocab">Parametric methods:</font> estimate `\(f\)` by making an assumption about the functional form of `\(f\)` then using the data to fit a model based on that form --- class: middle, center .alert[ In this class, we will focus on <font class="vocab">parametric methods</font> ] <small> *Non-parametric methods covered in STA 325: Machine Learning & Data Mining* </small> --- ## General form of model .alert[ `$$Y = f(\mathbf{X}) + \epsilon$$` ] - `\(Y\)`: quantitative response variable - `\(\mathbf{X} = (X_1, X_2, \ldots, X_p)\)`: predictor variables - `\(f\)`: fixed but unknown function - systematic information `\(\mathbf{X}\)` provides about `\(Y\)` - `\(\epsilon\)`: random error term with mean 0 that is independent of `\(\mathbf{X}\)` --- ## How to estimate `\(f\)`? In general, we will use the following steps to estimate `\(f\)` 1. Choose the functional form of `\(f\)`, i.e. <font class="vocab">choose the appropriate model given the data</font> - Ex: `\(f\)` is a linear model `$$f(\mathbf{X}) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \epsilon$$` -- 2. Use the data to fit (or train) the model, i.e. <font class="vocab">estimate the model parameters</font> - Ex: Use a method to estimate the model parameters `\(\beta_0, \beta_1, \ldots, \beta_p\)` --- ## Why estimate `\(f\)`? Suppose we have the model `$$\text{sales} = \beta_0 + \beta_1 \times \text{tv} + \beta_2 \times \text{radio} + \beta_3 \times \text{newspaper} + \epsilon$$` <br><br><br> -- .question[ What are 1-2 questions you could answer using this model? ] --- ## Why estimate `\(f\)`? There are two types of questions we may be interesting in answering with our model: - <font class="vocab">Prediction: </font> What is the expected `\(Y\)` given particular values of `\(X_1, X_2, \ldots, X_p\)`? -- - Ex: What is the expected `sales` in a market if the company spends $100,000 TV ads, $30,000 on radio ads, and $10,000 on newspaper ads? -- - <font class="vocab">Inference: </font> What is the relationship between `\(\mathbf{X}\)` and `\(Y\)`. How does `\(Y\)` change as a function of `\(\mathbf{X}\)`? -- - Ex: Which type of media advertising generates the largest boost in sales? --- ## Prediction `$$\hat{\text{sales}} = \hat{f}(\text{tv},\text{radio}, \text{newspaper})$$` - **Question**: How accurate is `\(\hat{\text{sales}}\)` as a prediction of `\(\text{sales}\)`? - This depends on two quantities: reducible and irreducible error -- - <font class="vocab">Reducible error:</font> error that can potentially be reduced by using a model , `\(\hat{f}\)`, that is a more appropriate fit for the data -- - <font class="vocab">Irreducible error: </font> Error that cannot be reduced even if the model, `\(\hat{f}\)`, is a perfect fit. In other words, this is variability associated with `\(\epsilon\)` --- class: middle, center .question[ In the case of the `advertising` data, what types of factors are included in the irreducible error? ] --- ## Inference `$$\hat{\text{sales}} = \hat{f}(\text{tv},\text{radio}, \text{newspaper})$$` - **Question**: How is `sales` affected as `tv`, `radio`, and `newspaper` change? -- - We can break this into several questions: -- - What predictors are important for understanding sales? -- - What is the relationship between sales and each predictor? -- - Does the linear model adequately describe the relationship between `sales` and `tv`, `newspaper`, and `radio`, or is a more complex model required? --- ## Course Outline - **Unit 1: Quantitative Response Variables** - Simple Linear Regression - Multiple Linear Regression <br><br> -- - **Unit 2: Categorical Response Variable** - Logistic Regression - Multinomial Logistic Regression --- ## Course Outline - **Unit 3: Modeling in Practice** - Model Validation - Dealing with messy data <br><br> -- - **Unit 4: Looking Ahead** - Time Series - Other models? --- ## Recap - We want to find the function that best fits the relationship between a response variable `\(Y\)` and some predictor variables `\(X_1, \ldots, X_p\)` - We will use **parametric** methods to find the function - Choose the appropriate model - Estimate values of the model parameters - There are two main purposes for the model: - Prediction - Inference --- class: center, middle ## Reproducible data analysis --- ## Reproducibility checklist .question[ What does it mean for a data analysis to be "reproducible"? ] -- **Near-term goals:** - Are the tables and figures reproducible from the code and data? - Does the code actually do what you think it does? - In addition to what was done, is it clear **why** it was done? (e.g., how were parameter settings chosen?) -- **Long-term goals:** - Can the code be used for other data? - Can you extend the code to do other things? --- ## Toolkit <img src="img/01/toolkit.png" width="70%" style="display: block; margin: auto;" /> - Scriptability `\(\rightarrow\)` R - Literate programming (code, narrative, output in one place) `\(\rightarrow\)` R Markdown - Version control `\(\rightarrow\)` Git / GitHub --- class: center, middle # R and RStudio --- ## What is R/RStudio? - R is a statistical programming language - RStudio is a convenient interface for R (an integrated development environment, IDE) - At its simplest:<sup>*</sup> - R is like a car’s engine - RStudio is like a car’s dashboard <img src="img/01/engine-dashboard.png" width="70%" style="display: block; margin: auto;" /> .footnote[ *Source: [Modern Dive](https://moderndive.com/) ] --- ## R/RStudio Tour .center[ [DEMO] ] Concepts introduced: - Console - Using R as a calculator - Environment - Loading and viewing a data frame - Accessing a variable in a data frame - R functions --- ### R essentials (a short list) - **Functions** are (most often) verbs, followed by what they will be applied to in parentheses: ```r do_this(to_this) do_that(to_this, to_that, with_those) ``` -- - **Columns** (variables) in data frames are accessed with `$`: ```r dataframe$var_name ``` -- - **Packages** are installed with the `install.packages` function and loaded with the `library` function, once per session: ```r install.packages("package_name") library(package_name) ``` --- ## tidyverse .pull-left[  ] .pull-right[ .center[ [tidyverse.org](https://www.tidyverse.org/) ] - The tidyverse is an opinionated collection of R packages designed for data science. - All packages share an underlying philosophy and a common grammar. ] --- class: center, middle # R Markdown --- ## R Markdown - Fully reproducible reports -- the analysis is run from the beginning each time you knit - Simple Markdown syntax for text - Code goes in chunks, defined by three backticks, narrative goes outside of chunks --- ## R Markdown Tour .center[ [DEMO] ] -- Concepts introduced: - Copying a project in RStudio Cloud - Knitting documents - R Markdown and (some) R syntax --- ## R Markdown tips **Resources** - [R Markdown cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf) - Markdown Quick Reference: - `Help -> Markdown Quick Reference` <br><br> -- **Remember**: The workspace of the R Markdown document is <u>separate</u> from the console --- ## Workspace vs. console - Run the following in the console ```r x <- 2 x * 3 ``` -- - Then, add the following chunk in your R Markdown document and knit ```r x * 3 ``` -- .question[ What happens? Why the error? ] --- ## How will we use R Markdown? - Every assignment / lab / project / etc. is an R Markdown document - You'll always have a template R Markdown document to start with - The amount of scaffolding in the template will decrease over the semester --- ## Recap Can you answer these questions? **Motivating Regression Analysis** - Given `\(Y = f(\mathbf{X}) + \epsilon\)` - What are the two basic steps for estimating `\(f\)`? - What are two possible reasons for estimating `\(f\)`? **Computing** - What is a reproducible data analysis, and why is it important? - What is R vs. RStudio? --- ## Before next class - Reading 01 posted online - due Wed - Accept the invite to join `STA210-Sp19` organization on GitHub - If you have not already done so, - complete "Getting to know you" survey on Sakai - complete Pretest ([test info & access code](https://www2.stat.duke.edu/courses/Spring19/sta210.001/slides/lab-slides/00-lab-slides.html#8))