October 7, 2014

Relationship between two numerical variables

Batter up


What are some commonly used measures of a baseball teams' success?

Moneyball

The movie "Moneyball" focuses on the "quest for the secret of success in baseball". It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player's ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

https://www.youtube.com/watch?v=yGf6LNWY9AI

Data

  • Data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics

  • Goal: To summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team's runs scored in a season

load(url("http://www.openintro.org/stat/data/mlb11.RData"))
names(mlb11)
##  [1] "team"         "runs"         "at_bats"      "hits"        
##  [5] "homeruns"     "bat_avg"      "strikeouts"   "stolen_bases"
##  [9] "wins"         "new_onbase"   "new_slug"     "new_obs"
What does each variable mean?

Visualizing relationships

What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team's at_bats, would you be comfortable using a linear model to predict the number of runs?
qplot(runs, at_bats, data = mlb11)

plot of chunk unnamed-chunk-2

Correlation

  • Correlation describes the strength of the linear association between two variables.

  • It takes values between -1 (perfect negative) and +1 (perfect positive).

  • A value of 0 indicates no linear association.

  • We use \(\rho\) to indicate the population correlation coefficient, and \(R\) or \(r\) to indicate the sample correlation coefficient.

Correlation examples

Covariance

  • We have previously discussed the variance as a measure of uncertainty of a random variable:

\[ Var(X) = \frac{1}{n}\sum_{i = 1}^n (x_i - \mu_X)^2 \]

  • In order to define correlation we first need to define covariance, which is a generalization of variance to two random variables:

\[ Covar(X,Y) = \frac{1}{n}\sum_{i = 1}^n (x_i - \mu_X) (y_i - \mu_Y) \]

  • Covariance is not a measure of uncertainly but rather a measure of the degree to which \(X\) and \(Y\) tend to be large (or small) at the same time or the degree to which one tends to be large while the other is small.

Covariance, cont.

  • The magnitude of the covariance is not very informative since it is affected by the magnitude of both \(X\) and \(Y\). However, the sign of the covariance tells us something useful about the relationship between \(X\) and \(Y\).

  • Consider the following conditions:
    • \(x_i > \mu_X\) and \(y_i > \mu_Y\) then \((x_i-\mu_X)(y_i-\mu_Y)\) will be positive.
    • \(x_i < \mu_X\) and \(y_i < \mu_Y\) then \((x_i-\mu_X)(y_i-\mu_Y)\) will be positive.
    • \(x_i > \mu_X\) and \(y_i < \mu_Y\) then \((x_i-\mu_X)(y_i-\mu_Y)\) will be negative.
    • \(x_i < \mu_X\) and \(y_i > \mu_Y\) then \((x_i-\mu_X)(y_i-\mu_Y)\) will be negative.

Properties of covariance

  • \(Cov(X,X) = Var(X)\)
  • \(Cov(X,Y) = Cov(Y,X)\)
  • \(Cov(X,Y) = 0\) if \(X\) and \(Y\) are independent
  • \(Cov(X,c) = 0\)
  • \(Cov(aX,bY) = ab~Cov(X,Y)\)
  • \(Cov(X+a,Y+b) = Cov(X,Y)\)
  • \(Cov(X,Y+Z) = Cov(X,Y)+Cov(X,Z)\)

Correlation

  • Since \(Cov(X,Y)\) depends on the magnitude of \(X\) and \(Y\) we would prefer to have a measure of association that is not affected by changes in the scales of the variables.

  • The most common measure of linear association is correlation which is defined as

\[ \rho(X,Y) = \frac{Cov(X,Y)}{\sigma_X \; \sigma_Y} \] \[ -1 < \rho(X,Y) < 1 \]

  • Where the magnitude of the correlation measures the strength of the linear association and the sign determines if it is a positive or negative relationship.

Correlation and independence

Given random variables \(X\) and \(Y\):

  • \(X\) and \(Y\) are independent implies \(Cov(X,Y) = \rho(X,Y) = 0\)

  • \(Cov(X,Y) = \rho(X,Y) = 0\) does not imply \(X\) and \(Y\) are independent

Which of the following is the best guess for the correlation between runs and at_bats?
qplot(runs, at_bats, data = mlb11)

plot of chunk unnamed-chunk-3 (a) 0.02 (b) 0.60 (c) -1.5 (d) 0.95 (e) -0.55

Which of the following has the strongest correlation?

plot of chunk unnamed-chunk-4

Pairwise plots

#install.packages("GGally")
library(GGally)
ggpairs(d, alpha=0.7)

plot of chunk unnamed-chunk-5

Best fit line

Quantifying the best fit

plot of chunk unnamed-chunk-6

Residual

Residual is the difference between the observed and predicted \(y\). \[ e_i = y_i - \hat{y}_i \]

plot of chunk unnamed-chunk-7

A measure for the best line

  • We want a line that has small residuals:
    • Option 1: Minimize the sum of magnitudes (absolute values) of residuals \[ |e_1| + |e_2| + \cdots + |e_n| \]
    • Option 2: Minimize the sum of squared residuals – least squares \[ e_1^2 + e_2^2 + \cdots + e_n^2 \]
  • Why least squares?
    • Most commonly used
    • Easier to compute (by hand and using software)
    • In many applications, a residual twice as large as another is more than twice as bad

The least squares line

\[ \hat{y} = \beta_0 + \beta_1 x \]

  • Intercept: \(\beta_0\), \(b_0\)
    • When x = 0, y is expected to equal the intercept.
  • Slope: \(\beta_1\), \(b_1\)
    • For each unit increase in x, y is expected to increase/decrease on average by the slope.

fit = lm(runs ~ at_bats, data = mlb11)
summary(fit)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -125.6  -47.0  -16.6   54.4  176.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.243    853.696   -3.27  0.00287 ** 
## at_bats         0.631      0.155    4.08  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared:  0.373,  Adjusted R-squared:  0.35 
## F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339

Write the linear model and interpret the slope and the intercept.
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
at_bats         0.6305     0.1545   4.080 0.000339 ***

Prediction vs. extrapolation

plot of chunk unnamed-chunk-9

Measuring the strength of the fit

  • The strength of the fit of a linear model is most commonly evaluated using \(R^2\)

  • \(R^2\) is calculated as the square of the correlation coefficient

  • It tells us what percent of variability in the response variable is explained by the model.

  • The remainder of the variability is explained by variables not included in the model.

  • Sometimes called the coefficient of determination.

cor(mlb11$runs, mlb11$at_bats)^2
## [1] 0.3729

summary(fit)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -125.6  -47.0  -16.6   54.4  176.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.243    853.696   -3.27  0.00287 ** 
## at_bats         0.631      0.155    4.08  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared:  0.373,  Adjusted R-squared:  0.35 
## F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339

Another look at \(R^2\)

  • For a linear regression we have defined the correlation coefficient to be \[R = \text{Cor}(X,Y) = \frac{1}{n-1} \sum_i (x_i-\bar{x}) (y_i-\bar{y})\]

  • This definition works fine for the simple linear regression case where \(X\) and \(Y\) are numerical variable, but does not work well in some of the extensions we will see this week and next week.\

  • A better definition is \(R = \text{Cor}(Y,\hat{Y})\), which will work for all regression examples we will see in this class. Additionally, it is equivalent to \(\text{Cor}(X,Y)\) in the case of simple linear regression and it is useful for obtaining a better understanding of the meaning of \(R^2\).