Lecture 13 -

Modeling the relationship between two numerical variables

What are some commonly used measures of a baseball teams’ success?

The movie “Moneyball” focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

Data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics

Goal: To summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season

```
load(url("http://www.openintro.org/stat/data/mlb11.RData"))
names(mlb11)
```

```
## [1] "team" "runs" "at_bats" "hits"
## [5] "homeruns" "bat_avg" "strikeouts" "stolen_bases"
## [9] "wins" "new_onbase" "new_slug" "new_obs"
```

What does each variable mean?

What type of plot would you use to display the relationship between

`runs`

and one of the other numerical variables? Plot this relationship using the variable `at_bats`

as the predictor. Does the relationship look linear? If you knew a team’s `at_bats`

, would you be comfortable using a linear model to predict the number of runs?
`qplot(runs, at_bats, data = mlb11)`

**Correlation**describes the strength of the linear association between two variables.It takes values between -1 (perfect negative) and +1 (perfect positive).

A value of 0 indicates no linear association.

We use \(\rho\) to indicate the population correlation coefficient, and \(R\) or \(r\) to indicate the sample correlation coefficient.

- We have previously discussed the variance as a measure of uncertainty of a random variable:

\[ Var(X) = \frac{1}{n}\sum_{i = 1}^n (x_i - \mu_X)^2 \]

- In order to define correlation we first need to define covariance, which is a generalization of variance to two random variables:

\[ Covar(X,Y) = \frac{1}{n}\sum_{i = 1}^n (x_i - \mu_X) (y_i - \mu_Y) \]

- Covariance is not a measure of uncertainly but rather a measure of the degree to which \(X\) and \(Y\) tend to be large (or small) at the same time or the degree to which one tends to be large while the other is small.

The magnitude of the covariance is not very informative since it is affected by the magnitude of both \(X\) and \(Y\). However, the sign of the covariance tells us something useful about the relationship between \(X\) and \(Y\).

- Consider the following conditions:
- \(x_i > \mu_X\) and \(y_i > \mu_Y\) then \((x_i-\mu_X)(y_i-\mu_Y)\) will be positive.
- \(x_i < \mu_X\) and \(y_i < \mu_Y\) then \((x_i-\mu_X)(y_i-\mu_Y)\) will be positive.
- \(x_i > \mu_X\) and \(y_i < \mu_Y\) then \((x_i-\mu_X)(y_i-\mu_Y)\) will be negative.
- \(x_i < \mu_X\) and \(y_i > \mu_Y\) then \((x_i-\mu_X)(y_i-\mu_Y)\) will be negative.

- \(Cov(X,X) = Var(X)\)
- \(Cov(X,Y) = Cov(Y,X)\)
- \(Cov(X,Y) = 0\) if \(X\) and \(Y\) are independent
- \(Cov(X,c) = 0\)
- \(Cov(aX,bY) = ab~Cov(X,Y)\)
- \(Cov(X+a,Y+b) = Cov(X,Y)\)
- \(Cov(X,Y+Z) = Cov(X,Y)+Cov(X,Z)\)

Since \(Cov(X,Y)\) depends on the magnitude of \(X\) and \(Y\) we would prefer to have a measure of association that is not affected by changes in the scales of the variables.

The most common measure of

*linear*association is correlation which is defined as

\[ \rho(X,Y) = \frac{Cov(X,Y)}{\sigma_X \; \sigma_Y} \] \[ -1 < \rho(X,Y) < 1 \]

- Where the magnitude of the correlation measures the strength of the
*linear*association and the sign determines if it is a positive or negative relationship.

Given random variables \(X\) and \(Y\):

\(X\) and \(Y\) are independent implies \(Cov(X,Y) = \rho(X,Y) = 0\)

\(Cov(X,Y) = \rho(X,Y) = 0\) does not imply \(X\) and \(Y\) are independent

Which of the following is the best guess for the correlation between

`runs`

and `at_bats`

?
`qplot(runs, at_bats, data = mlb11)`

(a) 0.02 (b) 0.60 (c) -1.5 (d) 0.95 (e) -0.55

Which of the following has the strongest correlation?