October 7, 2014
The movie "Moneyball" focuses on the "quest for the secret of success in baseball". It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player's ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
Data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics
Goal: To summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team's runs scored in a season
load(url("http://www.openintro.org/stat/data/mlb11.RData")) names(mlb11)
## [1] "team" "runs" "at_bats" "hits" ## [5] "homeruns" "bat_avg" "strikeouts" "stolen_bases" ## [9] "wins" "new_onbase" "new_slug" "new_obs"
runs
and one of the other numerical variables? Plot this relationship using the variable at_bats
as the predictor. Does the relationship look linear? If you knew a team's at_bats
, would you be comfortable using a linear model to predict the number of runs?qplot(runs, at_bats, data = mlb11)
Correlation describes the strength of the linear association between two variables.
It takes values between -1 (perfect negative) and +1 (perfect positive).
A value of 0 indicates no linear association.
We use \(\rho\) to indicate the population correlation coefficient, and \(R\) or \(r\) to indicate the sample correlation coefficient.
From http://en.wikipedia.org/wiki/Correlation_and_dependence.
\[ Var(X) = \frac{1}{n}\sum_{i = 1}^n (x_i - \mu_X)^2 \]
\[ Covar(X,Y) = \frac{1}{n}\sum_{i = 1}^n (x_i - \mu_X) (y_i - \mu_Y) \]
The magnitude of the covariance is not very informative since it is affected by the magnitude of both \(X\) and \(Y\). However, the sign of the covariance tells us something useful about the relationship between \(X\) and \(Y\).
Since \(Cov(X,Y)\) depends on the magnitude of \(X\) and \(Y\) we would prefer to have a measure of association that is not affected by changes in the scales of the variables.
The most common measure of linear association is correlation which is defined as
\[ \rho(X,Y) = \frac{Cov(X,Y)}{\sigma_X \; \sigma_Y} \] \[ -1 < \rho(X,Y) < 1 \]
Given random variables \(X\) and \(Y\):
\(X\) and \(Y\) are independent implies \(Cov(X,Y) = \rho(X,Y) = 0\)
\(Cov(X,Y) = \rho(X,Y) = 0\) does not imply \(X\) and \(Y\) are independent
runs
and at_bats
?qplot(runs, at_bats, data = mlb11)
(a) 0.02 (b) 0.60 (c) -1.5 (d) 0.95 (e) -0.55
#install.packages("GGally") library(GGally) ggpairs(d, alpha=0.7)
Residual is the difference between the observed and predicted \(y\). \[ e_i = y_i - \hat{y}_i \]
\[ \hat{y} = \beta_0 + \beta_1 x \]
fit = lm(runs ~ at_bats, data = mlb11) summary(fit)
## ## Call: ## lm(formula = runs ~ at_bats, data = mlb11) ## ## Residuals: ## Min 1Q Median 3Q Max ## -125.6 -47.0 -16.6 54.4 176.9 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -2789.243 853.696 -3.27 0.00287 ** ## at_bats 0.631 0.155 4.08 0.00034 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 66.5 on 28 degrees of freedom ## Multiple R-squared: 0.373, Adjusted R-squared: 0.35 ## F-statistic: 16.6 on 1 and 28 DF, p-value: 0.000339
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2789.2429 853.6957 -3.267 0.002871 ** at_bats 0.6305 0.1545 4.080 0.000339 ***
The strength of the fit of a linear model is most commonly evaluated using \(R^2\)
\(R^2\) is calculated as the square of the correlation coefficient
It tells us what percent of variability in the response variable is explained by the model.
The remainder of the variability is explained by variables not included in the model.
Sometimes called the coefficient of determination.
cor(mlb11$runs, mlb11$at_bats)^2
## [1] 0.3729
summary(fit)
## ## Call: ## lm(formula = runs ~ at_bats, data = mlb11) ## ## Residuals: ## Min 1Q Median 3Q Max ## -125.6 -47.0 -16.6 54.4 176.9 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -2789.243 853.696 -3.27 0.00287 ** ## at_bats 0.631 0.155 4.08 0.00034 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 66.5 on 28 degrees of freedom ## Multiple R-squared: 0.373, Adjusted R-squared: 0.35 ## F-statistic: 16.6 on 1 and 28 DF, p-value: 0.000339
For a linear regression we have defined the correlation coefficient to be \[R = \text{Cor}(X,Y) = \frac{1}{n-1} \sum_i (x_i-\bar{x}) (y_i-\bar{y})\]
This definition works fine for the simple linear regression case where \(X\) and \(Y\) are numerical variable, but does not work well in some of the extensions we will see this week and next week.\
A better definition is \(R = \text{Cor}(Y,\hat{Y})\), which will work for all regression examples we will see in this class. Additionally, it is equivalent to \(\text{Cor}(X,Y)\) in the case of simple linear regression and it is useful for obtaining a better understanding of the meaning of \(R^2\).