Linear regression models the mean value of a response variable (outcome) for given levels of the predictor variables (covariates)
For example, in the previous lectures of this case study, we fitted linear regression to investigate the relationship between infants’ birth weight and a set of predictors, such as gestational age, sex and mother’s smoking status
This linear regression model estimates how, on average, these mothers’ and infants’ characteristics affect the birth weights of infants
While this model can address questions such as “is prenatal care important?” it cannot answer an important question: “does prenatal care influence birth weight differently for infants with low birth weight than for those with average birth weights? “
What the regression curve does is give a grand summary for the averages of the distributions corresponding to the set of of x’s. We could go further and compute several different regression curves corresponding to the various percentage points of the distributions and thus get a more complete picture of the set. Ordinarily this is not done, and so regression often gives a rather incomplete picture. Just as the mean gives an incomplete picture of a single distribution, so the regression curve gives a correspondingly incomplete picture for a set of distributions.
– Mosteller and Tukey (1977)
A more comprehensive picture of the effect of the predictors on the response variable can be obtained by using Quantile regression
Quantile regression models the relation between a set of predictors and specific percentiles (or quantiles) of the outcome variable
For example, a median regression (median is the 50th percentile) of infant birth weight on mothers’ characteristics specifies the changes in the median birth weight as a function of the predictors
The quantile regression parameter estimates the change in a specified quantile of the outcome corresponding to a one unit change in the covariate
This allows comparing how some percentiles of the birth weight may be more affected by certain mother characteristics than other percentiles. This is reflected in the change in the size of the regression coefficient
Median regression was first proposed in 1760 by Bošković, a Jesuit Catholic priest; then developed by Laplace, and Francis Edgeworth
Most important modern development of quantile is due to Roger Koenker
For a random variable \(Y\) with CDF\(F(y)=\Pr(Y\leq y)\), the \(q\)th (\(q\in (0,1)\)) quantile of \(Y\) is: \[Q_Y(q)=F_Y^{-1}(q)=inf\{y:F(y)\geq q\}\] In other words, the quantile function is the inverse of the CDF
Empirically, a \(q\)th quantile, \(y_q\), is the \(y\) value splits the data into proportions \(q\) below and \(1-q\) above: \(F(y_q)=q\) and \(y_q=F^{-1}(q)\)
Median \(q=0.5\)
In most real settings, quantiles are defined conditioning on a set of covariates \(X\) – quantile regression
Regression form: \(y_i=x_i\beta+\epsilon_i\)
Mean regression (OLS): minimize \(\sum_ie_i^2\), where \(e_i=y_i-\hat{y}_i=y_i-x_i\hat{\beta}\) are the residuals
Median regression (aka least-absolute-deviations or LAD): minimize \(\sum_i|e_i|\)
Median regression is more robust to outliers than least squares regression, and avoids assumptions about the parametric distribution of the error process (i.e. semiparametric)
Quantile regression generalizes the median regression: minimizes a sum that gives asymmetric penalties \((1-q)|e_i|\) for over-prediction and \(q|e_i|\) for under-prediction
Formally, the quantile regression estimator of coefficients \(\beta\) for the quantile \(q\) minimizes the following objective function (i.e. the check function)
\[{\small Q(\beta_q)= \sum_{i:y_i\geq x_i\beta}^N q|y_i-x_i\beta_q|+\sum_{i:y_i< x_i\beta}^N (1-q)|y_i-x_i\beta_q|}\]
For the \(q\)th-quantile regression, the estimated coefficient of a covariate \(X\), \(\hat{\beta}_q\), is interpreted as: Corresponding to one unit change in the covariate \(X\), the \(q\)th-quantile of the outcome increases by \(\hat{\beta}_q\).
It is often useful to run quantile regression on a discrete set of quantiles, e.g. (.05, .1, .5, .9, .95). The higher and lower quantiles are often of particular interest and can differ much from mean regression and median regression.
In practice, standard errors are usually calcaluated using bootstrap.
Koenker developed and has been maintaining a R package “quantreg”, with a very informative vignette
In mircoeconomics, an Engel curve (named after German statistician Ernst Engel) describes how household expenditure on a particular good or service varies with household income
Engel collected food expenditure vs household income for a sample of 235 19th century working class Belgian households
Fit Engel data: grey (.05, .1, .25, .75, .90, .95) QR lines; blue median fit; red OLS (mean) fit
library(quantreg)
data(engel)
attach(engel)
plot(income,foodexp,cex=.25,type="n",xlab="Household Income", ylab="Food Expenditure")
points(income,foodexp,cex=.5,col="blue")
abline(rq(foodexp~income,tau=.5),col="blue")
abline(lm(foodexp~income),lty=2,col="red") #the dreaded ols line
taus = c(.05,.1,.25,.75,.90,.95)
for( i in 1:length(taus)){
abline(rq(foodexp~income,tau=taus[i]),col="gray")
}
Plot the slope and intercept of the estimated quantile regression for the Engel data as a function of quantile
income_c <- income - mean(income)
fit1 <- summary(rq(foodexp~income_c,tau=2:98/100))
fit2 <- summary(rq(foodexp~income_c,tau=c(.05, .25, .5, .75, .95)))
plot(fit1,mfrow = c(1,2))
Engel’s law: the poorer a family is, the larger the budget share it spends on nourishment.
Run quantile regression using our birth weight data to determine the association between quantiles of birth weight and a set of covariates including gestional age, biological sex, and particularly the maternal smoking status
Subset the data to Durham County (CORES=32) and 2016
library(quantreg)
load(file = "data/CS1/birthweight_clean.RData")
birth2016=subset(bdclean,bdclean$YOB=='2016')
#CORES code for Durham Co is 32
durham=subset(birth2016,birth2016$CORES=='32')
Similar as OLS, one can also use k-fold cross validation to compare different quantile regression model (with different set of covariates)
The goal is to minimize an objective function over the test data
The objective function, instead of the MSE (which corresponds to the objective function in OLS), is the mean of the check function evaluated at the empirical residuals:
\[{\small MQ_q= \frac{1}{n_{test}}\left(\sum_{i:y_i\geq \hat{y}_i}^{n_{test}} q|y_i-\hat{y}_i|+\sum_{i:y_i< \hat{y}_i}^{n_{test}} (1-q)|y_i-\hat{y}_i|\right)}\] where \(\hat{y}_i=x_i\hat{\beta}_q\) are the predicted outcome for a sample unit \(i\) in the test data
Continue the quantile regression with the birth weight data in Durham
Perform 5-fold cross validation to compare a few models, e.g. with vs. without smoking status