Introduction to Quantile Regression

Linear Regression

Linear regression models the mean value of a response variable (outcome) for given levels of the predictor variables (covariates)
For example, in the previous lectures of this case study, we fitted linear regression to investigate the relationship between infants’ birth weight and a set of predictors, such as gestational age, sex and mother’s smoking status
This linear regression model estimates how, on average, these mothers’ and infants’ characteristics affect the birth weights of infants
While this model can address questions such as “is prenatal care important?” it cannot answer an important question: “does prenatal care influence birth weight differently for infants with low birth weight than for those with average birth weights? “

Motivation

What the regression curve does is give a grand summary for the averages of the distributions corresponding to the set of of x’s. We could go further and compute several different regression curves corresponding to the various percentage points of the distributions and thus get a more complete picture of the set. Ordinarily this is not done, and so regression often gives a rather incomplete picture. Just as the mean gives an incomplete picture of a single distribution, so the regression curve gives a correspondingly incomplete picture for a set of distributions.
– Mosteller and Tukey (1977)

Quantile Regression

A more comprehensive picture of the effect of the predictors on the response variable can be obtained by using Quantile regression
Quantile regression models the relation between a set of predictors and specific percentiles (or quantiles) of the outcome variable
For example, a median regression (median is the 50th percentile) of infant birth weight on mothers’ characteristics specifies the changes in the median birth weight as a function of the predictors

Quantile Regression (cont’d)

The quantile regression parameter estimates the change in a specified quantile of the outcome corresponding to a one unit change in the covariate
This allows comparing how some percentiles of the birth weight may be more affected by certain mother characteristics than other percentiles. This is reflected in the change in the size of the regression coefficient
Median regression was first proposed in 1760 by Bošković, a Jesuit Catholic priest; then developed by Laplace, and Francis Edgeworth
Most important modern development of quantile is due to Roger Koenker

Growth Curves

Quantile

For a random variable \(Y\) with CDF\(F(y)=\Pr(Y\leq y)\), the \(q\)th (\(q\in (0,1)\)) quantile of \(Y\) is: \[Q_Y(q)=F_Y^{-1}(q)=inf\{y:F(y)\geq q\}\] In other words, the quantile function is the inverse of the CDF
Empirically, a \(q\)th quantile, \(y_q\), is the \(y\) value splits the data into proportions \(q\) below and \(1-q\) above: \(F(y_q)=q\) and \(y_q=F^{-1}(q)\)
Median \(q=0.5\)
In most real settings, quantiles are defined conditioning on a set of covariates \(X\) – quantile regression

Mean vs. Median regression

Regression form: \(y_i=x_i\beta+\epsilon_i\)
Mean regression (OLS): minimize \(\sum_ie_i^2\), where \(e_i=y_i-\hat{y}_i=y_i-x_i\hat{\beta}\) are the residuals
Median regression (aka least-absolute-deviations or LAD): minimize \(\sum_i|e_i|\)
Median regression is more robust to outliers than least squares regression, and avoids assumptions about the parametric distribution of the error process (i.e. semiparametric)

Quantile regression

Quantile regression generalizes the median regression: minimizes a sum that gives asymmetric penalties \((1-q)|e_i|\) for over-prediction and \(q|e_i|\) for under-prediction
Formally, the quantile regression estimator of coefficients \(\beta\) for the quantile \(q\) minimizes the following objective function (i.e. the check function)

\[{\small Q(\beta_q)= \sum_{i:y_i\geq x_i\beta}^N q|y_i-x_i\beta_q|+\sum_{i:y_i< x_i\beta}^N (1-q)|y_i-x_i\beta_q|}\]

Optimizing the above objective function usually requires linear programming methdos; nonetheless, the quantile regression estimator \(\hat{\beta}_q\)is asymptotically normal with a closed-form variance-covariance matrix.

Interpretation

For the \(q\)th-quantile regression, the estimated coefficient of a covariate \(X\), \(\hat{\beta}_q\), is interpreted as: Corresponding to one unit change in the covariate \(X\), the \(q\)th-quantile of the outcome increases by \(\hat{\beta}_q\).
It is often useful to run quantile regression on a discrete set of quantiles, e.g. (.05, .1, .5, .9, .95). The higher and lower quantiles are often of particular interest and can differ much from mean regression and median regression.
In practice, standard errors are usually calcaluated using bootstrap.
Koenker developed and has been maintaining a R package “quantreg”, with a very informative vignette

Example: Engel Curves

In mircoeconomics, an Engel curve (named after German statistician Ernst Engel) describes how household expenditure on a particular good or service varies with household income
Engel collected food expenditure vs household income for a sample of 235 19th century working class Belgian households
Fit Engel data: grey (.05, .1, .25, .75, .90, .95) QR lines; blue median fit; red OLS (mean) fit

Example: Engel Curves

library(quantreg)
data(engel)
attach(engel)
plot(income,foodexp,cex=.25,type="n",xlab="Household Income", ylab="Food Expenditure")
points(income,foodexp,cex=.5,col="blue")
abline(rq(foodexp~income,tau=.5),col="blue")
abline(lm(foodexp~income),lty=2,col="red") #the dreaded ols line
taus = c(.05,.1,.25,.75,.90,.95)
for( i in 1:length(taus)){
  abline(rq(foodexp~income,tau=taus[i]),col="gray")
 }

Engel Coefficient Plots

Plot the slope and intercept of the estimated quantile regression for the Engel data as a function of quantile

income_c <- income - mean(income)
fit1 <- summary(rq(foodexp~income_c,tau=2:98/100))
fit2 <- summary(rq(foodexp~income_c,tau=c(.05, .25, .5, .75, .95)))
plot(fit1,mfrow = c(1,2))

Engel’s Law

Engel cofficients at different quantiles

Engel’s law: the poorer a family is, the larger the budget share it spends on nourishment.

Class Exercise 1

Run quantile regression using our birth weight data to determine the association between quantiles of birth weight and a set of covariates including gestional age, biological sex, and particularly the maternal smoking status
Subset the data to Durham County (CORES=32) and 2016

library(quantreg)
load(file = "data/CS1/birthweight_clean.RData")
birth2016=subset(bdclean,bdclean$YOB=='2016')
#CORES code for Durham Co is 32
durham=subset(birth2016,birth2016$CORES=='32')

First run a median regression, then run at different quantiles; compare with a mean (OLS) regression

Validation of Quantile Regression

Similar as OLS, one can also use k-fold cross validation to compare different quantile regression model (with different set of covariates)
The goal is to minimize an objective function over the test data
The objective function, instead of the MSE (which corresponds to the objective function in OLS), is the mean of the check function evaluated at the empirical residuals:

\[{\small MQ_q= \frac{1}{n_{test}}\left(\sum_{i:y_i\geq \hat{y}_i}^{n_{test}} q|y_i-\hat{y}_i|+\sum_{i:y_i< \hat{y}_i}^{n_{test}} (1-q)|y_i-\hat{y}_i|\right)}\] where \(\hat{y}_i=x_i\hat{\beta}_q\) are the predicted outcome for a sample unit \(i\) in the test data

Class Exercise 2

Continue the quantile regression with the birth weight data in Durham
Perform 5-fold cross validation to compare a few models, e.g. with vs. without smoking status