Introduction to Quantile Regression

Linear Regression

  • Linear regression models the mean value of a response variable (outcome) for given levels of the predictor variables (covariates)

  • For example, in the previous lectures of this case study, we fitted linear regression to investigate the relationship between infants’ birth weight and a set of predictors, such as gestational age, sex and mother’s smoking status

  • This linear regression model estimates how, on average, these mothers’ and infants’ characteristics affect the birth weights of infants

  • While this model can address questions such as “is prenatal care important?” it cannot answer an important question: “does prenatal care influence birth weight differently for infants with low birth weight than for those with average birth weights? “

Motivation

What the regression curve does is give a grand summary for the averages of the distributions corresponding to the set of of x’s. We could go further and compute several different regression curves corresponding to the various percentage points of the distributions and thus get a more complete picture of the set. Ordinarily this is not done, and so regression often gives a rather incomplete picture. Just as the mean gives an incomplete picture of a single distribution, so the regression curve gives a correspondingly incomplete picture for a set of distributions.
– Mosteller and Tukey (1977)

Quantile Regression

  • A more comprehensive picture of the effect of the predictors on the response variable can be obtained by using Quantile regression

  • Quantile regression models the relation between a set of predictors and specific percentiles (or quantiles) of the outcome variable

  • For example, a median regression (median is the 50th percentile) of infant birth weight on mothers’ characteristics specifies the changes in the median birth weight as a function of the predictors

Quantile Regression (cont’d)

  • The quantile regression parameter estimates the change in a specified quantile of the outcome corresponding to a one unit change in the covariate

  • This allows comparing how some percentiles of the birth weight may be more affected by certain mother characteristics than other percentiles. This is reflected in the change in the size of the regression coefficient

  • Median regression was first proposed in 1760 by Bošković, a Jesuit Catholic priest; then developed by Laplace, and Francis Edgeworth

  • Most important modern development of quantile is due to Roger Koenker

Growth Curves

Quetelet (1870) pioneered the use of growth charts – conditional quantile estimation. Standard in modern pediatric practice
Quetelet (1870) pioneered the use of growth charts – conditional quantile estimation. Standard in modern pediatric practice

Quantile

  • For a random variable \(Y\) with CDF\(F(y)=\Pr(Y\leq y)\), the \(q\)th (\(q\in (0,1)\)) quantile of \(Y\) is: \[Q_Y(q)=F_Y^{-1}(q)=inf\{y:F(y)\geq q\}\] In other words, the quantile function is the inverse of the CDF

  • Empirically, a \(q\)th quantile, \(y_q\), is the \(y\) value splits the data into proportions \(q\) below and \(1-q\) above: \(F(y_q)=q\) and \(y_q=F^{-1}(q)\)

  • Median \(q=0.5\)

  • In most real settings, quantiles are defined conditioning on a set of covariates \(X\) – quantile regression

Mean vs. Median regression

  • Regression form: \(y_i=x_i\beta+\epsilon_i\)

  • Mean regression (OLS): minimize \(\sum_ie_i^2\), where \(e_i=y_i-\hat{y}_i=y_i-x_i\hat{\beta}\) are the residuals

  • Median regression (aka least-absolute-deviations or LAD): minimize \(\sum_i|e_i|\)

  • Median regression is more robust to outliers than least squares regression, and avoids assumptions about the parametric distribution of the error process (i.e. semiparametric)

Quantile regression

  • Quantile regression generalizes the median regression: minimizes a sum that gives asymmetric penalties \((1-q)|e_i|\) for over-prediction and \(q|e_i|\) for under-prediction

  • Formally, the quantile regression estimator of coefficients \(\beta\) for the quantile \(q\) minimizes the following objective function (i.e. the check function)

\[{\small Q(\beta_q)= \sum_{i:y_i\geq x_i\beta}^N q|y_i-x_i\beta_q|+\sum_{i:y_i< x_i\beta}^N (1-q)|y_i-x_i\beta_q|}\]

  • Optimizing the above objective function usually requires linear programming methdos; nonetheless, the quantile regression estimator \(\hat{\beta}_q\)is asymptotically normal with a closed-form variance-covariance matrix.

Interpretation

  • For the \(q\)th-quantile regression, the estimated coefficient of a covariate \(X\), \(\hat{\beta}_q\), is interpreted as: Corresponding to one unit change in the covariate \(X\), the \(q\)th-quantile of the outcome increases by \(\hat{\beta}_q\).

  • It is often useful to run quantile regression on a discrete set of quantiles, e.g. (.05, .1, .5, .9, .95). The higher and lower quantiles are often of particular interest and can differ much from mean regression and median regression.

  • In practice, standard errors are usually calcaluated using bootstrap.

  • Koenker developed and has been maintaining a R package “quantreg”, with a very informative vignette

Example: Engel Curves

  • In mircoeconomics, an Engel curve (named after German statistician Ernst Engel) describes how household expenditure on a particular good or service varies with household income

  • Engel collected food expenditure vs household income for a sample of 235 19th century working class Belgian households

  • Fit Engel data: grey (.05, .1, .25, .75, .90, .95) QR lines; blue median fit; red OLS (mean) fit

Example: Engel Curves

Engel Coefficient Plots

Plot the slope and intercept of the estimated quantile regression for the Engel data as a function of quantile

Engel’s Law

Engel cofficients at different quantiles

Engel’s law: the poorer a family is, the larger the budget share it spends on nourishment.

Class Exercise 1

  • Run quantile regression using our birth weight data to determine the association between quantiles of birth weight and a set of covariates including gestional age, biological sex, and particularly the maternal smoking status

  • Subset the data to Durham County (CORES=32) and 2016

  • First run a median regression, then run at different quantiles; compare with a mean (OLS) regression

Validation of Quantile Regression

  • Similar as OLS, one can also use k-fold cross validation to compare different quantile regression model (with different set of covariates)

  • The goal is to minimize an objective function over the test data

  • The objective function, instead of the MSE (which corresponds to the objective function in OLS), is the mean of the check function evaluated at the empirical residuals:

\[{\small MQ_q= \frac{1}{n_{test}}\left(\sum_{i:y_i\geq \hat{y}_i}^{n_{test}} q|y_i-\hat{y}_i|+\sum_{i:y_i< \hat{y}_i}^{n_{test}} (1-q)|y_i-\hat{y}_i|\right)}\] where \(\hat{y}_i=x_i\hat{\beta}_q\) are the predicted outcome for a sample unit \(i\) in the test data

Class Exercise 2

  • Continue the quantile regression with the birth weight data in Durham

  • Perform 5-fold cross validation to compare a few models, e.g. with vs. without smoking status