class: center, middle, inverse, title-slide # Quantile Regression ### Yue Jiang ### STA 440 / Spring 2024 --- ### Transplants <img src="img/transplant.png" width="80%" style="display: block; margin: auto;" /> --- ### Transplants <img src="img/transplant2.png" width="80%" style="display: block; margin: auto;" /> https://www.reddit.com/r/mildlyinfuriating/comments/x3h80z/the_bill_for_my_liver_transplant_us/ --- ### Transplants <img src="qr_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ### Transplants <img src="qr_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ### Transplants <img src="qr_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ### A linear model ```r lm1 <- lm(owe ~ sex + age + region + any_hcc + severe + viral, data = dat) round(summary(lm1)$coef, 2) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -81174.32 24345.47 -3.33 0.00 ## sexM -17509.11 9758.68 -1.79 0.07 ## age 1925.28 314.63 6.12 0.00 ## regionMW 10289.50 12787.31 0.80 0.42 ## regionS 47850.42 13297.56 3.60 0.00 ## regionW 18519.15 12974.93 1.43 0.15 ## any_hcc -6295.16 10313.69 -0.61 0.54 ## severeExtreme 28799.49 18503.07 1.56 0.12 ## severeHigh 33983.09 16324.53 2.08 0.04 ## severeMed 11254.29 17056.78 0.66 0.51 ## viral -8652.44 10237.71 -0.85 0.40 ``` --- ### Diagnostics <img src="qr_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ### Diagnostics <img src="qr_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> .question[ What might we do? ] --- ### Transformations? ```r lm2 <- lm(log(owe) ~ sex + age + region + any_hcc + severe + viral, data = dat) ``` ```r Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y' ``` .question[ What happened? ] --- ### Transformations? ```r lm3 <- lm(log(owe + 0.0001) ~ sex + age + region + any_hcc + severe + viral, data = dat) round(summary(lm3)$coef, 3) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.680 0.349 16.272 0.000 ## sexM 0.114 0.140 0.813 0.416 ## age 0.039 0.005 8.548 0.000 ## regionMW 0.569 0.183 3.103 0.002 ## regionS 1.047 0.191 5.494 0.000 ## regionW 0.987 0.186 5.303 0.000 ## any_hcc 0.185 0.148 1.253 0.210 ## severeExtreme 0.438 0.265 1.652 0.099 ## severeHigh 0.399 0.234 1.704 0.088 ## severeMed -0.096 0.245 -0.392 0.695 ## viral -0.139 0.147 -0.945 0.345 ``` --- ### Transformations? ```r lm4 <- lm(log(owe + 10) ~ sex + age + region + any_hcc + severe + viral, data = dat) round(summary(lm4)$coef, 3) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.710 0.195 34.375 0.000 ## sexM 0.000 0.078 0.001 0.999 ## age 0.030 0.003 11.755 0.000 ## regionMW 0.236 0.103 2.301 0.021 ## regionS 0.766 0.107 7.185 0.000 ## regionW 0.732 0.104 7.034 0.000 ## any_hcc 0.170 0.083 2.059 0.040 ## severeExtreme 0.632 0.148 4.262 0.000 ## severeHigh 0.583 0.131 4.451 0.000 ## severeMed 0.229 0.137 1.676 0.094 ## viral -0.193 0.082 -2.349 0.019 ``` --- ### Transformations? ```r lm5 <- lm(log10(owe + 1) ~ sex + age + region + any_hcc + severe + viral, data = dat) round(summary(lm5)$coef, 3) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.816 0.096 29.201 0.000 ## sexM 0.010 0.039 0.269 0.788 ## age 0.014 0.001 10.990 0.000 ## regionMW 0.133 0.051 2.617 0.009 ## regionS 0.359 0.053 6.807 0.000 ## regionW 0.342 0.051 6.644 0.000 ## any_hcc 0.075 0.041 1.842 0.066 ## severeExtreme 0.261 0.073 3.560 0.000 ## severeHigh 0.240 0.065 3.712 0.000 ## severeMed 0.073 0.068 1.087 0.277 ## viral -0.079 0.041 -1.941 0.052 ``` .question[ What do you notice? Is anything problematic? ] --- ### Diagnostics ... round 2 <img src="qr_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ### Diagnostics <img src="qr_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- ### A danger with transformations Unfortunately, taking the inverse transformation of the predicted mean of the transformed response does not result in the expectation of the untransformed response: `$$E(Y | X) \neq f^{-1}(E(f(Y) | X))$$` Additionally, sometimes we can't even find a transformation that results in constant variance being satisfied. Finally (and this is the worst thing here), even if the model might be "correctly specified," is its interpretation *what we actually want*? .question[ What might we do? ] --- ### Quantile regression Linear regression (fit with OLS): `$$E(Y | \mathbf{X}) = \beta_0 + \beta_1\mathbf{X}_1 + \cdots + \beta_p\mathbf{X}_p$$` `$$\min_{\beta_0, \beta_1, \cdots, \beta_p} \sum_{i = 1}^n \left(Y_i - \beta_0 - \beta_1X_{i1} - \cdots - \beta_pX_{ip} \right)^2$$` Quantile regression (for quantile `\(\tau\)`): `$$Q_\tau(Y | \mathbf{X}) = \beta_{0, \tau} + \beta_{1, \tau}\mathbf{X}_1 + \cdots + \beta_{p, \tau}\mathbf{X}_p$$` `$$\min_{\beta_{0, \tau}, \beta_{1, \tau}, \cdots, \beta_{p, \tau}} \sum_{i = 1}^n \rho_\tau\left(Y_i - \beta_{0, \tau} - \beta_{1, \tau}X_{i1} - \cdots - \beta_{p, \tau}X_{ip} \right)$$` with check loss `\(\rho_\tau(z) = \tau \max (z, 0) + (1 - \tau) \max(-z, 0)\)`. --- ### Quantile regression .question[ What does this imply about each quantile `\(\tau\)`? ] -- There is a different solution set of `\(\beta\)` for each quantile (for instance, taking `\(\tau = 0.5\)` is for median regression, but we might care about 0.1, 0.25, etc.). --- ### Quantile regression ```r library(quantreg) q0.50 <- rq(owe ~ sex + age + region + any_hcc + severe + viral, data = dat, tau = 0.5) round(summary(q0.50)$coef, 2) ``` ``` ## Value Std. Error t value Pr(>|t|) ## (Intercept) -4672.69 944.00 -4.95 0.00 ## sexM 259.44 663.63 0.39 0.70 ## age 202.31 17.25 11.73 0.00 ## regionMW -517.15 767.28 -0.67 0.50 ## regionS 4848.59 896.30 5.41 0.00 ## regionW 6700.79 1106.35 6.06 0.00 ## any_hcc 2910.75 926.51 3.14 0.00 ## severeExtreme 5530.79 1410.45 3.92 0.00 ## severeHigh 5128.07 948.51 5.41 0.00 ## severeMed 2303.13 910.85 2.53 0.01 ## viral -4704.20 828.61 -5.68 0.00 ``` --- ### Quantile regression ```r library(quantreg) q0.90 <- rq(owe ~ sex + age + region + any_hcc + severe + viral, data = dat, tau = 0.9) round(summary(q0.90)$coef, 2) ``` ``` ## Value Std. Error t value Pr(>|t|) ## (Intercept) -47505.91 15659.06 -3.03 0.00 ## sexM -4869.52 11945.70 -0.41 0.68 ## age 2647.16 296.89 8.92 0.00 ## regionMW -12455.41 16246.67 -0.77 0.44 ## regionS 87455.54 38881.62 2.25 0.02 ## regionW 7829.43 11687.92 0.67 0.50 ## any_hcc 31910.00 15377.95 2.08 0.04 ## severeExtreme 66933.14 31111.35 2.15 0.03 ## severeHigh 25942.58 15624.95 1.66 0.10 ## severeMed 3550.60 13946.05 0.25 0.80 ## viral -10426.19 14432.74 -0.72 0.47 ``` --- ### Quantile regression ``` ## Q10 Q25 Q50 Q75 Q90 ## (Intercept) -2201.55 -2825.78 -4672.69 -19541.12 -47505.91 ## sexM -304.58 -175.49 259.44 -399.00 -4869.52 ## age 37.44 78.91 202.31 731.97 2647.16 ## regionMW 1004.64 138.75 -517.15 218.76 -12455.41 ## regionS 2508.50 3204.47 4848.59 15816.14 87455.54 ## regionW 2358.74 3758.17 6700.79 20036.76 7829.43 ## any_hcc 256.52 961.11 2910.75 13132.61 31910.00 ## severeExtreme 1495.36 3115.13 5530.79 13198.55 66933.14 ## severeHigh 1383.78 3527.73 5128.07 8894.08 25942.58 ## severeMed 428.11 1580.87 2303.13 5376.31 3550.60 ## viral -2.96 -1461.92 -4704.20 -9421.32 -10426.19 ``` .question[ How might we interpret each of these coefficients? ] --- ### Exploring across all quantiles <img src="qr_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" />