class: center, middle, inverse, title-slide # Introduction to survival analysis (2) ### Yue Jiang ### Duke University --- ### The DIG Trial <img src="img/02/dig_trial.png" width="120%" /> Investigators compared the **primary outcome** of the number of days from the start of the study to either death or hospitalization from worsening heart failure. --- ### Representing survival data Underlying data: - `\(T\)`: Failure time, a non-negative random variable - `\(C\)`: Censoring time, a non-negative random variable Observed data for individual `\(i\)`: - `\(Y_i\)`: `\((T_i \wedge C_i)\)`, the minimum of `\(T_i\)` and `\(C_i\)` - `\(\delta_i\)`: `\(1_{(T_i \le C_i)}\)`, whether we observe a failure If `\(\delta_i = 0\)`, then we have .vocab[right-censoring]: the survival time is longer than the censoring time. Commonly, we assume `\(C_i\)` are *i.i.d.* random variables with some distribution and that the censoring mechanism is *independent* of the failure mechanism. --- ### Characterizing continuous `\(T\)` - Density function: `\(f(t) = \lim_{\Delta t \to 0^+} \frac{P(t \le T < t + \Delta t)}{\Delta t}\)` - Distribution function: `\(F(t) = P(T \le t) = \int_0^t f(s)ds\)` - Survival function: `\(S(t) = P(T > t) = 1 - F(t)\)` - Hazard function: `\(\lambda(t) = \lim_{\Delta t \to 0^+} \frac{P(t \le T < t + \Delta t | T \ge t)}{\Delta t}\)` - Cumulative hazard function: `\(\Lambda(t) = \int_0^t \lambda(s)ds\)` Knowing one is equivalent to knowing the others. --- ### Hazard distributions Exponential distribution: - `\(f(t) = \lambda e^{-\lambda t}\)` (don't get the rate parameter `\(\lambda\)` confused with the hazard) - `\(F(t) = 1 - e^{-\lambda t}\)` - `\(S(t) = e^{-\lambda t}\)` - `\(\lambda(t) = \lambda\)` - `\(\Lambda(t) = \lambda t\)` .question[ What do you notice about the hazard for survival times that have an exponential distribution? ] --- ### Hazard distributions Weibull distribution: - `\(f(t) = p\lambda^p t^{p - 1}e^{-(\lambda t)^p}\)` - `\(F(t) = 1 - e^{-(\lambda t)^p}\)` - `\(S(t) = e^{-(\lambda t)^p}\)` - `\(\lambda(t) = p\lambda^p t^{p - 1}\)` - `\(\Lambda(t) = (\lambda t)^p\)` When the shape parameter `\(p\)` is 1, then we have the exponential distribution. The hazard increases monotonically over time if `\(p > 1\)` and decreases monotonically if `\(p < 1\)` (is this reasonable to assume?). --- ### Hazard distributions Plotting Weibull hazard with `\(\lambda = 1\)` and various shape parameters `\(p\)`: <img src="survival-2_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ### Hazard distributions Log-normal distribution: - `\(f(t) = \frac{1}{t \sigma}\phi\left(\frac{\log(t) - \mu}{\sigma} \right)\)` - `\(F(t) = \Phi\left(\frac{\log(t) - \mu}{\sigma} \right)\)` - `\(S(t) = 1 - F(t)\)` - `\(\lambda(t) = f(t)/S(t)\)` - `\(\Lambda(t) = \int_0^t \lambda(s)ds\)` --- ### Hazard distributions Plotting log-normal hazard with `\(\mu = 0\)` and `\(\sigma^2 = 1\)`: <img src="survival-2_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ### Hazard distributions Log-logistic distribution: - `\(f(t) = \frac{(\beta/\alpha)(t/\alpha)^{\beta - 1}}{(1 + (t/\alpha)^\beta)^2}\)` - `\(F(t) = \frac{(t/\alpha)^\beta}{1 + (t/\alpha)^\beta}\)` - `\(S(t) = \frac{1}{1 + (t/\alpha)^\beta}\)` - `\(\lambda(t) = \frac{(\beta/\alpha)(t/\alpha)^{\beta - 1}}{1 + (t/\alpha)^\beta}\)` - `\(\Lambda(t) = \log(1 + (t/\alpha)^\beta)\)` --- ### Hazard distributions Plotting log-logistic hazards with `\(\alpha = 1\)` and various `\(\beta\)`: <img src="survival-2_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ### Estimation Suppose we had a series of *non-censored* observations `\(t_1, t_2, \cdots, t_n\)`, and we thought that these survival times were i.i.d. and came from some distribution with density `\(f(t)\)`. .question[ How might we estimate the parameter(s) `\(\theta\)` of that distribution? ] --- ### Review: maximum likelihood estimation `\begin{align*} \mathcal{L}(\theta | T) &= f(t_1, t_2, \cdots, t_n | \theta)\\ &= f(t_1 | \theta)f(t_2 | \theta) \cdots f(t_n | \theta)\\ &= \prod_{i = 1}^n f(t_i|\theta), \end{align*}` which we can often maximize in closed form for many familiar distributions (or numerically). --- ### Review: maximum likelihood estimation For instance, if we thought `\(T \stackrel{i.i.d}{\sim} Exp(\lambda)\)`, then `\begin{align*} \mathcal{L}(\lambda | T) &= \lambda^n \exp \left(-\lambda\sum_{i = 1}^n t_n \right)\\ \log \mathcal{L}(\lambda | T) &= n\log(\lambda) - \lambda \sum_{i = 1}^n t_i,\\ \hat{\lambda}_{MLE} &= \frac{n}{\sum_{i = 1}^n t_i} \end{align*}` --- ### Estimation for censored data How might we perform maximum likelihood estimation for *censored* data? What would the likelihood look like? Suppose we have `\(n\)` i.i.d. observations with the same `\(f(t)\)`, `\(S(t)\)`, and hazard `\(\lambda(t)\)`. Consider what might happen at time `\(t_i\)`: - Case 1: Individual experiences event at `\(t_i: \delta_i = 1\)`. In this case, they contribute the density at `\(t_i\)`, which is `\(f(t_i)\)`. .question[ What would their contribution to the likelihood be if they were censored? ] -- - Case 2: Individual is still alive at `\(t_i: \delta_i = 0\)`. In this case, we know that their survival time is greater than `\(t_i\)`, which happens with probability `\(S(t_i)\)`. -- So, this individual's contribution to the likelihood is `\begin{align*} f(t_i)^{\delta_i}S(t_i)^{1 - \delta_i} \end{align*}` --- ### Estimation for censored data Note that `\begin{align*} \lambda(t) &= \lim_{\Delta t \to 0^+} \frac{P(t \le T < t + \Delta t | T \ge t)}{\Delta t}\\ &= \lim_{\Delta t \to 0^+} \frac{P(t \le T < t + \Delta t \cap T \ge t)/P(T \ge t)}{\Delta t}\\ &= \lim_{\Delta t \to 0^+} \frac{P(t \le T < t + \Delta t)}{P(T \ge t)\Delta t}\\ &= \lim_{\Delta t \to 0^+} \frac{f(t)\Delta t}{P(T \ge t)\Delta t}\\ &= \frac{f(t)}{S(t)}, \\ f(t) &= \lambda(t)S(t) \end{align*}` --- ### Estimation for censored data An individual's contribution to the likelihood is thus `\begin{align*} &\mathrel{\phantom{=}} f(t_i)^{\delta_i}S(t_i)^{1 - \delta_i}\\ &= \lambda(t_i)^{\delta_i}S(t_i)^{\delta_i}S(t_i)^{1 - \delta_i}\\ &= \lambda(t_i)^{\delta_i}S(t_i) \end{align*}` For our exponential example with `\(T \stackrel{i.i.d.}{\sim} Exp(\lambda)\)`, since the hazard is given by `\(\lambda\)` and the survival function is given by `\(e^{-\lambda t}\)`, we would thus have `\begin{align*} \mathcal{L}(\lambda | T) &= \prod_{i = 1}^n \lambda^{\delta_i}\exp(-\lambda t_i), \end{align*}` which we could then maximize using familiar methods. --- ### Covariates? .question[ How might we adjust for potential covariates that might be related to our outcome? Is there any way to create a predictive model for survival time? ] --- ### Some candidate models Could we use the following model for survival outcomes? `\begin{align*} T_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \cdots + \beta_px_{pi} + \epsilon_i \end{align*}` -- How about this model? `\begin{align*} \log(T_i) = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \cdots + \beta_px_{pi} + \epsilon_i \end{align*}` --- ### An accelerated failure time model An .vocab[accelerated failure time] (AFT) model assumes `\begin{align*} \log(T_i) = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \cdots + \beta_px_{pi} + \epsilon_i \end{align*}` where `\(\epsilon_i\)` are commonly assumed to be i.i.d. and follow some specified distribution. .question[ How might we "know" what distribution `\(\epsilon\)` takes on? ] --- ### An accelerated failure time model `\begin{align*} \log(T_i) = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \cdots + \beta_px_{pi} + \epsilon_i \end{align*}` There is a one-to-one relationship between the distribution of `\(T\)` and the assumed error distribution in the AFT model. For instance, assuming `\(\epsilon_i\)` has a normal distribution implies `\(T_i\)` has a log-normal distribution. --- ### An accelerated failure time model Note that we can also write the AFT model as `\begin{align*} \log(T_i) &= \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \cdots + \beta_px_{pi} + \epsilon_i\\ T_i &= \exp\left(\beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \cdots + \beta_px_{pi} \right)e^{\epsilon_i} \end{align*}` .question[ How might we interpret coefficients in the AFT model? ] --- ### An accelerated failure time model `\begin{align*} T_i &= \exp\left(\beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \cdots + \beta_px_{pi} \right)e^{\epsilon_i}\\ &= e^{\beta_0}e^{\beta_1x_{1i}}e^{\beta_2x_{2i}}\cdots e^{\beta_px_{pi}}e^{\epsilon_i} \end{align*}` Covariates in an AFT model have a multiplicative effect on *time*. For instance, if `\(\beta_k = 0.4\)`, then `\(\exp(\beta_k) \approx 1.5\)`. Holding all else equal, an individual with covariate `\(x_k\)` one unit greater than another is expected to survive approximately 1.5 times longer than the other. The probability that they have survived to time `\(1.5t\)` is the same as the probability that the other individual is alive has survived to time `\(t\)`. --- ### Estimation in survival models Remember that an individual's contribution to the likelihood is `\begin{align*} \lambda(t)^{\delta_i}S(t), \end{align*}` and so the full likelihood function is given by `\begin{align*} \mathcal{L}(\theta | Y) = \prod_{i = 1}^n \lambda(t_i)^{\delta_i}S(t_i). \end{align*}` This likelihood function holds regardless of what type of model is being estimated (for independently right-censored data). --- ### Fitting an AFT model Recall that our model specifies a functional form between `\(Y\)`, `\(\beta\)`, and `\(\mathbf{X}\)`. For the AFT model, `\begin{align*} \log(T_i) = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \cdots + \beta_px_{pi} + \epsilon_i, \end{align*}` Suppose we think `\(T\)` follows an exponential distribution such that `\(T \sim Exp(\exp(\mathbf{X}^T\beta))\)`; that is, `\(\epsilon\)` has a standard extreme value distribution: `\(f(u) = \exp(u - \exp(u))\)`. Then we have: `\begin{align*} S(t | \mathbf{X}) &= \exp(-t\exp(\mathbf{X}^T\beta))\\ \lambda(t | \mathbf{X}) &= \exp(\mathbf{X}^T\beta) \end{align*}` --- ### Fitting an AFT model Thus we can maximize the likelihood for this model `\begin{align*} &\mathrel{\phantom{=}} \prod_{i = 1}^n \lambda(t_i | \mathbf{X}_i)^{\delta_i}S(t_i | \mathbf{X}_i)\\ &= \prod_{i = 1}^n \exp(\mathbf{X}_i^T\beta)^{\delta_i}\exp(-t\exp(\mathbf{X}_i^T\beta)) \end{align*}` with respect to the regression parameters of interest `\(\beta\)`. Different assumptions re: error distributions (and thus of `\(T\)`) result in different hazard and survival functions being plugged into our generic likelihood. Note that there is no closed form for this likelihood function with respect to `\(\beta\)`; choose your favorite numerical method (we'll see Newton-Raphson once again in model output on the following slides). --- ### Fitting an AFT model ```r library(survival) aft_e <- survreg(Surv(DWHFDAYS, DWHF) ~ TRTMT + EJF_PER + PREVMI, data = dig, dist = "exponential") summary(aft_e) ``` ``` ## ## Call: ## survreg(formula = Surv(DWHFDAYS, DWHF) ~ TRTMT + EJF_PER + PREVMI, ## data = dig, dist = "exponential") ## Value Std. Error z p ## (Intercept) 6.62850 0.07495 88.43 < 2e-16 ## TRTMT 0.29856 0.04167 7.17 7.7e-13 ## EJF_PER 0.04037 0.00239 16.92 < 2e-16 ## PREVMI -0.04978 0.04364 -1.14 0.25 ## ## Scale fixed at 1 ## ## Exponential distribution ## Loglik(model)= -20511.5 Loglik(intercept only)= -20680.3 ## Chisq= 337.51 on 3 degrees of freedom, p= 7.6e-73 ## Number of Newton-Raphson Iterations: 5 ## n= 6799 ``` --- ### Fitting an AFT model ```r library(survival) aft_w <- survreg(Surv(DWHFDAYS, DWHF) ~ TRTMT + EJF_PER + PREVMI, data = dig, dist = "weibull") summary(aft_w) ``` ``` ## ## Call: ## survreg(formula = Surv(DWHFDAYS, DWHF) ~ TRTMT + EJF_PER + PREVMI, ## data = dig, dist = "weibull") ## Value Std. Error z p ## (Intercept) 6.58614 0.10266 64.15 < 2e-16 ## TRTMT 0.39910 0.05768 6.92 4.5e-12 ## EJF_PER 0.05263 0.00337 15.61 < 2e-16 ## PREVMI -0.06661 0.06001 -1.11 0.27 ## Log(scale) 0.31844 0.01891 16.84 < 2e-16 ## ## Scale= 1.37 ## ## Weibull distribution ## Loglik(model)= -20351.2 Loglik(intercept only)= -20504.9 ## Chisq= 307.43 on 3 degrees of freedom, p= 2.5e-66 ## Number of Newton-Raphson Iterations: 5 ## n= 6799 ``` --- ### Fitting an AFT model ```r library(survival) aft_n <- survreg(Surv(DWHFDAYS, DWHF) ~ TRTMT + EJF_PER + PREVMI, data = dig, dist = "lognormal") summary(aft_n) ``` ``` ## ## Call: ## survreg(formula = Surv(DWHFDAYS, DWHF) ~ TRTMT + EJF_PER + PREVMI, ## data = dig, dist = "lognormal") ## Value Std. Error z p ## (Intercept) 6.06283 0.12297 49.30 < 2e-16 ## TRTMT 0.55329 0.06741 8.21 2.2e-16 ## EJF_PER 0.05932 0.00385 15.42 < 2e-16 ## PREVMI -0.11991 0.07059 -1.70 0.089 ## Log(scale) 0.82037 0.01636 50.14 < 2e-16 ## ## Scale= 2.27 ## ## Log Normal distribution ## Loglik(model)= -20343.2 Loglik(intercept only)= -20501 ## Chisq= 315.72 on 3 degrees of freedom, p= 3.9e-68 ## Number of Newton-Raphson Iterations: 3 ## n= 6799 ``` --- ### Fitting an AFT model ```r library(survival) aft_l <- survreg(Surv(DWHFDAYS, DWHF) ~ TRTMT + EJF_PER + PREVMI, data = dig, dist = "loglogistic") summary(aft_l) ``` ``` ## ## Call: ## survreg(formula = Surv(DWHFDAYS, DWHF) ~ TRTMT + EJF_PER + PREVMI, ## data = dig, dist = "loglogistic") ## Value Std. Error z p ## (Intercept) 6.04825 0.11365 53.22 < 2e-16 ## TRTMT 0.48004 0.06243 7.69 1.5e-14 ## EJF_PER 0.05570 0.00359 15.50 < 2e-16 ## PREVMI -0.08625 0.06517 -1.32 0.19 ## Log(scale) 0.19883 0.01835 10.84 < 2e-16 ## ## Scale= 1.22 ## ## Log logistic distribution ## Loglik(model)= -20342 Loglik(intercept only)= -20498.1 ## Chisq= 312.28 on 3 degrees of freedom, p= 2.2e-67 ## Number of Newton-Raphson Iterations: 4 ## n= 6799 ``` --- ### How well are we fitting? In each of these past examples, we've assumed a distribution of survival times. How can we check if we've made an appropriate assumption? Suppose we use the following residuals: `\begin{align*} \widehat{e}_i &= \frac{\log(Y_i) - \mathbf{X}_i^T\widehat{\beta}}{\widehat{\phi}} \end{align*}` where `\(\phi\)` is the scale parameter from the model (don't worry about this). .question[ What does `\(\widehat{e}_i\)` represent? Is it appropriate to use these in some way? ] --- ### How well are we fitting? `\(\widehat{e}_i\)` examine scaled differences from `\(Y_i\)`, the *observed* data. Thus, these are *censored* residuals, which we must take into account. One way to deal with this issue is to plot a Kaplan-Meier plot of these residuals against the survival function of the assumed distribution. ```r resids <- (log(dig$DWHFDAYS) - aft_ll$linear.predictors) / (aft_ll$scale) m1 <- survfit(Surv(resids, DWHF) ~ 1, data = dig) plot(m1, xlab = "AFT Residuals (Logistic model)", ylab = "Survival Probability") exp.x <- seq(min(resids), max(resids), length = 100) exp.y <- plogis(exp.x, lower.tail = F) # F(t) lines(exp.x, exp.y, col = "red", lwd = 2) ``` --- ### How well are we fitting? <img src="survival-2_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />