Introduction to survival analysis

class: center, middle, inverse, title-slide

# Introduction to survival analysis
### Yue Jiang
### Duke University

---

### Survival data

In many studies, the outcome of interest is the amount of time from an initial 
observation until the occurrence of some event of interest.

Typically, the event of interest is called a .vocab[failure] (even if it's a 
good thing), and the associated time interval between a starting point and 
failure the .vocab[failure time], .vocab[survival time], or .vocab[event time].

---

### Digoxin

.pull-left[
<img src="img/02/foxglove.jpg" width="100%" style="display: block; margin: auto auto auto 0;" />
]

.pull-right[
- Foxgloves have been used in medicine for centuries
- Digoxin (the active ingredient) first isolated in 1930 
- Traditionally used for heart arrhythmia and heart failure
- One of the most prescribed drugs globally
]

---

### The DIG Trial

Investigators compared the **primary outcome** of the number of days from the
start of the study to either death or hospitalization from worsening heart
failure.

---

### The DIG Trial

.question[
How would ***you*** investigate this question, comparing the two treatment 
groups of digoxin vs. placebo?
]

---

### A naive analysis

Death or hospitalization due to worsening heart failure:

```r
dig %>% 
  select(ID, TRTMT, DWHF, DWHFDAYS) %>% 
  slice(1:10)
```

```
##    ID TRTMT DWHF DWHFDAYS
## 1   1     0    1     1379
## 2   2     0    1     1329
## 3   3     0    1      631
## 4   4     1    0     1157
## 5   5     0    1      191
## 6   6     0    0     1620
## 7   7     1    0      903
## 8   8     1    0     1369
## 9   9     0    0     1747
## 10 10     1    0     1074
```

---

### A naive analysis

```r
dig %>% 
  filter(DWHF == 1) %>% 
  t.test(DWHFDAYS ~ TRTMT, data = .)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  DWHFDAYS by TRTMT
## t = -6.153, df = 2195.4, p-value = 9.01e-10
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -133.68940  -69.06796
## sample estimates:
## mean in group 0 mean in group 1 
##        418.8768        520.2555
```

.question[
Are you convinced? What if we made some sort of regression model to account for
covariates? Would that be enough?
]

---

### A naive analysis

```r
dig %>% 
  count(TRTMT)
```

```
##   TRTMT    n
## 1     0 3403
## 2     1 3397
```

```r
dig %>% 
  filter(DWHF == 1) %>% 
  count(TRTMT)
```

```
##   TRTMT    n
## 1     0 1291
## 2     1 1041
```

---

### Challenges

The unique nature of survival data is that typically not all units are observed
until their event times:
- Maybe a patient moved to Fiji and was lost to follow-up
- Maybe a patient never experienced the primary outcome at all because they got
hit by a bus
- Maybe the study was only funded to follow patients for two years after
enrollment

In these cases, observations are said to be .vocab[censored] - we know that 
they survived until at least their censoring time, but do not know any further
information.

Not accounting for censoring in an appropriate way leads to **biased** and/or
**inefficient** analyses.

---

### Representing survival data

See live visualization regarding .vocab[study time] vs. .vocab[patient time].

---

### Representing survival data

Underlying data:
- `\(T\)`: Failure time, a non-negative random variable
- `\(C\)`: Censoring time, a non-negative random variable
Observed data for individual `\(i\)`:
- `\(Y_i\)`: `\((T_i \wedge C_i)\)`, the minimum of `\(T_i\)` and `\(C_i\)`
- `\(\delta_i\)`: `\(1_{(T_i \le C_i)}\)`, whether we observe a failure

If `\(\delta_i = 0\)`, then we have .vocab[right-censoring]: the survival time is
longer than the censoring time.

Commonly, we assume `\(C_i\)` are *i.i.d.* random variables with some distribution
and that the censoring mechanism is *independent* of the failure mechanism.

**Our goal is to make inferential statements about** `\(T\)`.

---

### Characterizing continuous `\(T\)`

- Density function: `\(f(t) = \lim_{\Delta t \to 0^+} \frac{P(t \le T < t + \Delta t)}{\Delta t}\)`

- Distribution function: `\(F(t) = P(T \le t) = \int_0^t f(s)ds\)`

- Survival function: `\(S(t) = P(T > t) = 1 - F(t)\)`

- Hazard function: `\(\lambda(t) = \lim_{\Delta t \to 0^+} \frac{P(t \le T < t + \Delta t | T \ge t)}{\Delta t}\)`

- Cumulative hazard function: `\(\Lambda(t) = \int_0^t \lambda(s)ds\)`

Knowing one is equivalent to knowing the others.

.question[
How might you express the hazard function in terms of the density funciton and
the survival function?
]

---

### Survival vs. hazard functions:

Survival (or survivor) function:

`\begin{align*}
S(t) = P(T > t)
\end{align*}`
- Non-increasing with `\(S(0) = 1\)` and `\(\lim_{t \to \infty} S(t) = 0\)`
- For any given time `\(t\)`, a probability

Hazard function:

`\begin{align*}
\lambda(t) = \lim_{\Delta t \to 0^+} \frac{P(t \le T < t + \Delta t | T \ge t)}{\Delta t}
\end{align*}`

- Instantaneous failure rate, *given* already having survived to time `\(t\)`
- **Not** a probability (for continuous `\(T\)`)
- Non-negative and unbounded for all `\(t\)`
- Often more useful interpretations than survival functions
- Nice analytical properties under right-censoring

---