class: center, middle, inverse, title-slide .title[ # Censored data ] .author[ ### Yue Jiang ] .date[ ### STA 490 / STA 690 ] --- ### Customer service <img src="img/callcenter.jpg" width="60%" style="display: block; margin: auto;" /> Your last ten tickets took 2.5, 3.1, 5.7, 8.0, 9.8, 10+, 10+, 10+, 10+, and 10+ minutes each to resolve. .question[ - What do you estimate is your average ticket resolution time? (You really want to see if you have a good shot for the bonus) - What if you made a parametric assumption (let's say...exponentially distributed?) ] --- ### Digoxin <img src="img/02/digoxin.png" width="70%" style="display: block; margin: auto;" /> .pull-left[ <img src="img/02/foxglove.jpg" width="100%" style="display: block; margin: auto auto auto 0;" /> ] .pull-right[ In the landmark DIG study (Digitalis Investigation Group), investigators compared the primary outcome of the number of days from the start of the study to death between digoxin and placebo among almost 8,000 patients with heart failure. ] --- ### Let's take a look at the data: ``` ## ID TRTMT DWHF DWHFDAYS ## 1 21 1 0 1320 ## 2 22 0 0 1333 ## 3 23 1 0 1473 ## 4 24 1 0 521 ## 5 25 0 0 1173 ## 6 26 1 1 823 ## 7 27 0 1 1013 ## 8 28 0 0 1039 ``` .question[ - What if we only used observations that were fully observed? - What if we used only information regarding "time"? - What if we only looked at *whether* someone died? - What if we waited until everyone died and then analyzed the full data? ] --- ### Challenges The unique nature of survival data is that typically not all units are observed until their event times: - Maybe a patient moved to Fiji and was lost to follow-up - Maybe a patient never experienced the primary outcome at all because they got hit by a bus - Maybe the study was only funded to follow patients for two years after enrollment In these cases, observations are said to be .vocab[censored] - we know that they survived until at least their censoring time, but do not know any further information. Not accounting for censoring in an appropriate way leads to **biased** and/or **inefficient** analyses. --- ### Representing survival data The notion of .vocab[study time] vs. .vocab[patient time] (see board) Underlying data: - `\(T\)`: Failure time, a non-negative random variable - `\(C\)`: Censoring time, a non-negative random variable Observed data for individual `\(i\)`: - `\(Y_i\)`: `\((T_i \wedge C_i)\)`, the minimum of `\(T_i\)` and `\(C_i\)` - `\(\delta_i\)`: `\(1_{(T_i \le C_i)}\)`, whether we observe a failure -- If `\(\delta_i = 0\)`, then we have .vocab[right-censoring]: the survival time is longer than the censoring time. Commonly, we assume `\(C_i\)` are *i.i.d.* random variables with some distribution and that the censoring mechanism is *independent* of the failure mechanism. --- ### Life expectancy <img src="img/serfs.jpg" width="80%" style="display: block; margin: auto;" /> .question[ - What are some difficulties with trying to estimate moments? - What are some difficulties with parametric methods in general? ] --- ### Left, right, and interval censoring <img src="img/shoe.jpg" width="80%" style="display: block; margin: auto;" /> --- ### Have you done this before? --- ### Back to the DIG trial... <br> <img src="img/truncation.png" width="100%" style="display: block; margin: auto;" /> .question[ Can you think of any potential selection biases here? ] --- ### Truncation vs. censoring Not only did we have right censoring, but we also had .vocab[left truncation]. Patients who were enrolled in the DIG trial had not yet died by the time they entered the trial. We have *no information* on the people who died before study coordinators could get in touch with them. .question[ In the shoe-tying study, what might be an example of a study design that included left-censoring? What about left truncation? Can you come up with an example of a study with **right truncation** (e.g., when we are unable to observe *that an observation exists* unless `\(T < t\)`?) ] --- ### Truncation vs. censoring Improper consideration of truncation can lead to biased analyses (especially terrible for observational studies, in which we might even have differential truncation between treatment groups). Many commonly-used methods such as logistic regression can't deal with truncation. (with that said, we're...not going to be covering issues related to truncation in this course, though survival techniques can properly accommodate them).