7 Working with time-to-event data

A data type that is commonly found in clinical trials is time to event data. This type of data captures the amount of time that elapses before a particular event happens. As a sub-field of statistics, survival analysis has been around for a long time, as people have thought about and worked with data like mortality records (most notably John Graunt, who used the ‘Bills of Mortality’ during the 1600s to better understand the plague and other causes of death). However, it developed rapidly during the many cancer related clinical trials of the 1960s and 1970s. In these cases, the event in question was very often death, and which is why this branch of statistics came to be known as survival analysis. However, the event can be many other things, and indeed can be a positive outcome (for example being cured of some condition). Time-to-event data also appears in other applications, such as engineering (eg. monitoring the reliability of a machine) and marketing (eg. thinking of the time-to-purchase). As well as the books already mentioned, this chapter makes use of Collett (2003b).

Usually, survival data is given in terms of time, but it can also be the number of times something happens (for example, the number of clinic appointments attended) before the event in question occurs.

Survival data is trickier to handle than the data types we have seen so far, for two main reasons. Firstly (and simply) survival data is very often skewed, so even though it is (usually) continuous, we can’t just treat it as normally distributed. Secondly (and more complicatedly, if that’s a word) with time-to-event data we don’t usually observe the full dataset.

7.1 Censored times

If a trial monitors a sample of participants for some length of time, many will experience the event before the trial is finished. However, for some of the sample we likely won’t observe the event. This could be because it doesn’t happen within the lifetime of the trial, or it could be because the participant exits the trial prematurely for some reason (eg. withdrawal), or simply stops attending follow-up appointments after a certain times. For these participants, we do know that they had not experienced the event up to some time \(t\), but we don’t know what happened next. All we know is that their time-to-event or survival time is greater than that time \(t\). These partial observations are known as censored times, and in particular as right-censored times, because the event happens after the censored time. It is possible (but less common) to have left-censored or interval-censored data, but in this course we will deal only with right-censoring.

Figure 7.1: An example of some censored data. The dashed line at time 10 indicates the end of the trial period.

If we were to treat censored times as observations, ie. as though the event had happened at time \(t\), we would bias the results of the trial very seriously. The survival times reported would be systematically shorter than the true ones. For example, in the dataset shown in Figure 7.1, we would estimate the survival probability at time 10 as 0.2, since only two of the 10 participants were still in the trial after time 10. But it may well be that some of the participants whose observations were censored before \(t=10\) were still alive at \(t=10\).

If we were to remove the censored times, and only analyse the data in which the event was observed during the lifespan of the trial, we would be losing data and therefore valuable information. This approach may well also lead to bias, for example if some subset of patients died quite soon into the trial, but the remainder lived a long time (past the end of the trial). If our analysis ignores the survivors, we are likely to underestimate the general survival time. In the dataset in Figure 7.1 there are five participants (3,6,7,8,10) whom we are no longer able to observe at time 10, but of whom none had experienced the event by the point at which they were censored.

So we know that we need to somehow include these censored times in our analysis. How we do so will depend on our approach.

7.2 The Survival Curve and the Hazard function

The field of survival analysis is relatively unusual in statistics, in that it isn’t treated predominantly parametrically. For most continuous data, it is overwhelmingly common to work with the normal distribution and its friends (eg. the student’s t distribution). Similarly binary data is dominated by the binomial distribution. Inference is therefore often focussed on the parameters \(\mu,\;\sigma\) or \(p\), as an adequate summary of the truth given whatever parameteric assumption has been made.

However, in survival analysis, it is often the case that we focus on the whole shape of the data; there isn’t an accepted dominating probability distribution. In order to be able to deal with time-to-event data, we need to introduce some key ways of working with such data.

The survival time (or time-to-event) \(t\) for a particular individual can be thought of as the value of a random variable \(T\), which can take any non-negative value. We can think in terms of a probability distribution over the range of \(T\). If \(T\) has a probability distribution with underlying probability density function \(f\left(t\right)\), then the cumulative distribution function is given by

\[F\left(t\right) = P\left(T<t\right) = \int\limits_0^t f\left(u\right)du, \] and this gives us the probability that the survival time is less than \(t\).

Definition 7.1 The survival function, \(\operatorname{S}\left(t\right)\), is the probability that some individual (in our context a participant) survives longer than time \(t\). Therefore \(\operatorname{S}\left(t\right) = 1 - F(t)\). Conventionally we plot \(\operatorname{S}\left(t\right)\) against \(t\) and this gives us a survival curve.

We can immediately say two things about survival curves:

Since all participants must be alive (or equivalent) at the start of the trial, \(\operatorname{S}\left(0\right)=1\).
Since it’s impossible to survive past \(t_2>t_1\) but not past time \(t_1\), we must have \(\frac{d\operatorname{S}\left(t\right)}{dt}\leq{0}\), ie. \(\operatorname{S}\left(t\right)\) is non-increasing.

One summary that is often used is the median survival time, the time at which half of the participants have already experienced the event, and half haven’t.

Figure 7.2 shows two survival curves, comparing different therapies. We see that the hormonal therapy reduces the survival time slightly compared to no hormonal therapy.

An example of two survival curves, taken from @syriopoulou2022standardised.

Figure 7.2: An example of two survival curves, taken from Syriopoulou et al. (2022).

Following on from the survival function, we have another (slightly less intuitive) quantity: the Hazard function \(h\left(t\right)\).

Definition 7.2 The Hazard function \(h(t)\) is the probability that an individual who has survived up to time \(t\) fails just after time \(t\); in other words, the instantaneous probability of death (or experiencing the event) at time \(t\).

If we use \(T\) to denote the random variable of survival time (or time-to-event) then \(\operatorname{S}\left(t\right)\) and \(h(t)\) are defined by

\[\begin{align*} \operatorname{S}\left(t\right)&= \operatorname{Pr}\left(T>t\right)\\ h\left(t\right) & = \lim\limits_{s\rightarrow{0+}}\frac{\operatorname{Pr}\left(t<T<t+s\mid{T>t}\right)}{s}. \end{align*}\]

Using the definition of conditional probability, we can rewrite \(h(t)\) as

\[\begin{align*} h\left(t\right) & = \lim\limits_{s\rightarrow{0+}}\frac{\operatorname{Pr}\left(t<T<t+s\mid{T>t}\right)}{s} \\ & = \lim\limits_{s\rightarrow{0+}}\left[\frac{1}{\operatorname{Pr}\left(T>t\right)}\cdot\frac{\operatorname{Pr}\left(\left(t<T<t+s\right)\cap\left(T>t\right)\right)}{s}\right] \\ & = \lim\limits_{s\rightarrow{0+}}\left[\frac{1}{\operatorname{Pr}\left(T>t\right)}\cdot\frac{\operatorname{Pr}\left(t<T<t+s\right)}{s}\right]\\ & = \frac{f\left(t\right)}{\operatorname{S}\left(t\right)}, \end{align*}\]

where \(f\left(\cdot\right)\) is the probability density of \(T\). The hazard function can take any positive value (unlike the survival function), and for this reason \(\log\left(h\left(t\right)\right)\) is often used to transform it to the real line. The hazard function can also be called the ‘hazard rate’, the ‘instantaneous death rate’, the ‘intensity rate’ or the ‘force of mortality’.

As we hinted before, there are fundamentally two ways to deal with survival data: we can go about things either parametrically or non-parametrically. Unusually for statistics in general, the non-parametric paradigm is prevalent in survival analysis. We will consider some methods from both paradigms.

7.2.1 The Kaplan-Meier estimator

The Kaplan-Meier estimator is a non-parametric estimate of \(\operatorname{S}\left(t\right)\), originally presented in Kaplan and Meier (1958). The idea behind it is to divide the interval \(\left[0,\;t\right]\) into many short consecutive intervals,

\[\left[0,\,t\right] = \bigcup\limits_{k=0}^K \left[s_k,\,s_{k+1}\right],\] where \(s_k<s_{k+1}\;\forall{k}\), \(s_0=0\) and \(s_{K+1}=t\). We then estimate the probability of surviving past some time \(t\) by multiplying together the probabilities of surviving the successive intervals up to time \(t\). No distributional assumptions are made, and the probability of surviving interval \(\left[s_k,\,s_{k+1}\right]\) is estimated by \(1-Q\), where

\[Q = \frac{\text{Number who die in that interval}}{\text{Number at risk of death in that interval}}.\] More precisely, let’s say that deaths are observed at times \(t_1<t_2<\ldots < t_n\), and that the number of deaths at time \(t_i\) is \(d_i\) out of a possible \(n_i\). Then for some time \(t\in\left[t_J,\,t_{J+1}\right)\), the Kaplan-Meier estimate of \(\operatorname{S}\left(t\right)\) is

\[\hat{\operatorname{S}}\left(t\right) = \prod\limits_{j=0}^J \frac{\left(n_j - d_j\right)}{n_j}.\]

Notice that the number of people at risk at time \(t_{j+1}\), denoted \(n_{j+1}\), will be the number of people at risk at time \(t_j\) (which was \(n_j\)), minus any who died at time \(t_j\) (which we write as \(d_j\)) and any who were censored in the interval \(\left[t_j,\,t_{j+1}\right)\). In this way, the Kaplan-Meier estimator incorporates information from individuals with censored survival times up to the point they were censored.

Greenwood (1926) derived an approximation to the variance of the Kaplan-Meier estimate of survival curve, given by

\[\hat{V}\left(t\right) = \left(\hat{\operatorname{S}}\left(t\right)\right)^2\sum\limits_{t_{i}\leq{t}} \frac{d_i}{n_i\left(n_i-d_i\right)} \]

This uses the Delta method (which we’ve seen before in the binary outcome chapters) and makes the assumption that events at time \(t_i\) are independent binomial draws from a population of size \(n_i\). We can use this to form confidence intervals for the survival curve, and indeed ggsurvfit can add these automatically to plots.

Example 7.1 Edmonson et al. (1979) conducted a trial on patients with advanced ovarian cancer, comparing cyclophosphamide (group \(C\)) with a mixture of cyclophosphamide and adriamycin (group \(T\)). Patients were monitored, and their time of death was recorded, or a censoring time if they were alive at their last observation. The data are shown in Table 7.1.

Table 7.1: Table 7.2: Ovarian cancer data. FU time gives the survival or censoring time, and FU status the type: 0 for a censored observation, 1 for death.
	FU_time	FU_status
1	59	1
2	115	1
3	156	1
22	268	1
23	329	1
24	353	1
25	365	1
26	377	0
4	421	0
5	431	1
6	448	0
7	464	1
8	475	1
9	477	0
10	563	1
11	638	1
12	744	0
13	769	0
14	770	0
15	803	0
16	855	0
17	1040	0
18	1106	0
19	1129	0
20	1206	0
21	1227	0

We see that there are 26 individuals, and we have the time of death for 12 of them. The remaining 14 observations are censored. We can use this data to calculate the Kaplan-Meier estimator of the survival curve, as shown in Table 7.3. The columns are (from left to right): time \(t_j\); number at risk \(n_j\); number of events/deaths \(d_j\); number of censorings in \(\left[t_{j-1},\,t_j\right)\); estimate of survival curve and standard error of the estimate (using Greenwood’s formula).

Table 7.3: Table 7.4: Kaplan-Meier estimator calculations for ovarian cancer dataset.
time	n_risk	n_event	n_cens	survival	SE_surv
59	26	1	0	0.9615385	0.0377146
115	25	1	0	0.9230769	0.0522589
156	24	1	0	0.8846154	0.0626563
268	23	1	0	0.8461538	0.0707589
329	22	1	0	0.8076923	0.0772920
353	21	1	0	0.7692308	0.0826286
365	20	1	0	0.7307692	0.0869893
431	17	1	2	0.6877828	0.0918815
464	15	1	1	0.6419306	0.0965213
475	14	1	0	0.5960784	0.0999261
563	12	1	1	0.5464052	0.1032094
638	11	1	0	0.4967320	0.1051027

Figure 7.3 shows the Kaplan-Meier survival curve estimate for the ovarian data. Using the package ggsurvfit we can add in a table below the \(x\) axis showing the number at risk and the number of events at some times points.

Figure 7.3: Kaplan-Meier estimate of survival curve for ovarian cancer data with a 95% confidence level.

The Kaplan-Meier estimate may seem a bit dissatisfying, since it stops changing at \(t=638\) with a probability of \(0.497\). However, this is really (in a non-parametric setting) all we can say with the data available; 10 of the participants were definitely still alive at \(t=638\), and some of the other censored participants may also have been.

For a clinical trial, we want to plot the survival curves separately for the different treatment groups. This will give a first, visual, idea of whether there might be a difference, and also of the suitability of certain models (we’ll talk about this later).

Example 7.2 Figure 7.4 shows the Kaplan Meier plots for the ovarian cancer data from Figure 7.3, this time split by treatment group.

Figure 7.4: Kaplan-Meier curves for the ovarian cancer data, split by treatment group.

The second dataset we will use throughout this chapter has been simulated based on a trial of acute myeloid leukemia (Le-Rademacher et al. (2018)) and is from the survival package Therneau (2024). The Kaplan-Meier estimate for the myeloid data is shown in Figure 7.5.

Figure 7.5: Kaplan Meier curves for the Myeloid data, split by treatment.

7.2.2 A parametric approach

In a parametric approach, we’ll assume that the survival time \(T\) follows some probability distribution, up to unknown parameters which we will estimate from the data. The simplest distribution for time-to-event data is the exponential distribution, which has density

\[f\left(t\right) = \lambda e^{-\lambda t} \text{ for }t>0,\]

survival function

\[\operatorname{S}\left(t\right) = 1- \int\limits_{0}^t{\lambda e^{-\lambda t}} = e^{-\lambda t}, \] and mean survival time \(\frac{1}{\lambda}\). The hazard function is therefore

\[h\left(t\right) = \frac{f\left(t\right)}{S\left(t\right)} = \lambda, \] that is, the hazard is constant.

Given some dataset, we want to be able to find an estimate for \(\lambda\) (or the parameters of our distribution of choice).

7.2.2.1 Maximum likelihood for time-to-event data

Suppose our dataset has \(n\) times \(t_1,\,t_2,\,\ldots,t_n\). Of these, \(m\) are fully observed and \(n-m\) are censored. We can create a set of indicators \(\delta_1,\ldots,\delta_n\), where \(\delta_i=1\) if observation \(i\) is fully observed and \(\delta_i=0\) if it is censored.

Usually, the likelihood function is computed by multiplying the density function evaluated at each data point, \(f\left(t_i\mid{\text{params}}\right)\). However, this won’t work for survival data, because for our censored times (those for which \(\delta_i=0\)) we only know that the time-to-event is greater than \(t_i\). For these observations, it is the survival function (remember that this is \(p\left(T>t\right)\)) that contributes what we need to the likelihood function.

Therefore (for any probability distribution) we have

\[\begin{equation} L = \prod\limits_{i=1}^n f\left(t_i\right)^{\delta_i} S\left(t_i\right)^{1-\delta_i}. \tag{7.1} \end{equation}\]

If we have \(T\sim Exp\left(\lambda\right)\) then the log-likelihood is

\[\begin{align*} \ell\left(\lambda \mid \text{data}\right) &= \sum\limits_{i=1}^n \delta_i\left(\log\lambda - \lambda t_i\right) - \sum\limits_{i=1}^n\left(1-\delta_i\right)\lambda t_i \\ & = m\log\lambda - \lambda\sum\limits_{i=1}^n t_i. \end{align*}\]

From this we can find the maximum likelihood estimator (MLE)

\[\hat{\lambda} = \frac{m}{\sum\limits_{i=1}^n t_i}.\] The variance of the MLE is

\[\begin{equation} \operatorname{var}\left(\hat\lambda\right) = \frac{\lambda^2}{m}, \tag{7.2} \end{equation}\]

which we can approximate by

\[\operatorname{var}\left(\hat\lambda\right) \approx \frac{m}{\left(\sum\limits_{i=1}^n t_i\right)^2}.\]

Notice that the numerator in Equation (7.2) is \(m\), the number of complete observations (rather than \(n\) the total number including censored observations). This shows that there is a limit to the amount we can learn if a lot of the data is censored.

Example 7.3 Returning to the dataset from Example 7.1, we can fit an exponential distribution to the data simply by estimating the MLE

\[\begin{align*} \hat{\lambda}_C &= \frac{m_C}{\sum\limits_{i=1}^{n_C} t_i}\\ & =\frac{7}{6725}\\ & = 0.00104 \end{align*}\]

and

\[\begin{align*} \hat{\lambda}_T &= \frac{m_T}{\sum\limits_{i=1}^{n_T} t_i}\\ & =\frac{5}{8863}\\ & = 0.00056 \end{align*}\]

mC_ov = sum((ovarian$fustat==1)&(ovarian$rx==1))
mT_ov = sum((ovarian$fustat==1)&(ovarian$rx==2))
tsum_ov_C = sum(ovarian$futime[ovarian$rx==1])
tsum_ov_T = sum(ovarian$futime[ovarian$rx==2])
m_ov = mT_ov + mC_ov
tsum_ov = tsum_ov_C + tsum_ov_T
lamhat_ov_C = mC_ov / tsum_ov_C
lamhat_ov_T = mT_ov / tsum_ov_T

Kaplan Meier estimates of survival curves for the ovarian data (solid lines), with the fitted exponential S(t) shown in dashed lines (red = group C, blue = group T).

Figure 7.6: Kaplan Meier estimates of survival curves for the ovarian data (solid lines), with the fitted exponential S(t) shown in dashed lines (red = group C, blue = group T).

We can do the same for the myeloid data. Figure 7.7 shows the fitted curves, using \(S\left(t\right)=\exp\left[-\hat{\lambda}_Xt\right]\) for group \(X\).

Kaplan Meier estimates of survival curves for the Myeloid data (solid lines), with the fitted exponential S(t) shown in dashed lines (red = group C, blue = group T).

Figure 7.7: Kaplan Meier estimates of survival curves for the Myeloid data (solid lines), with the fitted exponential S(t) shown in dashed lines (red = group C, blue = group T).

7.2.3 The Weibull distribution

Having only one parameter, the exponential distribution is not very flexible, and often doesn’t fit data at all well. A related, but more suitable distribution is the Weibull distribution.

Definition 7.3 The probability density function of a Weibull random variable is

\[ f\left(t\mid \lambda,\,\gamma\right) = \begin{cases} \lambda\gamma t^{\gamma-1}\exp\left[{-\lambda t^{\gamma}}\right] & \text{for }t\geq{0}\\ 0 & \text{otherwise}. \end{cases} \]

Here, \(\gamma\) is the shape parameter, and \(\lambda\) is the scale parameter. If \(\gamma=1\) then this reduces to an exponential distribution. You can read more about it, should you choose to, in Collett (2003b).

For the Weibull distribution, we have

\[\operatorname{S}\left(t\right) = \exp\left(-\lambda t^{\gamma}\right). \]

As with the exponential distribution, we can use Equation (7.1) for the likelihood. For the Weibull distribution this becomes

\[\begin{align*} L\left(\lambda,\gamma\mid{\text{data}}\right) & = \prod\limits_{i=1}^n\left\lbrace \lambda \gamma t_i^{\gamma-1}\exp\left(-\lambda t_i^{\gamma}\right) \right\rbrace^{\delta_i} \left\lbrace \exp\left[-\lambda t_i^{\gamma}\right] \right\rbrace ^{1-\delta_i}\\ & = \prod\limits_{i=1}^n \left\lbrace \lambda\gamma t_i^{\gamma-1} \right\rbrace^{\delta_i} \exp\left(-\lambda t_i^{\gamma}\right) \end{align*}\]

and therefore

\[\begin{align*} \ell\left(\lambda,\gamma\mid{\text{data}}\right) & = \sum\limits_{i=1}^n \delta_i \log\left(\lambda\gamma\right) + \left(\gamma-1\right)\sum\limits_{i=1}^n\delta_i \log{t_i} - \lambda \sum\limits_{i=1}^n t_i^{\gamma}\\ & = m\log\left(\lambda\gamma\right) + \left(\gamma-1\right)\sum\limits_{i=1}^n \delta_i\log t_i - \lambda\sum\limits_{i=1}^n t_i^{\gamma}. \end{align*}\]

For the maximum likelihood estimators, we differentiate (separately) with respect to \(\lambda\) and \(\gamma\) and equate to zero, to solve for the estimators \(\hat\lambda\) and \(\hat\gamma\).

The equations we end up with are

\[\begin{align} \frac{m}{\hat{\lambda}} - \sum\limits_{i=1}^n t_i^{\hat\gamma} & = 0 \tag{7.3}\\ \frac{m}{\hat{\gamma}} + \sum\limits_{i=1}^n \delta_i\log t_i - \hat{\lambda}\sum\limits_{i=1}^n t_i^{\hat\gamma} \log t_i &=0 \tag{7.4}. \end{align}\]

We can rearrange Equation (7.3) to

\[\hat{\lambda} = \frac{m}{\sum\limits_{i=1}^n t_i^{\hat\gamma}},\] and substitute this into Equation (7.4) to find

\[\frac{m}{\hat{\gamma}} + \sum\limits_{i=1}^n \delta_i\log t_i - \frac{m}{\sum\limits_{i=1}^n t_i^{\gamma}}\sum\limits_{i=1}^n t_i^{\hat\gamma} \log t_i=0. \] This second equation is analytically intractable, so numerical methods are used to find \(\hat\gamma\), and then this value can be used to find \(\hat\lambda\).

Example 7.4 We can fit Weibull distributions to our myeloid dataset, as shown in Figure 7.8.

\[S\left(t\right) = \exp\left(-\lambda t^{\gamma}\right).\]

Weibull fit to survival curve of Myeloid data, dashed lines (Kaplan Meier estimate also shown in solid lines). Red for group T, black for group C.

Figure 7.8: Weibull fit to survival curve of Myeloid data, dashed lines (Kaplan Meier estimate also shown in solid lines). Red for group T, black for group C.

We see that there is some improvement compared to the exponential fit in Figure 7.7, but it still seems not to capture the fundamental shape.

Aside: Sample size calculations for time-to-event data

There are implications here for sample size calculations, which must take into account the duration of a trial; it is important that trials monitor patients until a sufficient proportion have experienced the event (whatever it is). Sample size calculations for time-to-event data therefore have two components:

The power of the trial can first be expressed in terms of \(m\), the number of complete observations.
A separate calculation is needed to estimate the number of participants needing to be recruited, and length of trial, to be sufficiently likely to achieve that value of \(m\).

Both of these calculations rely on a number of modelling assumptions, and on previous scientific/clinical data (if available).

We will think more about how this can be used in the next section, when we come to compare treatment effects.

References

———. 2003b. Modelling Survival Data in Medical Research. 2nd ed. Texts in Statistical Science. Chapman & Hall.

Edmonson, John H, Thomas R Fleming, David G Decker, George D Malkasian, Edward O Jorgensen, John A Jefferies, Maurice J Webb, and Larry K Kvols. 1979. “Prognosis in Advanced Ovarian Carcinoma Versus Minimal Residual.” Cancer Treatment Reports 63 (2): 241–47.

Greenwood, Major. 1926. “The Natural Duration of Cancer.” Reports on Public Health and Medical Subjects 33: 1–26.

Kaplan, Edward L, and Paul Meier. 1958. “Nonparametric Estimation from Incomplete Observations.” Journal of the American Statistical Association 53 (282): 457–81.

Le-Rademacher, Jennifer G, Ryan A Peterson, Terry M Therneau, Ben L Sanford, Richard M Stone, and Sumithra J Mandrekar. 2018. “Application of Multi-State Models in Cancer Clinical Trials.” Clinical Trials 15 (5): 489–98.

Syriopoulou, Elisavet, Tove Wästerlid, Paul C Lambert, and Therese M-L Andersson. 2022. “Standardised Survival Probabilities: A Useful and Informative Tool for Reporting Regression Models for Survival Data.” British Journal of Cancer 127 (10): 1808–15.

Therneau, Terry M. 2024. A Package for Survival Analysis in r. https://CRAN.R-project.org/package=survival.