D Computer practical 4 - Analysis for Binary and Survival data

R practicalities - click here

There are many, many packages in R that implement methods for designing and analysing clinical trials (see a list at CRAN task view). We will look at some of these, and will also write our own code for some tasks. Remember that to install a package, you can do

install.packages("<packagename>")

If you have problems running R on your laptop, or on the university machines, the most foolproof way might be to use Github codespaces (thanks to Louis Aslett, who developed this). You may already be familiar with this from previous modules.

An advantage of this is that you can open the same codespace (the same instance of R) from any computer, so if you plan to work on things (for example your summative assignment, which will involve some R) from more than one computer, this might be ideal.

Here is a guide to getting set up.

Direct link to codespace

In this practical we’ll focus on the analysis of binary and survival data. We’ll make use of several new packages. Installing them all now should hopefully prevent it from disrupting your flow! We’ll load them as we go along.

Notice that where there are code snippets that you want to copy directly, you can hover your cursor in the top right of the code box and a ‘copy’ icon will appear.

install.packages(c("tidyverse", "HSAUR", "HSAUR3", "pROC", "survival", "ggsurvfit", "survminer", "car"))

D.1 Binary outcome data

The dataset we’ll use to explore Binary outcome data is the respiratory dataset, from the package HSAUR. In this trial, patients across two centres were randomly assigned to receive treatment or placebo, and their respiratory status was recorded as either ‘poor’ or ‘good’. Measurements were recorded at months 0, 1, 2, 3, 4, but we will keep only month 0 (as a baseline) and month 4 (as a final outcome). We process the data below:

library(HSAUR)
data("respiratory")
# Keep only months 0 and 4
resp_04 = respiratory |>
  filter(month %in% c(0,4))

## convert to wider to have one row per participant
resp_df = pivot_wider(
  data = resp_04, 
  id_cols = c("subject", "centre", "treatment", "sex", "age"),
  names_from = "month",
  names_prefix = "status",
  values_from = "status"
)

For each patient, we have the following baseline covariates:

  • sex
  • age
  • treatment centre (centre 1 or centre 2)
  • symptom status (poor or good).

The outcome variable is whether the status of the patient’s symptoms are poor or good after four months of the trial. Note that unusually we do have the baseline observation of this quantity. Since a lot of the earlier methods we covered in Chapter 6 involve simple calculations with the data (and working with survival data is much less familiar), we will skip straight logistic regression.

D.1.1 Logistic regression

Logistic regression give us a way to account for the baseline covariates. To fit a logistic regression in R we use the function glm. This fits all kinds of generalised linear models, and so we specify that we want a logistic regression by choosing family = binomial(link='logit').

The code is included in all examples in Sections 6.3 and 6.4, so if you aren’t sure how to do something they would be a good place to look.

Exercise D.1 Fit a logistic regression model to the resp_df data set, being careful to check for any issues with the data.

Click for solution

Solution. The first model we fit involves all covariates linearly:

model1 = glm(status4 ~ centre + treatment + sex + age + status0, 
             family = binomial(link='logit'), data=resp_df)
summary(model1)
## 
## Call:
## glm(formula = status4 ~ centre + treatment + sex + age + status0, 
##     family = binomial(link = "logit"), data = resp_df)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.83921    0.67498  -1.243 0.213757    
## centre2             1.27397    0.48546   2.624 0.008684 ** 
## treatmenttreatment  1.08498    0.47415   2.288 0.022122 *  
## sexmale             0.33480    0.60171   0.556 0.577924    
## age                -0.02978    0.01805  -1.650 0.099035 .  
## status0good         1.72562    0.47024   3.670 0.000243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 153.44  on 110  degrees of freedom
## Residual deviance: 119.34  on 105  degrees of freedom
## AIC: 131.34
## 
## Number of Fisher Scoring iterations: 4

and we can inspect the variance inflation factor to check for multicollinearity:

##    centre treatment       sex       age   status0 
##  1.171418  1.120405  1.213698  1.194655  1.066666

We can also try one with age squared, since the effect might not be linear (the age range is 11 to 68 which is quite a wide range). While we’re at it we’ll remove sex from the model since it appears to have no effect

model2 = glm(status4 ~ centre + treatment + age + I(age^(2)) + status0, 
             family = binomial(link='logit'), data=resp_df)
summary(model2)
## 
## Call:
## glm(formula = status4 ~ centre + treatment + age + I(age^(2)) + 
##     status0, family = binomial(link = "logit"), data = resp_df)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.517788   1.512818   1.003 0.315723    
## centre2             1.268735   0.486975   2.605 0.009178 ** 
## treatmenttreatment  0.996219   0.465348   2.141 0.032290 *  
## age                -0.173598   0.087107  -1.993 0.046271 *  
## I(age^(2))          0.002000   0.001167   1.714 0.086530 .  
## status0good         1.671810   0.471304   3.547 0.000389 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 153.44  on 110  degrees of freedom
## Residual deviance: 116.44  on 105  degrees of freedom
## AIC: 128.44
## 
## Number of Fisher Scoring iterations: 4

This is an improvement in terms of AIC and deviance, so we’ll use this model (for now).

Exercise D.2 Perform some diagnostics on the model you fitted in Exercise D.1, focussing on the model’s ability to discriminate between outcomes, and also on how well the model calibrates to the observed data.

Click for solution

Solution. Firstly we’ll use ROC analysis to assess the model in terms of how well it discriminates the participants in terms of their respiratory status at 4 months.

library(pROC)
fit_resp = fitted(model2)   # Fitted values from model3
out_resp = resp_df$status4   # outcome values (1 or 2)
roc_resp_df = data.frame(fit = fit_resp, out = out_resp)
roc_resp = roc(data=roc_resp_df, response = out, predictor=fit)
roc_resp
## 
## Call:
## roc.data.frame(data = roc_resp_df, response = out, predictor = fit)
## 
## Data: fit in 52 controls (out poor) < 59 cases (out good).
## Area under the curve: 0.8155

The area under the curve (AUC) is around 0.8, which is generally considered quite good.

We can also plot the ROC curve

ggroc(roc_resp, legacy.axes=T) + geom_abline(slope=1, intercept=0, type=2)

Next we’ll assess the model in terms of calibration.

## Group the observations into age groups (I've chosen 10 year bins)
resp_df$age10 = round(resp_df$age, -1)
# find mean status (minus 1 because factor levels are 1 and 2) and number of obs
# for each combination of factor/age group levels
resp_sum = resp_df %>% 
  group_by(age10, centre, treatment, status0) %>%
  summarise(
    mean = mean(as.numeric(status4))-1, 
    n=length(age10), 
    .groups="keep")

# Plot the observations, using facet_wrap to deal with two of the factors 
 
obs_plot = ggplot(data=resp_sum, aes(x=age10, col=treatment)) +
  geom_point(aes(y=mean, size=n), pch=16) + 
  theme_bw()+
  facet_wrap(c("centre", "status0")) + theme(legend.position = "bottom")

## To include the estimate and SD from the model, we use the original dataset with continuous age,
# fit model 2 including an estimate of SE, and use geom_line and geom_ribbon to add the fitted model with 95% intervals

fit_resp = predict(model2, newdata=resp_df, se.fit=T, type="response")

resp_df$fit = fit_resp$fit
resp_df$fit_se = fit_resp$se.fit

obs_plot + 
  geom_line(data=resp_df, aes(x=age, y=fit)) +
  geom_ribbon(
    data=resp_df, 
    aes(x=age, ymin = fit - 1.96*fit_se, ymax = fit + 1.96*fit_se, fill = treatment), 
    alpha=0.3
    )
Observations and fitted model for combinations of centre (1: top and 2: bottom) for baseline status (poor: left and good: right)

Figure D.1: Observations and fitted model for combinations of centre (1: top and 2: bottom) for baseline status (poor: left and good: right)

Overall, the model does not appear to be a great fit, but there is no systematic cause for concern.

These plots also give us an idea of the model’s fit for different categories of patient. For example, the probability of good respiratory symptoms at 4 months is much higher for a patient in the treatment group who had good symptoms at month zero, especially if they are young and belong to centre 2. The model’s output is much more bleak for a patient in group \(C\) at centre 1 who had poor symptoms at month zero.

Having performed some diagnostics, we can proceed to use our model to provide information about the effect of the treatment to improve respiratory symptoms.

Exercise D.3 Compute a 95% CI for the odd ratio for this trial, using the model you built in Exercise D.1.

Click for solution

Solution. We can find the details of the coefficients via

summary(model2)
## 
## Call:
## glm(formula = status4 ~ centre + treatment + age + I(age^(2)) + 
##     status0, family = binomial(link = "logit"), data = resp_df)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.517788   1.512818   1.003 0.315723    
## centre2             1.268735   0.486975   2.605 0.009178 ** 
## treatmenttreatment  0.996219   0.465348   2.141 0.032290 *  
## age                -0.173598   0.087107  -1.993 0.046271 *  
## I(age^(2))          0.002000   0.001167   1.714 0.086530 .  
## status0good         1.671810   0.471304   3.547 0.000389 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 153.44  on 110  degrees of freedom
## Residual deviance: 116.44  on 105  degrees of freedom
## AIC: 128.44
## 
## Number of Fisher Scoring iterations: 4

and find that the estimate of the log OR is 0.996, with an SE of 0.465, and therefore a 95% CI for the log OR is

est_logOR = summary(model2)$coefficients[3,1]
se_logOR = summary(model2)$coefficients[3,2]
logOR_CI = c(est_logOR - qnorm(0.975)*se_logOR, est_logOR + qnorm(0.975)*se_logOR)

Finally we can use this to find a CI for the OR itself

exp(logOR_CI)
## [1] 1.087796 6.741514

We call this the adjusted odds ratio, since we have adjusted for other covariates. Since this confidence interval does not contain the null value (1 for the OR), we see that the treatment effect is significant.

Note that these values are still quite difficult to interpret - this is an important issue in binary outcome RCTs in particular, as you’ll see in Formative 2.

D.2 Analysis for Survival data

The packages we need for this section are survival, ggsurvfit and survminer, so load those now.

The first dataset we will use is aml - this dataset is from a trial investigating whether the standard course of chemotheraphy should be extended for some additional cycles (‘maintenance’) for patients with Acute Myelogenous Leukemia.

Exercise D.4 Look at the help file for the aml dataset and make sure you understand what each variable is doing. R might ask which aml dataset you want - if it does, choose the one from the survival package.

Click for solution

Solution. The data set aml has three columns

  • time - survival or censoring time for each patient. This is the last time at which that patient was recorded.
  • status - censoring status. By convention, this is 1 if a death is observed (ie. for complete data) and 0 for censored data (ie. the time in the time column was the last time that patient was seen, and they were still alive)
  • x - maintenance chemotherapy given? This is the treatment variable

The first step is to combine the first two columns into a form we can use. We do this using the Surv function in the package survival, which creates a ‘survival object’ that we can then use in various other functions. This object contains the times and the information about which observations are censored.

To create a survival object from some dataset dataframe containing a time variable and a censoring status variable, the general form is

surv_obj = with(data=<dataframe>, Surv(<time variable>, <censoring status variable>))

or if you prefer,

surv_obj = Surv(<dataframe$time variable>, <dataframe$censoring status variable>)

Exercise D.5 Use the Surv function now on the aml data. The output will contain some notation you probably haven’t seen before - can you work out what it means?

Click for solution

Solution. We create the survival object for aml by

surv_aml = with(data=aml, Surv(time,status))
surv_aml
##  [1]   9   13   13+  18   23   28+  31   34   45+  48  161+   5    5    8    8   12   16+  23   27 
## [20]  30   33   43   45

This is a vector of the time values, in the same order as in aml. You’ll notice that some have a ‘+’ attached. This denotes the censored observations (the notation reflects the fact that the true time of death/the event will be greater than this).

D.2.1 Fitting a survival curve

The next thing we probably want to do is to estimate the survival function and plot the survival curve. The first method we’ll use is the Kaplan-Meier estimator.

D.2.1.1 Kaplan-Meier

To fit a Kaplan-Meier survival curve, we use the function survfit, which is specified using a formula, much like lm or glm. To fit a Kaplan-Meier estimate with a data frame split by treatment effect, the general form is

survfit(<surv_obj> ~ <treatment var>, data = <dataframe>)

We can then use summary to see the intermediary calculations at each step, and plot (for base plot) or ggsurvfit (from ggplot2 and ggsurvfit) to plot the curves.

Exercise D.6 Fit a Kaplan-Meier estimator to the aml data. View the table using summary. Plot the curves.

Click for solution

Solution. To fit the Kaplan-Meier estimator we use

km_aml = survfit(surv_aml ~ x, data=aml)

We can then look at the summary table and plot the data by

summary(km_aml)
## Call: survfit(formula = surv_aml ~ x, data = aml)
## 
##                 x=Maintained 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     9     11       1    0.909  0.0867       0.7541        1.000
##    13     10       1    0.818  0.1163       0.6192        1.000
##    18      8       1    0.716  0.1397       0.4884        1.000
##    23      7       1    0.614  0.1526       0.3769        0.999
##    31      5       1    0.491  0.1642       0.2549        0.946
##    34      4       1    0.368  0.1627       0.1549        0.875
##    48      2       1    0.184  0.1535       0.0359        0.944
## 
##                 x=Nonmaintained 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     5     12       2   0.8333  0.1076       0.6470        1.000
##     8     10       2   0.6667  0.1361       0.4468        0.995
##    12      8       1   0.5833  0.1423       0.3616        0.941
##    23      6       1   0.4861  0.1481       0.2675        0.883
##    27      5       1   0.3889  0.1470       0.1854        0.816
##    30      4       1   0.2917  0.1387       0.1148        0.741
##    33      3       1   0.1944  0.1219       0.0569        0.664
##    43      2       1   0.0972  0.0919       0.0153        0.620
##    45      1       1   0.0000     NaN           NA           NA
km_aml %>%  ggsurvfit() +
  add_censor_mark() +
  add_risktable() +
  add_confidence_interval()

We can see from the table that the lower curve is the non-maintained arm - there is only one survivor of this group, and the data finish at \(t=45\).

The function add_risktable() adds information about the numbers at risk and numbers of events. The function add_confidence_interval() adds a confidence interval, and we see that with the aml data the uncertainty is huge.

D.2.1.2 Fitting an exponential distribution

To fit an exponential distribution, we need to estimate \(\hat\lambda_C\) and \(\hat\lambda_T\), using

\[\hat\lambda_X = \frac{m_X}{\sum\limits_{i=1}^{n_X} t_i} = \frac{m_X}{t^+_X}, \] where \(n_X\) is the number of observations \(t_1,\ldots,t_{n_X}\) in group \(X\), of which \(m_X\) are censored.

Exercise D.7 Fit an exponential distribution for each treatment group to the aml data and plot the resulting estimated survival curves, along with the Kaplan Meier estimators from Exercise D.6 (for comparison).

Click for solution

Solution. To calculate \(\hat\lambda_C\) and \(\hat\lambda_T\) we need to find \(m_C,\;\,m_T\;,t^+_C\) and \(t^+_T\).

mC_aml = sum((aml$status==1)&(aml$x=="Nonmaintained"))
mT_aml = sum((aml$status==1)&(aml$x=="Maintained"))
tsum_aml_C = sum(aml$time[aml$x=="Nonmaintained"])
tsum_aml_T = sum(aml$time[aml$x=="Maintained"])
lamhat_aml_C = mC_aml / tsum_aml_C
lamhat_aml_T = mT_aml / tsum_aml_T

We can then plot the Survival curves using geom_function

# Define survival function for exponential density

exp_st = function(t, lambda){exp(-lambda*t)}

km_aml %>% ggsurvfit() + ylim(0,1) + theme_bw() +
  add_censor_mark() +
  geom_function(fun=exp_st, args=list(lambda = lamhat_aml_C), col="darkturquoise") +
  geom_function(fun=exp_st, args=list(lambda = lamhat_aml_T), col="red") 

D.2.2 Comparing survival curves

Having found the MLEs for the aml dataset, assuming an exponential distribution, we can now immediately conduct a likelihood ratio test.

D.2.2.1 Likelihood ratio test

Recall that our test statistic (which we found in Section 8.1) is

\[ \lambda_{LR} = 2\left(m_C\log\left(\frac{m_C}{t^+_C}\right) + m_T\log\left(\frac{m_T}{t^+_T}\right) - m\log\left(\frac{m}{t^+}\right)\right),\]

which we then refer to a \(\chi^2_1\) distribution.

Exercise D.8 Conduct a likelihood ratio test for the aml data, with the null hypothesis that the survival curves are the same for both treatment groups. Before you calculate the answer, think about what you expect to see.

Click for solution

Solution. We already have the \(m_X\) and \(t^+_X\) from Exercise D.7, and we can easily find \(t^+\) and \(m\) from these.

m_aml = mT_aml + mC_aml
tsum_aml = tsum_aml_T + tsum_aml_C

and we can use these to compute \(\lambda_{LR}\):

LRstat_aml =  2*(mC_aml*log(mC_aml/tsum_aml_C) + mT_aml*log(mT_aml/tsum_aml_T) - m_aml*log(m_aml/tsum_aml))
LRstat_aml
## [1] 4.061349

Finally, we refer this to \(\chi^2_1\)

1-pchisq(LRstat_aml, df=1)
## [1] 0.04387544

We find that we have just enough evidence to reject \(H_0\) at the 95% level.

D.2.2.2 Log-rank test

Because our computer practicals are early in the week we haven’t covered this material yet in lectures. So, it might be sensible to return to it at a later date, and instead to continue to the extension question if you still have time in the session.

The log-rank test is most easily found using the function survdiff. This function has an argument rho that controls the type of test. If we set rho=0 then it performs a log-rank test.

The general form is similar to survfit:

survdiff(<surv_obj> ~ <treatment variable>, data=<dataframe>, rho=0)

Exercise D.9 Use survdiff to do a log rank test on the aml data. Do you expect the results to be similar to your results from Exercise D.8?

Click for solution

Solution. To conduct a log rank test we use

survdiff(surv_aml~x, data=aml, rho=0)
## Call:
## survdiff(formula = surv_aml ~ x, data = aml, rho = 0)
## 
##                  N Observed Expected (O-E)^2/E (O-E)^2/V
## x=Maintained    11        7    10.69      1.27       3.4
## x=Nonmaintained 12       11     7.31      1.86       3.4
## 
##  Chisq= 3.4  on 1 degrees of freedom, p= 0.07

This is not that close to the result of the likelihood ratio test, probably reflecting the less-than-perfect fit of the exponential survival curve.

D.2.2.3 Cox regression

Finally, we will fit the Cox regression model. This is done using the function coxph in the package survival. Look at the help file and the examples in Section 8.3.2 to understand how to use this function.

Exercise D.10 Fit a Cox proportional hazards model to the data. Do you think this is an appropriate model to use? How influential is the treatment?

Click for solution

Solution. In the aml dataset there are no baseline covariates, so our only dependent variable is the treatment group variable x.

cox_aml = coxph(formula = Surv(time, status)~x, data=aml)
cox_aml
## Call:
## coxph(formula = Surv(time, status) ~ x, data = aml)
## 
##                  coef exp(coef) se(coef)     z      p
## xNonmaintained 0.9155    2.4981   0.5119 1.788 0.0737
## 
## Likelihood ratio test=3.38  on 1 df, p=0.06581
## n= 23, number of events= 18

We’ve already seen that the survival curves don’t cross, so we can be reasonably comfortable fitting this model. To see how the model performs, we can use the summary function:

summary(cox_aml)
## Call:
## coxph(formula = Surv(time, status) ~ x, data = aml)
## 
##   n= 23, number of events= 18 
## 
##                  coef exp(coef) se(coef)     z Pr(>|z|)  
## xNonmaintained 0.9155    2.4981   0.5119 1.788   0.0737 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                exp(coef) exp(-coef) lower .95 upper .95
## xNonmaintained     2.498     0.4003    0.9159     6.813
## 
## Concordance= 0.619  (se = 0.063 )
## Likelihood ratio test= 3.38  on 1 df,   p=0.07
## Wald test            = 3.2  on 1 df,   p=0.07
## Score (logrank) test = 3.42  on 1 df,   p=0.06

To check this more carefully we can use the log-log plot

and we see that the curves are close to parallel.

Although the coefficient of x isn’t quite significant at the 0.05 level, it appears there is reasonable evidence that the treatment has an effect (though probably not enough to make a clinical decision).

Exercise D.11 The dataset colon, also from the survival package, contains data from a trial of colon cancer patients, comparing three treatments: observation (obs), levamisole (Lev) and levamisole + 5-FU (Lev+5FU). To simplify things, we will restrict the data to those patients on Obs or Lev+5FU. The main report of this trial is Moertel et al. (1990).

colondf = colon[colon$rx!="Lev",]
colondf$rx = as.factor(as.character(colondf$rx)) # Removes Lev factor level

For this data

  1. Look at the help file and make sure you understand what the columns mean
  2. Fit Kaplan-Meier estimators to the survival curves for the two groups.
  3. Perform some tests to see whether the treatment has a significant effect on the outcome. What do you find?
Click for solution

Solution. From the help file we see that rx is the treatment group, stop is the time variable, status gives the censoring status.

str(colondf)
## 'data.frame':    1238 obs. of  16 variables:
##  $ id      : num  1 1 2 2 3 3 4 4 5 5 ...
##  $ study   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ rx      : Factor w/ 2 levels "Lev+5FU","Obs": 1 1 1 1 2 2 1 1 2 2 ...
##  $ sex     : num  1 1 1 1 0 0 0 0 1 1 ...
##  $ age     : num  43 43 63 63 71 71 66 66 69 69 ...
##  $ obstruct: num  0 0 0 0 0 0 1 1 0 0 ...
##  $ perfor  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ adhere  : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ nodes   : num  5 5 1 1 7 7 6 6 22 22 ...
##  $ status  : num  1 1 0 0 1 1 1 1 1 1 ...
##  $ differ  : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ extent  : num  3 3 3 3 2 2 3 3 3 3 ...
##  $ surg    : num  0 0 0 0 0 0 1 1 1 1 ...
##  $ node4   : num  1 1 0 0 1 1 1 1 1 1 ...
##  $ time    : num  1521 968 3087 3087 963 ...
##  $ etype   : num  2 1 2 1 2 1 2 1 2 1 ...

We can therefore fit (and plot) the Kaplan-Meier estimator split by treatment group.

km_colon = survfit(Surv(time, status) ~ rx, data=colondf)
km_colon %>% ggsurvfit + ylim(0,1) +
  add_censor_mark() +
  add_confidence_interval()

Next we can find the MLE for each treatment group

mC_colon = sum((colondf$status==1)&(colondf$rx=="Obs"))
mT_colon = sum((colondf$status==1)&(colondf$rx=="Lev+5FU"))
tsum_colon_C = sum(colondf$time[colondf$rx=="Obs"])
tsum_colon_T = sum(colondf$time[colondf$rx=="Lev+5FU"])
lamhat_colon_C = mC_colon / tsum_colon_C
lamhat_colon_T = mT_colon / tsum_colon_T

We can then plot the Survival curves using geom_function

# Define survival function for exponential density

km_colon %>% ggsurvfit() +
    add_censor_mark() +
  add_risktable() +
  add_confidence_interval() + 
  ylim(0,1) + theme_bw() +
  geom_function(fun=exp_st, args=list(lambda = lamhat_colon_C), col="darkturquoise") +
  geom_function(fun=exp_st, args=list(lambda = lamhat_colon_T), col="red") 
Survival curves for groups C and T in colon study, fitted assuming an exponential distribution. Kaplan-Meier estimates also shown.

Figure D.2: Survival curves for groups C and T in colon study, fitted assuming an exponential distribution. Kaplan-Meier estimates also shown.

We see that this fit is quite poor.

m_colon = mT_colon + mC_colon
tsum_colon = tsum_colon_T + tsum_colon_C

and we can use these to compute \(\lambda_{LR}\):

LRstat_colon =  2*(mC_colon*log(mC_colon/tsum_colon_C) + mT_colon*log(mT_colon/tsum_colon_T) - m_colon*log(m_colon/tsum_colon))
LRstat_colon
## [1] 35.0111

Finally, we refer this to \(\chi^2_1\)

1-pchisq(LRstat_colon, df=1)
## [1] 3.278309e-09

Highly significant, but because of the very poor fit in Figure D.2 not especially trustworthy.

Next, we can perform a log-rank test:

survdiff(Surv(time, status)~rx, data=colondf, rho=0)
## Call:
## survdiff(formula = Surv(time, status) ~ rx, data = colondf, rho = 0)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## rx=Lev+5FU 608      242      306      13.5      28.2
## rx=Obs     630      345      281      14.7      28.2
## 
##  Chisq= 28.2  on 1 degrees of freedom, p= 1e-07

We see that the test statistic, although still very significant, is much lower than for the likelihood ratio test.

Finally, we fit a Cox regression model. In the first instance we can do this with just the treatment group as a covariate:

coxph(formula = Surv(time, status)~rx, data=colondf)
## Call:
## coxph(formula = Surv(time, status) ~ rx, data = colondf)
## 
##          coef exp(coef) se(coef)     z        p
## rxObs 0.44213   1.55602  0.08394 5.267 1.38e-07
## 
## Likelihood ratio test=28.25  on 1 df, p=1.068e-07
## n= 1238, number of events= 587

But we can also include other baseline covariates

coxph(formula = Surv(time, status)~rx + sex + age + obstruct + nodes, data=colondf)
## Call:
## coxph(formula = Surv(time, status) ~ rx + sex + age + obstruct + 
##     nodes, data = colondf)
## 
##               coef exp(coef)  se(coef)      z        p
## rxObs     0.470039  1.600056  0.085522  5.496 3.88e-08
## sex      -0.155126  0.856308  0.083835 -1.850   0.0643
## age      -0.002940  0.997064  0.003400 -0.865   0.3872
## obstruct  0.072027  1.074685  0.105813  0.681   0.4961
## nodes     0.104607  1.110274  0.008493 12.317  < 2e-16
## 
## Likelihood ratio test=146  on 5 df, p=< 2.2e-16
## n= 1214, number of events= 574 
##    (24 observations deleted due to missingness)

and we see that as well as the treatment arm, the number of lymph nodes with detectable cancer (given by nodes) is highly significant, and sex is also close to significant.

We can visualise this by further subsetting the Kaplan-Meier estimator

km_colon_bl = survfit(Surv(time, status) ~ rx + sex, data=colondf)
km_colon_bl %>% ggsurvfit() +
  add_censor_mark() 

nodes is a numeric output, so one way to get a visual impression of its effect on the survival curve is to bin it. For example, we can choose \(\texttt{nodes}\leq 4\) and \(\texttt{nodes}>4\).

colondf$nodes4 = sapply(1:nrow(colondf), function(i){ifelse(colondf$nodes[i]>4, 1, 0)})
km_colon_bl = survfit(Surv(time, status) ~ rx + nodes4, data=colondf)
km_colon_bl %>% ggsurvfit()

The separation in the lines fits with the strong effect of nodes in the model.

To check our model more carefully we can use the log-log plot for the categorical variables:

km_colon = survfit(Surv(time, status) ~ rx + sex, data=colondf)
ggsurvplot(km_colon, fun = "cloglog")

and the Schoenfeld residuals for the continuous residuals (after removing the insignificant variables):

cox_colon = coxph(formula = Surv(time, status)~rx + sex + nodes, data=colondf)

ggcoxzph(cox.zph(cox_colon), var = "nodes")

and we see that there is insufficient evidence to reject the null hypothesis (of proportional hazards).

D.3 Extension problem: sample size by simulation (part II)

We now combine the work we’ve done, to implement power simulation for a survival outcome trial.

To simulate survival / time-to-event data as though from a clinical trial, we need to consider two things:

  1. The distribution of the observation/event times (if everyone were observed perfectly)
  2. The censoring mechanism

For the first, we are likely to use a probability distribution like the exponential (which we saw in Section 7.2.2) or the Weibull (which we looked at briefly in Section 7.2.3).

For the second, there are two aspects:

  1. Anyone who is still alive (or hasn’t experienced the event) by the end of the trial will be censored at that point
  2. There will be some censoring during the lifetime of the trial, and this should also be randomly generated.

In this practical we’ll use an Exponential distribution. If you want to read more about this approach, you can read Jiang et al. (2012), who use a Weibull distribution.

D.3.1 Minimum detectable effect size

One of the first practical issues we need to tackle is the concept of the difference between the treatment and control group outcomes, and the idea of the minimum difference we want to be able to detect. Ideally this should be something clinically meaningful. We also need to work out how to factor this into the simulation.

Adapting the approach followed by Jiang et al. (2012), our steps will be:

  1. Specify the median survival time for the control and treatment groups, \(M_C\) and \(M_T\). We’re assuming that ‘the event’ is bad, and so we’ll assume \(M_T>M_C\). This is how the MDES concept enters the simulation, so \(M_T\) should be the smallest value of \(M_T\) we want to be able to detect (ie. creating the smallest clinically worthwhile improvement from \(M_C\), which we assume to be well understood).
  2. Use \(M_C,\,M_T\) to find the parameters \(\lambda_C,\,\lambda_T\), which then give us our observation time distribution function for each group.

D.3.2 Censoring

Censoring will happen in two ways:

  1. An endpoint \(T_{max}\) is specified as part of the trial design, and any observation after \(T_{max}\) is censored at \(T_{max}\)
  2. Every simulation is censored with a specified drop-out probability \(\pi\), using a Bernoulli distribution. The value of the censored observation is kept at the value simulated from the exponential distribution.

There are other ways to model drop-out, but this relatively simple method works nicely.

Exercise D.12 Write code to simulate a survival trial dataset as described above. Using this function, simulate data for a trial where \(M_T=7,\,M_C=5,\,\pi=0.1, t_{max}=10\), with approximately 100 participants in each group.

  • Plot the Kaplan-Meier estimated survival curve of your data
  • What are some of the modelling assumptions/simplifications that have been made?

Some technical things to note:

  • For the Surv function to interpret the time and status as right censored, it seems that the status variable has to be numeric (not factor, which would seem more logical)
Click for hint 1 (about finding \(\lambda_C,\,\lambda_T\))

The CDF of the exponential distribution is

\[ P\left(X \leq x\right) = 1 - e^{-\lambda x}.\] You can find the median value of \(x\) by setting the CDF to 0.5, which gives a median value of

\[\frac{\ln{2}}{\lambda}.\]

Click for hint 2 (about setting up the simulated data)

Your simulated participant data frame should have three columns (they don’t need to have these exact names):

  • Time
  • Status (censored = 0, complete = 1)
  • Group (Control or Treatment)
Click for solution

Solution. As we did in Section C we’ll write functions so that we can do all of this lots of times. Firstly, we rearrange the median equation in the hint above to find the \(\lambda_T,\,\lambda_C\):

\[\lambda_X = \frac{\ln2}{M_X},\] where \(X\) is \(C\) or \(T\).

Therefore we have \(\lambda_C = 0.139,\,\lambda_T = 0.099\).

## THIS IS DEFNITELY NOT GIVING THE RIGHT DATA!
surv_data = function(
    npart, # number of participants per group (roughly)
    MC,    # median survival time - group C
    MT,    # median survival time.- group T
    p_cens, # probability of censoring for each observation
    t_max   # end point of trial
){
  rateC = log(2)/(MC)
  rateT = log(2)/(MT)

  dat_mat = matrix(NA, nrow=2*npart, ncol=3)
  dat_df = as.data.frame(dat_mat)
  names(dat_df) = c("time", "status", "group")
  
  for (i in 1:(2*npart)){
    # Using SRS here, but could use something else
    group_i = sample(c("C", "T"), size=1)
    # Generate time accordingly
    if (group_i == "C"){
      time_i = round(rexp(1, rate = rateC))
    } else if (group_i == "T"){
      time_i = round(rexp(1, rate = rateT))
    }
    # Determine whether observation i is censored
    status_i = sample(c(0,1), size=1, prob = c(p_cens, 1-p_cens))
    if(time_i>t_max){
      time_i = t_max
      status_i = 0
    }
    # Having a zero time observation messes things up so we fudge it slightly
    if(time_i == 0){
      time_i = 1
      status_i = 1
    }
    
    dat_df[i,] = c(time_i, status_i, group_i)
  }
  # Round to 0 dp for a hint of reality
  dat_df$time = as.numeric(dat_df$time)
  dat_df$status = as.numeric(dat_df$status)
  dat_df$group = as.factor(dat_df$group)
  
  return(dat_df)
}

We can use the function to simulate the data described above

surv_df1 = surv_data(npart=100, MC=5, MT=7, p_cens=0.1, t_max=10)
with(data = surv_df1, Surv(time, status, type = "right"))
##   [1]  3   1  10+  3   1  10  10+ 10+ 10+  6   9   6+  3   2   9   5   1  10+  2   1   1  10+  6   1 
##  [25]  9   5   4   8   8  10+  6   6   3  10+  1  10+  4   2   2   1   6+  5+ 10+  4   9   9  10+  3 
##  [49]  1  10+  4  10  10+  9   6   6   2  10   4   7  10   2   1   1   7   1   2  10+  2   1  10+  1 
##  [73] 10+  4  10   2   5+ 10+  9   3   1   6   2  10+ 10+ 10+  7  10+  4  10+  3  10   5   3  10+  3 
##  [97] 10+  9   1+  1  10+  3   1  10+  1  10+  2   1   1+ 10+ 10+  1   2  10+ 10+ 10+  4  10  10+ 10+
## [121]  1   8   3  10   4   1  10+ 10+  4   7   1   1   7  10+  4   2  10+  5   3  10+  5   9  10  10+
## [145]  2   4   6+  5  10+ 10+ 10+  3   6   5  10+ 10+ 10+ 10+  9  10+ 10+  1+ 10+  6   2  10+  4  10+
## [169] 10+ 10+ 10+  7   7   7   9  10+  2   5  10+  4   1  10+  5  10+  2   2   8   6   1   9   6  10 
## [193] 10+  1+ 10   4   4  10+  7+ 10+

and plot the estimated survival curve

km_df1 = survfit(Surv(time, status) ~ group, data=surv_df1)
autoplot(km_df1, conf.int = F)

In this sampling scheme it seems it’s common to get some very low values, which is probably quite unrealistic.

Some other assumptions / simplifications we’re making are:

  • Censorings during the trial are completely at random (simulated via a Bernoulli distribution). In reality this is very unlikely; it’s far more likely that people become lost to follow-up for a reason, and that within the dataset the instances of this are not independent.
  • Censored data keep the same time as was generated by the exponential distribution. This will slightly skew our results.

Having set up a function to simulate data for one trial, we can run it many times to estimate the power of this trial set up. The extra pieces of information we need for this are:

  • The significance level \(\alpha\) - we will assume \(\alpha = 0.05\)
  • The test to be used - we will use the log rank test (as in Section D.2.2.2).

Exercise D.13 Using the function you wrote in Exercise D.12 and the log-rank test, write a function to estimate the power of a trial set-up.

Assuming the same parameters as in Exercise D.12 and a significance level of \(\alpha=0.05\):

  • Estimate the number of participants you need per trial group to achieve a power of 80%.
  • Suppose the clinicians decided to extend the trial so that \(t_{max}=20\). How many participants do you now need (approxiamtely) to achieve a power of 80%?
Click for solution

Solution.

surv_sim = function(
    nsim,  # number of simulations to run
    alpha, # the significance level of the test
    npart, # number of participants per group (roughly)
    MC,    # median survival time - group C
    MT,    # median survival time.- group T
    p_cens, # probability of censoring for each observation
    t_max   # end point of trial
){
  H0_reject_vec = rep(NA, nsim)
  
  for (i in 1:nsim){
    df_i = surv_data(npart=npart, MC=MC, MT=MT, p_cens=p_cens, t_max=t_max)
    log_rank_i = survdiff(Surv(time, status) ~ group, data=df_i, rho=0)
    p_val = log_rank_i$pvalue
    H0_reject_vec[i]= ifelse(p_val < alpha, 1, 0)
  }
  power.est = mean(H0_reject_vec)
  power.est
}

We can now use this to find the sample size we need by experimenting with some values.

surv_sim(nsim=100, alpha=0.05, npart=250,MC=5, MT=7, p_cens=0.1, t_max=10)
## [1] 0.82
survsim_vec1 = rep(NA, 20)
for (i in 1:20){
  survsim_vec1[i] = surv_sim(nsim=100, alpha=0.05, npart=250,MC=5, MT=7, p_cens=0.1, t_max=10)
}
ggplot(mapping = aes(survsim_vec1)) + geom_histogram(bins=10)

If we extend the trial duration to \(t_{max}=20\), the power given by a particular number of participants is now larger, because there are fewer censored observations and therefore more information. Some more experimentation with sample size would enable us to find a suitable number for this new scenario.

There are of course lots of ways we could refine and extend this, many of them similar to what we discussed in Section C, for example:

  • Running more simulations (100 is really not enough, but was chosen so that the file would compile - feel free to increase it)
  • Incorporating a better sampling scheme than SRS
  • Improving the drop-out / censoring modelling
  • Allowing the shape parameter to vary (though being careful of the implications on analysis)

References

Jiang, Zhiwei, Ling Wang, Chanjuan Li, Jielai Xia, and Hongxia Jia. 2012. “A Practical Simulation Method to Calculate Sample Size of Group Sequential Trials for Time-to-Event Data Under Exponential and Weibull Distribution.”
Moertel, Charles G, Thomas R Fleming, John S Macdonald, Daniel G Haller, John A Laurie, Phyllis J Goodman, James S Ungerleider, et al. 1990. “Levamisole and Fluorouracil for Adjuvant Therapy of Resected Colon Carcinoma.” New England Journal of Medicine 322 (6): 352–58.