A Computer Practical 1 - Allocation

WARNING

This practical is designed to help you investigate the behaviour of the various allocation methods we’ve been looking at. There are lots of things we do here that you couldn’t do in a real trial. For example, you wouldn’t be able to plot the baseline covariates, since you wouldn’t have them all at the start of the allocation process. You also wouldn’t be able to re-do the allocation lots of times, because in most cases the participants join the trial (and are allocated) sequentially.

You will need a lot of the techniques covered in this practical for your summative assignments, so consider it a sort of informal formative assignment to finish it if you don’t in class. Or, at the very least, you might need to return to it while working on your summative assignments.

There will be a mixture of full solutions, examples of possible solutions and example code to adapt. If you’re not sure how to do something, please ask!

In the example code, I have used the same names for most objects. In order to store your results and avoid confusion, it will be sensible to name things intelligently! For example, suffix each allocation data frame so that you know which allocation method you used. Create an R file with the code you use, so that you can easily replicate your work.

A.1 R practicalities

There are many, many packages in R that implement methods for designing and analysing clinical trials (see a list at CRAN task view). We will look at some of these, and will also write our own code for some tasks. Remember that to install a package, you can do

install.packages("<packagename>")

If you have problems running R on your laptop, or on the university machines, the most foolproof way might be to use Github codespaces (thanks to Louis Aslett, who developed this for Data Science and Statistical Computing II). You may be familiar with this approach if you did Bayesian Computational Modelling III.

An advantage of this is that you can open the same codespace (the same instance of R) from any computer, so if you plan to work on things (for example your summative assignment, which will involve some R) from more than one computer, this might be ideal.

This requires you to have a github account (you can sign up for free here) and there is a short guide to creating a github account here.

Direct link to codespace

Instructions for how to use codespace

A.2 Allocation

We’ll make use of several packages in this section. Installing them all now should hopefully prevent it from disrupting your flow! We’ll load them as we go along.

Notice that where there are code snippets like this one that you want to copy directly, you can hover your cursor in the top right of the code box and a ‘copy’ icon will appear.

install.packages(c("medicaldata", "ggplot2", "gtsummary", "Minirand", 
                   "blockrand", "dplyr", "randomizeR"))

A.2.1 Licorice gargle dataset

To work with allocation, we will use the licorice_gargle dataset from the package medicaldata, which you can find by

library(medicaldata)
data("licorice_gargle")

You can find out about the dataset by looking at the help file

?licorice_gargle

and at this website, and if you’re feeling really keen you can even read the original paper: Ruetzler et al. (2013) (but please don’t do any of that now!).

Exercise A.1 Of the 19 columns, how many are baseline characteristics that we could use in the allocation?

Click for solution

Solution. In this dataset the baseline characteristics are helpfully prefixed by preOp, so there are seven baseline characteristics.

## [1] "preOp_gender"     "preOp_asa"        "preOp_calcBMI"    "preOp_age"        "preOp_mallampati"
## [6] "preOp_smoking"    "preOp_pain"

In order to be able to work with the dataset, we need to convert several of the columns to factor variables (you can see by looking at the structure str(licorice_gargle) that all columns are numeric to begin with), so run the following code to convert the necessary columns to factors, and to restrict the dataset to only the columns we’re interested in.

lic_garg = licorice_gargle[ ,1:8]
# vector of names of columns to be coerced to factor
cols <- c("preOp_gender", "preOp_asa",  
"preOp_mallampati", "preOp_smoking", "preOp_pain", "treat")
# convert each of those columns to factors
lic_garg[cols] <- lapply(lic_garg[cols], factor) 

# Check the result:
str(lic_garg)

## 'data.frame':    235 obs. of  8 variables:
##  $ preOp_gender    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ preOp_asa       : Factor w/ 3 levels "1","2","3": 3 2 2 2 1 2 3 2 3 3 ...
##  $ preOp_calcBMI   : num  33 23.7 26.8 28.4 30.4 ...
##  $ preOp_age       : num  67 76 58 59 73 61 66 61 83 69 ...
##  $ preOp_mallampati: Factor w/ 4 levels "1","2","3","4": 2 2 2 2 1 3 1 2 1 2 ...
##  $ preOp_smoking   : Factor w/ 3 levels "1","2","3": 1 2 1 1 2 1 1 1 1 3 ...
##  $ preOp_pain      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ treat           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

A.2.1.1 Summary tables

For every allocation, we’ll want to see a summary table of how the patients are distributed between the groups. In this practical we’ll use the function tbl_summary from gtsummary, but there are plenty of other ways to create nice tables around clinical trials (for example the packages table1 and atable).

Notice that the licorice gargle data contains the allocation used in the trial, so we can create a table summarising the participants in each group.

tbl_summary(lic_garg, by = "treat")

Characteristic	0 N = 117¹	1 N = 118¹
preOp_gender
0	73 (62%)	69 (58%)
1	44 (38%)	49 (42%)
preOp_asa
1	19 (16%)	22 (19%)
2	67 (57%)	67 (57%)
3	31 (26%)	29 (25%)
preOp_calcBMI	26.1 (22.4, 28.1)	25.7 (22.7, 28.6)
preOp_age	63 (45, 68)	61 (48, 68)
preOp_mallampati
1	31 (26%)	39 (33%)
2	69 (59%)	66 (56%)
3	16 (14%)	13 (11%)
4	1 (0.9%)	0 (0%)
preOp_smoking
1	45 (38%)	45 (38%)
2	36 (31%)	36 (31%)
3	36 (31%)	37 (31%)
preOp_pain
0	115 (98%)	118 (100%)
1	2 (1.7%)	0 (0%)
¹ n (%); Median (Q1, Q3)

Notice that we can also save this as an object and access the individual sub-tables, for example

rb_tab = randbalance(
  trt = lic_garg$treat, 
  covmat = lic_garg[,-8], 
  ntrt=2, 
  trtseq = c("0", "1"))
rb_tab$preOp_gender

##    
## trt  0  1
##   0 73 44
##   1 69 49

There are also packages with functions to output these demographic tables formatted for use in latex documents, for example atable.

A.2.1.2 Binning continuous variables

Before we can use any of our allocation methods, we’re going to need to bin the two numeric variables preOp_age and preOp_calcBMI. In a real trial, we’d have to decide on the bins before seeing the data. The sort of information we’d be likely to get pre-trial (the point at which we need to decide on the bins) would be

The study was open to all adults (aged 18+), but preOp_age was likely to centre around aged 60.
preOp_BMI was expected to have mean 26 and standard deviation around 4.5.

Exercise A.2 Using the informationn above, bin the two variables preOp_age and preOp_calcBMI to convert them into factor variables BMI and age.

Investigate how the bins you have chosen split the actual data - are you pleased with your choices?

In the solutions we’ll create a new data frame lg_df with the two new factor variables instead of the original two numeric variables. It might be helpful to you if you make sure your objects and columns are named the same (or else you may need to alter code further into the practical)

Click for solution

Solution. These solutions show you one way to create such factor variables, but if you choose different bins that’s fine!

Based on the information given, one reasonable split for preOp_age would be \(<50,\;50-70\) and \(>70\), in which case we could do

lic_garg$age[lic_garg$preOp_age < 50] <- "Under 50"
lic_garg$age[lic_garg$preOp_age >= 50 & lic_garg$preOp_age < 70] <- "50 to 70"
lic_garg$age[lic_garg$preOp_age >= 70] <- "70 plus"
lic_garg$age = factor(lic_garg$age, levels = c("Under 50", "50 to 70", "70 plus"))

We can see how the data fall into these categories:

## 
## Under 50 50 to 70  70 plus 
##       67      122       46

These aren’t equal but we have a decent number in each group.

With the BMI measurement, we know that there are predefined boundaries between categories, so it might be sensible to use those. The boundary between ‘medium’ and ‘high’ is usually given as 25, so since that is near our given mean we could use that as the boundary.

lic_garg$BMI[lic_garg$preOp_calcBMI < 25] <- "medium_or_low"
lic_garg$BMI[lic_garg$preOp_calcBMI >= 25] <- "high"
lic_garg$BMI = factor(lic_garg$BMI, levels = c("medium_or_low", "high"))

Again, we can see how the participants fall into these bins

## 
## medium_or_low          high 
##            99           136

and the split is fairly even.

Finally, we can select the columns we want from lic_garg to create lg_df

lg_df = lic_garg[,c(1,2,5,6,7,9,10,8)]

If you’ve chosen different bins, that’s fine! But it will help you if the column names and data frame names are the same as on this page.

A.2.2 Measures of imbalance

It will be useful to have a simple numerical summary of how imbalanced an allocation is. First, we will use the imbalance measure defined in lectures,

\[D(n) = \lvert N_T\left(n\right) - N_C\left(n\right) \rvert \]

This is pretty simple to code in R:

imbalance = function(
    df,   # participant data frame with allocation column included
    alloc # name of allocation column
    ){
  alloc_vec = as.factor(df[ ,names(df)==alloc])
  alloc_lev = levels(alloc_vec) # how the treatment groups are coded
  n1 = nrow(df[df[alloc]==alloc_lev[1],])
  n2 = nrow(df[df[alloc]==alloc_lev[2],])
  abs(n1-n2)
}

Exercise A.3 Use the imbalance function above to find the imbalance in the allocation recorded in the lic_garg dataset. Do you find this helpful as a summary of the allocation?

Click for solution

Solution.

imbalance(df = lic_garg, alloc = "treat")

## [1] 1

All this tells us is the difference in overall numbers. This is a tiny bit useful, but it would be nice to have some information about how the allocation balances participants with respect to their characteristics - this would give a better understanding of how similar the two groups are to one another.

To see how well balanced our allocations are in terms of each covariate, we will define a function summing the marginal imbalance

marg_imbalance = function(
    df,  # participant data frame, including allocation and all factor variables
    alloc, # name of allocation column
    factors # names of prognostic factors to be included
    ){
  df = as.data.frame(df) # deals with tibbles
  n_fact = length(factors) # the numbers of factors
  imb_sum=0                # a running total of imbalance
  for (i in 1:n_fact){     # loop through the factors 
    ind_i = (1:ncol(df))[names(df)==factors[i]]
    col_i = as.factor(df[ ,ind_i])
    levels_i = levels(col_i)
    nlevels_i = length(levels_i)
    for (j in 1:nlevels_i){ # loop through the levels of factor i
      # df_ij contains just those entries with level j of factor i
      df_ij = df[df[ ,ind_i]==levels_i[j] , ] 
      imb_ij = imbalance(df=df_ij, alloc=alloc) # find the imbalance for the sub-data-frame in which factor i has level j
      imb_sum = imb_sum + imb_ij
    }
  }
  imb_sum
}

For example, to find the marginal imbalance over the gender and age factors, use

marg_imbalance(df=lg_df, alloc="treat", factors = c("preOp_gender", "age"))

## [1] 20

Note that the larger the total number of factor levels, the larger the marginal imbalance will be, so if you’re comparing between methods, make sure you’re including all the same factors!

A.2.3 Allocation methods

To mimic the way that participants are recruited sequentially in a trial (which is generally the case), the code for each type of allocation will work its way through the participant data frame, even though this might not be the most efficient way to produce the end result. For the more complex methods we’ll use packages. Feel free to write your own code for the simpler methods if you want to!

We’ll start by performing each method once with the whole dataset, and then go on to include baseline variables, and finally we’ll perform a simulation study.

A.2.3.1 Simple random allocation

In simple random allocation, each participant is allocated to one of the two trial arms with equal probability.

srs = function(
    df, # DF should be the participant data frame. 
        # A column 'treat' will be added
    levels = c("0", "1") # Levels of treat factor
){
  n = nrow(df) # number of rows / participants
# Create a new column 'treat'
  df$treat = rep(NA, n)
# work through the rows, randomly allocating patients with probably 1/2
  for (i in 1:n){
    df$treat[i] = sample(levels, size=1, prob = c(0.5, 0.5))
  }
  df$treat = as.factor(df$treat)
  df
}

Exercise A.4 Use the function srs above to allocate the patients in the licorice_gargle dataset to groups T or C.

Generate the balance table and imbalance and comment on them.

Click for solution

Solution. To allocate the patients, use

lg_srs = srs(
  df = lg_df[,-8],
  levels = c("T", "C")
)

And to display the balance table

tbl_summary(lg_srs, by = "treat")

Characteristic	C N = 111¹	T N = 124¹
preOp_gender
0	65 (59%)	77 (62%)
1	46 (41%)	47 (38%)
preOp_asa
1	19 (17%)	22 (18%)
2	68 (61%)	66 (53%)
3	24 (22%)	36 (29%)
preOp_mallampati
1	32 (29%)	38 (31%)
2	67 (60%)	68 (55%)
3	11 (9.9%)	18 (15%)
4	1 (0.9%)	0 (0%)
preOp_smoking
1	45 (41%)	45 (36%)
2	32 (29%)	40 (32%)
3	34 (31%)	39 (31%)
preOp_pain
0	110 (99%)	123 (99%)
1	1 (0.9%)	1 (0.8%)
age
Under 50	23 (21%)	44 (35%)
50 to 70	69 (62%)	53 (43%)
70 plus	19 (17%)	27 (22%)
BMI
medium_or_low	46 (41%)	53 (43%)
high	65 (59%)	71 (57%)
¹ n (%)

To find the total imbalance, the command is

imbalance(lg_srs, alloc = "treat")

## [1] 13

and we can find the marginal imbalance using

lg_factors_all = c("preOp_gender", "preOp_asa", "preOp_mallampati", "preOp_smoking", 
"preOp_pain", "age", "BMI")
marg_imbalance(lg_srs, alloc = "treat", factors = lg_factors_all)

## [1] 129

Notice that we’ve collected the names of the factors into the vector lg_factors_all. This saves time, and makes it easier to compare the same factors each time.

Since srs uses the sample function, and we didn’t set a seed, you may well have got a different allocation (and therefore imbalance).

A.2.3.2 Randomly permuted blocks

We will do this with the package blockrand.

library(blockrand)

The function for generating RPB designs is also called blockrand. The default for the function blockrand is that it will randomly vary block length within \(\left\lbrace 2,\,4,\,6,\,8 \right\rbrace\).

Exercise A.5 Try playing around with the function blockrand, for example starting with

blockrand(n=100)

Can you generate an allocation for the licorice_gargle data (remember our data frame is now lg_df) and examine the balance?

Click for solution

Solution. There are 235 rows/participants in lg_df, and to keep consistency we would like the levels of our treatment to be "T" and "C". Therefore we can do

rpb_lg = blockrand(n=235, levels = c("T", "C"))

Notice that this doesn’t have 235 rows: the blockrand function will always finish after a whole block.

Let’s add this to our participant data to create a new data frame lg_rpb:

# create the new data frame, a copy of lg_df
lg_rpb = lg_df  

# Replace the original treat column with the RPB treatment column
# Using only the first 235 allocations
lg_rpb$treat = rpb_lg$treatment[1:235]

Then we can generate the demographic table as before:

tbl_summary(lg_rpb, by = "treat")

and calculate the imbalance

## [1] 1

and marginal imbalance

## [1] 113

The package blockrand contains a function plotblockrand, which outputs PDFs of randomization cards, ready to be printed and put into envelopes!

A.2.3.3 Biased coin designs

We can write code for a biased coin design by adapting the srs function above, setting \(p=\frac{2}{3}\) by default as per Efron (1971).

biased_coin = function(
    data,
    levels = c("T", "C"),
    p=2/3
){
  Dn = 0 # starting value of imbalance
  n = nrow(data)
  alloc = rep(NA, n)
  
  for (i in 1:n){
    if (Dn==0){ # equally balanced
      alloc[i] = sample(levels, size=1, prob=c(0.5, 0.5) )
    } else if(Dn<0){ # More allocations to levels[2] up to this point
      alloc[i] = sample(levels, size=1, prob=c(p, 1-p) )
    } else if(Dn>0){ # More allocations to levels[1] up to this point
      alloc[i] = sample(levels, size=1, prob=c(1-p, p) )
    }
    # Compute imbalance at this stage
    alloc_to_n = alloc[1:i]
    Dn = sum(alloc_to_n==levels[1]) - sum(alloc_to_n == levels[2])
  }
  data$treat = as.factor(alloc)
  data
}

Exercise A.6 Use the function biased_coin above to allocate patients to the two groups. Produce the demographic table and calculate the imbalance. Try this for some different values of p.

Click for solution

Solution. To create a biased coin design with \(p=0.9\) (for example), we enter

lg_bc1 = biased_coin(
  data=lg_df[,-8],
  p=0.9
)

and to produce a balance table:

tbl_summary(lg_bc1, by = "treat")

Finally we can find the imbalance:

imbalance(lg_bc1, alloc="treat")

## [1] 1

and marginal imbalance

marg_imbalance(lg_bc1, alloc="treat", factors = lg_factors_all)

## [1] 93

A.2.3.4 Urn designs

For the urn design, we will use the function udPar from the package randomizeR. We wrote this in lectures as \(UD\left(r,s\right)\), where

\(r\) is the number of balls for each treatment group in the urn to begin with
\(s\) is the number of balls added per treatment after each allocation

Exercise A.7 Look at the help file for udPar. Which arguments correspond to \(r\) and \(s\) (as we called them in Section 3.2.3?

The function udPar creates a model object, storing the parameters for the particular Urn design. To generate sequences of allocations, we use the function genSeq

Exercise A.8 Create an urn design object using udPar with \(r=3,\;s=1\), set to create enough allocations for the licorice gargle data.

Use the function genSeq to generate a sequence of allocations.

Click for solution

Solution.

library(randomizeR)
ud_31 = udPar(235, 3, 1, c("0", "1"))
seq_ud31 = genSeq(ud_31)

The allocation vector is stored as a row in the matrix M which is a slot in the genSeq object (part of the object). You can access it by

seq_ud31$M[1,]

##   [1] 0 0 0 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 0 0 0 1 0 0 0 1 0 0 0 1
##  [49] 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 1 0 1 0 1 0 0 0
##  [97] 1 1 0 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1
## [145] 1 1 1 1 1 0 1 0 0 1 1 1 1 1 0 0 1 1 0 1 0 1 0 0 1 1 1 1 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 1 1 1 0 1
## [193] 1 1 1 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 1 1 1 1 1 0 1 1 0 0 0 1 1 1

The argument r allows you to generate r sequences, in which case the matrix M has r rows.

A.2.4 Stratifying the dataset

The first way we thought about to account for baseline measurements was to use stratified sampling. For this, we split the dataset into a number of strata, within each of which the participants have the same levels of the stratifying factors. For example, in some hypothetical dataset one stratum might contain only women, aged 50-65, with a history of smoking.

Once we have stratified the dataset, we can apply any of the methods above by simply treating the different strata as separate ‘mini-trials’.

It’s not uncommon for some of the baseline data to be missing, in which case the following algorithms won’t work without some adjustment. Two possible approaches in that case would be

to impute a ‘dummy’ value for missing data, chosen using some available information about likely values
to not involve that covariate in the allocation method (especially if it’s thought not to be likely to be informative).

An obvious problem with applying stratification to this dataset is the number of strata. We have (if you have the same numbers of levels for age and BMI as I do)

\[2\times{3}\times{4}\times{3}\times{2}\times{3}\times{2} = 864 \] strata. Even by collecting levels together and so on, we are not going to get down to a sufficiently small number of strata. Therefore we will choose just a couple of covariates to stratify by. For example, gender and smoking status:

library(dplyr)
# Add an ID variable so that we can keep track of the order of participants
lg_df$ID = 1:nrow(lg_df)
# split the data frame according to levels of factors
strat_gen_sm <- lg_df %>%
  group_split(preOp_gender, preOp_smoking)

will create a list of data frames, one for each combination of preOp_gender and preOp_smoking. In this case there are six data frames, and for example the first (accessed by strat_gen_sm[[1]] contains all participants with preOp_gender=0 and preOp_smoking=1. You can choose different factors to stratify by if you want to!

Exercise A.9 Split your group using the code above, choosing two factors to stratify by. How many participants are in each stratum?

Click for solution

Solution. We’ll stick with the group as above. To find the numbers of participants in each group we do

group_sizes = sapply(
  1:length(strat_gen_sm),
  function(i){
    nrow(strat_gen_sm[[i]])
  }
)
group_sizes

## [1] 63 47 32 27 25 41

or if you prefer dplyr

lg_df |>
  group_by(preOp_gender, preOp_smoking) |>
  summarise(count = n())

## # A tibble: 6 × 3
## # Groups:   preOp_gender [2]
##   preOp_gender preOp_smoking count
##   <fct>        <fct>         <int>
## 1 0            1                63
## 2 0            2                47
## 3 0            3                32
## 4 1            1                27
## 5 1            2                25
## 6 1            3                41

Exercise A.10 Choose a couple of methods from Section A.2.3, and use them with your stratified dataset. Use the marg_imbalance function to find the marginal imbalance. Try this for just the factors you stratified by, and for other collections of factors. What do you expect to see?

Click for solution

Solution. The code for this will vary, but the basic idea is to work through the individual data sets individually, and apply the functions from Section A.2.3. One way to do this is by creating a for loop. The code below shows how this would work for simple random sampling, but you can change the function to use whichever method you prefer.

# This command creates an empty list, which we will fill with allocation data frames as we go through
alloc_list = list()
# The loop works through the stratified data frames, applies SRS to allocate patients
# and stores them in alloc_list
for (i in 1:length(strat_gen_sm)){
  alloc_list[[i]] = srs(strat_gen_sm[[i]])
}
# bind all the data frames back together again
alloc_full= dplyr::bind_rows(alloc_list)
# re-order according to ID variable
alloc_full[order(alloc_full$ID),]

## # A tibble: 235 × 9
##    preOp_gender preOp_asa preOp_mallampati preOp_smoking preOp_pain age      BMI           treat    ID
##    <fct>        <fct>     <fct>            <fct>         <fct>      <fct>    <fct>         <fct> <int>
##  1 0            3         2                1             0          50 to 70 high          0         1
##  2 0            2         2                2             0          70 plus  medium_or_low 0         2
##  3 0            2         2                1             0          50 to 70 high          0         3
##  4 0            2         2                1             0          50 to 70 high          1         4
##  5 0            1         1                2             0          70 plus  high          0         5
##  6 0            2         3                1             0          50 to 70 high          1         6
##  7 0            3         1                1             0          50 to 70 high          0         7
##  8 0            2         2                1             0          50 to 70 high          1         8
##  9 0            3         1                1             0          70 plus  medium_or_low 0         9
## 10 0            3         2                3             0          50 to 70 high          1        10
## # ℹ 225 more rows

It would be silly though to do this with SRS - why?

Once you’ve performed the allocation, you can find the demographic tables, imbalance and marginal imbalance as before (with alloc_full, or whatever your is called, as the data frame)

A.2.5 Minimisation

If we want to try to achieve balance for all prognostic factors, minimisation is a more suitable method. The function Minirand in the package Minirand implements the minimisation algorithm.

Much like we did in lectures (and like we would in a real trial), the function Minirand works from the point of view of having already allocated \(j-1\) particpants, and being presented with a \(j^{th}\).

This example code is copied from the function’s help file, but with some extra comments to help you to follow it. It first creates a participant data frame, then allocates the particpants to treatment groups using Minimisation.

## Information about the treatment
ntrt <- 3 # There will three treatment groups
trtseq <- c(1, 2, 3) # the treatment groups are indexed 1, 2, 3
ratio <- c(2, 2, 1)  # the treatment groups will be allocated in a 2:2:1 ratio

## The next few rows generate the participant data frame
nsample <- 120 # we will have 120 participants
c1 <- sample(seq(1, 0), nsample, replace = TRUE, prob = c(0.4, 0.6)) 
c2 <- sample(seq(1, 0), nsample, replace = TRUE, prob = c(0.3, 0.7))
c3 <- sample(c(2, 1, 0), nsample, replace = TRUE, prob = c(0.33, 0.2, 0.5)) 
c4 <- sample(seq(1, 0), nsample, replace = TRUE, prob = c(0.33, 0.67)) 
covmat <- cbind(c1, c2, c3, c4) # generate the matrix of covariate factors for the subjects
# label of the covariates 
colnames(covmat) = c("Gender", "Age", "Hypertension", "Use of Antibiotics") 
covwt <- c(1/4, 1/4, 1/4, 1/4) # equal weights/importance applied to each factor

## Applying the algorithm - start here if you already have participant data!

res <- rep(NA, nsample) # Generate a vector to store the results (the allocations)
# generate treatment assignment for the 1st subject
res[1] = sample(trtseq, 1, replace = TRUE, prob = ratio/sum(ratio)) 
# work through the remaining patients sequentially
for (j in 2:nsample)
  {
  # get treatment assignment sequentially for all subjects
  # The vector res is updated and so all previous allocations are accounted for
  # covmat is the data frame of participant data
    res[j] <- Minirand(
      covmat=covmat, j, covwt=covwt, ratio=ratio, ntrt=ntrt, trtseq=trtseq, method="Range", result=res, p = 0.9
      )
}
## Store the allocation vector 'res' as 'trt1'
trt1 <- res

# Display the number of randomized subjects at covariate factors
balance1 <- randbalance(trt1, covmat, ntrt, trtseq) 
balance1

## $Gender
##    
## trt  0  1
##   1 31 17
##   2 32 16
##   3 16  8
## 
## $Age
##    
## trt  0  1
##   1 34 14
##   2 34 14
##   3 17  7
## 
## $Hypertension
##    
## trt  0  1  2
##   1 22 11 15
##   2 23 11 14
##   3 12  4  8
## 
## $`Use of Antibiotics`
##    
## trt  0  1
##   1 31 17
##   2 31 17
##   3 15  9

# Calculate the total imbalance of the allocation
totimbal(trt = trt1, covmat = covmat, covwt = covwt, 
ratio = ratio, ntrt = ntrt, trtseq = trtseq, method = "Range")

## [1] 1.375

Exercise A.11 Adapt the code above to apply the minimisation algorithm to the licorice gargle data. Investigate how balanced the data set is.

Click for solution

Solution. Notice in the code above that we only need to start about half way down if we already have a dataset (which we do).

nsample = nrow(lg_df)
res = rep(NA, nsample)
res[1] = sample(c(0,1), 1, replace = TRUE, prob = c(0.5,0.5)) 
# work through the remaining patients sequentially
for (j in 2:nsample){
  # get treatment assignment sequentially for all subjects
  # The vector res is updated and so all previous allocations are accounted for
  # covmat is the data frame of participant data - including only the covariates
    res[j] <- Minirand(
      covmat=lg_df[ ,1:7], j, covwt=rep(1,7)/7, ratio=c(1,1), ntrt=2, trtseq=c(0,1), method="Range", result=res, p = 0.9
      )
}

lg_df$treat = res

## You can now investigate the balance of the design as usual

A.2.6 Extension: A simulation experiment!

In the allocation section we used each method once and produced summaries like the marginal imbalance. Because they are inherently random, to understand the general behaviour of each method, we’ll conduct a simulation experiment.

The general process for this is:

Choose a numerical summary (probably either the imbalance or the marginal imbalance)
Choose some large number \(n_{sim}\) and create a vector of zeroes or NAs that length
Run through the following process \(n_{sim}\) times:
- Perform the allocation
- Find the numerical summary
- Store the numerical summary in the vector you created in step 2.
Plot (or otherwise summarise) the vector of summaries.

Exercise A.12 Perform the simulation experiment for some of the methods above. Make sure you store the summary vectors by different (and intelligible!) names.

Based on your results, comment on how plausible it is that Ruetzler et al. (2013) used each method in their trial.

Hint: think of the simulated summaries as approximations of probability distributions

For the licorice dataset, Ruetzler et al. (2013) say that “Randomization (1:1) to licorice or placebo was assigned by a Web-based system that was accessed just before treatment by an independent researcher who was not involved in data collection; no stratification was used.”

References

Efron, Bradley. 1971. “Forcing a Sequential Experiment to Be Balanced.” Biometrika 58 (3): 403–17.

Ruetzler, Kurt, Michael Fleck, Sabine Nabecker, Kristina Pinter, Gordian Landskron, Andrea Lassnigg, Jing You, and Daniel I Sessler. 2013. “A Randomized, Double-Blind Comparison of Licorice Versus Sugar-Water Gargle for Prevention of Postoperative Sore Throat and Postextubation Coughing.” Anesthesia & Analgesia 117 (3): 614–21.