B Computer Practical 2 - Analysis

Preliminaries

You will need a lot of the techniques covered in this practical for your summative assignments, so consider it a sort of informal formative assignment to finish it if you don’t in class. Or, at the very least, you might need to return to it while working on your summative assignments.

There will be a mixture of full solutions, examples of possible solutions and example code to adapt. If you’re not sure how to do something, please ask!

In the example code, I have used the same names for most objects. In order to store your results and avoid confusion, it will be sensible to name things intelligently! For example, suffix each allocation data frame so that you know which allocation method you used. Create an R file with the commands in that you use, so that you can easily replicate your work.

R practicalities

There are many, many packages in R that implement methods for designing and analysing clinical trials (see a list at CRAN task view). We will look at some of these, and will also write our own code for some tasks. Remember that to install a package, you can do

install.packages("<packagename>")

If you have problems running R on your laptop, or on the university machines, the most foolproof way might be to use Github codespaces (thanks to Louis Aslett, who developed this for Data Science and Statistical Computing II). You may be familiar with this approach if you did Bayesian Computational Modelling III.

An advantage of this is that you can open the same codespace (the same instance of R) from any computer, so if you plan to work on things (for example your summative assignment, which will involve some R) from more than one computer, this might be ideal.

This requires you to have a github account (you can sign up for free here) and there is a short guide to creating a github account here.

Direct link to codespace

Instructions for how to use codespace

B.1 Analysis

For our analysis section, we will work with real datasets, and use them to explore and compare some of the different analysis methods we’ve covered in lectures.

B.1.1 Polyps data

The dataset we’ll use concerns a small trial in which patients with familial adenomatous polyposis (FAP) were treated either with Sulindac (group T) or a placebo (group C). People with FAP tend to have polyps (small growths) grow in their colon. Although the polyps themselves are not dangerous or harmful, they can turn into colon cancer. You can read about the study in Giardiello et al. (1993).

Exercise B.1 First of all let’s explore the dataset:

  • Look at the help file for polyps
  • Which columns should you use for baseline and outcome variables?
  • Plot the data in these columns. Do you see any issues? Can you suggest a way to address them? What effect will this have on our estimator of the treatment effect?
Click for solution

Solution. First of all, to access the data, enter

data(polyps, package = "medicaldata")

You can then view the help file by entering ?polyps.

To investigate the columns in polyps we can use the structure function str(polyps)

We have the baseline number of polyps for each patient in the column baseline, and a sensible column to use for the outcome would be number12m. Other baseline variables are sex and age.

To plot the baseline and outcome variables we can do

ggplot(data=polyps, aes(x=baseline)) + geom_histogram()

ggplot(data=polyps, aes(x=number12m)) + geom_histogram()

These are both very right skewed, and exclusively positive (which makes sense given they are counts). A sensible thing to do therefore would be to take the log of these two variables. Since they are counts we will use the log to base 10, so that our results are easier to interpret. If you’d prefer to use natural logs that’s fine, some of your numbers will be different.

polyps$log_baseline = log10(polyps$baseline)
polyps$log_number12m = log10(polyps$number12m)

Since we used the logarithms of the numbers of polyps, our treatment effect is \[\log\left(X_T\right) - \log\left(X_C\right) = \log\left(\frac{X_T}{X_C}\right).\]

We will work through the tests as we did in lectures, to see how the results differ.

Exercise B.2 Using just the outcome variable (which should be log_number12m or something similar, see solution above if you aren’t sure why), test the hypothesis that the Sulindac has had some effect.

Click for solution

Solution. We can either do this the long way or the short way! For sanity’s sake, we’ll do it the long way once, then in subsequent questions I’ll only show the short way in the solutions.

The long way

The first thing to do is to calculate means for each group:

polyps_df = na.omit(polyps) # two participants (001 and 018) don't have data for 12m, so remove these
mean_T = mean(polyps_df$log_number12m[polyps_df$treatment=="sulindac"]) 
sd_T = sd(polyps_df$log_number12m[polyps_df$treatment=="sulindac"])
mean_C = mean(polyps_df$log_number12m[polyps_df$treatment=="placebo"])
sd_C = sd(polyps_df$log_number12m[polyps_df$treatment=="placebo"])
# There are 11 patients on Placebo (group C) and 9 on Sulindac (group T)
# The pooled standard deviation 
pooled_sd_polypsX = sqrt((10*sd_C^2 + 8*sd_T^2)/(10+8))
# Finally we find the test statistic
test_stat = (mean_T - mean_C)/(pooled_sd_polypsX*sqrt(1/11 + 1/9))

Under \(H_0\) (that there is no treatment effect) the test statistic follows a \(t_{18}\) distribution, and therefore we find our \(p\)-value by

2*pt(test_stat, df=18)
## [1] 0.001046571

Note that if the test statistic were positive, we’d have to do

2*(1-pt(test_stat, df=18))

Thus \(p=0.001\) and we conclude that the treatment effect is significant. Our confidence interval is given by

estimate = mean_T - mean_C
error = qt(0.975, df=18)*pooled_sd_polypsX*sqrt(1/11 + 1/9)
c(estimate - error, estimate + error)
## [1] -1.2264748 -0.3678718

The short way

R has a t-test function built in, so we can simply use that. Notice that it gives us everything our heart might desire (on a fairly mundane day), including a confidence interval!

t.test(
  x=polyps_df$log_number12m[polyps_df$treatment == "sulindac"],
  y=polyps_df$log_number12m[polyps_df$treatment == "placebo"],
  alternative = "two.sided",
  var.equal=T, # this makes the method use pooled variances, as we did in lectures
  conf.level = 0.95 # note that this is 1-alpha
)
## 
##  Two Sample t-test
## 
## data:  polyps_df$log_number12m[polyps_df$treatment == "sulindac"] and polyps_df$log_number12m[polyps_df$treatment == "placebo"]
## t = -3.9012, df = 18, p-value = 0.001047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.2264748 -0.3678718
## sample estimates:
## mean of x mean of y 
## 0.6671373 1.4643106

Either way, our 95% confidence interval is \(\left(-1.226,\;-0.368\right)\) (to 3 d.p). Therefore, the confidence interval for \(\frac{X_T}{X_C}\) is

\[\left(10^{-1.226},\;10^{-0.368}\right) = \left(0.0594,\;0.429\right).\] That is, the number of polyps at 12 months for someone in group \(T\) is likely to be somewhere between \(0.0594 X_C\) and \(0.429 X_C\).

We’ll now move on to comparing the differences between baseline and outcome

Exercise B.3 Perform another \(t\)-test, this time using the difference between outcome and baseline. Hint: because we’ve taken logs, you’ll have to think about what variable to use and perhaps experiment with some possibilities

Click for solution

Solution. The first step is to calculate a difference column. This is somewhat complicated by that fact that we have taken logs of the measurements. Taking logs of the difference would not work, since some are negative. Potentially it might work to just work with the (unlogged) differences

polyps_df$diff = polyps_df$number12m - polyps_df$baseline
ggplot(data=polyps_df, aes(x=diff, fill=treatment)) + geom_histogram(position="dodge", bins=10)

But the outliers look potentially problematic, and the central bulk of each distribution doesn’t look very normal. We could also try the difference of the logged measurements:

polyps_df$diff_log = polyps_df$log_number12m - polyps_df$log_baseline
ggplot(data=polyps_df, aes(x=diff_log, fill=treatment)) + geom_histogram(position="dodge", bins=10)

This looks a lot better - no more outliers and closer to normal-looking. Obviously with so few observations we won’t have a nice normal curve.

To do a t-test, we can use R’s in-built function

t.test(
  x=polyps_df$diff_log[polyps_df$treatment == "sulindac"],
  y=polyps_df$diff_log[polyps_df$treatment == "placebo"],
  alternative = "two.sided",
  var.equal=T, # this makes the method use pooled variances, as we did in lectures
  conf.level = 0.95 # note that this is 1-alpha
)
## 
##  Two Sample t-test
## 
## data:  polyps_df$diff_log[polyps_df$treatment == "sulindac"] and polyps_df$diff_log[polyps_df$treatment == "placebo"]
## t = -3.5142, df = 18, p-value = 0.002477
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.0242756 -0.2578053
## sample estimates:
##   mean of x   mean of y 
## -0.58476481  0.05627565

Finally we can try fitting an ANCOVA model to the data, using the function lm.

Exercise B.4 Fit an ANCOVA model to the licorice data, and interpret your results. If you aren’t familiar with the function lm, it might be a good idea to look at the solutions.

Click for solution

Solution. In the ANCOVA model we saw in lectures so far, we fit a linear model with the outcome as dependent / target variable, and the trial group and baseline as independent / explanatory variables.

lm_polyp1 = lm(log_number12m ~ treatment + log_baseline, data=polyps_df)
summary(lm_polyp1)
## 
## Call:
## lm(formula = log_number12m ~ treatment + log_baseline, data = polyps_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69628 -0.15530  0.04155  0.17288  0.68553 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.6291     0.2798   2.249 0.038084 *  
## treatmentsulindac  -0.7046     0.1675  -4.205 0.000595 ***
## log_baseline        0.5932     0.1824   3.251 0.004698 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3673 on 17 degrees of freedom
## Multiple R-squared:  0.6659, Adjusted R-squared:  0.6266 
## F-statistic: 16.94 on 2 and 17 DF,  p-value: 8.973e-05

Based on this, the effect of the drug sulindac is significant (\(p=0.00595\), lower than our previous models). The 95% CI for the treatment effect (which is still \(\log\left(\frac{X_T}{X_C}\right)\)) now is

c(-0.7046 - qt(0.975, df=17)*0.1675, -0.7046 + qt(0.975, df=17)*0.1675 ) 
## [1] -1.0579941 -0.3512059

We could create a nicer-looking table of the coefficients using tbl_regression (from gtsummary):

Characteristic Beta 95% CI1 p-value
treatment


    placebo
    sulindac -0.70 -1.1, -0.35 <0.001
log_baseline 0.59 0.21, 0.98 0.005
1 CI = Confidence Interval

but note that this doesn’t show all the information that summary does.

Unlike in the \(t\)-test where we can only compare measurements like-for-like, in ANCOVA we fit a coefficient to the baseline covariate. This means we are no longer limited to comparing the outcome variable to variables on the same scale, but can also include other baseline variables.

Exercise B.5 Fit another linear model, this time including the other baseline variables.

Click for solution

Solution.

lm_polyp2 = lm(log_number12m ~ treatment + log_baseline + sex + age, data = polyps_df)
summary(lm_polyp2)
## 
## Call:
## lm(formula = log_number12m ~ treatment + log_baseline + sex + 
##     age, data = polyps_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59539 -0.32008  0.09788  0.17374  0.62089 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.888170   0.503801   1.763 0.098266 .  
## treatmentsulindac -0.730464   0.177936  -4.105 0.000936 ***
## log_baseline       0.495560   0.216900   2.285 0.037307 *  
## sexmale            0.197442   0.201555   0.980 0.342821    
## age               -0.009412   0.012205  -0.771 0.452620    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3779 on 15 degrees of freedom
## Multiple R-squared:  0.688,  Adjusted R-squared:  0.6048 
## F-statistic:  8.27 on 4 and 15 DF,  p-value: 0.0009895

In this case it appears that the other two baseline covariates, age and sex, don’t have a significant effect on the outcome.

Exercise B.6 Inspect some plots of the residuals, to check whether these models have any systematic problems. Does there appear to be any multicollinearity in the data?

Click for solution

Solution. These solutions will demonstrate some methods with the first model. First of all, we can add columns with the residuals and fitted values from the model.

polyps_df$resid1 = resid(lm_polyp1)
polyps_df$fitted1 = fitted(lm_polyp1)
ggplot(data=polyps_df, aes(x=fitted1, y=resid1, col=treatment)) + geom_point()

ggplot(data=polyps_df, aes(x=log_baseline, y=resid1, col=treatment)) + geom_point()

ggplot(data=polyps_df, aes(x=resid1, fill=treatment)) + geom_histogram(bins=20)

None of these ring huge alarm bells, but because the dataset is so small it’s quite hard to tell! Arguably the residuals for the sulindac group are more spread out than those for the plaecbo, but there are no obvious systematic trends.

We can find the variance inflation factor by

vif(lm_polyp1)
##    treatment log_baseline 
##     1.029768     1.029768

and it indicates no problems. Another sign that the ANCOVA model is well-fitted is that the standard errors of the coefficients are all reasonably small. If there was a problem, for example multicollinearity, some of these would blow up.

B.1.2 Extension: Treatment for maternal periodontal disease

In this section we look at a larger and more complex (therefore more realistic) dataset. The opt data are in the package medicaldata, which you should already have loaded.

The aim of the study is to find out whether treating women for periodontal disease during the first 21 weeks of pregnancy resulted in a low birth weight or pre-term birth.

The groups are stored in the group variable

  • group T: those who received periodontal treatment, and tooth polishing at their follow-ups
  • group C: Brief oral exams

There are 823 participants and 171 variables - many of these are baseline covariates, but there are also quite a few interim measurements.

The study’s primary outcome variable is gestational age at end of pregnancy, but birth weight was also measured, as well as some clinical measures of periodontal disease and some microbiological and immunological outcomes.

To make this more managable, we will concentrate on the outcome Birthweight.

Exercise B.7 Plot the variable Birthweight. Do we need to use any transformations before we model it?

Click for solution

Solution.

ggplot(data=opt, aes(x=Birthweight, fill=Group)) + geom_histogram(position="dodge")

This data looks very nice and normal (although there is a bit of a fat tail to the left), so we won’t transform it.

There are lots of issues with the opt data, many of which are common in real datasets.

We’ll reduce the dataset to a more managable size, by removing all outcomes except Birthweight and reducing the number of covariates.

One issue with the opt dataset is missingness. While some of the missing data are genuinely missing, quite a lot of it can be logically filled in. This is explained in the code below.

opt_red = opt[ ,c(1:22,72)] 
# Change NAs to "None" for diabetic
# since in opt the non-diabetic people are coded as NA (and therefore excluded from the model)
diab = as.character(opt_red$BL.Diab.Type)
diab[is.na(diab)] = "None"
opt_red$BL.Diab.Type = as.factor(diab)
# similar problem with smokers and how many cigarettes per day

# If people are non-smokers and have missing for number of cigarettes per day
# change their number of cigarettes to zero
sm = opt_red$Use.Tob
cigs = opt_red$BL.Cig.Day
cigs[(is.na(cigs)&(sm=="No "))] = 0
opt_red$BL.Cig.Day = cigs

# Same for alcohol and drinks per day

alc = opt_red$Use.Alc
dr = opt_red$BL.Drks.Day
dr[(is.na(dr)&(alc=="No "))] = 0
opt_red$BL.Drks.Day = dr

# If a participant hasn't had a previous pregnancy, her N.prev.preg should be zero (not NA)

pp = opt_red$Prev.preg
npp = opt_red$N.prev.preg
npp[pp=="No "] = 0
opt_red$N.prev.preg = npp

When we use lm to fit an ANCOVA model, all rows with an NA in will be ignored. It’s therefore important to try to eradicate any NAs where we actually do know the value!

In reality we’d also investigate the patterns of missingness to check whether this might have introduced bias into the trial. There’s a whole extra computer practical on this that I’ve included on the website but that we aren’t going to do (and so it won’t be assessed!) - C. Feel free to have a look.

Exercise B.8 Perform a \(t\)-test on the outcome Birthweight. What do you find? Can you also perform a \(t\)-test using difference from baseline?

Click for solution

Solution. To perform the \(t\)-test we can use the inbuilt R function

# Check SDs are fairly close before proceeding
sd(opt_red$Birthweight[opt_red$Group == "T"], na.rm=T)
## [1] 636.82
sd(opt_red$Birthweight[opt_red$Group == "C"], na.rm=T)
## [1] 727.4854
t.test(
  x=opt_red$Birthweight[opt_red$Group == "T"],
  y=opt_red$Birthweight[opt_red$Group == "C"],
  alternative = "two.sided",
  var.equal=T, # this makes the method use pooled variances, as we did in lectures
  conf.level = 0.95 # note that this is 1-alpha
)
## 
##  Two Sample t-test
## 
## data:  opt_red$Birthweight[opt_red$Group == "T"] and opt_red$Birthweight[opt_red$Group == "C"]
## t = 0.74585, df = 807, p-value = 0.456
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -58.49266 130.18492
## sample estimates:
## mean of x mean of y 
##  3216.670  3180.824

We find that the difference in Birthweights between the groups is not even close to significant (\(p=0.456\)).

We can’t do a \(t\)-test with differences because there is no comparable baseline measurement.

Now we will move on to ANCOVA.

Exercise B.9 Before fitting the model, think about what you expect to find. Bear in mind:

  • the result above
  • the description of risk factors of preterm birth / low birth weight in the help file for opt
  • which variables you will include in your model
Click for solution

Solution. The \(p\)-value from the \(t\)-test was very high, so it seems unlikely that we’ll find a significant treatment effect using ANCOVA unless there are some significant interactions going on between covariates.

The help file says (under ‘Background’):

Many risk factors for preterm birth have already been identified, including maternal age, drug use, and diabetes. However, such factors are exhibited in only about half of preterm birth mothers, highlighting a need to expand our understanding of what contributes to preterm birth risk.

Therefore, if we include related factors in our model (many of which we have in our dataset) we should expect to see significant coefficients. But, it sounds like there is a lot that isn’t well understood, so our model is likely not to explain a huge proportion of the variance.

Exercise B.10 Fit an ANCOVA model. What do you find?

Click for solution

Solution. To fit a linear model including every variable as a covariate (apart from the target variable), we can do

lm_full = lm(Birthweight ~ ., data = opt_red[ ,-1]) # don't include the ID column!
summary(lm_full)
## 
## Call:
## lm(formula = Birthweight ~ ., data = opt_red[, -1])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3019.46  -233.35    46.89   382.82  1935.06 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3906.439    702.835   5.558 3.91e-08 ***
## ClinicMN                -57.521     73.391  -0.784  0.43345    
## ClinicMS               -173.068     88.189  -1.962  0.05011 .  
## ClinicNY               -146.548    100.180  -1.463  0.14397    
## GroupT                   28.780     49.914   0.577  0.56440    
## Age                      -1.686      5.417  -0.311  0.75572    
## BlackYes               -158.055    127.807  -1.237  0.21664    
## WhiteYes                -47.195    100.791  -0.468  0.63976    
## Nat.AmYes              -224.757    112.946  -1.990  0.04699 *  
## AsianYes               -142.693    301.298  -0.474  0.63594    
## HispNo                  133.343     75.097   1.776  0.07624 .  
## HispYes                 126.498    115.009   1.100  0.27177    
## EducationLT 8 yrs       -43.324     70.513  -0.614  0.53915    
## EducationMT 12 yrs       23.011     66.301   0.347  0.72864    
## Public.AsstceYes       -144.623     62.886  -2.300  0.02176 *  
## HypertensionY          -372.548    148.033  -2.517  0.01207 *  
## DiabetesYes             443.811    170.343   2.605  0.00938 ** 
## BL.Diab.TypeType I     -425.019    318.490  -1.334  0.18249    
## BL.Diab.TypeType II          NA         NA      NA       NA    
## BMI                       2.124      3.857   0.551  0.58206    
## Use.TobYes              -17.794    134.088  -0.133  0.89446    
## BL.Cig.Day              -23.101     11.560  -1.998  0.04607 *  
## Use.AlcYes             -159.827    223.653  -0.715  0.47509    
## BL.Drks.Day              28.460     24.851   1.145  0.25252    
## Drug.AddNo             -498.971    659.330  -0.757  0.44944    
## Drug.AddYes            -475.778    806.745  -0.590  0.55555    
## Prev.pregYes             85.355     70.825   1.205  0.22856    
## N.prev.preg             -17.281     18.228  -0.948  0.34345    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 653.6 on 686 degrees of freedom
##   (110 observations deleted due to missingness)
## Multiple R-squared:  0.06774,    Adjusted R-squared:  0.03241 
## F-statistic: 1.917 on 26 and 686 DF,  p-value: 0.004183

We see from the model summary that our model is terrible: \(R^2=0.03241\), which means it is explaining about 3% of the variance in Birthweight.

If you want to use only certain terms, you can include them in the formula, for example

lm_eg = lm(Birthweight ~ Group + Age + Hypertension, data=opt_red)

We see that, as we expected, the Group variable is not significant (\(p=0.5644\)). However, some terms are significant, for example whether or not the participant has diabetes, and how many cigarettes a mother smokes per day - this isn’t surprising given the contextual information we had.

Exercise B.11 Perform some diagnostic checks on your model. Do you have any reason to suspect it isn’t adequate?

Click for solution

Solution. One thing to notice from the linear model summary is that many coefficients have proportionally quite large standard errors. This could be because of a lack of data (eg. if there are very few participants in some category) or because of multicollinearity (in which case the model cannot know which of the linked variables to attribute the effect to). Combining some categories or carefully removing some covariates could help improve the model.

We can also study the residuals of our model, to check that they appear to be homoskedastic and approximately normal with mean 0.

The first step is to extract the residuals and the fitted values (which are also useful). We will create a new data frame called opt_diag with these in, so that we can plot things easily but don’t pollute our original dataset (in case we want to fit any more models)

opt_diag = na.omit(opt_red) # lm only fits where all variables are present
opt_diag$resid = resid(lm_full)
opt_diag$fitted = fitted(lm_full)

Some examples of plots are

ggplot(data=opt_diag, aes(x=resid, fill=Group)) + geom_histogram(position="dodge")

ggplot(data=opt_diag, aes(x=fitted, y=resid, col=Group)) + geom_point()

Exercise B.12 Given the results of your ANCOVA model, what do you think the risks would be if the study had been much smaller, or if the allocation had not been well-balanced?

Click for solution

Solution. In this case, it would be possible for the baseline factors that did turn out to be significant to make it appear that the treatment had a significant effect. This wouldn’t be possible with ANCOVA, but it would with a \(t\)-test in which the other covariates aren’t considered.

References

Giardiello, Francis M, Stanley R Hamilton, Anne J Krush, Steven Piantadosi, Linda M Hylind, Paul Celano, Susan V Booker, C Rahj Robinson, and G Johan A Offerhaus. 1993. “Treatment of Colonic and Rectal Adenomas with Sulindac in Familial Adenomatous Polyposis.” New England Journal of Medicine 328 (18): 1313–16.