A Computer Practical 1 - Missing data

In this computer practical we focus on an important issue in clinical trials (and most real world projects!): missing data. There will be a bit more reading than in future practicals, to give you the necessary theory and background. We will only have time to skim the surface of working with missing data, but if you want to find out more, some good references are Little and Rubin (2019) and chapter 17 of Gelman, Hill, and Vehtari (2021).

Consider it a sort of informal formative assignment to finish this practical in your own time if you don’t in class. Or, at the very least, you might need to return to it while working on future assignments. For this reason, it would be sensible to keep track of everything you do in an .R file that you can return to in the future.

There will be a mixture of full solutions, examples of possible solutions and example code to adapt. If you’re not sure how to do something, please ask!

R practicalities

There are many, many packages in R that implement methods for designing and analysing clinical trials (see a list at CRAN task view). We will look at some of these, and will also write our own code for some tasks. Remember that to install a package, you can do

install.packages("<packagename>")

and to then load the package (it doesn’t load automatically on install) enter

library(<packagename>)

If you have problems running R on your laptop, or on the university machines, the most foolproof way might be to use Github codespaces (thanks to Louis Aslett, who developed this for Data Science and Statistical Computing II). You may be familiar with this approach if you did Bayesian Computational Modelling III.

An advantage of this is that you can open the same codespace (the same instance of R) from any computer, so if you plan to work on things (for example your summative assignment, which will involve some R) from more than one computer, this might be ideal.

Codespace requires you to have a github account (you can sign up for free here) and there is a short guide to creating a github account here.

Direct link to codespace

Instructions for how to use codespace

A.1 Missing data

In [almost] any real-world study or trial, there will be some missing data. This can happen for a whole host of reasons: perhaps some participants dropped out completely, or couldn’t make some appointments. Perhaps some data were lost or corrupted. Perhaps some participants decided not to provide certain details, or didn’t comply properly with the study.

Why does this matter? Or, why do we need to think properly about this? Well, a huge amount of work in the planning and design of an RCT goes into the random allocation: we want to eliminate all sources of bias (including those we can’t observe, or aren’t even aware of) by randomly balancing the participants between the trial arms. If some of these participants’ data are missing, we can no longer be confident that we still have this balance.

Dealing with missing data boils down to two main tasks:

  • Understanding the pattern(s) of missingness
  • Processing the data to mitigate against the effect of the missingness

so this is what we’ll be working on now. Before that though, we need some data to work with!

A.1.1 Datasets

The datasets we’ll work with come from the packages medicaldata and smdi, so you’ll need to install those if you don’t have them. We’ll also be using the packages naniar and visdat, which contain lots of methods for handling missing data, and rstanarm, which will enable us to easily sample from regression models. Other packages are available, as you can see from the CRAN task view on Missing Data. We’ll also use tidyverse for various plotting and data handling tasks.

require(medicaldata)
library(naniar)
library(smdi)
library(tidyverse)
library(visdat)
library(rstanarm)

Throughout this practical we’ll work with three datasets.

A.1.1.1 Supraclavicular

The first dataset is the supraclavicular data. This is a fairly simple data set (at least in terms of missingness) which I’ll use to to demonstrate techniques before asking you to work on the more complex data!

Exercise A.1 Load the supraclavicular data. - What is the study about? - What is the primary outcome? - What are the baseline covariates (remember these must be measured before allocation) Create a data frame containing just the baseline covariates, the allocation and the primary outcome variable. Make sure these are of the correct type.

Click for solution

Solution. The details of the study can be found using ?supraclavicular.

The primary outcome variable is time to 4-nerve sensory block, which has the column name onset_sensory.

The baseline covariates are gender, BMI and age. To create our data frame, we need

sup_df = supraclavicular[ ,c(2:5, 9)]
sup_df = sup_df%>% mutate(across(c(group, gender), factor))

A.1.1.2 Lung cancer dataset

This is the smdi_data dataset from the package smdi. The intervention group variable is exposure. The outcome data are survival (or ‘time-to-event’) data: in this study, the ‘event’ is death and the follow-up period is five (probably years, but it doesn’t say!). Some participants are still alive at the end of the follow-up period (so they have status=0) and the data are therefore right censored. We will learn more about this type of data later in the course.

For now, the only preparation to do is to convert status to a factor variable:

smdi_data$status = factor(smdi_data$status)

This is a synthetic (or possibly synthetically altered) dataset designed for use with missing data exploration, so we have the advantage that the help file tells about the missingness mechanisms (more on that soon).

A.1.1.3 Obstetrics and Periodontal Therapy

The final dataset we’ll work with is opt, which is about the treatment of maternal periodontal disease and its effect on birthweight and term.

Exercise A.2 Use ?opt to find out about this trial.

There are a lot of columns here, but we won’t keep all of them. Our dataset (for now) will be

opt_df = opt[ ,c(1:4, 10:22, 72)]

which retains most of the baseline covariates, intervention group and outcome variable (birthweight). There is some ‘messiness’ in the data, for example for some variables missing values are recorded as "", rather than NA. We’ll sort that now by running the following command

opt_df = opt_df %>% 
  mutate(across(c(Use.Tob, Use.Alc, Drug.Add), 
                gsub, pattern = "^ +$", replacement = NA))%>%
  mutate(across(c(Use.Tob, Use.Alc, Drug.Add), factor))

A.2 Understanding the patterns of missingness

The first thing we’ll do is explore the data to find out what the pattern of missingness is. First of all, we can visualise which data are missing.

A.2.1 Visualising missingness

The function vis_dat (from visdat) gives an excellent first look at a dataset, colouring data by type (and missing data in grey). The default is to sort the columns by type, but you can see below that I’ve made it keep to the origianl order.

vis_dat(sup_df, sort_type=F)

The functions gg_miss_var, gg_miss_case and gg_miss_upset (all from naniar) allow us to quickly see how much missing data there is, either by variable, by case or by intersection. Look at the help files to find out what the arguments do. We see that in supraclavicular there are three missing entries for BMI, and all other variables are complete.

gg_miss_var(sup_df)

Plotting by cases shows that each of these missing data are associated with a different participant, so there are three participants with one missing item each.

gg_miss_case(sup_df)

md.pattern(sup_df)

##     group gender age onset_sensory bmi  
## 100     1      1   1             1   1 0
## 3       1      1   1             1   0 1
##         0      0   0             0   3 3

Exercise A.3 Make these plots for the remaining datasets, smdi_data and opt_df. How does the missingness compare? Which plots do you find more useful and why?

Click for solution

Solution. Lung cancer

We see there is missingness for three variables:

vis_dat(smdi_data)

The naniar functions can show us how this is spread across the variables and participants.

gg_miss_var(smdi_data, show_pct=T)

gg_miss_case(smdi_data, order_cases = T)

The function md.pattern shows us how many participants have each combination of missing data:

md.pattern(smdi_data, rotate.names = T)

##     exposure age_num female_cat smoking_cat physical_cat alk_cat histology_cat ses_cat
## 795        1       1          1           1            1       1             1       1
## 479        1       1          1           1            1       1             1       1
## 446        1       1          1           1            1       1             1       1
## 263        1       1          1           1            1       1             1       1
## 151        1       1          1           1            1       1             1       1
## 176        1       1          1           1            1       1             1       1
## 93         1       1          1           1            1       1             1       1
## 97         1       1          1           1            1       1             1       1
##            0       0          0           0            0       0             0       0
##     copd_cat eventtime status pdl1_num ecog_cat egfr_cat     
## 795        1         1      1        1        1        1    0
## 479        1         1      1        1        1        0    1
## 446        1         1      1        1        0        1    1
## 263        1         1      1        1        0        0    2
## 151        1         1      1        0        1        1    1
## 176        1         1      1        0        1        0    2
## 93         1         1      1        0        0        1    2
## 97         1         1      1        0        0        0    3
##            0         0      0      517      899     1015 2431

Obstetrics and Periodontal Therapy Using vis_dat we see that there is a lot of missing data in the opt dataset, and that some variables are particularly badly affected.

vis_dat(opt_df)

Now the naniar functions really do help us to understand what we’re dealing with!

gg_miss_var(opt_df, show_pct=T)

gg_miss_case(opt_df, order_cases = T, show_pct=T)

Not many individual participants have more than about 25% of their data missing, but there are some variables that are missing for nearly all participants (more on this later).

A.2.2 Summary tables

We may also want to visualise or summarise missingness in a table, and there are various ways to do this, for example miss_case_summary, miss_var_summary and miss_var_summary.

Exercise A.4 Use the three table generating functions listed above on our datasets. Do they give you any new information compared to the plots?

Click for solution

Solution. Working through the list of functions (just for opt_df), we have:

miss_case_summary(opt_df)
## # A tibble: 823 × 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1   423      9     50  
##  2    80      8     44.4
##  3    93      8     44.4
##  4   477      8     44.4
##  5   539      8     44.4
##  6    11      7     38.9
##  7    21      7     38.9
##  8    43      7     38.9
##  9    54      7     38.9
## 10   150      7     38.9
## # ℹ 813 more rows

This shows how many variables (both in number and as a percentage of the total number of variables) are missing, for each case (in decreasing order of missingness).

miss_case_table(opt_df)
## # A tibble: 9 × 3
##   n_miss_in_case n_cases pct_cases
##            <int>   <int>     <dbl>
## 1              1       7     0.851
## 2              2      87    10.6  
## 3              3     462    56.1  
## 4              4     224    27.2  
## 5              5      17     2.07 
## 6              6      12     1.46 
## 7              7       9     1.09 
## 8              8       4     0.486
## 9              9       1     0.122

miss_case_table tabulates the number of cases with \(n\) missing variables, for \(n=1,\ldots\).

The remaining functions look at things from the point of view of variables:

miss_var_summary(opt_df)
## # A tibble: 18 × 3
##    variable      n_miss pct_miss
##    <chr>          <int>    <num>
##  1 BL.Drks.Day      810    98.4 
##  2 BL.Diab.Type     799    97.1 
##  3 BL.Cig.Day       731    88.8 
##  4 N.prev.preg      217    26.4 
##  5 BMI               73     8.87
##  6 Use.Alc           27     3.28
##  7 Drug.Add          27     3.28
##  8 Use.Tob           26     3.16
##  9 Birthweight       14     1.70
## 10 PID                0     0   
## 11 Clinic             0     0   
## 12 Group              0     0   
## 13 Age                0     0   
## 14 Education          0     0   
## 15 Public.Asstce      0     0   
## 16 Hypertension       0     0   
## 17 Diabetes           0     0   
## 18 Prev.preg          0     0

Now we see how many (and what percentage) of cases are missing for each variable.

There are many, many possible ways to visualise or tabulate missing data, and we have used but a few. If you want to find more, you can look in the help file for naniar, or indeed at another package designed for working with missing data.

A.2.3 Mechanisms of missingness

So far, things have been a bit ad-hoc. We may have noticed some patterns or trends, but we haven’t subjected these to any scrutiny, or used any statistical methodology to understand the missingness.

When working with missing data, it’s important to think about the mechanism that is causing some data to be missing. Is there some systematic reason why some variables are missing?

Broadly speaking, there are three types of missing data. These were first coined by Rubin (1976).

A.2.3.1 Missing completely at random (MCAR)

If data are MCAR, then the probability of missingness is the same for all units: missingness is statistically independent of all covariates (observed and unobserved), of the treatment and of the outcome. If each participant decided whether or not to provide a certain piece of information by rolling a dice (eg. 1 = don’t provide), then the resulting missingness would be MCAR.

This type of missingness is the simplest to deal with, because removing the missing units is unlikely to cause any bias. However, it’s extremely rare in real life, and it’s difficult to even think of an example where it would definitely hold!

A.2.3.2 Missing at random (MAR)

For data to be MAR, the probability of missingness must depend only on available information. In this case, we can use the available information to model the missingness.

MAR data are somewhat more common (or at least the mechanism is more commonly assumed than MCAR), and is thankfully still moderately simple to deal with. If missingness is ‘at random’, then adjusting for the relevant covariates in the analysis will avoid bias.

A.2.3.3 Missing not at random (MNAR)

This is the trickiest (and probably most common) mechanism, in which the missingness depends on unobserved data. It is likely then that the unobserved (and therefore unavailable) information also predicts the value that is missing, and possibly also the outcome.

For example, suppose that the treatment in a clinical trial causes some unwanted side effect, such as pain. In this case, a participant in the intervention group is more likely to drop out of the study than a participant in the control group. Unless we measure this side effect (for example, a pain score), the resulting missingness is not at random.

The most difficult case of MNAR is when the missingness depends on the (potentially missing) variable itself. For example, in a trial of a weight loss intervention, participants with greater weights might be more reticent to reveal them, especially if they feel the trial is going badly.

MNAR mechanisms can be [imperfectly] modelled, or else mitigated by including more covariates in the data, which brings the mechanism closer to MAR. It is vital to work with the subject experts, who will have a much better idea of the sorts of mechanisms that might be causing missingness in their study.

We can never prove which of these is the case, since by definition we don’t have the information we would need to establish that the mechanism is (or isn’t) MNAR. All we can do is study patterns in the data and find evidence one way or the other.

A.3 Exploring the relationship between missingness and other variables

In a clinical trial, ultimately what we care about is whether the missingness has changed our understanding of the outcome variable(s). This is most likely to happen if the missingness is related to the outcome variable. This could happen either directly, or because the missingness is correlated to a variable that affects the outcome.

Again, we will explore things visually before using some more rigorous methods. Now, instead of simply understanding how much missing data there is, and which variables and/or cases are involved, we want to look at the relationships between missingness and the values of the other variables.

In the package naniar you can create a copy of the data frame containing just NA and !NA, indexing exactly where the missing values are.

as_shadow(sup_df)
group_NA gender_NA bmi_NA age_NA onset_sensory_NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA
!NA !NA !NA !NA !NA

To avoid duplication of the original names, the column names are suffixed by ’_NA’. This means it can be appended to the original data frame to create what the authors of naniar call a nabular object.

nabular(sup_df)
group gender bmi age onset_sensory group_NA gender_NA bmi_NA age_NA onset_sensory_NA
1 0 41.15 52 0 !NA !NA !NA !NA !NA
2 0 25.22 54 7 !NA !NA !NA !NA !NA
2 0 34.14 46 24 !NA !NA !NA !NA !NA
1 0 41.57 54 4 !NA !NA !NA !NA !NA
1 1 27.17 41 30 !NA !NA !NA !NA !NA
2 1 22.05 21 4 !NA !NA !NA !NA !NA
1 1 26.32 68 12 !NA !NA !NA !NA !NA
2 1 24.69 61 13 !NA !NA !NA !NA !NA
1 0 35.63 44 27 !NA !NA !NA !NA !NA
1 0 35.12 28 4 !NA !NA !NA !NA !NA
1 0 36.11 36 3 !NA !NA !NA !NA !NA
1 0 28.34 60 21 !NA !NA !NA !NA !NA
2 0 22.60 34 9 !NA !NA !NA !NA !NA
2 1 30.14 64 9 !NA !NA !NA !NA !NA
2 1 26.36 37 5 !NA !NA !NA !NA !NA
2 0 NA 51 50 !NA !NA NA !NA !NA
2 1 22.32 58 7 !NA !NA !NA !NA !NA
1 0 37.84 24 2 !NA !NA !NA !NA !NA
2 1 39.85 28 6 !NA !NA !NA !NA !NA
1 0 31.21 50 3 !NA !NA !NA !NA !NA
2 1 25.64 74 5 !NA !NA !NA !NA !NA
2 1 21.26 52 41 !NA !NA !NA !NA !NA
1 0 24.22 53 4 !NA !NA !NA !NA !NA
2 1 25.10 19 14 !NA !NA !NA !NA !NA
1 0 25.42 41 7 !NA !NA !NA !NA !NA
1 0 23.60 49 2 !NA !NA !NA !NA !NA
2 0 28.20 46 4 !NA !NA !NA !NA !NA
1 1 28.33 57 3 !NA !NA !NA !NA !NA
2 0 30.40 50 17 !NA !NA !NA !NA !NA
1 0 42.38 60 1 !NA !NA !NA !NA !NA
1 0 36.98 52 2 !NA !NA !NA !NA !NA
2 1 30.67 39 1 !NA !NA !NA !NA !NA
1 1 23.43 53 1 !NA !NA !NA !NA !NA
2 1 30.61 64 3 !NA !NA !NA !NA !NA
2 1 23.62 51 20 !NA !NA !NA !NA !NA
1 1 30.17 47 6 !NA !NA !NA !NA !NA
1 0 18.80 66 10 !NA !NA !NA !NA !NA
1 1 24.52 59 33 !NA !NA !NA !NA !NA
2 1 29.60 31 39 !NA !NA !NA !NA !NA
1 1 29.16 64 8 !NA !NA !NA !NA !NA
1 1 25.85 61 48 !NA !NA !NA !NA !NA
2 1 25.84 50 9 !NA !NA !NA !NA !NA
2 1 35.36 32 3 !NA !NA !NA !NA !NA
1 1 28.43 54 22 !NA !NA !NA !NA !NA
2 1 35.25 53 27 !NA !NA !NA !NA !NA
2 0 19.70 57 6 !NA !NA !NA !NA !NA
2 1 27.50 56 37 !NA !NA !NA !NA !NA
2 1 24.40 32 38 !NA !NA !NA !NA !NA
1 1 24.30 23 6 !NA !NA !NA !NA !NA
1 0 37.92 61 5 !NA !NA !NA !NA !NA
1 0 23.76 52 4 !NA !NA !NA !NA !NA
2 1 22.12 18 19 !NA !NA !NA !NA !NA
2 0 32.40 42 15 !NA !NA !NA !NA !NA
2 0 20.49 44 23 !NA !NA !NA !NA !NA
1 1 NA 22 16 !NA !NA NA !NA !NA
2 0 32.50 48 8 !NA !NA !NA !NA !NA
2 0 36.72 52 10 !NA !NA !NA !NA !NA
2 0 39.87 48 19 !NA !NA !NA !NA !NA
1 0 33.00 51 13 !NA !NA !NA !NA !NA
1 1 30.70 54 6 !NA !NA !NA !NA !NA
1 0 27.20 36 9 !NA !NA !NA !NA !NA
1 0 30.60 53 9 !NA !NA !NA !NA !NA
1 0 28.33 66 11 !NA !NA !NA !NA !NA
1 0 34.84 60 50 !NA !NA !NA !NA !NA
1 1 NA 50 6 !NA !NA NA !NA !NA
2 0 29.00 53 7 !NA !NA !NA !NA !NA
2 1 28.59 52 26 !NA !NA !NA !NA !NA
2 0 38.19 36 18 !NA !NA !NA !NA !NA
2 0 36.96 66 9 !NA !NA !NA !NA !NA
1 1 22.55 32 16 !NA !NA !NA !NA !NA
2 1 35.15 57 6 !NA !NA !NA !NA !NA
2 0 37.62 54 19 !NA !NA !NA !NA !NA
1 0 42.93 52 13 !NA !NA !NA !NA !NA
2 1 20.98 31 9 !NA !NA !NA !NA !NA
2 1 27.37 36 8 !NA !NA !NA !NA !NA
1 0 27.02 35 5 !NA !NA !NA !NA !NA
1 0 22.83 51 2 !NA !NA !NA !NA !NA
1 0 28.73 56 5 !NA !NA !NA !NA !NA
2 0 29.49 62 10 !NA !NA !NA !NA !NA
1 0 34.08 32 18 !NA !NA !NA !NA !NA
1 0 25.59 57 15 !NA !NA !NA !NA !NA
1 1 34.17 59 12 !NA !NA !NA !NA !NA
2 1 31.39 62 24 !NA !NA !NA !NA !NA
2 0 21.26 25 50 !NA !NA !NA !NA !NA
1 0 24.20 29 8 !NA !NA !NA !NA !NA
1 1 32.28 65 4 !NA !NA !NA !NA !NA
2 1 30.45 28 8 !NA !NA !NA !NA !NA
1 1 23.34 65 6 !NA !NA !NA !NA !NA
2 0 24.54 46 5 !NA !NA !NA !NA !NA
2 0 23.61 67 10 !NA !NA !NA !NA !NA
2 0 43.46 55 10 !NA !NA !NA !NA !NA
1 0 19.91 59 3 !NA !NA !NA !NA !NA
1 1 24.45 38 20 !NA !NA !NA !NA !NA
1 0 27.78 60 7 !NA !NA !NA !NA !NA
2 0 23.06 41 11 !NA !NA !NA !NA !NA
2 0 32.79 68 14 !NA !NA !NA !NA !NA
1 1 24.96 66 12 !NA !NA !NA !NA !NA
2 0 23.06 41 10 !NA !NA !NA !NA !NA
2 1 24.40 19 10 !NA !NA !NA !NA !NA
1 0 22.24 31 9 !NA !NA !NA !NA !NA
1 0 33.91 40 11 !NA !NA !NA !NA !NA
2 1 31.65 53 22 !NA !NA !NA !NA !NA
1 0 41.54 41 40 !NA !NA !NA !NA !NA

This is useful because we can investigate the values of the actual data while conditioning on the missingness. We can summarize the outcome distribution according to whether a variable is missing or observed:

sup_nab %>%
  group_by(bmi_NA) %>%
  summarise_at(.vars = "onset_sensory",
               .funs = c("mean", "sd", "var", "min", "max"),
               na.rm = TRUE)
## # A tibble: 2 × 6
##   bmi_NA  mean    sd   var   min   max
##   <fct>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 !NA       13  11.4  131.     0    50
## 2 NA        24  23.1  532      6    50

and visualise the outcome distribution for missing and non-missing values of a covariate:

ggplot(sup_nab,
       aes(x = onset_sensory,
           fill = bmi_NA)) + 
  geom_histogram()

We could also explore whether the missingness is related to the values of the other covariates, for example gender (in this data 0=female, 1=male)

ggplot(sup_nab,
       aes(x = gender,
           fill = bmi_NA)) + 
  geom_bar()

This isn’t a great example because there are only three missing BMI values, but you can probably guess what you’ll be doing next…

Exercise A.5 For the lung cancer and obstetric periodontal treatment datasets, investigate whether / how the outcome distribution appears to be affected by the missing values.

  • For the smdi_data dataset, remember that the outcome variable is the combination of eventtime and status. Can you think of a way to visualise the data that combines these pieces of information?
  • For opt, ignore the variables BL.Diab.Type, BL.Cig.Day, BL.Cig.Day and N.prev.preg. You can create a temporary opt data frame to work with
opt_tmp = opt_df[ ,-c(9,12,14,17)]

There are many, many plots you could make here, but try not to spend too long on this - why not divide up tasks between you and your neighbours?

Click for solution

Solution. We’ll look here in some detail about the lung cancer dataset (smdi_data) - these aren’t really solutions as such, just some more detail of things you could do.

smdi_nab = nabular(smdi_data)

First by ecog_cat. Since there is a lot more missing data here, I’ve chosen to compare the distributions side-by-side using facet_wrap. This also allows me to colour by status (although it is sort of obvious from the large peak at eventtime=5 that those people are still alive). Setting scales = "free_y" makes it easier to compare the overall shape of the distribution. You may have found a different way to display the data that may be equally (or more!) informative - if so that’s fine!

smdi_nab %>%
  group_by(ecog_cat) %>%
  summarise_at(.vars = "eventtime",
               .funs = c("mean", "sd", "var", "min", "max"),
               na.rm = TRUE)
## # A tibble: 3 × 6
##   ecog_cat  mean    sd   var       min   max
##   <fct>    <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1 0         2.21  1.80  3.25 0.0000215     5
## 2 1         2.07  1.79  3.22 0.000754      5
## 3 <NA>      2.22  1.86  3.46 0.00248       5
ggplot(smdi_nab,
       aes(x = eventtime,
           fill = status)) + 
  geom_histogram() + facet_wrap(~ecog_cat_NA, scales = "free_y")

Next by egfr_cat:

smdi_nab %>%
  group_by(egfr_cat) %>%
  summarise_at(.vars = "eventtime",
               .funs = c("mean", "sd", "var", "min", "max"),
               na.rm = TRUE)
## # A tibble: 3 × 6
##   egfr_cat  mean    sd   var       min   max
##   <fct>    <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1 0         1.96  1.79  3.22 0.000754      5
## 2 1         2.94  1.91  3.66 0.0115        5
## 3 <NA>      2.15  1.76  3.10 0.0000215     5
ggplot(smdi_nab,
       aes(x = eventtime,
           fill = status)) + 
  geom_histogram() + facet_wrap(~egfr_cat_NA, scales = "free_y")

and finally by pdl1_num:

smdi_nab %>%
  group_by(pdl1_num) %>%
  summarise_at(.vars = "eventtime",
               .funs = c("mean", "sd", "var", "min", "max"),
               na.rm = TRUE)
## # A tibble: 1,540 × 6
##    pdl1_num  mean    sd   var   min   max
##       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     6.14 0.655    NA    NA 0.655 0.655
##  2    12.4  0.463    NA    NA 0.463 0.463
##  3    12.9  0.679    NA    NA 0.679 0.679
##  4    13.6  0.358    NA    NA 0.358 0.358
##  5    15.8  0.290    NA    NA 0.290 0.290
##  6    16.7  1.71     NA    NA 1.71  1.71 
##  7    17.4  0.245    NA    NA 0.245 0.245
##  8    17.8  1.52     NA    NA 1.52  1.52 
##  9    18.7  4.91     NA    NA 4.91  4.91 
## 10    19.0  2.03     NA    NA 2.03  2.03 
## # ℹ 1,530 more rows
ggplot(smdi_nab,
       aes(x = eventtime,
           fill = status))+
  geom_histogram() + facet_wrap(~pdl1_num_NA, scales = "free_y")

It appears that pdl1_num values are likely to be MNAR, since the outcome distribution looks different (proportionally more early deaths) for the missing values.

We could make many more plots of this type: for example, by plotting the distributions of other variables rather than the outcome.

For example, we could look at how the missingness of each variable relates to smoking category:

library(gridExtra)
ecog_sm = ggplot(smdi_nab,
       aes(x = smoking_cat,
           fill = ecog_cat_NA))+
  geom_bar() 


egfr_sm = ggplot(smdi_nab,
       aes(x = smoking_cat,
           fill = egfr_cat_NA))+
  geom_bar() 

pdl1_sm = ggplot(smdi_nab,
       aes(x = smoking_cat,
           fill = pdl1_num_NA))+
  geom_bar()

grid.arrange(ecog_sm, egfr_sm, pdl1_sm, nrow=1)

So, it looks likely that the missingness of egfr_cat depends on smoking status, but less likely that the missingness of the other two variables does.

Looking into all the variables one by one like this would take a very long time, and wouldn’t even allow us to make any (useful) conclusions. Really what we want to know is whether there is a significant difference in the observed data compared to the missing data.

A.3.1 Statistical summaries of the effect of missingness

According to the framework coined by Rubin (1976), if the data are MAR (missing at random) then the missingness can be explained by (some of) the observed covariates. We would therefore expect the participant characteristics to be different between those with missing values and those without. If the missingness is MCAR (missing completely at random) we would expect no significant difference. Similarly, if the data are MNAR (missing not-at-random) due to some unobserved variable that is independent of all observed variables, we would expect no pattern. In reality this is almost never the case (the confounding variables are usually linked to some observed variables) and so really we are testing against the data being MCAR.

A.3.1.1 Hotelling’s multivariate t-test

This test examines the differences between those observations with an observation (of some partially observed variable) and those without. The test statistic is derived by assuming both groups are drawn from the same multivariate normal distribution, and so a high value of the test statistic (conversely a low p-value) suggests that there are significant differences between the groups. This test is from a generalisation of Student’s \(t\)-test made by Hotelling et al. (1931).

To perform Hotelling’s multivariate \(t\)-test on the sup_df data we enter

smdi_hotelling(sup_df)
##   covariate hotteling_p
## 1       bmi       0.340

For this data, we find that there is insufficient evidence to reject the hypothesis that the BMI data are MCAR.

Exercise A.6 Perform Hotelling’s multivariate \(t\)-test on the other two datasets, opt_tmp and smdi_data. What do you find?

Click for solution

First we’ll look at smdi_data.

smdi_hotelling(smdi_data)
##   covariate hotteling_p
## 1  ecog_cat       0.783
## 2  egfr_cat       <.001
## 3  pdl1_num       <.001

It appears that ecog_cat may be MCAR, but for egfr_cat and pdl1_num there is sufficient evidence to suggest MAR or MNAR.

Now for opt_tmp

smdi_hotelling(opt_tmp)
##     covariate hotteling_p
## 1         BMI       <.001
## 2     Use.Tob       <.001
## 3     Use.Alc       <.001
## 4    Drug.Add       <.001
## 5 Birthweight       <.001

These all appear to have a significant departure from MCAR.

Caution: The power of this test (and others like it) can be strongly influenced by sample size, so it is sensible to combine it with a more detailed approach.

A.3.1.2 Absolute standardised mean difference

The absolute standardised mean difference (ASMD) gives a measure of how different the values of the observed covariates are for missing versus observed values of each partially observed covariate. For every partially observed covariate, there is an ASMD for each other covariate.

Label the partially observed covariate \(X_M\), and suppose \(X_1,\ldots,X_K\) are the other covariates. The dataset is split into those cases with \(X_M\) observed and those with \(X_M\) missing. For each covariate \(X_1,\ldots,X_K\) we find the absolute value of the difference in means, and divide this by the standard deviation of that covariate. The ASMD is therefore always non-negative, and should not be affected by sample size.

A general rule of thumb is that ASMD values over 0.1 are cause for concern, though again this can be vulnerable to small sample sizes.

The function smdi_asmd from smdi creates an asmd object, which has several parts to it. Note that we set includeNA=T so that we can see the effect of missingness, as well as observed values, of other variables. This won’t make a difference for sup_df, but it will in the exercise.

asmd_sup = smdi_asmd(sup_df, includeNA=T)

Firstly there is a summary table, which shows the median, min and max of ASMD for each partially observed covariate.

asmd_sup
## # A tibble: 1 × 4
##   covariate asmd_median asmd_min asmd_max
## * <chr>     <chr>       <chr>    <chr>   
## 1 bmi       0.471       0.343    0.604

There is also a Table 1, so called because a summary table of this nature should always be included when summarising a dataset in terms of the difference between two groups. This is formatted a little strangely, as it is designed for use in printed works. This table includes the result of a statistical test (by default a chi-squared test) showing whether the differences are statistically significant. (The function kable here is to do with formatting the table for HTML, not to do with missing data).

kable(asmd_sup$bmi$asmd_table1)
0 1 p test SMD
n 100 3
group = 2 (%) 50 (50.0) 1 (33.3) 1.000 0.343
gender = 1 (%) 44 (44.0) 2 (66.7) 0.850 0.468
age (mean (SD)) 48.10 (13.30) 41.00 (16.46) 0.367 0.474
onset_sensory (mean (SD)) 13.00 (11.44) 24.00 (23.07) 0.114 0.604

Finally there is a plot showing each ASMD

asmd_sup$bmi$asmd_plot

We see that although the ASMD values are quite large (much bigger than the advised 0.1), because there are a very small number of them they are not statistically significant.

Exercise A.7 Investigate the ASMD for our datasets smdi_data and opt_tmp. Of the partially observed covariates, which seem to be most strongly related to other covariates? Do any seem to be MCAR?

Make sure you remove any participant ID variables, since we don’t want to include those in our analysis! Again, it might be a good idea to pair up!

Click for solution

Solution. We will do this just for the opt_tmp data in the solutions.

asmd_opt = smdi_asmd(opt_tmp[ ,-1], includeNA=T)
asmd_opt
## # A tibble: 5 × 4
##   covariate   asmd_median asmd_min asmd_max
## * <chr>       <chr>       <chr>    <chr>   
## 1 BMI         0.195       0.041    1.717   
## 2 Use.Tob     0.093       0.004    39.900  
## 3 Use.Alc     0.099       0.004    8.204   
## 4 Drug.Add    0.073       0.004    7.229   
## 5 Birthweight 0.340       0.004    2.556

These values are all well above 0.1, so it seems unlikely that the data are MCAR.

We can look in a little more detail, first at BMI:

asmd_opt$BMI$asmd_plot

kable(asmd_opt$BMI$asmd_table1)
0 1 p test SMD
n 750 73
Clinic (%) <0.001 1.717
KY 205 (27.3) 6 ( 8.2)
MN 237 (31.6) 10 (13.7)
MS 192 (25.6) 0 ( 0.0)
NY 116 (15.5) 57 (78.1)
Group = T (%) 375 (50.0) 38 (52.1) 0.832 0.041
Age (mean (SD)) 25.91 (5.54) 26.68 (5.87) 0.256 0.136
Education (%) 0.137 0.229
8-12 yrs 441 (58.8) 38 (52.1)
LT 8 yrs 134 (17.9) 20 (27.4)
MT 12 yrs 175 (23.3) 15 (20.5)
Public.Asstce = Yes (%) 563 (75.1) 38 (52.1) <0.001 0.492
Hypertension = Y (%) 25 ( 3.3) 0 ( 0.0) 0.220 0.263
Diabetes = Yes (%) 24 ( 3.2) 0 ( 0.0) 0.235 0.257
Use.Tob (%) 0.153 0.269
No 636 (84.8) 68 (93.2)
Yes 89 (11.9) 4 ( 5.5)
NA 25 ( 3.3) 1 ( 1.4)
Use.Alc (%) 0.583 0.146
No 709 (94.5) 71 (97.3)
Yes 15 ( 2.0) 1 ( 1.4)
NA 26 ( 3.5) 1 ( 1.4)
Drug.Add (%) 0.434 0.161
No 720 (96.0) 71 (97.3)
Yes 4 ( 0.5) 1 ( 1.4)
NA 26 ( 3.5) 1 ( 1.4)
Prev.preg = Yes (%) 555 (74.0) 56 (76.7) 0.715 0.063
Birthweight (mean (SD)) 3194.08 (684.92) 3247.29 (669.27) 0.529 0.079

We see that there appears to be a strong relationship between clinic and missingness of BMI. In particular, a disproportionately high number of missing BMI values seem to be from the New York (NY) clinic. There also appears to be less missingness for those with public.asstce=1 (those for whom the government paid for the delivery).

Next we look at Use.Alc

kable(asmd_opt$Use.Alc$asmd_table1)
0 1 p test SMD
n 796 27
Clinic (%) <0.001 0.874
KY 207 (26.0) 4 (14.8)
MN 239 (30.0) 8 (29.6)
MS 191 (24.0) 1 ( 3.7)
NY 159 (20.0) 14 (51.9)
Group = T (%) 399 (50.1) 14 (51.9) 1.000 0.035
Age (mean (SD)) 25.96 (5.59) 26.37 (4.99) 0.710 0.077
Education (%) 0.936 0.070
8-12 yrs 464 (58.3) 15 (55.6)
LT 8 yrs 149 (18.7) 5 (18.5)
MT 12 yrs 183 (23.0) 7 (25.9)
Public.Asstce = Yes (%) 582 (73.1) 19 (70.4) 0.924 0.061
Hypertension = Y (%) 23 ( 2.9) 2 ( 7.4) 0.438 0.206
Diabetes = Yes (%) 24 ( 3.0) 0 ( 0.0) 0.738 0.249
BMI (mean (SD)) 27.70 (7.16) 26.88 (6.18) 0.568 0.122
Use.Tob (%) <0.001 8.204
No 704 (88.4) 0 ( 0.0)
Yes 92 (11.6) 1 ( 3.7)
NA 0 ( 0.0) 26 (96.3)
Drug.Add (%) <0.001 7.079
No 790 (99.2) 1 ( 3.7)
Yes 5 ( 0.6) 0 ( 0.0)
NA 1 ( 0.1) 26 (96.3)
Prev.preg = Yes (%) 591 (74.2) 20 (74.1) 1.000 0.004
Birthweight (mean (SD)) 3198.27 (677.67) 3225.88 (949.61) 0.873 0.033
asmd_opt$Use.Alc$asmd_plot

This time the two most strongly related covariates are Use.Tob and Drug.Add. This is perhaps not surprising because they are quite similar variables. Notice that clinic still has a strong effect (the \(x\)-axis goes a lot higher on this plot than the one for BMI). From the table, we see that again a disproportionately high number of missing values come from the NY clinic, and that missingness in Use.Alc is much more likely if Use.Tob and/or Drug.Add are missing.

A.3.2 Modelling missingness

One of the most common ways to model missingness is using logistic regression. You might already have come across this in your degree, and we’ll do more about it later, but for now the box below tells you all you need to know.

Logistic regression is a type of generalised linear model that is used to model a binary categorical variable \(Y\) (often labelled 0 and 1). In our cases the values are NA and !NA or ‘missing’ and ‘not missing’.

The logistic regression model has the form

\[ \operatorname{logit}\left(p\right) = \log \left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p,\] where \(p = p\left(Y=1\right)\) the \(x_i\) are explanatory variables (in our case this will be the baseline covariates and the trial arm), the \(\beta_i\) are coefficients and \(p\) is the probability of an outcome of 1.

The logit function rescales the probability (which can only be in \(\left[0,1\right]\)) to the real line, so that it works with linear combination on the right hand side. Conversely, applying the inverse logit function to the RHS gives a value in \(\left[0,1\right]\).

Fitting a logistic regression model in R is very similar to fitting a linear regression. We use the function glm, which is a general function for generalised linear models. To specify that we want logistic regression, we must include the argument family = binoial(link = "logit").

We’ll use our nabular objects for this, since we already have variables denoting missingness. For example, we can see whether missingness of BMI in the supraclavicular dataset relates to any of the other variables.

sup_nab = nabular(sup_df)
sup_glm = glm(bmi_NA ~ group + gender + age + onset_sensory, 
              data = sup_nab, 
              family = binomial(link = "logit"))
summary(sup_glm)
## 
## Call:
## glm(formula = bmi_NA ~ group + gender + age + onset_sensory, 
##     family = binomial(link = "logit"), data = sup_nab)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -2.45041    2.26927  -1.080    0.280
## group2        -1.46042    1.38616  -1.054    0.292
## gender1        0.99119    1.28605   0.771    0.441
## age           -0.04954    0.04609  -1.075    0.282
## onset_sensory  0.06444    0.04116   1.566    0.117
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27.128  on 102  degrees of freedom
## Residual deviance: 22.854  on  98  degrees of freedom
## AIC: 32.854
## 
## Number of Fisher Scoring iterations: 7

In this case none of the other variables are significant, so it would be reasonable to proceed as though the missingness of BMI is MCAR (unless there is expert knowledge to suggest that missingness might be linked to some other observed factor). This agrees with what we found in Section A.3.1

A word of caution: the default in most R functions, particularly for plotting or fitting models, is to remove all rows with any missingness. In a situation where there are missing values for multiple variables, particularly if the missingness is related, this could in itself introduce bias.

Exercise A.8 Model the patterns of missingness for smdi and opt (use the opt_tmp data for this, without the severely missing variables). What do you find?

To avoid typing out all the column names for the formula, you can copy and paste the output from

paste(names(smdi_data), collapse = " + ")

and delete the terms you don’t want.

Click for solution

Solution. Lung cancer data

We can fit a few models, to see what we find. First we’ll tackle ecog_cat_NA.

Missingness in ecog_cat doesn’t appear to be associated with any other baseline covariates:

ecog_NA_glm1 = glm(ecog_cat_NA ~ exposure + age_num + female_cat + smoking_cat + physical_cat + alk_cat + histology_cat + ses_cat + copd_cat, 
                data=smdi_nab, family = binomial(link="logit"))
summary(ecog_NA_glm1)
## 
## Call:
## glm(formula = ecog_cat_NA ~ exposure + age_num + female_cat + 
##     smoking_cat + physical_cat + alk_cat + histology_cat + ses_cat + 
##     copd_cat, family = binomial(link = "logit"), data = smdi_nab)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     -0.546409   0.226805  -2.409    0.016 *
## exposure        -0.089400   0.088616  -1.009    0.313  
## age_num         -0.002654   0.003034  -0.875    0.382  
## female_cat1     -0.034202   0.086890  -0.394    0.694  
## smoking_cat1     0.101202   0.097823   1.035    0.301  
## physical_cat1    0.137491   0.087074   1.579    0.114  
## alk_cat1        -0.049371   0.257094  -0.192    0.848  
## histology_cat1  -0.002256   0.104115  -0.022    0.983  
## ses_cat2_middle  0.128238   0.113992   1.125    0.261  
## ses_cat3_high    0.098773   0.113447   0.871    0.384  
## copd_cat1       -0.017240   0.098015  -0.176    0.860  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3265.9  on 2499  degrees of freedom
## Residual deviance: 3259.5  on 2489  degrees of freedom
## AIC: 3281.5
## 
## Number of Fisher Scoring iterations: 4

Now including the other variables with missing values:

ecog_NA_glm2 = glm(ecog_cat_NA ~ exposure + age_num + female_cat + smoking_cat + physical_cat + alk_cat + histology_cat + ses_cat + copd_cat + egfr_cat + pdl1_num, 
                data=smdi_nab, family = binomial(link="logit"))
summary(ecog_NA_glm2)
## 
## Call:
## glm(formula = ecog_cat_NA ~ exposure + age_num + female_cat + 
##     smoking_cat + physical_cat + alk_cat + histology_cat + ses_cat + 
##     copd_cat + egfr_cat + pdl1_num, family = binomial(link = "logit"), 
##     data = smdi_nab)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -0.063844   0.418284  -0.153    0.879
## exposure        -0.094567   0.140956  -0.671    0.502
## age_num         -0.005729   0.004484  -1.278    0.201
## female_cat1     -0.215749   0.133629  -1.615    0.106
## smoking_cat1     0.036868   0.139759   0.264    0.792
## physical_cat1    0.149057   0.130846   1.139    0.255
## alk_cat1         1.050366   0.741154   1.417    0.156
## histology_cat1  -0.092049   0.173989  -0.529    0.597
## ses_cat2_middle  0.215523   0.166471   1.295    0.195
## ses_cat3_high    0.179477   0.166802   1.076    0.282
## copd_cat1        0.106252   0.137255   0.774    0.439
## egfr_cat1        0.082752   0.148229   0.558    0.577
## pdl1_num        -0.008005   0.006214  -1.288    0.198
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1620.9  on 1240  degrees of freedom
## Residual deviance: 1606.6  on 1228  degrees of freedom
##   (1259 observations deleted due to missingness)
## AIC: 1632.6
## 
## Number of Fisher Scoring iterations: 4

None of the variables are significant, so it is probably reasonable to suppose that ecog_cat is MCAR (indeed the help file tells us it is).

Now for egfr_cat:

egfr_NA_glm1 = glm(egfr_cat_NA ~ exposure + age_num + female_cat + smoking_cat + physical_cat + alk_cat + histology_cat + ses_cat + copd_cat, 
                data=smdi_nab, family = binomial(link="logit"))
summary(egfr_NA_glm1)
## 
## Call:
## glm(formula = egfr_cat_NA ~ exposure + age_num + female_cat + 
##     smoking_cat + physical_cat + alk_cat + histology_cat + ses_cat + 
##     copd_cat, family = binomial(link = "logit"), data = smdi_nab)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.776612   0.253687 -10.945  < 2e-16 ***
## exposure         0.855880   0.093967   9.108  < 2e-16 ***
## age_num          0.014198   0.003271   4.340 1.42e-05 ***
## female_cat1      0.781194   0.092458   8.449  < 2e-16 ***
## smoking_cat1     0.626782   0.102280   6.128 8.89e-10 ***
## physical_cat1    0.513986   0.092421   5.561 2.68e-08 ***
## alk_cat1         1.116563   0.271932   4.106 4.03e-05 ***
## histology_cat1   0.807125   0.108896   7.412 1.25e-13 ***
## ses_cat2_middle -0.253823   0.119805  -2.119   0.0341 *  
## ses_cat3_high   -0.165891   0.118853  -1.396   0.1628    
## copd_cat1        0.579084   0.102993   5.623 1.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3376.8  on 2499  degrees of freedom
## Residual deviance: 2949.9  on 2489  degrees of freedom
## AIC: 2971.9
## 
## Number of Fisher Scoring iterations: 3
egfr_NA_glm2 = glm(egfr_cat_NA ~ exposure + age_num + female_cat + smoking_cat + physical_cat + alk_cat + histology_cat + ses_cat + copd_cat + ecog_cat + pdl1_num, 
                data=smdi_nab, family = binomial(link="logit"))
summary(egfr_NA_glm2)
## 
## Call:
## glm(formula = egfr_cat_NA ~ exposure + age_num + female_cat + 
##     smoking_cat + physical_cat + alk_cat + histology_cat + ses_cat + 
##     copd_cat + ecog_cat + pdl1_num, family = binomial(link = "logit"), 
##     data = smdi_nab)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -7.111213   0.592182 -12.008  < 2e-16 ***
## exposure         1.003371   0.164023   6.117 9.52e-10 ***
## age_num          0.023939   0.005328   4.493 7.02e-06 ***
## female_cat1      0.886902   0.149387   5.937 2.90e-09 ***
## smoking_cat1     1.048884   0.165025   6.356 2.07e-10 ***
## physical_cat1    1.089550   0.150277   7.250 4.16e-13 ***
## alk_cat1         3.727345   0.719142   5.183 2.18e-07 ***
## histology_cat1   1.294128   0.173948   7.440 1.01e-13 ***
## ses_cat2_middle -0.072720   0.189281  -0.384    0.701    
## ses_cat3_high   -0.123102   0.189587  -0.649    0.516    
## copd_cat1        0.994580   0.164698   6.039 1.55e-09 ***
## ecog_cat1        0.939474   0.152432   6.163 7.13e-10 ***
## pdl1_num         0.040946   0.007493   5.465 4.63e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1686.9  on 1273  degrees of freedom
## Residual deviance: 1209.5  on 1261  degrees of freedom
##   (1226 observations deleted due to missingness)
## AIC: 1235.5
## 
## Number of Fisher Scoring iterations: 5

This time there is a definite relationship between missingness and the values of the other covariates, suggesting an MAR or MNAR mechanism.

Finally, let’s model missingness in pdl1_num:

pdl1_NA_glm1 = glm(
  pdl1_num_NA ~ exposure + age_num + female_cat + smoking_cat + physical_cat + 
    alk_cat + histology_cat + ses_cat + copd_cat, 
  data=smdi_nab, family = binomial(link="logit"))
summary(pdl1_NA_glm1)
## 
## Call:
## glm(formula = pdl1_num_NA ~ exposure + age_num + female_cat + 
##     smoking_cat + physical_cat + alk_cat + histology_cat + ses_cat + 
##     copd_cat, family = binomial(link = "logit"), data = smdi_nab)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -1.337969   0.271141  -4.935 8.03e-07 ***
## exposure        -0.707361   0.112571  -6.284 3.31e-10 ***
## age_num          0.002670   0.003621   0.737 0.460901    
## female_cat1      0.115354   0.102830   1.122 0.261947    
## smoking_cat1     0.070899   0.117623   0.603 0.546663    
## physical_cat1   -0.057181   0.105602  -0.541 0.588177    
## alk_cat1         0.880597   0.260178   3.385 0.000713 ***
## histology_cat1  -0.106808   0.127451  -0.838 0.402014    
## ses_cat2_middle -0.072022   0.135175  -0.533 0.594168    
## ses_cat3_high   -0.034611   0.133995  -0.258 0.796175    
## copd_cat1        0.094061   0.118031   0.797 0.425500    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2548.4  on 2499  degrees of freedom
## Residual deviance: 2488.3  on 2489  degrees of freedom
## AIC: 2510.3
## 
## Number of Fisher Scoring iterations: 4
pdl1_NA_glm2 = glm(pdl1_num_NA ~ exposure + age_num + female_cat + smoking_cat + physical_cat + alk_cat + histology_cat + ses_cat + copd_cat + ecog_cat + egfr_cat, 
                data=smdi_nab, family = binomial(link="logit"))
summary(pdl1_NA_glm2)
## 
## Call:
## glm(formula = pdl1_num_NA ~ exposure + age_num + female_cat + 
##     smoking_cat + physical_cat + alk_cat + histology_cat + ses_cat + 
##     copd_cat + ecog_cat + egfr_cat, family = binomial(link = "logit"), 
##     data = smdi_nab)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.834316   0.566570  -6.768 1.31e-11 ***
## exposure        -0.431716   0.221450  -1.950   0.0512 .  
## age_num          0.015682   0.007177   2.185   0.0289 *  
## female_cat1      0.358442   0.197460   1.815   0.0695 .  
## smoking_cat1     0.300876   0.220606   1.364   0.1726    
## physical_cat1    0.500724   0.201918   2.480   0.0131 *  
## alk_cat1         3.890278   0.673079   5.780 7.48e-09 ***
## histology_cat1   0.479733   0.245695   1.953   0.0509 .  
## ses_cat2_middle -0.102158   0.261232  -0.391   0.6958    
## ses_cat3_high    0.125765   0.253598   0.496   0.6199    
## copd_cat1        0.882138   0.225914   3.905 9.43e-05 ***
## ecog_cat1        0.307018   0.195834   1.568   0.1169    
## egfr_cat1        0.301910   0.227109   1.329   0.1837    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 830.66  on 945  degrees of freedom
## Residual deviance: 737.55  on 933  degrees of freedom
##   (1554 observations deleted due to missingness)
## AIC: 763.55
## 
## Number of Fisher Scoring iterations: 5

This time there are far fewer significantly related variables, but we can again be confident that the mechanism isn’t MCAR.

Frustratingly, the only way we could determine whether the mechanisms for egfr_cat and pdl1_num was MAR or MNAR would be to measure (or otherwise procure) some of the missing data. We simply do not have the necessary information to work out which is the case. If this were a real trial, we would now talk at length with the experts/clinicians, who will have a much better understanding of the probable causes of missingness.

opt

Now we’ll do the same with opt_tmp. We have missingness in several variables: BMI, Use.Tob, Use.Alc, Drug.Add, Birthweight. A model fit to the data in R will remove cases with any of these missing (which for this dataset might mean a lot are removed!). So, as well as building models with all variables, we can build models using only the fully observed data, and use the _NA variables from nabular as covariates:

glm_opt_BMI_allvar = glm(
  BMI_NA ~ Clinic + Group + Age +  
    Education + Public.Asstce + Hypertension + Diabetes + Use.Tob + Use.Alc + Drug.Add + Prev.preg, 
  data = nabular(opt_tmp),
  family = binomial(link = "logit")) 
summary(glm_opt_BMI_allvar)
## 
## Call:
## glm(formula = BMI_NA ~ Clinic + Group + Age + Education + Public.Asstce + 
##     Hypertension + Diabetes + Use.Tob + Use.Alc + Drug.Add + 
##     Prev.preg, family = binomial(link = "logit"), data = nabular(opt_tmp))
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -3.69872    0.86773  -4.263 2.02e-05 ***
## ClinicMN              0.38878    0.58170   0.668   0.5039    
## ClinicMS            -15.63977  755.63175  -0.021   0.9835    
## ClinicNY              3.02941    0.51105   5.928 3.07e-09 ***
## GroupT                0.15793    0.28816   0.548   0.5837    
## Age                  -0.00288    0.02650  -0.109   0.9135    
## EducationLT 8 yrs     0.67635    0.35763   1.891   0.0586 .  
## EducationMT 12 yrs    0.23558    0.37844   0.622   0.5336    
## Public.AsstceYes     -0.18766    0.30399  -0.617   0.5370    
## HypertensionY       -15.53379 1839.84596  -0.008   0.9933    
## DiabetesYes         -15.18021 1942.81416  -0.008   0.9938    
## Use.TobYes           -0.24898    0.59964  -0.415   0.6780    
## Use.AlcYes            0.60910    1.49054   0.409   0.6828    
## Drug.AddYes           0.09438    1.58954   0.059   0.9527    
## Prev.pregYes          0.02862    0.36074   0.079   0.9368    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 483.11  on 794  degrees of freedom
## Residual deviance: 330.17  on 780  degrees of freedom
##   (28 observations deleted due to missingness)
## AIC: 360.17
## 
## Number of Fisher Scoring iterations: 18

Complete cases only:

glm_opt_BMI_comp = glm(
  BMI_NA ~ Clinic + Group + Age +  
    Education + Public.Asstce + Hypertension + Diabetes + Prev.preg, 
  data = nabular(opt_tmp),
  family = binomial(link = "logit")) 
summary(glm_opt_BMI_comp)
## 
## Call:
## glm(formula = BMI_NA ~ Clinic + Group + Age + Education + Public.Asstce + 
##     Hypertension + Diabetes + Prev.preg, family = binomial(link = "logit"), 
##     data = nabular(opt_tmp))
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.522e+00  8.233e-01  -4.279 1.88e-05 ***
## ClinicMN            1.960e-01  5.469e-01   0.358   0.7200    
## ClinicMS           -1.582e+01  7.539e+02  -0.021   0.9833    
## ClinicNY            2.752e+00  4.724e-01   5.825 5.72e-09 ***
## GroupT              1.292e-01  2.777e-01   0.465   0.6416    
## Age                -4.618e-03  2.615e-02  -0.177   0.8599    
## EducationLT 8 yrs   6.861e-01  3.442e-01   1.993   0.0463 *  
## EducationMT 12 yrs  2.811e-01  3.674e-01   0.765   0.4442    
## Public.AsstceYes   -2.250e-01  2.980e-01  -0.755   0.4502    
## HypertensionY      -1.633e+01  1.764e+03  -0.009   0.9926    
## DiabetesYes        -1.522e+01  1.938e+03  -0.008   0.9937    
## Prev.pregYes        9.714e-02  3.527e-01   0.275   0.7830    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 493.01  on 822  degrees of freedom
## Residual deviance: 348.69  on 811  degrees of freedom
## AIC: 372.69
## 
## Number of Fisher Scoring iterations: 18

Using missingness of incomplete variables in model:

glm_opt_BMI_NA = glm(
  BMI_NA ~ Clinic + Group + Age +  
    Education + Public.Asstce + Hypertension + Diabetes + Use.Tob_NA + Use.Alc_NA + Drug.Add_NA + Prev.preg, 
  data = nabular(opt_tmp),
  family = binomial(link = "logit")) 
summary(glm_opt_BMI_NA)
## 
## Call:
## glm(formula = BMI_NA ~ Clinic + Group + Age + Education + Public.Asstce + 
##     Hypertension + Diabetes + Use.Tob_NA + Use.Alc_NA + Drug.Add_NA + 
##     Prev.preg, family = binomial(link = "logit"), data = nabular(opt_tmp))
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.493e+00  8.286e-01  -4.215 2.49e-05 ***
## ClinicMN            2.307e-01  5.472e-01   0.422   0.6733    
## ClinicMS           -1.582e+01  7.547e+02  -0.021   0.9833    
## ClinicNY            2.825e+00  4.740e-01   5.960 2.52e-09 ***
## GroupT              1.310e-01  2.800e-01   0.468   0.6398    
## Age                -4.480e-03  2.613e-02  -0.171   0.8639    
## EducationLT 8 yrs   6.498e-01  3.462e-01   1.877   0.0605 .  
## EducationMT 12 yrs  2.335e-01  3.705e-01   0.630   0.5286    
## Public.AsstceYes   -2.047e-01  3.006e-01  -0.681   0.4959    
## HypertensionY      -1.563e+01  1.824e+03  -0.009   0.9932    
## DiabetesYes        -1.527e+01  1.945e+03  -0.008   0.9937    
## Use.Tob_NANA        1.737e+01  1.532e+04   0.001   0.9991    
## Use.Alc_NANA       -3.129e+00  1.091e+04   0.000   0.9998    
## Drug.Add_NANA      -1.604e+01  1.075e+04  -0.001   0.9988    
## Prev.pregYes        7.120e-02  3.553e-01   0.200   0.8412    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 493.01  on 822  degrees of freedom
## Residual deviance: 343.93  on 808  degrees of freedom
## AIC: 373.93
## 
## Number of Fisher Scoring iterations: 18

From these three models it appears that BMI is much more likely to be missing for those from the NY clinic (we already knew this from our previous investigations!):

ggplot(data=nabular(opt_tmp), aes(x=Clinic, fill = BMI_NA)) +
  geom_bar()

Perhaps the most concerning missing data in the opt dataset is in the outcome Birthweight.

glm_opt_BW_allvar = glm(
  Birthweight_NA ~ Clinic + Group + Age +  
    Education + Public.Asstce + Hypertension + Diabetes + BMI + Use.Tob + Use.Alc + Drug.Add + Prev.preg, 
  data = nabular(opt_tmp),
  family = binomial(link = "logit")) 
summary(glm_opt_BW_allvar)
## 
## Call:
## glm(formula = Birthweight_NA ~ Clinic + Group + Age + Education + 
##     Public.Asstce + Hypertension + Diabetes + BMI + Use.Tob + 
##     Use.Alc + Drug.Add + Prev.preg, family = binomial(link = "logit"), 
##     data = nabular(opt_tmp))
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)            33.729  16752.867   0.002    0.998
## ClinicMN             -133.713  20792.481  -0.006    0.995
## ClinicMS             -103.928  39728.644  -0.003    0.998
## ClinicNY              137.894  40148.666   0.003    0.997
## GroupT                -85.621   6320.331  -0.014    0.989
## Age                   -14.213    877.730  -0.016    0.987
## EducationLT 8 yrs      38.459  39853.578   0.001    0.999
## EducationMT 12 yrs    234.150  38187.193   0.006    0.995
## Public.AsstceYes     -196.390  12389.447  -0.016    0.987
## HypertensionY          37.256  48814.207   0.001    0.999
## DiabetesYes           424.730  25942.553   0.016    0.987
## BMI                     2.829    615.616   0.005    0.996
## Use.TobYes            -41.710  44071.964  -0.001    0.999
## Use.AlcYes             26.362  75102.687   0.000    1.000
## Drug.AddYes           369.274 101563.415   0.004    0.997
## Prev.pregYes         -152.840   9926.365  -0.015    0.988
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3.8896e+01  on 722  degrees of freedom
## Residual deviance: 9.5715e-07  on 707  degrees of freedom
##   (100 observations deleted due to missingness)
## AIC: 32
## 
## Number of Fisher Scoring iterations: 25
glm_opt_BW_comp = glm(
  Birthweight_NA ~ Clinic + Group + Age +  
    Education + Public.Asstce + Hypertension + Diabetes + Prev.preg, 
  data = nabular(opt_tmp),
  family = binomial(link = "logit")) 
summary(glm_opt_BW_comp)
## 
## Call:
## glm(formula = Birthweight_NA ~ Clinic + Group + Age + Education + 
##     Public.Asstce + Hypertension + Diabetes + Prev.preg, family = binomial(link = "logit"), 
##     data = nabular(opt_tmp))
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)  
## (Intercept)          -2.39389    1.55326  -1.541   0.1233  
## ClinicMN            -17.39271 1787.56149  -0.010   0.9922  
## ClinicMS             -1.66204    1.20073  -1.384   0.1663  
## ClinicNY              1.24364    0.72519   1.715   0.0864 .
## GroupT               -0.11738    0.56324  -0.208   0.8349  
## Age                  -0.04719    0.05873  -0.804   0.4217  
## EducationLT 8 yrs    -0.87213    1.07459  -0.812   0.4170  
## EducationMT 12 yrs    0.35302    0.66105   0.534   0.5933  
## Public.AsstceYes     -0.18665    0.66256  -0.282   0.7782  
## HypertensionY       -16.92650 5484.29344  -0.003   0.9975  
## DiabetesYes           2.48332    1.25710   1.975   0.0482 *
## Prev.pregYes         -0.49570    0.62136  -0.798   0.4250  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 141.83  on 822  degrees of freedom
## Residual deviance: 116.01  on 811  degrees of freedom
## AIC: 140.01
## 
## Number of Fisher Scoring iterations: 20
glm_opt_BW_NA = glm(
  Birthweight_NA ~ Clinic + Group + Age +  
    Education + Public.Asstce + Hypertension + Diabetes + BMI_NA + Use.Tob_NA + Use.Alc_NA + Drug.Add_NA + Prev.preg, 
  data = nabular(opt_tmp),
  family = binomial(link = "logit")) 
summary(glm_opt_BW_NA)
## 
## Call:
## glm(formula = Birthweight_NA ~ Clinic + Group + Age + Education + 
##     Public.Asstce + Hypertension + Diabetes + BMI_NA + Use.Tob_NA + 
##     Use.Alc_NA + Drug.Add_NA + Prev.preg, family = binomial(link = "logit"), 
##     data = nabular(opt_tmp))
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)  
## (Intercept)        -2.018e+00  2.652e+00  -0.761   0.4465  
## ClinicMN           -2.355e+01  2.073e+03  -0.011   0.9909  
## ClinicMS           -3.526e+00  2.002e+00  -1.761   0.0782 .
## ClinicNY           -2.028e-01  1.090e+00  -0.186   0.8524  
## GroupT              3.653e-01  9.552e-01   0.382   0.7021  
## Age                -2.149e-01  1.176e-01  -1.827   0.0677 .
## EducationLT 8 yrs   1.192e+00  1.840e+00   0.648   0.5170  
## EducationMT 12 yrs  4.508e+00  2.026e+00   2.225   0.0261 *
## Public.AsstceYes   -1.239e+00  1.108e+00  -1.118   0.2638  
## HypertensionY      -2.131e+01  7.405e+03  -0.003   0.9977  
## DiabetesYes         4.688e+00  1.901e+00   2.466   0.0137 *
## BMI_NANA           -2.385e-01  1.384e+00  -0.172   0.8632  
## Use.Tob_NANA       -1.047e+01  6.859e+04   0.000   0.9999  
## Use.Alc_NANA        8.377e+00  4.876e+04   0.000   0.9999  
## Drug.Add_NANA       1.078e+01  4.824e+04   0.000   0.9998  
## Prev.pregYes       -9.335e-02  9.792e-01  -0.095   0.9240  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 141.830  on 822  degrees of freedom
## Residual deviance:  42.686  on 807  degrees of freedom
## AIC: 74.686
## 
## Number of Fisher Scoring iterations: 21

It appears from this that missing valuse of Birthweight aren’t associated with missingness in the other variables. Birthweight is more likely to be missing for patients with diabetes.

ggplot(data = nabular(opt_tmp), aes(x=Diabetes, fill = Birthweight_NA)) +
  geom_bar()

A.4 What to do about missing data?!

Having established that missing data can be a problem, we now need to do something about it. Methods for handling missing data fall into one of two categories:

  1. Discard some (non-missing) data
  2. Add in (‘impute’) some synthetic data

We’ll look at a few versions of these methods now, and use them on our practice datasets.

A.4.1 Discarding some (non-missing) data

A.4.1.1 Complete case analysis

The very simplest thing we can do is to discard the data for any participant who has some missing data. This is called a complete-case analysis, because we only analyse data for participants whose data are complete.

There are two main problems with this:

  • If the missing data are not MCAR, then this can induce bias.
  • This approach can drastically reduce the amount of data

Exercise A.9 For each of our datasets, how many complete cases are there? Would you recommend a complete case analysis for any of these datasets? Hint: you can use na.omit to remove all rows with at least one NA from a data frame.

Click for solution

Solution. All we need to do is find the number of complete rows in each dataset, and compare it to the number of participants.

nrow(na.omit(sup_df))
## [1] 100
nrow(sup_df)
## [1] 103

If we perform a complete case analysis on the sup_df data, we lose data from three participants.

nrow(na.omit(smdi_data))
## [1] 795
nrow(smdi_data)
## [1] 2500

A complete case analysis of the smdi_data would leave us with only 795 (about 32%!) of the participants.

nrow(na.omit(opt_tmp))
## [1] 720
nrow(opt_tmp)
## [1] 823

If we use a complete case analysis on opt_tmp (remember we’re ignoring some of the most missing variables for now!) we’d lose 113 cases (out of 823).

There are some other methods that involve discarding some data, but for the remainder of this practical we’ll focus on methods that involve imputing synthetic data.

A.4.2 Imputing data

In order to keep all the data we have, even for those cases with some missing variables, we will need to impute (add in) some new data. Let’s assume we have no way of actually measuring the true value now, and our only option is to choose some value that seems appropriate.

One of our main reasons for imputing data is to avoid the bias that would result from discarding incomplete cases, but we could inadvertently introduce bias if we aren’t careful (or if we are careful and unlucky) while imputing synthetic data.

We’ll look at a few methods, ranging from the very simple to the moderately complex.

A.4.2.1 Mean imputation

In this method we simply replace each missing value by the mean of the observed values for that variable. This method is not uncommon in practice, but it can have a number of undesirable effects:

  • If the data are not MCAR, bias is introduced
  • The sample standard deviation is reduced
  • Relationships between this and other variables are distorted

A.4.2.2 Imputing using logic

Sometimes there is missingness in a dataset that we can fill in using information about how that variable relates to other variables. For example, in the opt data, consider the two columns Use.Tob and BL.Cig.Day:

Use.Tob: Self-reported participant history of tobacco use, factor; Yes, No; Blank = Missing

BL.Cig.Day: Self-reported number of cigarettes per day for those with tobacco use history, numeric, range: 1-30; Blank = Missing (variable 16= Yes or blank) or non-smoker (variable 16 = No)`

The first, Use.Tob, is a binary variable indicating whether or not the participant uses tobacco. The second, BL.Cig.Day is a numerical variable indicating how many cigarettes per day the participant smokes. As the data are now, there are a lot of missing values for BL.Cig.Day. However, if for a particular participant we have Use.Tob = No then we know that BL.Cig.Day is zero, and we can impute that value.

Exercise A.10 Using the information in the help file, impute values for the columns BL.Cig.Day, BL.Drks.Day and N.prev.preg in the opt_df data. Make a new version of opt_df so that we can compare before and after.

Note the annoying space in some of the responses - often we have "No "!

Click for solution

Solution. One important thing to remember is that if the associated categorical variable is missing, then the value for the variable we’re imputing will also be missing. We therefore need to condition on the ‘parent’ variable, rather than replace all missing values of the ‘child’ variable. For example:

opt_df_imp = opt_df
opt_df_imp$BL.Cig.Day[opt_df_imp$Use.Tob=="No "] = 0

By doing this we have ‘fixed’ 704 missing values.

Similarly we cand fix those for BL.Drks.Day and N.prev.preg:

opt_df_imp$BL.Drks.Day[opt_df_imp$Use.Alc=="No "] = 0
opt_df_imp$N.prev.preg[opt_df_imp$Prev.preg=="No "] = 0

We can see that this has very much improved our situation! We could do something similar for BL.Diab.Type too since this is linked to Diabetes.

vis_dat(opt_df, sort_type = F)

vis_dat(opt_df_imp, sort_type = F)

A.4.2.3 Imputation using a regression model

In this section we’re going to use the STAN regression functions from the package rstanarm. The syntax is very similar to the base R regression and glm functions, but random sampling is much simpler. The rstanarm functions use MCMC to generate samples from the posterior distribution of the regression model. There is a lot of output about chains (which you can ignore, in this practical). There are no p-values associated with coefficients,

We’ve already looked at using logistic regression to understand patterns of missingness, and so it may not come as a surprise that we can use regression models to choose appropriate values for imputation. This won’t be the same model, since we’re now interested in the value, rather than the missingness. Which type of regression model we use depends on the type of the variable we’re imputing values for. If the variable is continuous, linear regression is likely to work well. If the variable is binary, we should try logistic regression. There are plenty of other types of model we could use (as well as a whole host of machine learning type models!) but in this practical we’ll stick to those two.

Let’s suppose we want to use regression to impute values for pdl1_num

pdl1_lm = stan_glm(
  pdl1_num ~ exposure + age_num + female_cat + smoking_cat + physical_cat + alk_cat + 
    histology_cat + ses_cat + copd_cat + eventtime + status + ecog_cat + egfr_cat, 
  data = smdi_data
  )
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.001392 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 13.92 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.066 seconds (Warm-up)
## Chain 1:                0.095 seconds (Sampling)
## Chain 1:                0.161 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 6e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.049 seconds (Warm-up)
## Chain 2:                0.094 seconds (Sampling)
## Chain 2:                0.143 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 4e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.055 seconds (Warm-up)
## Chain 3:                0.092 seconds (Sampling)
## Chain 3:                0.147 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 4e-06 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.054 seconds (Warm-up)
## Chain 4:                0.094 seconds (Sampling)
## Chain 4:                0.148 seconds (Total)
## Chain 4:
summary(pdl1_lm)
## 
## Model Info:
##  function:     stan_glm
##  family:       gaussian [identity]
##  formula:      pdl1_num ~ exposure + age_num + female_cat + smoking_cat + physical_cat + 
##     alk_cat + histology_cat + ses_cat + copd_cat + eventtime + 
##     status + ecog_cat + egfr_cat
##  algorithm:    sampling
##  sample:       4000 (posterior sample size)
##  priors:       see help('prior_summary')
##  observations: 795
##  predictors:   15
## 
## Estimates:
##                   mean   sd    10%   50%   90%
## (Intercept)      36.9    2.7  33.4  36.8  40.5
## exposure          8.8    0.7   7.9   8.8   9.7
## age_num           0.1    0.0   0.0   0.1   0.1
## female_cat1      -0.9    0.7  -1.8  -0.9   0.0
## smoking_cat1     -0.6    0.8  -1.6  -0.6   0.4
## physical_cat1    -0.8    0.8  -1.8  -0.8   0.2
## alk_cat1         -4.7    5.5 -11.9  -4.7   2.4
## histology_cat1    0.4    1.0  -0.8   0.4   1.7
## ses_cat2_middle   0.7    0.9  -0.5   0.6   1.8
## ses_cat3_high     1.0    0.9  -0.1   1.0   2.2
## copd_cat1         0.6    0.8  -0.4   0.6   1.6
## eventtime         0.5    0.3   0.1   0.5   1.0
## status1          -1.7    1.4  -3.5  -1.7   0.2
## ecog_cat1         0.4    0.7  -0.5   0.4   1.3
## egfr_cat1        -1.5    0.9  -2.7  -1.4  -0.3
## sigma             9.4    0.2   9.1   9.4   9.7
## 
## Fit Diagnostics:
##            mean   sd   10%   50%   90%
## mean_PPD 44.4    0.5 43.8  44.4  45.0 
## 
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
## 
## MCMC diagnostics
##                 mcse Rhat n_eff
## (Intercept)     0.0  1.0  4046 
## exposure        0.0  1.0  5843 
## age_num         0.0  1.0  5425 
## female_cat1     0.0  1.0  6996 
## smoking_cat1    0.0  1.0  5452 
## physical_cat1   0.0  1.0  6963 
## alk_cat1        0.1  1.0  6964 
## histology_cat1  0.0  1.0  8206 
## ses_cat2_middle 0.0  1.0  4445 
## ses_cat3_high   0.0  1.0  4443 
## copd_cat1       0.0  1.0  5772 
## eventtime       0.0  1.0  3760 
## status1         0.0  1.0  3807 
## ecog_cat1       0.0  1.0  7711 
## egfr_cat1       0.0  1.0  6011 
## sigma           0.0  1.0  6827 
## mean_PPD        0.0  1.0  4953 
## log-posterior   0.1  1.0  1617 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Exercise A.11 This approach is going to fail! Can you tell why? Can you suggest what to do about it?

Click for solution

Solution. The model involves variables that also have missingness (ecog_cat1 and egfr_cat), and so for any case with either of those missing, we won’t be able to impute a value.

There are a couple of possibilities:

  • Use the missingness of those values as an input (by creating a nabular object)
  • Remove those variables from the model
  • Work iteratively, generating temporary imputed values and cycling round the variables with missingness

Because all our investigations suggest that ecog_cat1 is more-or-less unrelated to anything, we will remove it from the model. However, because egfr_cat is quite close to being significant in the model, we will keep it in but use missingness in the model.

smdi_nab = nabular(smdi_data)

pdl1_lm2 = stan_glm(
  pdl1_num ~ exposure + age_num + female_cat + smoking_cat + physical_cat + alk_cat + 
    histology_cat + ses_cat + copd_cat + eventtime + status + egfr_cat_NA, 
  data = smdi_nab
  )
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 1.3e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.046 seconds (Warm-up)
## Chain 1:                0.157 seconds (Sampling)
## Chain 1:                0.203 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 6e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.047 seconds (Warm-up)
## Chain 2:                0.159 seconds (Sampling)
## Chain 2:                0.206 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 6e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.051 seconds (Warm-up)
## Chain 3:                0.159 seconds (Sampling)
## Chain 3:                0.21 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 1.9e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.061 seconds (Warm-up)
## Chain 4:                0.156 seconds (Sampling)
## Chain 4:                0.217 seconds (Total)
## Chain 4:
summary(pdl1_lm2)
## 
## Model Info:
##  function:     stan_glm
##  family:       gaussian [identity]
##  formula:      pdl1_num ~ exposure + age_num + female_cat + smoking_cat + physical_cat + 
##     alk_cat + histology_cat + ses_cat + copd_cat + eventtime + 
##     status + egfr_cat_NA
##  algorithm:    sampling
##  sample:       4000 (posterior sample size)
##  priors:       see help('prior_summary')
##  observations: 1983
##  predictors:   14
## 
## Estimates:
##                   mean   sd   10%   50%   90%
## (Intercept)     39.7    1.7 37.6  39.7  41.9 
## exposure         8.0    0.5  7.4   8.0   8.7 
## age_num          0.0    0.0  0.0   0.0   0.1 
## female_cat1     -0.4    0.5 -1.0  -0.4   0.2 
## smoking_cat1    -0.5    0.5 -1.2  -0.5   0.2 
## physical_cat1   -1.5    0.5 -2.2  -1.5  -0.9 
## alk_cat1        -1.8    1.5 -3.7  -1.8   0.2 
## histology_cat1  -0.4    0.6 -1.1  -0.4   0.3 
## ses_cat2_middle  0.1    0.6 -0.7   0.1   0.8 
## ses_cat3_high    1.2    0.6  0.4   1.2   1.9 
## copd_cat1       -0.9    0.5 -1.6  -0.9  -0.2 
## eventtime        0.6    0.2  0.3   0.6   0.8 
## status1         -1.6    0.9 -2.7  -1.6  -0.5 
## egfr_cat_NANA    4.1    0.5  3.4   4.1   4.8 
## sigma            9.8    0.2  9.6   9.8  10.0 
## 
## Fit Diagnostics:
##            mean   sd   10%   50%   90%
## mean_PPD 46.0    0.3 45.6  46.0  46.4 
## 
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
## 
## MCMC diagnostics
##                 mcse Rhat n_eff
## (Intercept)     0.0  1.0  4788 
## exposure        0.0  1.0  5888 
## age_num         0.0  1.0  5881 
## female_cat1     0.0  1.0  6882 
## smoking_cat1    0.0  1.0  5807 
## physical_cat1   0.0  1.0  6488 
## alk_cat1        0.0  1.0  7446 
## histology_cat1  0.0  1.0  6248 
## ses_cat2_middle 0.0  1.0  4656 
## ses_cat3_high   0.0  1.0  4302 
## copd_cat1       0.0  1.0  6487 
## eventtime       0.0  1.0  4531 
## status1         0.0  1.0  4847 
## egfr_cat_NANA   0.0  1.0  5651 
## sigma           0.0  1.0  6860 
## mean_PPD        0.0  1.0  4764 
## log-posterior   0.1  1.0  1358 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

We can see from the means and standard deviations that most of the coefficients are not even close to ‘significant’, but if we want to investigate some more closely we can visualise them using the plot function (separately in this case because they are quite far apart numerically)

plot(pdl1_lm2, plotfun="areas", prob = 0.95, pars = c("exposure"))

plot(pdl1_lm2, plotfun="areas", prob = 0.95, pars = c("age_num"))

We see that for both variables, zero is not in the central 95% of the distribution. This isn’t directly relevant here but it’s nice to know about!

We will now use this second model to impute values for pdl1_num.

## Split the data according to whether egfr_cat is missing

smdi_pdl1_comp = smdi_nab[!is.na(smdi_nab$pdl1_num),]
smdi_pdl1_miss = smdi_nab[is.na(smdi_nab$pdl1_num),]

## Use the GLM to fit values to egfr_cat

pdl1_imp_lm = predict(object = pdl1_lm2, 
                      newdata = smdi_pdl1_miss)
                
smdi_pdl1_miss$pdl1_num = pdl1_imp_lm

## Join the data back together again (in a different order, but it doesn't matter)

smdi_imp = rbind(smdi_pdl1_miss, smdi_pdl1_comp)

This might seem sensible, but actually it isn’t a very realistic use of a regression model.

Compared to the observed values, the imputed values have a strong central tendency. This is because we have imputed the mean fitted value for each point, ignoring the measure of spread / uncertainty in the model. We can see this by comparing the observed values of pdl1_num to their fitted values (which we would have imputed had they been missing).

smdi_pdl1_comp$pdl1_imp = predict(pdl1_lm2, newdata = smdi_pdl1_comp)
# The next line just recreates the imputataion so that we can bind the datasets together
smdi_pdl1_miss$pdl1_imp = predict(pdl1_lm2, newdata = smdi_pdl1_miss)

smdi_pdl1_imp = rbind(smdi_pdl1_comp, smdi_pdl1_miss)

ggplot(data = smdi_pdl1_imp, aes(x=pdl1_imp, y=pdl1_num, col = pdl1_num_NA)) + 
  geom_point() +
  xlab("Regression fit") + 
  ylab("Observed value / imputed value") +
  theme_bw()

A more realistic approach would be to sample one value from the posterior distribution for each point (this is why we are using rstanarm!)

smdi_pdl1_comp$pdl1_imp_rand = smdi_pdl1_comp$pdl1_num

## This time draw one point at random from the posterior distribution for each participant
## The output is a 1xdraws matrix, which we'll convert to a vector
pdl1_rand_draw = posterior_predict(pdl1_lm2, newdata = smdi_pdl1_miss, draws=1)
smdi_pdl1_miss$pdl1_imp_rand = as.numeric(pdl1_rand_draw)

## Now we can bind the two dataframes together and plot the randomly imputed / observed
## data against the deterministically fitted data

smdi_pdl1_imp = rbind(smdi_pdl1_comp, smdi_pdl1_miss)

ggplot(data = smdi_pdl1_imp, aes(x=pdl1_imp, y=pdl1_imp_rand, col = pdl1_num_NA)) + 
  geom_point() +
  xlab("Regression fit") + 
  ylab("Observed value / randomly imputed value") +
  theme_bw()

This imputed data looks much more representative of the actual dataset.

Exercise A.12 Use regression imputation to impute values for BMI in opt_tmp.

A.4.3 Multiple imputation

Having reached the point of acknowledging the randomness needed in imputation, a natural next step would be to draw multiple values from the posterior distribution, rather than just one. This leads to a method called multiple imputation, which is (arguably) the most widely used approach to imputing missing data, though sadly we don’t have time to go into it in this practical.

References

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2021. Regression and Other Stories. Cambridge University Press.
Hotelling, Harold et al. 1931. “The Generalization of Student’s Ratio.”
Little, Roderick JA, and Donald B Rubin. 2019. Statistical Analysis with Missing Data. Vol. 793. John Wiley & Sons.
Rubin, Donald B. 1976. “Inference and Missing Data.” Biometrika 63 (3): 581–92.