Skip to content
 

Diagnostics for multivariate imputations: getting inside the black box

Random imputation is a flexible and useful way to handle missing data (see chapter 25 for a quick overview), but it’s typically taken as a black box. This partly is a result of confusion over statistical theory. Structural assumptions such as “missingness at random” cannot be checked from data–this is a fundamental difficulty–but this does not mean that imputations cannot be checked. In our recent paper, Kobi Abayomi, Mark Levy, and I do the following:

We consider three sorts of diagnostics for random imputations: displays of the completed data, which are intended to reveal unusual patterns that might suggest problems with the imputations, comparisons of the distributions of observed and imputed data values and checks of the fit of observed data to the model that is used to create the imputations. We formulate these methods in terms of sequential regression multivariate imputation, which is an iterative procedure in which the missing values of each variable are randomly imputed conditionally on all the other variables in the completed data matrix.We also consider a recalibration procedure for sequential regression imputations.We apply these methods to the 2002 environmental sustainability index, which is a linear aggregation of 64 environmental variables on 142 countries.

The article has some pretty pictures (and some ugly pictures too; hey, we’re not perfect). I don’t know how directly useful these methods are; I think of them as providing “proof of concept” model checking for imputations is possible at all, and I’m hoping this will spur lots of work by many researchers in the area. Ultimately I’d like people (or computer programs) to check their imputations just as they currently check their regression models.

2 Comments

  1. Hadley says:

    There's a chapter on this topic in the ggobi book – http://ggobi.org/book

  2. Hadley says:

    I'm worried by the histograms in Figure 6: inconsistent scales, inconsistent bin widths (presumably, you didn't report what bin width you actually used and so they're probably just R defaults), unnecessarily repeated axes and an ugly fill! Also, why not colour consistently with all the other plots in the paper?

    Figures 7 and 8 are better but still suffer from inconsistent bin widths, and why not label the histograms instead of putting labels in the caption?