Skip to content
 

When should you worry about imputed data?

Majid Ezzati writes:

My research group is increasingly focusing on a series of problems that involve data that either have missingness or measurements that may have bias/error. We have at times developed our own approaches to imputation (as simple as interpolating a missing unit and as sophisticated as a problem-specific Bayesian hierarchical model) and at other times, other groups impute the data.

The outputs are being used to investigate the basic associations between pairs of variables, Xs and Ys, in regressions; we may or may not interpret these as causal. I am contacting colleagues with relevant expertise to suggest good references on whether having imputed X and/or Y in a subsequent regression is correct or if it could somehow lead to biased/spurious associations. Thinking about this, we can have at least the following situations (these could all be Bayesian or not):

1) X and Y both measured (perhaps with error)
2) Y imputed using some data and a model and X measured
3) Y measured and X imputed using some data and a model
4) Y and X imputed with separate models, with no common covariate
5) Y and X imputed with separate models, with a common covariate
6) Y and X imputed and imputed X a covariate in the model for imputing Y
7) Y and X imputed and imputed Y a covariate in the model for imputing X
.8) Y and X imputed in a joint model – multiple imputation

Any suggestions you may have on readings on whether such regressions could lead to spurious or biased associations or if they are OK to do would be appreciated.

My reply:

The short answer is that if your imputation model is correct (that is, if the likelihood you are using corresponds to the process by which the data and missingness were generated) then it should be fine to impute randomly conditional on all available information. But in real life the model won’t be correct so there can be problems. I don’t have any general thoughts on the 8 situations above. The 1994 paper by Xiao-Li Meng on congenial inference may be useful in developing your understanding.

Also, if you’re planning to check the fit of your imputation model (as I think you should), my paper with Kobi could be a useful starting point.

It’s frustrating that there’s no general answer but I think that’s the way it often is in statistics, that much depends on the reasonableness of the model.

2 Comments

  1. Joe says:

    Hi Andrew,

    I wonder if I can ask a follow up question about variables used in the imputation. I’m trying to sort out an imputation on data dealing with racist attitudes that has produced a dataset with, very roughly, 30%-50% missingness on some key variables (probably NMAR, because some people don’t like admitting to be racists – though they are probably a more ‘middle ground’, as in my experience, hard-core racists will tell you they are at every opportunity).

    If I use Amelia, or some STATA equivalent, what variables should be included, in the general sense? I mean, suppose I want to have a model later that says X (measured with almost no missingess) causes Y (our variable with 30% missingness), should I include X as a variable in the imputation model? It should be a good predictor of Y if it is a cause, but will it bias future models?

    Moreover, say variable Z is a consequence of Y (the variable with missingess), should Z be included to maximize variance explained, even though it is weird in causal terms.

    Finally, would you take a ‘maximal’ approach to imputation, and just throw everything in that has some significant relationship with the (observed) cases on Y, or is it better to be more selective?

    Any help would be greatly appreciated,
    best wishes.

  2. K? O'Rourke says:

    The multiple bias analysis (MBA) by Sander Greenland may be helpful.

    My current favourite is – Relaxation Penalties and Priors .. Stat Sci 2009

    Also, I gave a talk on MBA this past summer at the Montreal Epi Congress
    – my main point was that though MBA was in everyones long term best interests
    it is a burdensome hardship for any idividual group of investigators. After a lot
    of carefull work the result will likely be to just better understand how much uncertainty
    remains.

    Not the best route to quick visible publications!