Majid Ezzati writes:
My research group is increasingly focusing on a series of problems that involve data that either have missingness or measurements that may have bias/error. We have at times developed our own approaches to imputation (as simple as interpolating a missing unit and as sophisticated as a problem-specific Bayesian hierarchical model) and at other times, other groups impute the data.
The outputs are being used to investigate the basic associations between pairs of variables, Xs and Ys, in regressions; we may or may not interpret these as causal. I am contacting colleagues with relevant expertise to suggest good references on whether having imputed X and/or Y in a subsequent regression is correct or if it could somehow lead to biased/spurious associations. Thinking about this, we can have at least the following situations (these could all be Bayesian or not):
1) X and Y both measured (perhaps with error)
2) Y imputed using some data and a model and X measured
3) Y measured and X imputed using some data and a model
4) Y and X imputed with separate models, with no common covariate
5) Y and X imputed with separate models, with a common covariate
6) Y and X imputed and imputed X a covariate in the model for imputing Y
7) Y and X imputed and imputed Y a covariate in the model for imputing X
.8) Y and X imputed in a joint model – multiple imputation
Any suggestions you may have on readings on whether such regressions could lead to spurious or biased associations or if they are OK to do would be appreciated.
The short answer is that if your imputation model is correct (that is, if the likelihood you are using corresponds to the process by which the data and missingness were generated) then it should be fine to impute randomly conditional on all available information. But in real life the model won’t be correct so there can be problems. I don’t have any general thoughts on the 8 situations above. The 1994 paper by Xiao-Li Meng on congenial inference may be useful in developing your understanding.
Also, if you’re planning to check the fit of your imputation model (as I think you should), my paper with Kobi could be a useful starting point.
It’s frustrating that there’s no general answer but I think that’s the way it often is in statistics, that much depends on the reasonableness of the model.