Shira writes:
This came up from trying to help a colleague of mine at Human Rights Watch.
He has several completely observed variables X, and a variable with 29% missing, Y. He wants a histogram (and other descriptive statistics) of a “filled in” Y.
He can regress Y on X, and impute missing Y’s from their fully observed X values (from the posterior predictive distribution). If he wants a histogram of the “filled in” Y, what would you recommend to him? Is there a good way to display this, taking the uncertainty in the imputed Y’s into account?
My reply:
http://www.stat.columbia.edu/~gelman/research/published/biom_031010.pdf
It’s not clear to me what the author really wants, is it:
1) A histogram which represents approximately the histogram of the full set of Y values?
2) A way to model/quantify/display the uncertainty in the histogram described in (1) given that some of the data is imputed?
3) Something else?
Part 2 is a very Bayesian question, as it asks essentially for a (Bayesian probability) distribution over frequency distributions.
+1
I’m having a hard time understanding the exact goal too. What makes this exercise different from a run of the mill regression or missing value imputation?
The HRW researcher wants a histogram of Y, displaying uncertainty in the imputations. Your option (2). The paper Andy linked to shows histograms side by side for a few different imputations, but this is probably not satisfactory for an HRW report. Probably a different display is needed, one that is not discussed in the linked paper. Ideas?
Modify the histogram: take your usual histogram with blocks, and for the Y block, don’t show a block but show a vertical scatterplot of the point-estimates from, say, 100 imputations. Now you’ve graphed the uncertainty, and the reader can mentally draw a line left/right from the other levels to see how much they believe that Y is bigger/smaller/as-predicted compared to the others.
Thanks! This seems reasonable. What about something like:
http://www.velir.com/blog/index.php/2013/07/11/visualizing-data-uncertainty-an-experiment-with-d3-js/
?
How does (parametric) multiple imputation compare with (non-parametric) random forest imputation?
http://www.ncbi.nlm.nih.gov/pubmed/24589914