Elena Grewal writes:
I am currently using the iterative regression imputation model as implemented in the Stata ICE package. I am using data from a survey of about 90,000 students in 142 schools and my variable of interest is parent level of education. I want only this variable to be imputed with as little bias as possible as I am not using any other variable. So I scoured the survey for every variable I thought could possibly predict parent education. The main variable I found is parent occupation, which explains about 35% of the variance in parent education for the students with complete data on both. I then include the 20 other variables I found in the survey in a regression predicting parent education, which explains about 40% of the variance in parent education for students with complete data on all the variables.
My question is this: many of the other variables I found have more missing values than the parent education variable, and also, although statistically significant predictors in the complete case sample, have very small coefficients. Is this a problem? Is my method of including all the variables that were statistically significant predictors in the imputation model a valid strategy for deciding what to include in the imputation?
Your imputation plan seems reasonable. To check it, you can do some cross-validation: randomly remove 1/5 (say) of the observations for your variable of interest, run the algorithm, then compare the held-out values to the random imputations. We did some of this in our 1998 paper but I still haven’t gotten around to formalizing the method.
The cross-validation check won’t save you if you have serious nonignorable missingness (for example, large values more likely than small values to be misreported), but it can be thought of as a minimal check.