Skip to content

One quick tip for building trust in missing-data imputations?

Peter Liberman writes:

I’m working on a paper that, in the absence of a single survey that measured the required combination of variables, analyzes data collected by separate, uncoordinated Knowledge Networks surveys in 2003. My co-author (a social psychologist who commissioned one of the surveys) and I obtained from KN unique id numbers for all of the respondents in the surveys, and found that 363 participated in both of the main ones.

The resultant dataset resembles a panel survey with ~90% attrition and sample refreshment. Rather than just analyze the overlap between the samples, to improve statistical power I used some of the wave-nonreponse cases and imputed the missing data using MI. (Using them all results in far too much missing data for an MI imputation model to converge.) I’m not a methodologist and did not use rigorous criteria, much less state-of-the-art ones, in choosing how many and which wave-nonresponse cases to add and in dealing with selection effects (survey acquiescence having some impact on appearing in the overlapping samples).

Might you be able to suggest publications, besides this paper by Gelman, King, and Liu and this paper by Si, Reiter, and Hillygus,t that might provide useful guidance for analyzing this type of data? Would you be interested in providing advice on the attached paper, or even collaborating on it, or can you think of someone with the relevant expertise who might be?

The goal would not just be to provide more rigorous analysis of the research questions in this paper, but to provide more methodologically sound direction for other researchers wanting to use this novel type of data. I say “novel” because I have not yet found a previous example of its use, and execs at Gfk/KN and YouGov/Polimetrix project director told me that nobody has requested such data before. I can imagine researchers often wanting to conduct secondary analysis of variables measured only in separate surveys. Sample size presents an obvious limitation, and my particular study benefitted from surveys that had unusually large original samples (each with Ns >3,000). But usable overlap might be quite common among surveys using specialized sampling frames (e.g., political science surveys being fielded to online respondent panelists from whom political data already has been collected). Given the accumulation of data sitting in online survey companies’ archives, this could represent a significant untapped resource for testing post-hoc hypotheses specific to certain time periods.

My reply: I have no great answers here. I think the problem of building trust in imputations is important, and I’ve written two papers on the topic, one with Kobi Abayomi and Marc Levy, and one with Yu-Sung Su, Jennifer Hill, and Masanao Yajima. But much more needs to be done. Our original plan with our multiple imputation package mi (available on CRAN for use in R) was to include all sorts of diagnostics by default. We do have a few diagnostics in mi (see the above-linked paper by Su et al.) but we have not really integrated them into our workflow.

P.S. In case you’re interested, here’s the abstract to the research paper by Liberman and Linda Skitka:

This paper examines the role of revenge in U.S. public support for the Iraq War. Citizens who mistakenly blamed Iraq for 9/11 felt relatively strongly that it would satisfy a desire for revenge, and such feelings significantly predicted war support after controlling for security incentives, beliefs about the costs of war, and political orientations. But many of those who said Iraq was not involved also expected war would satisfy a desire for revenge, which we interpret as a foreign policy analogue of displaced aggression. This research helps us understand how the Bush Administration was able bring the nation to war against a country having nothing to do with 9/11, testifies to the roles of emotion and moral motivation in public opinion, and demonstrates the feasibility of utilizing independently conducted online surveys in secondary data analysis.

One Comment

  1. Garnett McMillan says:

    The role of trust is absolutely crucial in any aspect of data analysis. I notice that an investigator’s ‘trust’ in a data analysis model is directly related to statistical significance. A reviewer’s ‘trust’ is familiarity. New things, particularly in data analysis with a history of ‘better statistics’ being promoted, are not to be trusted.

Leave a Reply