Someone writes in with an interesting question:
I’d like to speak with you briefly to get your thoughts on the imputation of missing data in a new online web-survey technique I’m developing.
Our survey uses Split Questionnaire Design. The total number of surveys will vary in length with different customers, but will generally be between 5 to 15 questions in length.
We ask only a set of 3 questions from the question pool, consisting of the following: an overall customer satisfaction question that is asked of all customers/visitors, and 2 questions selected at random from the pool of questions.
The goal of collecting this data is to run a linear regression of the eight variables (independent variables) against overall satisfaction (the dependent variable). However, in order to do this, the missing data in the survey must be imputed so the data set is “rectangular”. This is important, as different imputation techniques are ideal for different intended outcomes; the goal here is to result in a data set to allow for a regression against the dependent variable, overall satisfaction.
The motivation was that people won’t want to answer 5 or 10 questions but maybe you can get them to answer 2 or 3.
At first I was stuck—will these subsets give you enough information to estimate the regression?—but after thinking about it for a few moments I realized that yes, you can do it. You can’t estimate the full joint distribution with just the subsets, but a regression model with no interactions is living on a lower-dimensional space.
With a sample in which you ask random pairs of questions to different people, you can estimate the covariance matrix for all your variables, and from this you can reconstruct the linear regression. Perhaps the easiest way to do this is not through imputation but rather to do it directly by constructing a regularized estimate of the covariance matrix.
The short story is that there are four concerns:
1. Do you have enough data so that you can mathematically “identify” (that is, estimate) the regression given the available data. It turns out (and it’s easy to show mathematically) that, yes, if you ask at least 2 questions on each person, giving different questions randomly to different groups of people, that you do have enough information to estimate the regression.
2. How do you actually do the estimation? There are various ways to estimate the regression. Various missing-data imputation programs will do the job. You can pretty much pick it based on which software you are comfortable with.
3. Efficiency. How many questions should you ask each person (it should be at least 2, but you could ask 3 or 4, for example)? How do you do the design? Do you want some questions included more often than others? For this you could just make up a design, or it should be possible to do some simulation studies to design something more optimal.
4. Practical issues that will come up, for example unintentional missing data (not everyone will answer every question). This can probably be handled easily enough using whatever missing-data imputation algorithm you are using.
I thought it would make sense to blog this since other people might be interested in doing this sort of thing too.