Someone writes in with an interesting question:

I’d like to speak with you briefly to get your thoughts on the imputation of missing data in a new online web-survey technique I’m developing.

Our survey uses Split Questionnaire Design. The total number of surveys will vary in length with different customers, but will generally be between 5 to 15 questions in length.

We ask only a set of 3 questions from the question pool, consisting of the following: an overall customer satisfaction question that is asked of all customers/visitors, and 2 questions selected at random from the pool of questions.

The goal of collecting this data is to run a linear regression of the eight variables (independent variables) against overall satisfaction (the dependent variable). However, in order to do this, the missing data in the survey must be imputed so the data set is “rectangular”. This is important, as different imputation techniques are ideal for different intended outcomes; the goal here is to result in a data set to allow for a regression against the dependent variable, overall satisfaction.

The motivation was that people won’t want to answer 5 or 10 questions but maybe you can get them to answer 2 or 3.

At first I was stuck—will these subsets give you enough information to estimate the regression?—but after thinking about it for a few moments I realized that yes, you can do it. You can’t estimate the full joint distribution with just the subsets, but a regression model with no interactions is living on a lower-dimensional space.

With a sample in which you ask random pairs of questions to different people, you can estimate the covariance matrix for all your variables, and from this you can reconstruct the linear regression. Perhaps the easiest way to do this is not through imputation but rather to do it directly by constructing a regularized estimate of the covariance matrix.

The short story is that there are four concerns:

1. Do you have enough data so that you can mathematically “identify” (that is, estimate) the regression given the available data. It turns out (and it’s easy to show mathematically) that, yes, if you ask at least 2 questions on each person, giving different questions randomly to different groups of people, that you do have enough information to estimate the regression.

2. How do you actually do the estimation? There are various ways to estimate the regression. Various missing-data imputation programs will do the job. You can pretty much pick it based on which software you are comfortable with.

3. Efficiency. How many questions should you ask each person (it should be at least 2, but you could ask 3 or 4, for example)? How do you do the design? Do you want some questions included more often than others? For this you could just make up a design, or it should be possible to do some simulation studies to design something more optimal.

4. Practical issues that will come up, for example unintentional missing data (not everyone will answer every question). This can probably be handled easily enough using whatever missing-data imputation algorithm you are using.

I thought it would make sense to blog this since other people might be interested in doing this sort of thing too.

Interesting post! Could you please give more details on how to show that 2 questions is enough? Also, is it possible to determine the number of people that need to answer in order for the estimate to be reliable, as a function of the number of questions?

I’d be happy with any pointer on how to do this sort of calculation :-)

@Andre: If there are K questions and 1 response, what you need is the complete (K +1) x (K + 1) covariance matrix. The entries of that matrix can be estimated pairwise. Now I should be less lazy and work out the algebra for the regression coefficients from the covariance matrix.

@Sam Mason: Yes, Stan should be able to do this no problem. If somebody sends me the formula for the regression coefficients or even better, a working Stan implementation, I can drop it in our next manual. It’s a neat example of the kind of estimation we can do with multivariate random variables like covariance matrices. It’d even be possible to code up with different people answering different numbers of questions — that’s just a different slice of the full covariance matrix.

Talk to folks like Fred Reichheld and he’ll tell you there’s only one question you need to ask (a la the infamous, and *trademarked* Net Promoter Score).

If what several other commentors say is correct though (a larger bucket of questions with only a subset viewable by any one respondent) I would be curious to understand this survey design. I can imagine that the overall sample size alone, just on the basis of having ‘enough’ information in the collected data to draw inferences, would need to be quite large. I don’t see how this scenario doesnt really call for a research design around a stratified sample (lots of landmines there).

Naive question: How can

“will generally be between 5 to 15 questions in length”&“We ask only a set of 3 questions from the question pool”both be true?I think what he’s saying is that he only asks three sets of questions, and that each set of questions has multiple questions in it. Leading to each person seeing only three sets of questions but having to answer 5 to 15 individual questions.

I’m pretty sure what he means is that he has a total of 5 to 15 questions in his pool and only three are shown to a user at any time.

An acquaintance of mine did his dissertation on this very topic in a panel survey context, mostly focused on how to intelligently determine who gets which questions so as to get the most efficient estimates. http://drum.lib.umd.edu/bitstream/1903/13171/1/Gonzalez_umd_0117E_13431.pdf

Thanks for publicizing this. One of the common problems in commercial surveys is that they are TOO DARNED LONG. (consider, for example, many of the “register receipt” surveys)

Not only is this easier than imputation, it’s also easier to explain and avoids the need to justify whichever set of assumptions are used in the imputation.

Curious, though. You mention “random” a couple of times in the post, but I would think any scheme (say, systematic) that insured a decent number of pairwise observations would work.

And thanks to awm for that link.

Sometimes I think commercial surveys are too long because the guys actually designing the survey seem to be playing lip service to the concept of a survey & are least interested in getting good data.

e.g. I buy Enterprise Hardware from Dell and keep getting surveys often. The first darned page always wants to know what I buy from Dell. Shouldn’t they know that already? Or is it some form of stupid anonymization that prevents them from accessing that info?

Or (more likely) the designer deemed it too much work to query the separate accounts database (probably a more secure one)? Instead burdening the user & wasting five more minutes of my time is fine by them.

I work for a large software company that sends out *hundreds of thousands* of surveys quarterly. The issue is that companies like ours–I’ve verified this with peers in other companies–are GRASPING to find something that matters; that’s often why the surveys are so long.

Re asking what you bought, you’d be amazed how in most companies–even those with millions of dollars spent on research and marketing (and databases)–the data about who’s who and who owns/bought what is so disparate that putting it together is sometimes impossible and sometimes too costly (time and money).

That’s my point though: If you send out hundreds of thousands of surveys every quarter, even if you save one minute per respondent by not asking him senseless, redundant questions (e.g. which products did you buy from us) isn’t the total saved time far more than what it’d take you to compile that info from internal databases?

Or is it just a version of the externality problem? i.e. programmer time gets billed, but respondent time is “free”?

Sounds a lot like compressed sensing to me, there should be some methods for efficiency estimating the missing values there.

If you can get away from imputing a rectangular matrix you may have more joy, I’m sure Stan would be good for this sort of thing!

John Graham and colleagues have written about these “planned missing data designs” in psychology:

http://www.ncbi.nlm.nih.gov/pubmed/17154750

Yes, and there is also a link to more ‘classical’ fractional factorial designs – which are another form of planned missingness.

Started to read through Little and Rubin for the missing-data mechanism is ignorable for likelihood inference if – then remembered this paper http://arxiv.org/pdf/1306.2812.pdf

My guess is it would be one question per person given enough people (in term of algebra given log quadratic likelihoods recall generalised inverses).

As a non-statistician I am confused. Asking 100 persons 10 questions each seems to have more information content than asking 100 persons 2 questions each? The question to me seems how sparse can you go without having a model too crappy for your purpose.

So, can this question be answered

a priori? Shouldn’t it depend on the quality, noisiness etc. of any particular data-set? Alternatively, to get the same model “quality” from this procedure isn’t it possible that we’d need to increase the respondent number?Another point that confuses me is that if you are using imputation for the un-asked questions are those eight variables truly independent variables? Aren’t you implicit assuming that your independent variables are correlated?

Yes, you get less information per person if you ask fewer questions. That’s what Andrew meant by you could simulate the design. You can use simulation for what’s traditionally lumped under “power calculations” in statistics; these simulatiosn will depend on particulars of the data set such as noise. You can simulate for multiple values, but you often have an idea before you start so can still ballpark.

Nobody assumes regression predictors are independent — “independent variables” is a bad name.

Your comment “living on a lower-dimensional space” provides the answer (e.g., when the 5 or 10 questions measured a unidimensional political ideology such as one would find at the Voteview.com website). Such a low-dimensional manifold is the basis for compressed sensing, recommendation systems, item response theory, cultural consensus theory, statistical analysis of roll call voting, and missing data by design (see Jake Westfall comment).

What Andrew said was that a regression

without interaction termsis living on a lower-dimensional space. When you add interactions, you then need the covariance of the interactions and so on. With only basic predictors corresponding to the inputs, you can estimate a full-rank covariance matrix from pairwise data.Estimating a sparse matrix is another model, though perhaps one that’d be fruitful to employ in this context if there were a very large number of survey questions which were assumed to be nearly linearly dependent.

Can we at least agree that it is an empirical issue whether the items fall along a low-dimensional manifold? Even a brief review of the customer satisfaction literature will reveal the common complaint that our ratings are highly correlated. Perhaps you have heard of halo effects? Interestingly, we see a similar phenomenon in political ideology with a single dimension explaining much of the variation in Supreme Court decisions and roll call votes. In fact, we find low-dimensional solutions in all the examples I provided, which is why I offered them as examples.

Absolutely. I was just trying to point out that it’s not a necessary assumption for being able to fit the model.

On the other hand, if you add interaction terms to the regression, then you need more than pairwise observations of the inputs (Andrew and Jennifer in their regression book use “inputs” for the basic inputs and “predictors” for possibly, but not necessarily, interacted basic inputs).

And lots of the low-dimensional manifold work I’m aware of in machine learning was about finding non-linear submanifolds. And lots of the factor models I’ve seen are technically full rank, but very very poorly conditioned — those last few eigenvalues/vectors or singular values/vectors don’t buy you much.

Bob:

There is this _pregnant assumption_ that linear effects are the same everywhere for anyone answering, or within interactions.

Checking that becomes the concern?

Also might be better to think beyond quadratics?

By the way, thanks for the Stan-ning!

Hunting for the oldest related reference:

Maximum Likelihood Estimation with Incomplete Multivariate Data

Irene Monahan Trawinski and R. E. Bargmann, 1964

http://projecteuclid.org/euclid.aoms/1177703562

Or perhaps this?

Lord, F. M. (1962). Estimating norms by item-sampling. Educational and Psychological Measurement, 22, 259-267

I didn’t fully investigate relevance, but I grabbed it from Mislevy et al (1992) Estimating population characteristics from sparse matrix samples of item responses

Remember Andrew’s principle….when in doubt check the psychometric literature.