“Anything worth doing is worth doing shittily”: missing-data edition

In another example of the paradox of importance, a colleague writes:

In other news, I am about to use the “hot deck” method to do some imputation. I considered using one of the more sophisticated and generally better methods instead, but hey, I’m on a deadline, plus there are many other sources of error that will be larger than the ones I’m introducing. It’s the same old story/justification for using linear models, normal models, assuming iid errors, etc.

At least he feels bad about it. That’s a start.

3 thoughts on ““Anything worth doing is worth doing shittily”: missing-data edition

  1. I'm sorry I gave the impression that I feel bad about it. I don't! I also don't feel bad about (most of) the cases in which I have fit linear models, assumed iid errors, assumed normality, etc.

    For most real-world problem, any reasonable model is likely to be demonstrably "incorrect" in one way or another…and for the ones that aren't _demonstrably_ incorrect, we can't make the demonstration because we don't have the data to prove that they are incorrect, rather than because the model is "correct".

    I'm actually pretty comfortable using the "hot deck" procedure for my problem. So, let me flip this around:
    (1) I have about 40 more hours that I can spend analyzing my data (at least, that they'll pay me for);
    (2) I have a heckuva lot of stuff I have to do in that time;
    (3) the data that I need to impute will feed into a function that is very "gentle", in the sense that even if they are in error by quite a bit, this will have only a minor influence on the output of the function (which includes several other data values, which I don't need to impute); and
    (4) I think the "hot deck" procedure will give me answers that are almost as good as what I could get from any other approach.

    Convince me that I should "feel bad" about using the hot deck procedure. And then convince me that I should feel so bad about it that I should do something else instead.

    –Phil Price

  2. Phil,

    When I've done missing data imputation, I've usually used model-based approaches–basically, regressing the variable with missingness on the other variables (including, in the X's, variables that come causally "after" the variable being imputed), then imputing X*beta + error for each missing value.

    It might be that hot deck will work as well or better for you, I don't know. My only direct experience with hot-deck was in playing with some imputation methods for an example in our forthcoming book. In this example, regression worked much better than hot-deck.

    Now that you've done the imputations, it might not be worth spending more time on them. If you hadn't started, I'd suggest starting with Mice (it's an R package, downloadable from CRAN).

  3. I'll take a look at Mice. But in case you're interested, here's my situation:

    I'm analyzing data from a survey that asked about people's ventilation. Mostly, what we care about is how many hours per day people have no, low, medium, or high ventilation, where those categories are defined according to both the number of windows open and how widely they're open. To a lesser extent, we care about the temporal pattern of window-opening: if someone has their windows open for three hours in the morning and three hours at night, that's a bit better (from the standpoint of avoiding poor indoor air quality) than having them open six hours at night and closed for the rest of the day. The efficiency issue is a pretty minor modifier, varying by at most about a factor of two for any given number of hours-with-windows-open, and you only get the factor of two if you compare the most extreme cases (consecutive hours with windows open, versus hours perfectly evenly spread throughout the day).

    We asked people, for each season, how many hours in different time periods they had windows open in: the kitchen, any bathroom, any bedroom, any other room. We also asked, for each season, how many hours per day (on average) they have no/low/medium/high ventilation. The idea is to combine all of this information into a "ventilation effectiveness" metric.

    Here's the problem: some people gave answers that are inconsistent (e.g. they only account for 8 hours of ventilation in the temporally-detailed section, but they said on the other question that they have 12 hours of medium ventilation), and quite a few people skipped the temporally-detailed questions altogether. (Maybe not surprising: it's a total of 72 season x time-periods x rooms we were asking about).

    For the people who skipped the temporally-detailed section, we need some way to impute a temporal pattern of window use. This was pretty easy to do using a hot-deck method, selecting from surveys that gave consistent answers to both questions, were similar to the subject in hours of no/low/medium/high ventilation, and were similar to the subject in reported hours that the house is unoccupied during the week and on weekends. I think that even specifying a reasonable model would be quite difficult. I'll take a look at Mice, but it's hard to picture that it will give substantially better results OR take a comparable amount of time to implement, much less both.

    –Phil

Comments are closed.