Percentage of missing observations, leading to a suggestion for a research paper based on grabbing available datasets

Jacob Felson writes,

Here is a question I was hoping you might address (on your blog?) It has to do with the distribution of missing data in a dataset. What is the relationship between the total percentage missing observations, and the number of observations left after listwise deletion? Given the percentage of missing observations in the data matrix, what is the expected number of observations after listwise deletion?

To take an arbitrary example, say you have 10 variables and 100 cases, 1000 total observations. 20% of the observations are missing. What is the expected number of observations after listwise deletion?

I was wondering whether any one does research on this kind of thing. For example, what would a specific pattern of missing data (say, measured as the number of listwise deleted cases relative to the number of expectation of such cases, given total missing observations) say about the administration of a dataset, and what would it say about the probability that the data are missing at random?

My reply: I wasn’t familiar with the term “listwise deletion,” but it seems to be the same thing as “complete-case analysis” (a more descriptive term, I think). Anyway, I’m not quite sure why the question arises since I’d think you could answer it directly with any particular dataset. For your example, if you have 100 units with 10 observations each and 20% missing, one extreme is that the same 20 units are missing all the observations (in which case you’d lose 20% with listwise deletion); in another extreme, everybody is missing at least one variable, in which case you can’t do listwise deletion at all! It depends on the pattern of what’s missing.

If you want a more quantitative answer, one approach would be to trawl the internet for a few thousand datasets, and, for each, count the number of units, number of observations, number of completely observed units, and number of missing observations–and see what patterns arise empirically. It could make for an intersesting paper.

3 thoughts on “Percentage of missing observations, leading to a suggestion for a research paper based on grabbing available datasets

  1. Andrew,

    I felt the driving force behind his question was concerning an attempt to find out the "probability that the data are missing at random."

    Which personally sounds like a strange question since you never really know if the missing data are MAR. Thus, no matter how many datasets you collect, how are you going to possibly be able to empirically determine a relationship between the probability of MAR and percentage of missing observations versus number of cases deleted in a complete case analysis.

    Of course, we would like to know how certain we can be in making the assumption of MAR, but how could we possibly address this question in general.

  2. Oh, yeah, I didn't notice that. I agree with you that it doesn't make sense to ask what is the probability that the data are missing at random. In practice, that probability is zero.

  3. I think the logic behind the question was this: If there is a relationship between two variables, and data data is missing not at random in the both variables, then there should be a relationship between the two variables with respect to whether data is missing or not.

    One might be tempted to say that if there doesn't appear to be a relationship between two variables with respect to whether data is missing, and there is a relationship between two variables, that means data is missing at random. But I can think if several ways this could happen:

    1. Data is missing at random in both variables

    2. Data is missing at random on one variable, but not the other

    3. There is a complex relationship between the two variables and whether data is missing that masks mimics the independence one would expect if 1 were true.

Comments are closed.