Robustness checks are a joke

Someone pointed to this post from a couple years ago by Uri Simonsohn, who correctly wrote:

Robustness checks involve reporting alternative specifications that test the same hypothesis. Because the problem is with the hypothesis, the problem is not addressed with robustness checks.

Simonsohn followed up with an amusing story:

To demonstrate the problem I [Simonsohn] conducted exploratory analyses on the 2010 wave of the General Social Survey (GSS) until discovering an interesting correlation. If I were writing a paper about it, this is how I may motivate it:

Based on the behavioral priming literature in psychology, which shows that activating one mental construct increases the tendency of people to engage in mentally related behaviors, one may conjecture that activating “oddness,” may lead people to act in less traditional ways, e.g., seeking information from non-traditional sources. I used data from the GSS and examined if respondents who were randomly assigned an odd respondent ID (1,3,5…) were more likely to report reading horoscopes.

The first column in the table below shows this implausible hypothesis was supported by the data, p<.01 (STATA code)

People are about 11 percentage points more likely to read the horoscope when they are randomly assigned an odd number by the GSS. Moreover, this estimate barely changes across alternative specifications that include more and more covariates, despite the notable increase in R2.

Simonsohn titled his post, “P-hacked Hypotheses Are Deceivingly Robust,” but really the point here has nothing to do with p-hacking (or, more generally, forking paths).

Part of the problem is that robustness checks are typically done for purpose of confirming one’s existing beliefs, and that’s typically a bad game to be playing. More generally, the statistical properties of these methods are not well understood. Researchers typically have a deterministic attitude, identifying statistical significance with truth (as for example here).

25 thoughts on “Robustness checks are a joke

  1. I understand the problem with this kind of robustness check (I wouldn’t expect that adding covariates would wipe out a relationship that is by definition a random pattern in the data), but it seems like what people often call robustness checks often encompass what I would think of as good statistical practice as well — looking at a more complete universe of possible modeling and data processing decisions you could make, and showing how sensitive your findings are to those decisions. Of course that’s a problem if you’re only presenting the ones that show your findings are robust (or defining ‘sensitive’ as jumping back and forth over an arbitrary p-value threshold), but the general idea of kicking the tires on an analysis seems useful – or am I missing something? Also this isn’t to discount Simonsohn’s preferred solution – I like the idea of conceptual replications, it just seems like an “and” as opposed to an “instead”.

  2. But isn’t a multiverse analysis (which I think you obviously endorse) just one large robustness check? If so, how do we distinguish robustness checks that are a “joke” from those that aren’t?

    • Any robustness check that shows that p remains less than 0.05 under an alternative specification is a joke. Any time a Bayesian posterior that shows the range of possibilities *simultaneously* for all the unknowns, and/or includes alternative specifications compared *simultaneously* with others is not a joke.

      The basic function of p values is to manufacture certainty: p is less than 0.05 for the hypothesis that x=0 therefore we treat x as if it were equal to its mean value in the sample (a delta function). This is just how p values are *actually* used whether it’s what they were intended to do or not.

    • Anon:

      It’s all about the goal. What makes robustness checks a joke is that they’re done for the purpose of protecting a claim or confirming a hypothesis. They’re not done in an open-minded spirit of wanting to understand uncertainty.

  3. Can “soft” sciences ever be made robust? Hard sciences impose hard constraint on what hypotheses are plausible. While the above example shows an implausible hypothesis, psychology and other social science disciplines seem to impose very loose constraint on what hypotheses are plausible. The foundations of these disciplines just seem soft. Could it be due to the inherent malleability of the subject matter?

  4. I second Eric Rasmusen’s point.

    I think of robustness checks as the authors pre-answering questions that may be occuring to a reader. “You may be wondering if the results depend on the log transformation. It doesn’t. Table 34 shows the results are little changed when the untransformed variable is used instead.”

    After presenting a paper a few times and responding to referees, most questions about the paper have already been raised. Why not answer them right in the paper in a section that includes a grab bag of alternate specifications? Sometimes such sections point out important limitations.

    I don’t see how robustness checks can be criticized in general terms because they are usually a hodge podge of questions that depend on the specific topic.

    You might be able to critique classes of robustness checks. Simonsohn is kind of doing that here. He is showing that adding other explanatory variables does not always get rid of a certain class of bad analysis. This is useful knowledge. On the other hand, this type of robustness check can be important when you think there may be an omitted variable that explains the result. For instance, the Simonsohn data may have been gathered in two different locations. The first assigned odd IDs while the second assigned even IDs, and the first was next door to the World Astrological Society.

  5. This is a generalization far too far.

    “Robustness checks” are alternate specifications which relax or modify assumptions which should not substantially affect conclusions. It’s not a “joke” to report such checks, it’s an important part of empirical research. To choose an example completely at random, Gelman, Fagan, and Kiss (2007) report multiple alternate specifications of their main model:

    “In addition to fitting model (1) as described earlier, we consider
    two forms of alternative specifications: first, fitting the
    same model but changing the batching of precincts, and second,
    altering the role played in the model by the previous year’s
    arrests. We compare the fits under these alternative models to
    assess sensitivity to details of model specification.”

    Giving an example, such as Simonsohn’s, of a specific case in which a specific type of robustness check fails to reveal a specific type of problem in no way invalidates robustness checks generally — any statistical method can be abused.

    • Chris:

      Fair enough, and good catch by finding that from one of my own papers. The title of the above post is an exaggeration. A more accurate title would be, “Robustness checks can be a joke, especially if they are used for confirmation rather than exploration.” I do think, though, that this is more than a general problem that statistical methods can be abused. The problem as I see it is that robustness checks are supposed to be for exploration but are typically used for confirmation.

      Maybe another way to put it is: As long as we recognize that robustness checks are typically used for confirmation, we can interpret them in that way. Thus, instead of taking a robustness check as evidence that a claimed finding is robust, we should take a robustness check as providing evidence on particular directions the model can be perturbed without changing the main conclusions.

      In any case, thanks for keeping me honest on this one!

      • Put more cynically, a robustness check that succeeds tells us that the researchers were capable of finding some alternative specification that sounded different enough to a reviewer that it counted as “robustness” while maintaining the main conclusion in the way that we knew for sure it would ahead of time.

        p(robustness check succeeds | paper is published) = 1

        • I was going to put a qualifier in “robustness check as providing [weak?] evidence on particular directions the model can be perturbed without changing the main conclusions” but you pointed out cases where it would be 0 (post-selection).

          Its primarily the intentions that matter and we are usually just stuck guessing them.

      • Almost twenty years ago (!), Andrew and I wrote a paper, with co-authors David Krantz and Chia-Yu Lin, in which we made some assumptions about the relationship between radiation exposure and cancer risk. Our default assumption was the standard linear-no-threshold relationship, but some researchers think low levels of radiation exposure may be good for you, or at least not bad for you, so we also calculated the numbers under the assumption that there is no risk below a specific threshold, and worked out how that would change the decisions we were analyzing. That’s a very useful kind of robustness check, but, notably, we were not trying to support a specific scientific hypothesis, we were just trying to check the sensitivity of our recommendations on one of our less-certain assumptions.

    • “Giving an example, such as Simonsohn’s, of a specific case in which a specific type of robustness check fails to reveal a specific type of problem in no way invalidates robustness checks generally — any statistical method can be abused.”

      Jolly good sentence there m’boy. Well done, I say!

  6. I think what you mean to say is robustness checks, not done in good faith, are a joke. But then, so are all statistical methods.

    Andrew, you’ve written articles that said “we tried method X, we felt the model was poor, so we switched to method Y”. As a statistician, I think it’s really good to be honest with your methods, so I applaud. I also think it totally makes sense that maybe your first model didn’t pan out as you expected, it’s good to go back and check if maybe you were looking at the data in the wrong way (why waste huge piles of data because things didn’t fall out exactly as you had hoped?).

    But at the same time, the second I doubt your intentions in doing so, faith in the analysis goes out the window.

  7. I agree with the commenters who say that robustness (or sensitivity) checks are generally good…but I would not call what Simonsohn has done a robustness check. The model is causal (or “explanatory”) right? Priming someone with an odd Id triggers some neural pathway that compels them to report reading horoscopes (or the assignment increases the probability that the “report reading horoscope” neural pathway is more likely to be stimulated).

    The only correct model for estimating the effect of ODD ID assignment is the model with everything that causally effects “reporting reading horoscopes”. Every other model results in a biased estimate of the effect of ODD ID assignment. Tinkering around with covariates is not a robustness check. Adding a covariate can move the expected (biased) value of the coefficient both toward the true value and away from the true value (depending on the covariance structure and the effects of the covariates, etc). Just because some tiny fraction of the confounders has been added to the model doesn’t mean there is not a missing confounder that wildly changes the coefficient. I’d even say that given this is a hypothesis with effectively zero understanding of the causal system, and with an effectively infinite number of confounders, adding a few covariates does nothing to decrease the probability that there is a missing confounder that would wildly change the coefficient.

  8. I feel the term “Robustness” is used quite vaguely in applied statistics. It is sometimes referred to data perturbation, or some influence functions mathematically, or input attacks machine-learningly, and sometimes referred to prior specification or model construction. OK prior is often more data so these views are overlapped. I think these concepts will be better understood in a predictive paradigm.

  9. No paper has ever been published with a failed robustness check. Failed checks are either (a) suppressed; (b) converted into some less preferred hypothesis; or (c) referred to with the weasel words “qualitatively similar results.” Thus, non-joke robustness tests are reduced to those which actually stopped a paper from being published or that substantially modified the method. Given the way in which papers generally represent final results, and not the path by which the authors got there, we can’t know how often this ever happens.

  10. My favourite type of “robustness check” is the significance test for non-normality. Logic is 1) we tested our result using a t-test and got significance 2) we want to make sure the assumptions of the test are satisfied so we tested the null hypothesis that the assumptions were satisfied and we did not get significance, so they were satisfied. It is funny because 1) if you got significance in the first place because you were measuring a noisy underlying effect with a small sample then this means 2) you failed to reject because you had insufficient samples to detect a deviation from normality. I think I have heard this somewhere (maybe here?) called the “Law of Small Numbers”- our sample wasn’t big enough to show non-normality, so it must have been normal!

  11. I do a lot of estimation work for which robustness checks are absolutely essential. For instance, I recently did a report in which I estimated the potential impact of AIDS morbidity on child labor at national, regional and global levels via the pathway of excessive household chores. I will be the first to admit this is a murky enterprise, based on many relationships that are only dimly understood. Nevertheless there was a demand for these numbers, and it’s also true that there is a range of likely effects, with potential estimates below and above them unlikely. It’s not a pointless exercise. But how to incorporate uncertainty? Just on the AIDS-chore effect, we have a number of studies based on household surveys, some with time use questions. They come up with different impacts, and there is a lot of uncertainty over their reliability and generalizability. I am very uncomfortable with, say, any global point estimate drawing on this literature. But robustness comes to the rescue: I can do my estimates using different impacts from local studies and then report how the estimates change depending on which studies are given more weight, which countries we are generalizing to, etc.

    For me, the most interesting result is how the pattern of estimates of interest is related to the pattern of various parameter assumptions. That’s more valuable than any individual point estimate. The hardest part is to get acceptance of uncertainty-preserving study output.

    It is true that reporting multiple parameter options within a given specification is not the same as varying the specification, and in many circumstances true robustness demands that as well. It is also true that cherry-picking particular robustness checks for the purpose of building a moat around your desired result is bad practice.

    • Peter:

      I agree that it can be a good idea to examine the sensitivity of conclusions to analyses. The “robustness checks” that I don’t like are the ones that are done not to explore but to back up a conclusion that has already been made.

  12. I don’t think Uri’s example is as damning as it is being portrayed. The robustness check actually gives you valuable information: the relationship between respondent ID and horoscope is not being spuriously generated by one particular specification of the estimating function. Often times, that is important information!

    The takeaway should be that papers (and reviewers) should be careful to ensure that robustness check addresses a specific purpose. Often times, the purpose is left ambiguous to give the impression that the results are ‘robust’ to every possible objection (e.g. “..and the relationship is robust across a range of specifications”). I agree that can be misleading, but not every empirical research problem is p-hacking.

Leave a Reply

Your email address will not be published. Required fields are marked *