7 ways to separate errors from statistics

Betsey Stevenson and Justin Wolfers have been inspired by the recent Reinhardt and Rogoff debacle to list “six ways to separate lies from statistics” in economics research:

1. “Focus on how robust a finding is, meaning that different ways of looking at the evidence point to the same conclusion.”

2. Don’t confuse statistical with practical significance.

3. “Be wary of scholars using high-powered statistical techniques as a bludgeon to silence critics who are not specialists.”

4. “Don’t fall into the trap of thinking about an empirical finding as ‘right’ or ‘wrong.’ At best, data provide an imperfect guide.”

5. “Don’t mistake correlation for causation.”

6. “Always ask ‘so what?'”

I like all these points, especially #4, which I think doesn’t get said enough. As I wrote a few months ago, high-profile social science research aims for proof, not for understanding—and that’s a problem.

My addition to the list

If you compare my title above to that of Stevenson and Wolfers, you’ll find two differences. First, I changed “lies” to “errors.” I have no idea who’s lying, and I’m much more comfortable talking about errors. Second, I think they missed an even better, more general way to find mistakes:

7. Make your data and analysis public.

This is the best approach, because now you can have lots of strangers checking your work for free! This advice is also particularly appropriate for Reinhardt and Rogoff because, according to various reports (see here and here), it was years before they made their data available to outsiders. Nearly three years ago (!), Dean Baker wrote a column entitled, “It Would Be Helpful if Rogoff and Reinhart Made Their Data Available.”

Perhaps “the risk of forced disclosure” (as Keith O’Rourke puts it) will motivate researchers to be more careful in the future.

Your additions?

I told Wolfers I was going to link to his list and add my own #7. He replied that we’re probably missing #8, 9, and 10. In the comments, feel free to add your favorite ways to separate errors from statistics. Phil already gave some here.

36 thoughts on “7 ways to separate errors from statistics

  1. Strange that they missed 7, since they discuss finding an error in a code. Maybe they thought it goes without saying. Does your 7 include making code public?

  2. #8 Whenever possible use codes, packages and techniques with a large user-base and / or a long history. Use stable versions not the latest beta.

    An esoteric, specialized package may have a state-of-the-art method, speed, fancy visualization, or some such “cool” attraction. But resist it. You are less likely to get burnt by unknown software bugs. How much can you trust a code that has, say, only a hundred users or has been around for only a year?

    • Like Stan? :-)

      I would modify this to say something like “Try to replicate the findings using an alternative software package, and especially if your primary package is new and fancy but less debugged”

      Sometimes the advantages of a new package are so overwhelming that a result just couldn’t be gotten without it. I’m thinking of things like Lattice Boltzmann fluid dynamics packages that can do complex geometries, multi-phase fluid flow, and model non-newtonian fluids with heat transfer, surface wetting and etc. It’s just not reasonable to run Navier Stokes solvers on some of these problems.

      In the realm of more standard statistics maybe you come up with a really high quality statistical model that nevertheless winds up being multi-modal and mixes very slowly unless you have the fancy HMC sampler from Stan.

      If Stan or LB or whatever is 6 orders of magnitude faster than the alternative, you really don’t have any alternative!

  3. I am curious to hear more about *when* to make a specific data set publicly available. Note that I absolutely believe that it should be public once published.

    Are there issues of getting “scooped” in the working paper stage? Maybe less likely for a working paper, where one can show that their version was out first. But with so many researchers seeing the publication process as a game and competition, what is the proper balance between being able to provide one’s own findings AND one’s data for others to check publicly?

    For the “good of the world”, I would say as soon as the data is created. But this is obviously an intractable solution (and one that no one will follow–after spending months putting a data set together, not many people are going to send it along to others before being able to use it).

    As an extreme example, my wife, a behavioral neuroscientist, tells me about how most at her conferences only present previously published papers in fear of being scooped. She already read the paper, and finds little added value to attending many of these presentations. There are special small conferences held, with agreements that no one will talk about what they saw, that encourage more sharing of results before publication. She particularly enjoys these, but they are less common (and not everyone presents *new* findings).

    With those sorts of backward incentives in research and publication, what is the best way to proceed?

    (Maybe this is best for a different post, but still relevant here. I think you may have a previous post about this–something along the lines of being well ahead of anyone else on the given research once you make the data public, so they should not be able to catch up.)

    • Personally, I think the once-published is a good time to release your data.

      The working paper stage ought to be much shorter anyways. I don’t know why Economics has this aberration of years of limbo of Working Paper stages.

      • I am curious whether the long working paper stage is a result of slow journal turnaround or authors just throwing up garbage before it is finished (perhaps even so the paper doesn’t get scooped and one can have a “i did it first” claim).

        • I think it is the former. Most Working Papers definitely don’t seem like garbage. If I saw one cold I probably wouldn’t guess it was a working paper and not the final product.

        • Well I was of course exaggerating by using the word “garbage”. My experience with the Econ journals is very mixed. Always find it interesting that this is the standard, yet other disciplines have never even heard of a working paper.

        • Different fields seem to have different implicit rules. A statistician will publish 20 papers a year. An engineer might publish 100 a year. A psychologist might publish 5. An economist might just publish one or two but of course putting much more effort into each one.

  4. Like charts-n-things, I’d love to see more disclosure of false starts and dead ends. In the course of producing “the paper”, what else did the authors find (or not find) that shaped their conclusions?

    • Yeah, I would also really love to see a more honest description of the exploration process, rather than the assert-and-defend approach that Andrew decries. Oh well.

  5. My title:

    7 ways to separate errors from scientific inference. (bc not all science is statistics)

    My suggestions:

    8. Simplicity is a virtue. Keep it simple.
    9. Everyone understands research design, not everyone understands statistics. Design trumps analysis.

    PS
    Point 1 is very popular, by fishy in my view. There are robust spurious correlations. Here is an alternative:

    1. Write down your assumptions and test their implications (I think this is a more internally coherent notion of robustness). Use DAGs.

    • 8: as simple as possible, but not simpler ;-)
      (usually attributed to Einstein, but see here http://quoteinvestigator.com/2011/05/13/einstein-simple/)

      9: I don’t think “everyone understands research design” is true at all. Maybe everyone *thinks* they understand it. good research design should have statistical content, randomization, subsampling, controls, measurement limitations, replication, these are all routinely screwed up in Biology as far as I can tell.

  6. Some of the biggest problems in social research is how to model the theoretical concepts into quantitative empirical indicators. For instance, well being and income are not the same thing, although frequently income is used as a measure of well being.
    I would add:

    #8 Don’t confound your data and indicators with the real social complex phenomena that was theorized.

  7. To reinforce your #7, the data set used by R&R in 2010 had errors. The same data set was used by the recent paper analyzing R&R’s 2010 paper. The data looks overall more like the critical paper, but much of the brouhaha would have been avoided if R&R had simply noted on their website what they left out.

    Mark Thoma noted in a post on his blog that he tried to correct calculation errors in a journal but they refused to do this because the original author – who Thomas notes was much better known – said the errors weren’t sufficiently material to the overall findings. But that means people will cite this particular section, use this particular work … and will be wrong because there is no correction noted in the public record.

    • A commenter to Mark Thoma’s blog noted that the number of errata published in AER was two (over some unspecified time period) whereas the number published in Science (one presumes over a comparable period but it’s not stated) was 2788, with a comparable number in Nature.

      There’s no good excuse for this, unless you presume that papers in AER are virtually error-free, and we know this not to be the case given the egregious example of R&R and the example that Mark Thoma tried to correct.

      • I didn’t want to say that as bluntly. Thanks.

        I’ve tried to point out that Economics has a professional problem with non-disclosure of later results, etc. It should be a professional standard to note, given our electronic era, when your data has changed or when you or someone else has caught an error in calculation. But mention this and you get accused of attacking the people or, with R&R, the basic conclusions so you get accused of saying idiotic things like “debt doesn’t matter.” They need to develop professional standards.

        I can’t imagine people not disclosing new or changed data in my areas of expertise. That would be considered near or actual fraud depending on the context. Corrections are supposed to help push things forward.

  8. Regarding errors, I offer a simple nonparametric test:
    count the number of errors and classify them as helping a hypothesis, being neutral, or weakening it.
    Real errors seem likely to be distributed.
    Of the missing/wrong data in this case, what’s the mix?

  9. My no. 8: if you are analysing data you did not generate then try to talk to the person who did generate the data as their intention in the data recording is not always the same as your interpretation in the analysis.

    I realise this is not in the spirit of understanding other peoples analysis but it is worth remembering – it is very easy to think that a particular figure is an empirical result only to find later that it was a prediction from a model which is not included in the data.

    • Tom:

      I think applies, I use the phrase “what I really need to understand is how the data came to be: how it was generated”.

      How would one do a sensible analysis without a (mis)understanding how the data came to be?

      (This is one of the things DAGs help capture in some sense, how nature generated the Xs that are usually assumed fixed and known).

  10. n+1. Draw pictures and include confidence intervals.

    n+2 (similar to Tom’s no. 8). Get people who have intimate knowledge of your data to look at and criticize your finding.

    n+3. I don’t know how to phrase this as a caution, but I’ve often read reports that take two time series, call one a numerator and another a denominator, call their quotient a “rate,” and make inappropriate inferences.

  11. My number whatever is: don’t accept a statistical result without one or more well-conducted case studies to illuminate the mechanism the statistics are testing for. In the case of R&R, any broad correlation between debt ratios and GDP growth, whether interpreted as causation from the first to the second, from the second to the first or something else, should be investigated in the experience of a single country which is believed to share pertinent characteristics with the countries people care about currently. Look for the mechanisms at work; don’t fall back on unexamined assumptions.

    Of course, case studies in themselves only document mechanisms in specific cases; without larger samples to test on you can’t tell whether they generalize or not.

    (Hedge: this suggestion doesn’t always work! In the case of public health, sometimes you have to go with epi data before you have clinical/toxicological/etc. case studies that reveal anything. Effects may be too small at the case level to pick up until you know a lot about what you’re looking for.)

  12. * If you use a model, explain which parts of the model causes the result to appear. As far as is possible, explain how your modelling assumptions influence your result.

  13. I would emphasize the flaws and fallacies that can arise through a host of “selection effects”: cherry-picking, data-dependent subgroups, data-dredging, hunting with a shotgun, multiple testing, multiple modeling, p-value hacking, looking for the pony, double-counting, ad hoc saves, etc. http://errorstatistics.com/2013/03/01/capitalizing-on-chance/

    I was impressed at how good a list of flaws arose from the Tilberg report on Stapel (~p. 57):
    http://www.tilburguniversity.edu/nl/nieuws-en-agenda/finalreportLevelt.pdf

    • @Mayo

      I went looking for the pony in the linked blog entry. Alas I did not find one.

      Was hoping to find a definition of each of the terms you give. They sound interesting.

  14. So it seems like the linked list is about things that non-statistics people can do when evaluating a study they see. In that case, I think that Fernando’s suggestions are key: figure out what they did, and think about what it is in the world that this research is actually measuring. If you break down how people are estimating things, smart readers can engage critically even if they don’t know a lot of stats.

    But a lot of these comments seem like suggestions for what researchers can do to make sure they don’t make mistakes in the first place. For that, I’d just add:

    Replicate previous work. Most of the (admittedly few) projects I’ve worked on involve using datasets other people have used for projects somewhat similar to mine. So if I am looking at the effects of, say, infant mortality on household investment decisions, I’ll make sure that the infant mortality rates I estimate for my right hand side match (or almost match) published numbers that used the same data. It’s like a mini-replication to make sure I’ve got the data cleaned/organized right. If you are extending a previous analysis or bringing a new estimator to the data, be sure you can replicate the previous findings.

    If covariates/fixed-effects/modelling assumptions change your point estimates (that is, if they are not robust), make sure you can explain why that is happening. What is your RHS variable of interest correlated with such that inclusion/exclusion of those covariates is changing your point estimate. If all you can say is “adding regional fixed effects makes the point estimate change sign and become significant”, you should probably be able to explain why that is and why the “within” estimator (or whatever the FEs do) is to be preferred. But I’ve found that papers that throw a few different smart estimators at something, note the difference in point estimates, and smartly discuss what we learn from these differences are often really good papers.

  15. “3. Be wary of scholars using high-powered statistical techniques as a bludgeon to silence critics who are not specialists. If the author can’t explain what they’re doing in terms you can understand, then you shouldn’t be convinced.”

    This is exactly the trick Steven D. Levitt pulled when he and I debated his popular abortion-cut-crime theory in Slate in 1999. I pointed out that if he had done a simple reality check of looking at 14-17 year old homicide offending rates year by year, he would have seen that the first cohort born after legalized abortion had homicide rates triple the last cohort born before legalization. His response was Well, I did a complex study on all 50 states and you just looked in a simple fashion at national date, so I win:

    http://www.slate.com/articles/news_and_politics/dialogues/features/1999/does_abortion_prevent_crime/_2.html

    And, hey, it worked great for Levitt and he rode it to becoming a celebrity six year later. Of course, six months after “Freakonomics” hit the bestseller charts, Christopher Foote and Christopher Goetz demonstrated that Levitt had messed up his statistical programming, which was why his state level analysis couldn’t be reconciled with the national level analysis. But, even that didn’t hurt the Freakonomics brand much.

    • Steve:

      I just read the exchange on Slate. At one point, Levitt writes, “It is so refreshing to have someone challenge our study based on the facts instead of the knee-jerk reactions I have been hearing and reading about in the press the last few weeks.” But I don’t know if he’d have such a discussion anymore, at least I don’t see that sort of give-and-take on his blog. The uncharitable view of this is that now that he’s a celebrity, he doesn’t need to engage with criticism. But the charitable view is that, over the years he’s realized that on average he gets better feedback from his colleagues than from strangers. I can sympathize with such a view. I value my blog commenters and learn a lot from them, but I also waste a lot of time explaining basics to people (for example, the commenter the other day who seemed to think Bayesian inference is some sort of fraud or mass delusion). By not engaging his critics anymore, Levitt loses various opportunities to learn, but maybe on balance he’s learning more from his daily interactions with his University of Chicago colleagues. Meanwhile, what happens when I spend 15 minutes following a link and writing this blog comment? I learn something—I develop some epsilon more insight into scientific interactions–but that’s 15 minutes I’m not spending on research or in hallway conversations with professors of statistics and political science.

  16. I would add: “Look for simple reality checks that you can perform on your conclusions, or encourage others to perform them (in the manner of Einstein listing ways to falsify his General Theory of Relativity).”

    The big problem with Malcolm Gladwell, for instance, is that he’s largely incapable of coming up with reality checks on the theories that promoters feed him.

  17. I’m not sold on #3. By any standard, using a Bayesian aggregate of individual state-by-state polls to predict the winner of the presidential election is a “high-powered statistical technique”, much moreso than looking at national public opinion polls. But I’m probably going to continue to get my election forecasts from Nate Silver.

    • Gray:

      Take a look at the second part of item #3: “as a bludgeon to silence critics who are not specialists.” I agree with Stevenson and Wolfers that those of us who use high-powered statistical techniques should do the research required to make the results of these methods as clear as possible. If you look at my own work, I put a lot of effort into understanding what my methods are doing with the data. I don’t just throw regression coefficients and posterior distributions at people and try to stun them into submission.

      • It can be awfully hard to judge intent, though. I don’t think that the article they linked to fits that description; it was flawed (deeply flawed, even) but, I thought, sincere in trying to get at causality and used standard tools for that literature.

        I agree with you about what we should do (the high-powered statistics users), but don’t think that it translates into a useful decision rule for, say, our undergraduate students; especially since I think it will be interpreted as, “don’t trust fancy statistics.”

        • True and so I think best never to openly suggest that but always to suspect it (or pretend true) until its ruled out.

          I can assure it happens often and (unfortunately) I have been involved though always arguing against it (not always successfully).

          Couple actual examples.
          Question: Would an unpaired t.test have less power here.
          Answer: Yes, a lot less.
          Bad decision: We will use that for the test the annoying journal reviewer requested.

          Question: Do you know a reference for using heterogeneity tests to decide if random effect is needed?
          Answer: Yes, so and so, but you know the reasoning is flawed right?.
          Bad decision: Yes, but I just want something to get a paper by a journal reviewer and that would work.
          (Unfortunately, afterwards it looks like they set an example that was followed in a wide community of clinical researchers for years. )

          And one I had no involvement in but heard of afterwards – “So those references you gave in response to X’s questions in the department seminar were not real. Yup, just made them up on the spot – think he will give up looking for them, the journal names I gave are not in this university’s library.”

  18. Pingback: Rapide retour sur Reinhart et Rogoff | Polit’bistro : des politiques, du café

Comments are closed.