Enough with the replication police

Can’t those shameless little bullies just let scientists do their research in peace?

If a hypothesis test is statistically significant and a result is published in a real journal, that should be enough for any self-styled skeptic.

Can you imagine what might happen if any published result could be questioned—by anybody? You’d have serious psychology research being grilled by statisticians, and biology research being called into question by . . . political scientists?

Where did that come from? I don’t ask my hairdresser to check my math calculations, I don’t ask my house cleaner to repair my TV, and I sure as heck wouldn’t trust a political scientist to vet a biology paper.

When something’s published in the Journal of Theoretical Biology, pal, it’s theoretical biology. It’s not baseball statistics. Don’t send an amateur to do a pro’s job.

Simply put, peer review is a method by which scientists who are experts in a particular field examine another scientist’s work to verify that it makes a valid contribution to the evidence base. With that assurance, a scientist can report his or her work to the public, and the public can trust the work.

And then there’s multiple comparisons. Or should I say, “p hacking.” Christ, what a load of bull. You replication twits are such idiots. If I publish a paper with 9 statistically significant results, do you know what the probability of this is, if the null hypothesis were true? It’s (1/20) to the 9th power. Actually much lower than that: you’d get (1/20)^9 if all the p-values were .05, but actually some of these p-values are even lower, like .01 or .001. Anyway, even if it’s just (1/20)^9, do you know how low that is?

Probably not, you innumerate git.

So let me tell you, it’s 1.953125e-12, that’s 0.00000000000195312. Got that? No way any amount of multiple comparisons can cover that one. If I find 9 statistically significant results, my result is real. Period. I don’t care how many people can’t replicate it. If they can’t replicate it, it’s their problem.

p less than 0.00000000000195312. You can take that one to the bank, pal.

OK, let’s be systematic. Suppose I do a study and it is statistically significant and I publish it—it’s hard to get a paper published, dontcha know?—and then some little Dutch second-stringers raise some pissant objection on some blog, and then they sandbag me with some lame-a\$\$ “replication.” OK, fine. There are two possibilities, then:

1. My study replicates. Good. So shut the f^&#!@ up. Or,

2. The so-called replication fails. This doesn’t mean squat. All it tells us is that the world is complicated. We already knew that.

Science is about exploration, not criticism. Let’s be open-minded. Personally, I’m open-minded enough to believe that women’s political preferences change by 20 percentage points during their monthly cycle. Why not? What are you, anti-science? OK, ok, I’m not so sure that Daryl Bem found ESP—but I think we’re a damn sight better off giving him the benefit of the doubt, than censoring any result that doesn’t fit our high-and-mighty idea of what is proper science.

Jean Piaget never did a preregistered replication. Nor, for that matter, did B. F. Skinner or Sigmund Freud or Barbara Fredrickson or that Dianetics guy or all the other leading psychologists of the last two centuries.

What did Piaget and the rest of those guys did? They did what all the best scientists did: they ran questionnaires on Mechanical Turk, they found p less than .05, and they published in Psychological Science. If it was good enough for Jean Piaget and B. F. Skinner and William James and Daryl Bem and Satoshi Kanazawa, it’s good enough for me.

So take those replications and stick ’em where the sun don’t shine, then crawl back under the rock where you came from, you little twerp. The rest of us won’t even notice. Why? Cos, while you’re sniping and criticizing and replicating and blogging, we’re busy in our labs. Doing science.

P.S. Someone asked me where some of the above quotes came from. Here are some sources:

“Replication police” and “shameless little bullies” here

“Simply put, peer review is a method by which scientists who are experts in a particular field examine another scientist’s work to verify that it makes a valid contribution to the evidence base. With that assurance, a scientist can report his or her work to the public, and the public can trust the work” here

“Second stringers” here

“Little twerp” here (ok, it’s nothing to do with the topic at hand, but the phrase fit in so well that I included it).

Other lines above are generally based on things I’ve read but are not exact quotes.

1. Daniel Gotthardt says:

“Can you imagine what might happen if any published result could be questioned—by anybody?” – Doesn’t sound like police to me. ;-)

2. fool says:

1 April 2015, 9:00 am

3. Nick says:

A friend once described the views of Mitchell, Gilbert, etc. as “Harry Potter science”. My p’s are real, yours are mistaken, and if you can’t replicate my study it’s because you’re a muggle.

4. Mayo says:

Andrew: Then surely you’ll be astounded at this new multi-million dollar award by the U.S. Govt! http://errorstatistics.com/2015/04/01/are-scientists-really-ready-for-retraction-offsets-and-precautionary-withdrawals/

• Dale Lehman says:

well, it is April Fool’s Day

• Andrew says:

Mayo:

Those replication offsets are a great idea! As an author of two published retractions (one false theorem and one analysis of miscoded data), I’m in good shape.

5. Dale Lehman says:

This theme again – given how often statistical analysis is either done wrong, intentionally or otherwise, and how hard it is to get replication valued, I will be the last to complain that you are blogging about this again. And again.

But, I think this is either unfocused or distracting. The problem is not that people are poor statisticians (I’ll be the first to admittedly join that group). The problem is the incentives within our disciplines – whether it be biology or economics or…. Other professions have standards and attempt to self-police, whether we are talking physicians, lawyers, engineers, pilots, or whatever. And they often fall short, but at least they try. When it comes to academics we don’t even pretend to try. We count publications. We resist admitting mistakes. When errors are discovered we only admit them when forced to – and then we change our assumptions so that our conclusions were really correct to begin with.

And – and here is my point – our colleagues and employers do not penalize us and, in fact, promote us for our research prowess. Given those incentives, it is not surprising that this item needs to be blogged over and over again. Why is it so hard for academics to step up and judge (yes JUDGE) the merits of each others’ work? And, when it comes to teaching, we are even worse. We’ll use student evaluation numbers and rarely visit each others’ classes and never offer critical commentary. We must have really thin skins after all.

I’m not sure who or what to blame. Tenure? Nothing really at stake? Poor editorial practices? I’m an economist and my inclination is to first look at incentives – and the incentives are very poor, for all these reasons and others. Change will only occur when those at the top (e.g., journal editors, professional organizational leaders, etc.) stand behind it – otherwise, it looks like the academic failures complaining about their lack of success. But what incentives do those on top have to dismantle the structures that they are on top of? That is why things don’t change much and blog posts like this are needed, again and again.

I can’t help it – I grew up in NY and the glass is always one-third full.

• Keith O'Rourke says:

> Why is it so hard for academics to step up and judge (yes JUDGE) the merits of each others’ work?
Because they simply don’t have to and one would have to abandon much self interest to contribute to changing that.

Random audits (like with tax returns) would _fairly_ bring out mistakes so that all could be aware of them and many could be fixed – many more would be avoid (I believe) if folks were aware of possible audits and resources could be diverted to carefulness (e.g. the cost of double data entry means we can’t all attend conference X).

• Andrew says:

Keith:

The tax audit analogy works on many levels. I’ve heard that the US tax authorities are pretty sure they could raise a lot more money by increasing enforcement—that is, the amount of unpaid tax they’d collect would be much higher than the cost of paying a bunch more IRS agents. But they don’t, in part I assume because a lot of tax cheats are influential people, and Congress doesn’t want to piss them off. Also there are ideological reasons to minimize enforcement—some people hate taxes and want to reduce the size of government by any means. And there’s also the economic motivation that a bit of unpaid tax is no big deal, whereas hiring agents is a deadweight loss.

All these arguments map pretty well to scientific error and fraud: some big mistakes have been done by prominent scholars who’d have a lot to lose from exposure; some researchers appear to oppose criticism on principle; and others make the argument that the societal costs of compliance (making replication materials available, etc.) are greater than the costs of publishing occasional bad research.

• Keith O'Rourke says:

Agree, funding agencies will be concerned about documenting apparent carelessness on the part of those they entrusted with funds, universities with documenting apparent carelessness of faculty, some apparently very successful researchers will abruptly stop doing research, some folks will figure out how to carefulness evade and not get caught …

On the other hand, Keith Baggerly is currently mandating certain minimal levels of reproducibility at MDAnderson.

Part of it is who goes first and reaps a benefit rather than a loss. Maybe some university that see this as their competitive edge or even someone who is setting up a new university (hey its not as is someone does not get offered a few hundred million to do this – seriously.)

6. David Johnson says:

Happy April 1st, Andrew.

7. Rahul says:

The policemen of the world feel slighted.

8. Anon says:

Let us not forget that just because a result replicates does not mean it was interpreted correctly. A stable measurement is necessary but not sufficient for reliable inference. For example, a cancer treatment (eg a monoclonal antibody) could be thought to improve survival times by targeting some misbehaving cell type. Say there are four large blinded RCTs showing a similar improvement, that sounds convincing right?

Hold on, what if instead the treatment works by reducing appetite, leading to caloric restriction? If the latter is the mechanism there are cheaper ways of achieving the goal with fewer side effects.

• Keith O'Rourke says:

Anon:

I agree.

“If its real, with sufficient experimental effort it _would_ replicate” does not imply “if it replicates its real” or even “It _will_ replicate”.

9. It’s probably wisest to just stay off the computer on April Fool’s Day (for both writers and readers). Thanks to the above commenters for noting the date.
But whether as a joke or not, I don’t think Piaget and Skinner should be in the same category, or even in the same sentence.

10. Richard H. Serlin says:

Outside parties, aside from the peer reviewer, should look at things like how the data is gathered and culled, how the optimization is done, is it global, is it local, what’s the burn-in, what’s the starting point, are there problems with the computer code. These things can easily happen, and a peer reviewer who may only spend a few hours or days reviewing may not even begin to look at these things that can be hugely consequential.

11. Richard H. Serlin says:

Yes, April 1st, but seriously, where you start a local optimization, for example, can be huge, and from what I’ve seen in finance, peer reviewers don’t usually even begin to spend the time to look at something like that, and then there can be all kinds of programming problems and numerical issues. You really need outside replicators to look at all of this very carefully with influential papers.

12. D.O. says:

Too obvious. If Prof. Gelman posted a frequentist analysis and then said that he doesn’t know how to do as well with Bayes, we would have had a nice April 1st on our hands…

13. matus says:

“Jean Piaget never did a preregistered replication. Nor, for that matter, did B. F. Skinner or Sigmund Freud or Barbara Fredrickson or that Dianetics guy or all the other leading psychologists of the last two centuries.”

Interestingly, the same can be said of inferential statistics. Piaget, Skinner or Freud rarely if ever used inferential statistics. In this respect Wundt, Fechner and others are better suited as role models for modern experimental psychology. One more similarity: the 18th century exp psychologists also dabbled in psychic research :)