Skip to content

Enough with the replication police


Can’t those shameless little bullies just let scientists do their research in peace?

If a hypothesis test is statistically significant and a result is published in a real journal, that should be enough for any self-styled skeptic.

Can you imagine what might happen if any published result could be questioned—by anybody? You’d have serious psychology research being grilled by statisticians, and biology research being called into question by . . . political scientists?

Where did that come from? I don’t ask my hairdresser to check my math calculations, I don’t ask my house cleaner to repair my TV, and I sure as heck wouldn’t trust a political scientist to vet a biology paper.

When something’s published in the Journal of Theoretical Biology, pal, it’s theoretical biology. It’s not baseball statistics. Don’t send an amateur to do a pro’s job.

Simply put, peer review is a method by which scientists who are experts in a particular field examine another scientist’s work to verify that it makes a valid contribution to the evidence base. With that assurance, a scientist can report his or her work to the public, and the public can trust the work.

And then there’s multiple comparisons. Or should I say, “p hacking.” Christ, what a load of bull. You replication twits are such idiots. If I publish a paper with 9 statistically significant results, do you know what the probability of this is, if the null hypothesis were true? It’s (1/20) to the 9th power. Actually much lower than that: you’d get (1/20)^9 if all the p-values were .05, but actually some of these p-values are even lower, like .01 or .001. Anyway, even if it’s just (1/20)^9, do you know how low that is?

Probably not, you innumerate git.

So let me tell you, it’s 1.953125e-12, that’s 0.00000000000195312. Got that? No way any amount of multiple comparisons can cover that one. If I find 9 statistically significant results, my result is real. Period. I don’t care how many people can’t replicate it. If they can’t replicate it, it’s their problem.

p < 0.00000000000195312. You can take that one to the bank, pal.

OK, let’s be systematic. Suppose I do a study and it is statistically significant and I publish it—it’s hard to get a paper published, dontcha know?—and then some little Dutch second-stringers raise some pissant objection on some blog, and then they sandbag me with some lame-a$$ “replication.” OK, fine. There are two possibilities, then:
1. My study replicates. Good. So shut the f^&#!@ up. Or,
2. The so-called replication fails. This doesn’t mean squat. All it tells us is that the world is complicated. We already knew that.

Science is about exploration, not criticism. Let’s be open-minded. Personally, I’m open-minded enough to believe that women’s political preferences change by 20 percentage points during their monthly cycle. Why not? What are you, anti-science? OK, ok, I’m not so sure that Daryl Bem found ESP—but I think we’re a damn sight better off giving him the benefit of the doubt, than censoring any result that doesn’t fit our high-and-mighty idea of what is proper science.

Jean Piaget never did a preregistered replication. Nor, for that matter, did B. F. Skinner or Sigmund Freud or Barbara Fredrickson or that Dianetics guy or all the other leading psychologists of the last two centuries.

What did Piaget and the rest of those guys did? They did what all the best scientists did: they ran questionnaires on Mechanical Turk, they found p<.05, and they published in Psychological Science. If it was good enough for Jean Piaget and B. F. Skinner and William James and Daryl Bem and Satoshi Kanazawa, it's good enough for me.

So take those replications and stick ‘em where the sun don’t shine, then crawl back under the rock where you came from, you little twerp. The rest of us won’t even notice. Why? Cos, while you’re sniping and criticizing and replicating and blogging, we’re busy in our labs. Doing science.

The round of 8 begins: Mark Twain (4) vs. Miguel de Cervantes (2); Carlin advances

For yesterday‘s contest I really really really wanted to pick John Waters. For one thing, of all the 64 people in the bracket, he’s the one I think I’d like to hear the most. For another, he’s still alive and just might conceivably be amused enough by this whole contest to come up from Baltimore and give a talk. Which would just be amazing.

But . . . the rules are the rules. And there was only one relevant comment in the thread, from Z:

Andrew and Carlin could debate whether it’s rational to vote. George Carlin makes some arguments that it’s not (, while we all know that Andrew thinks it is (

Not the most persuasive argument but it’s what we have.

And today the round of 8 begins with a face-off between the two greatest comic novelists of all time! (I’m partial to Peter De Vries, but he’s clearly a minor figure compared to these two.) Who’s it gonna be? Huck Finn or Don Quixote?

Time-release pedagogy??


Mark Palko points to this report and writes:

Putting aside my concerns with the “additional years of learning” metric (and I have a lot of them), I have the feeling that there’s something strange here or i’m missing something obvious. That jump from 3-year impact to 4-year seems excessive.

The press release links to a full report that might answer the questions but I can’t get it to come up.

George Carlin (2) vs. John Waters (1); Updike advances

The bard of the suburbs wins yesterday‘s bout with another fine turn of phrase, this time brought to us in comments by Ethan:

“Drinking a toast to the visible world, his impending disappearance from it be damned.”

Updike, from “My Father’s Tears.” I want to hear from someone who can write like that about things like that.

And today’s match pits a pioneering comic against one of the great storytellers of our time. Either could be a strong contender for the championship. Carlin might have more to say about statistics—I googled *John Waters statistics* and just found this—but this doesn’t need to be a technical seminar.

P.S. As always, here’s the background, and here are the rules.

How is ethics like logistic regression?

Ethics decisions, like statistical inferences, are informative only if they’re not too easy or too hard.

For the full story, read the whole thing.

Friedrich Nietzsche (4) vs. John Updike; Austen advances

I chose yesterday‘s winner based on the oddest comment we’ve received so far in the competition, from AC:

I’d love to see what Jane Austen (Austen’s early Regency dress style: thought of late Regency dresses (, which were basically the exact opposite sensibility. It’s an astonishingly quick reversal, from narrow and prim to a sort of walking wedding cake in twenty years. I imagine she’d have had some interesting thoughts, but she died too early to see it.

You go girl.

As for today, I’m surprised the man from Shillington has survived this far—his opponents were Buddha and Bertrand Russell, neither of whom is a tomato can. He got by on his good turns of phrase. Meanwhile, the angry German philosopher is coming up on the outside. If we could get them both to speak, we could have a spirited debate about God. Again, though, only one can advance.

P.S. As always, here’s the background, and here are the rules.

Regression: What’s it all about? [Bayesian and otherwise]

Regression: What’s it all about?

Regression plays three different roles in applied statistics:

1. A specification of the conditional expectation of y given x;

2. A generative model of the world;

3. A method for adjusting data to generalize from sample to population, or to perform causal inferences.

We could also include prediction, but I prefer to see that as a statistical operation that is implied for all three of the goals above: conditional prediction as a generalization of conditional expectation, prediction as the application of a linear model to new cases, and prediction for unobserved cases in the population or for unobserved potential outcomes in a causal inference.

I was thinking about the different faces of regression modeling after being asked to review the new book, Bayesian and Frequentist Regression Methods, by Jon Wakefield, a statistician who is known for his work on Bayesian modeling in pharmacology, genetics, and public health. . . .

Here is Wakefield’s summary of Bayesian and frequentist regression:

For small samples, the Bayesian approach with thoughtfully well-specified priors is often the only way to go because of the difficulty in obtaining well-calibrated frequentist intervals. . . . For medium to large samples, unless there is strong prior information that one wishes to incorporate, a robust frequentist approach . . . is very appealing since consistency is guaranteed under relatively mild conditions. For highly complex models . . . a Bayesian approach is often the most convenient way to formulate the model . . .

All this is reasonable, and I appreciate Wakefield’s effort to delineate the scenarios where different approaches are particularly effective. Ultimately, I think that any statistical problem that can be solved Bayesianly can be solved using a frequentist approach as well (if nothing else, you can just take the Bayesian inference and from it construct an “estimator” whose properties can then be studied and perhaps improved) and, conversely, effective non-Bayesian approaches can be mimicked and sometimes improved by considering them as approximations to posterior inferences. More generally, I think the most important aspect of a statistical method is not what it does with the data but rather what data it uses. That all said, in practice different methods are easier to apply in different problems.

A virtue—and a drawback—of Bayesian inference is that it is all-encompassing. On the plus side, once you have model and data, you can turn the crank, as the saying goes, to get your inference; and, even more importantly, the Bayesian framework allows the inclusion of external information, the “meta-data,” as it were, that come with your official dataset. The difficulty, though, is the requirement of setting up this large model. In addition, along with concerns about model misspecification, I think a vital part of Bayesian data analysis is checking fit to data—a particular concern when setting up complex models—and having systematic ways of improving models to address problems that arise.

I would just like to clarify the first sentence of the quote above, which is expressed in such a dry fashion that I fear it will mislead casual or uninformed readers. When Wakefield speaks of “the difficulty in obtaining well-calibrated frequentist intervals,” this is not just some technical concern, that nominal 95% intervals will only contain the true value 85% of the time, or whatever. The worry is that, when data are weak and there is strong prior information that is not being used, classical methods can give answers that are not just wrong—that’s no dealbreaker, it’s accepted in statistics that any method will occasionally give wrong answers—but clearly wrong, obviously wrong. Wrong not just conditional on the unknown parameter, but conditional on the data. Scientifically inappropriate conclusions. That’s the meaning of “poor calibration.” Even this, in some sense, should not be a problem—after all, if a method gives you a conclusion that you know is wrong, you can just set it aside, right?—but, unfortunately, many users of statistics consider to take p < 0.05 or p < 0.01 comparisons as “statistically significant” and to use these as motivation to accept their favored alternative hypotheses. This has led to such farces as recent claims in leading psychology journals that various small experiments have demonstrated the existence of extra-sensory perception, or huge correlations between menstrual cycle and voting, and so on.

In delivering this brief rant, I am not trying to say that classical statistical methods should be abandoned or that Bayesian approaches are always better; I’m just expanding on Wakefield’s statement to make it clear that this problem of “calibration” is not merely a technical issue; it’s a real-life concern about the widespread exaggeration of the strength of evidence from small noisy datasets supporting scientifically implausible claims based on statistical significance.

Frequentist inference has the virtue and drawback of being multi-focal, of having no single overarching principle of inference. From the user’s point of view, having multiple principles (unbiasedness, asymptotic efficiency, coverage, etc.) gives more flexibility and, in some settings, more robustness, with the downside being that application of the frequentist approach requires the user to choose a method as well as a model. As with Bayesian methods, this flexibility puts some burden on the user to check model fit to decide where to go when building a regression.

Regression is important enough that it deserves a side-by-side treatment of Bayesian and frequentist approaches. The next step to take the level of care and precision that is taken when considering inference and computation given the model, and apply this same degree of effort to the topics of building, checking, and understanding regressions. There are a number of books on applied regression, but connecting the applied principles to theory is a challenge. A related challenge in exposition is to unify the three goals noted at the beginning of this review. Wakefield’s book is an excellent start.

Stewart Lee vs. Jane Austen; Dick advances

Yesterday‘s deciding arguments came from Horselover himself.

As quoted by Dalton:

Any given man sees only a tiny portion of the total truth, and very often, in fact almost . . . perpetually, he deliberately deceives himself about that precious little fragment as well.


We ourselves are information-rich; information enters us, is processed and is then projected outwards once more, now in an altered form. We are not aware that we are doing this, that in fact this is all we are doing.

Wow—Turing-esque (but I can’t picture Dick running around the house).

And, as quoted by X:

“But—let me tell you my cat joke. It’s very short and simple. A hostess is giving a dinner party and she’s got a lovely five-pound T-bone steak sitting on the sideboard in the kitchen waiting to be cooked while she chats with the guests in the living room—has a few drinks and whatnot. But then she excuses herself to go into the kitchen to cook the steak—and it’s gone. And there’s the family cat, in the corner, sedately washing it’s face.”

“The cat got the steak,” Barney said.

“Did it? The guests are called in; they argue about it. The steak is gone, all five pounds of it; there sits the cat, looking well-fed and cheerful. “Weigh the cat,” someone says. They’ve had a few drinks; it looks like a good idea. So they go into the bathroom and weigh the cat on the scales. It reads exactly five pounds. They all perceive this reading and a guest says, “okay, that’s it. There’s the steak.” They’re satisfied that they know what happened, now; they’ve got empirical proof. Then a qualm comes to one of them and he says, puzzled, “But where’s the cat?”

Fat wins the thread.

Today’s contest matches up two surprisingly strong unseeded speaker candidates. Jane Austen cuts to the bone, but with discretion; Stewart Lee lets it all hang out. So how do we like our social observations: subtle, or like a refrigerator to the side of the head?

P.S. As always, here’s the background, and here are the rules.

The publication of one of my pet ideas: Simulation-efficient shortest probability intervals

In a paper to appear in Statistics and Computing, Ying Liu, Tian Zheng, and I write:

Bayesian highest posterior density (HPD) intervals can be estimated directly from simulations via empirical shortest intervals. Unfortunately, these can be noisy (that is, have a high Monte Carlo error). We derive an optimal weighting strategy using bootstrap and quadratic programming to obtain a more computation- ally stable HPD, or in general, shortest probability interval (Spin). We prove the consistency of our method. Simulation studies on a range of theoretical and real-data examples, some with symmetric and some with asymmetric posterior densities, show that intervals constructed using Spin have better coverage (relative to the posterior distribution) and lower Monte Carlo error than empirical shortest intervals. We implement the new method in an R package (SPIn) so it can be routinely used in post-processing of Bayesian simulations.

This is one of my pet ideas but it took a long time to get it working. I have to admit I’m still not thrilled with the particular method we’re using—it works well on a whole bunch of examples, but the algorithm itself is a bit clunky. I have a strong intuition that there’s a much cleaner version that does just as well while preserving the basic idea, which is to get a stable estimate of the shortest interval at any given probability level (for example, 0.95) given a bunch of posterior simulations. Once we have this cleaner algorithm, we’ll stick it into Stan, as there are lots of examples (starting with the hierarchical variance parameter in the famous 8-schools model) where a highest probability density interval (or, equivalently, shortest probability interval) makes a lot more sense than the usual central interval.

Mohandas Gandhi (1) vs. Philip K. Dick (2); Hobbes advances

All of yesterday‘s best comments were in favor of the political philosopher. Adam writes:

With Hobbes, the seminar would be “nasty, brutish, and short.” And it would degenerate into a “war of all against all.” In other words, the perfect academic seminar.

And Jonathan writes:

Chris Rock would definitely be more entertaining. But the chance to see a speaker who knew Galileo, basing his scientific worldview on him, and could actually find weak points in the proofs of the best mathematicians of the day (even if he couldn’t do any competent math himself) should not be squandered. . . .

I love Chris Rock, but you can see him on HBO. Let Hobbes have the last word against Wallis.

Also, Hobbes could talk about the implications of bullet control.

And, now, both of today’s contestants have a lot to talk about, and they’re both interested in the real world that underlies what we merely think is real. Gandhi was a vegetarian, but Dick was a cat person, which from my perspective is even better. Which of these two culture heroes is ready for prime time??

P.S. As always, here’s the background, and here are the rules.