OK, here’s a paper with a true theorem but then some false corollaries.

First the theorem:

The above is actually ok. It’s all true.

But then a few pages later comes the false statement:

This is just wrong, for two reasons. First, the relevant reference distribution is discrete uniform, not continuous uniform, so the normal CDF thing is at best just an approximation. Second, with Markov chain simulation, the draws theta_i^(l) are dependent, so for any finite L, the distribution of q_i won’t even be discrete uniform.

The theorem is correct because it’s in the limit as L approaches infinity, but the later statement (which I guess if true would be a corollary, although it’s not labeled as such) is false.

The error wasn’t noticed in the original paper because the method happened to work out on the examples. But it’s still a false statement, even if it happened to be true in a few particular cases.

Who are these stupid-ass statisticians who keep inflicting their errors on us??? False theorems, garbled data analysis, going beyond scientific argument and counterargument to imply that the entire field is inept and misguided, methodological terrorism . . . where will it all stop? Something should be done.

**P.S.** For those of you who care about the actual problem of statistical validation of algorithms and software which this is all about: One way to fix the above error is by approximating to a discrete uniform distribution on a binned space, with the number of bins set to something like the effective sample size of the simulations. We’re on it. And thanks to Sean, Kailas, and others for revealing this problem.

“Who are these stupid-ass statisticians who keep inflicting their errors on us??? False theorems, garbled data analysis, going beyond scientific argument and counterargument to imply that the entire field is inept and misguided, methodological terrorism . . . where will it all stop? Something should be done.”

Being rather hard your on yourself today? :P

Or maybe he’s just being a smart ass?

The stupid ass statisticians link isn’t working, but I see it’s got a future date in the URL. Something for us to look forward to?

Ben:

Yah, see here.

I think Michael’s recommendation was using a number of bins equal to the square root of the effective sample size. I think that’s just for the visual tests (aka chi-by-eye). Though I suppose this would affect any binned chi-square test, too.

Was there an issue with repeated Markov chain states throwing things off, too?

I still don’t understand why there would be discretization spike at zero, so I’m looking forward to the paper.

Bob:

The simulation draws are correlated so if your separate independent draw from the target distribution happens to be very low, it’s more likely to be in position 1 (lower than all L of the MCMC draws) than in position 2. Similarly on the high end. So you’d expect some piling up at 0 and 1, even if the MCMC was programmed fine and even if it had reached convergence.

Serial correlation, plus “turning around” when you get out deep in the tail ensures that you leave a few samples near each other at the turn-around point, the way a ball appears to hang in the air at its peak of trajectory, so the left and right tails should have a little extra weight in the light tail, and a little less weight in the deep tail, until you get extremely large sample sizes. On the other hand, suppose auto-correlation is small because you’re thinning substantially, then this should go away. So one way to handle this issue is perhaps to run 10x as many samples as you need and thin by 10 or something like that. Binning into bins is also kind of like thinning.

Personally I liked Aki’s suggestion to use q-q plots, but perhaps q-q plots after thinning.

In his Q-Q plot you can see that the independent normal draws go deeper into the tails than the simulation, why? because the simulation has to “turn around” it can’t just jump from the tail back to the core the way independent draws do.

With longer and longer simulations, I’d expect the location where this shortfall occurs to be deeper and deeper into the tail, but it would still always occur, always have a deeper tail for the independent draws. When you thin, i’d expect these tails to be closer to flat. Imagine you throw out 9 out of every 10 of the draws in Aki’s graph, then the kinked tail would have only 1 or maybe 2 draws in it. It would get noisier, but also more unbiased I think.

References for those who don’t follow the Stan list:

http://discourse.mc-stan.org/t/cook-et-al-spike-at-0/1378/58

It seems the way to go is to do the simulation for as much time as you have to get a big sample… calculate Neff and then thin by round(N/Neff + .5) or the like, so that by the end your Neff ~ N, at this point your sample should act a lot more like an independent sample I’d expect, and a Q-Q plot should be pretty straight. That’s my intuition at least.

In other bigshot news, Eric Robinson posted a preprint today criticizing various Wansink articles that formed the basis of “Smarter Lunchrooms”:

https://peerj.com/preprints/3137/

I don’t really know much about the details of this government program, but it seems like a lot of money went into it and we’ve been contacted by various major news organizations asking us if we’ve found serious problems with these papers. Typically the papers had large sample sizes so we couldn’t do granularity testing, and of course the data wasn’t publicly available.

You know, we need to be careful here. Just because we found a few typos here and there it doesn’t mean everything touched by Wansink is wrong…nah, who am I kidding, all this data is probably complete shit.

In all seriousness, just think how much Brian Wansink, Susan Fiske, Satoshi Kanazawa, Amy Cuddy, etc etc, could progress in their research if they’d just acknowledge their errors openly rather than keep trying to dodge criticism and hide their mistakes.

Maybe, but their errors kind of crush their work. It’s one thing to have a big body of work with something like epsilon fraction errors, it’s another thing to have a big body of work that is 1-epsilon errors

Daniel:

Sure, but all these people have a lot of energy. If they’d set up their work to be open to criticism from Day 1, then maybe by now they’d be doing useful things.

Agreed, open to criticism from day 1 is necessary to do good science. Also, it helps if people in your field actually think critically and aren’t all doing the same lemming thing.

Daniel:

Yup. The saddest thing of all in these stories is not the lifers such as Wansink, Bargh, Fiske, etc., but rather those members of the younger generation of scientists who have already, Alex P. Keaton-style, bought into the worldview of their elders. I’m thinking in particular of the ovulation-and-clothing researchers who pushed aside tons of feedback and continue, as far as I know, not to recognize any of the problems in their work. It’s just sad, as they potentially have decades of research ahead of them yet they remain on the far side of the exploration/exploitation divide.

Day 1 needs to be at least as early as first year grad school — ideally, as undergraduates. Openness to criticism is part of intellectual maturity.

All credit to Sean Talts who spent a few weeks trying to implement Cook et al, discovering the spikes that wouldn’t go away, and then doing all of the validation work to verify that it was indeed the discretization of the quantiles coupled with the MCMC standard error.

+1

Once the problem was clearly spelled out, it wasn’t so hard to understand what had gone wrong—but we only got to that point as a result of seeing things fail in real problem. After all, yes, my colleagues and I made a stupid mistake in our paper, but it sat there for over a decade years without us realizing the problem. It took Sean’s careful work to isolate the problem so that we could see it for the obvious-in-retrospect error that it was.

The stupid ass statisticians hyperlink (“https://andrewgelman.com/2017/12/28/stupid-ass-statisticians-dont-know-goddam-confidence-interval/”) leads to a 404.

+ 1

Ouch..lol

See here.

I’m looking forward to a write-up on the correction. A month ago I tried the method in this paper on an MCMC sampler I wrote and got confusing results. In the end, I gave up (on the method, not the sampler). Any chance the BayesValidate will get updated in the (near-)future?