I read more carefully the news article linked to in the previous post, which describes a forking-pathed nightmare of a psychology study, the sort of thing that was routine practice back in 2010 or so but which we’ve mostly learned to at least try to avoid.

Anyway, one thing I learned there’s something called “terror management theory.” Not as important as embodied cognition, I guess, but it seems to be a thing: according to the news article, it’s appeared in “more than 500 studies conducted over the past 25 years.”

I assume that each of these separate studies had p less than 0.05, otherwise they wouldn’t’ve been published, and I doubt they’re counting unpublished studies.

So that would make the combined p-value less than 0.05^500.

Ummm, what’s that in decimal numbers?

> 500*log10(0.05) [1] -650.515 > 10^(-0.515) [1] 0.3054921

OK, so the combined result is p less than 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000031.

I guess terror management theory must be real, then.

The news article, from a Pulitzer Prize-winning reporter (!), concludes:

Score one for the scientists.

That’s old-school science writing for ya.

As I wrote in my previous post, I feel bad for everyone involved in this one. Understanding of researcher degrees of freedom and selection bias has only gradually percolated through psychology research, and it stands to reason that there are still lots of people, young and old, left behind, still doing old-style noise-mining, tea-leaf-reading research. I can only assume these researchers are doing their best, as is the journalist reporting these results, with none of them realizing that they’re doing little more than shuffling random numbers.

The whole thing is funny, but it’s sad, but I hope we’re moving forward. The modern journalists are getting clued in, and I expect the traditional science journalists will follow. There remains the problem of selection bias, that the credulous reporters write up these stories while the skeptics don’t bother. But I’m hoping that, one by one, reporters will figure out what’s going on.

After all, nobody wants to be the last one on the sinking ship. I guess not completely alone, as you’d be accompanied by the editor of Perspectives on Psychological Science and the chair of the Association for Psychological Science publications board. But who really wants to hang out with them all day?

Now that we’ve eliminated those 500 random number generators from the possibilities… there are only (∞ – 500) other random number generators to test before conclusively ruling out that these results come from random number generators.

On the other hand, we could just cut out the middle man and directly study random number generators: http://noosphere.princeton.edu/

Though I kind of prefer the old school tarot cards.

Daniel:

If we’re going to study random number generators, I’d go old school and use the classic Rand volume of a million random digits.

Within 1 minute of when Hillary Clinton conceded the US election I opened my copy of RAND’s table of 1 Million random digits to page 331 where I closed my eyes and put my finger down on the number 0005 which clearly represents 0.005 and is less than 0.05 thereby proving once again that global consciousness is able to alter the outcome of random number generators.

Wow! Well, “Score one for the scientists.”

Multiplying p values does not give you a p value.

A p value is a random variable that has a uniform (0,1) distribution, under the assumption that the null hypothesis is true. And the product of two independent uniform (0,1) random variables is not a uniform (0,1) random variable.

Dan:

The p-value is defined as the probability of seeing something as extreme as the data, or more so, conditional on the null hypothesis: in math, Pr(T(y_rep) >= T(y)), where y is the data, T(y) is the test statistic, and the probability distribution of y_rep is evaluated over the null. In this example, the “test statistic” that I had in mind is the number of statistically significant results in 500 studies. Call this T(y) = sum_{i=1}^100 T_i(y), where in the data at hand T_i(y)=1 for i=1,…,500. So T(y)=500. Under the null, the probability of seeing something statistically significant is at most 0.05, and the studies are I assume independent; thus the p-value, which is Pr(T(y_rep)=500), is less than or equal to 0.05^500. I think it’s safe to assume that some of those 500 studies involve discrete data, for which Pr(statistical significance) < 0.05 (it won't be 0.05 exactly), hence the p-value of the set of 500 studies is less than 0.05^500. P-values are horrible but it's good to be able to work these things out from first principles!

Andrew:

But you already stated that you’ve conditioned on p less than 0.05, otherwise no publication.

In that case, the distribution of your p-values is actually Unif(0, 0.05) under the null, so you’ve actually only calculated the normalizing constant, not the p-value…or maybe you don’t want to condition on publication, but then you have to consider all the truncated p-values (i.e. those greater than 0.05), which you did not.

Or maybe you’re subtly implying that no researcher has ever found p greater than 0.05 in a study, which is quite pessimistic…but sadly probably not that far from reality. The rate at finding p less than 0.05 is at least statistically indistinguishable from 100%, anyway :)

For the record, the distribution of published p-values under 0.05 is definitely not uniform:

http://www.p-curve.com

jrc:

Of course, but none of us claimed that.

If the distribution of all published p-values was, in fact, Uniform(0, 0.05) (assuming that all peer review is nothing more than a p less than 0.05 filter), that would imply things were pretty dismal.

A:

The p-value bound I’m computing is not serious. It’s computed based on the exact same assumption used in almost every p-value ever computed (including the paper cited above and, I’m guessing, the 500 others in that list), which is the assumption that there is no selection on statistical significance. Yes, my p-value of 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000031 is not serious. And it is not serious in the exact same way that the p-values in that published paper are not serious, in the exact same way that the p-values in all those PPNAS papers are not serious, in the exact same way that the p-values in the power pose paper are not serious, in the exact same way that the p-values in the fat-arms-and-voting paper are not serious, in the exact same way that the p-values of Daryl Bem are not serious, etc. Conversely, anyone such as Susan T. Fiske who buys the p-values in all those papers, should also buy the 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000031 value here.

Andrew, the 31 on the end is a nice touch. Glad I decided to scroll over.

At least a scrollbar here so that I can *see* how many zeros there are…

Hey, I provided reproducible code! It was fun to stop and think for a moment about how to compute the decimal expansion of 0.05^50.

Andrew:

If your current argument is to be considered serious (I believe it is), you are implying that p-values should not be trusted because the peer-review process filters out all data that does not support a given hypothesis and so when we see a published paper, we are seeing nothing but an analysis on a truncated dataset (i.e. all data that didn’t support our hypothesis never made it to publication), without accounting for truncation.

I’m not going to argue whether that is true or not (I could be picky about things, but at the very least, it’s a point definitely requiring consideration), but I will point out that with this issue, it’s not just p-values that cannot be trusted. It’s also Bayesian analyses. And basic descriptive statistics. In otherwords, we should have no faith in any type of published data.

In that case, what is that you propose we do consider the validity of a theory? To be honest, I do agree with this view and have always thought of first papers on an idea as not much more than exploratory data analysis, no matter what the authors may claim. If we have other people who can build on it reliably, then I start to trust it. In truth, the p-value is really only relevant to the researcher, who presumably is seeing their data *before* the statistical significance filter, and thus holds more meaning.

A:

I do think that you can look at p-values from non-preregistered studies in the same way that you can generalize to a population from a non-probability sample and the same way you can make causal inferences from observational data; see this discussion. You just need to make some assumptions.

Here’s the difference: When making inferences from nonrandom samples or observational data, the norm in science is to be aware of the potential pitfalls and to do your best to justify your assumptions. But when people compute p-values from non-preregistered studies, they typically don’t think about these issues at all.

Regarding your claim that the same issues arise with all statistical methods: Not quite. Unlike many other methods, p-values are specific statements about what would have been seen, had the data been different. So p-values are particularly vulnerable to forking paths.

But, yes, in general, selection is a problem which is why I recommend analyzing all the data and being clear about choices that are made in data processing and analysis, which I think gets us closer to being able to make generalizable probability statements from data.

The probability of getting simultaneously N p-values all of which are less than 0.05 under the assumption of the independent uniform(0,1) distribution on each p value is the volume of the hypercube [0,0.05]^N which is 0.05^N

Of course, there’s nothing independent about these 500 “random” p values.

Incidentally, I wrote a blog post on a related topic in the last few days: http://models.street-artists.org/2017/09/18/on-the-lack-of-lebesgue-measure-on-countably-infinite-dimensional-spaces-and-nonstandard-analysis/

As the dimension of the space increases, for regions of space confined to subsets of [0,1] the Lebesgue measure of the selected subset goes to zero dramatically quickly. This is why in infinite dimensions you can’t have Lebesgue measure, anything but the unit cube has measure zero or infinity as dimensions increase without bound.

If you are looking for replication failures for terror management theory, here’s one from the Reproducibilty Project: Psychology:

https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwj43Ly1lrbWAhXIQBoKHZZDDXoQFggxMAA&url=https%3A%2F%2Fosf.io%2Fuhnd2%2Fdownload%3Fversion%3D1%26displayName%3DFinal%2BReplication%2BReport%2B-%2BCox%2Bet%2Bal-2015-03-20T15%253A56%253A57.805000%252B00%253A00.pdf&usg=AFQjCNHO-Mrl36Ji_58KG0FlpSSXOs3M8g

With papers like the one your original post is referring to, there’s bound to be more to come.

Perhaps the shortcomings of Terror Management Theory can be explained by Error Management Theory:

https://en.wikipedia.org/wiki/Error_management_theory

And I thought Lord & Novick were the originators of error management in science…

I would like someone to p-curve the TMT literature already. There are existing meta-analyses out there; I just don’t have the time to convert the table of effect sizes into p-values and then p-curve it. I will gladly help if other people want to do this, though.