We had a couple recent discussions regarding questionable claims based on p-values extracted from forking paths, and in both cases (a study “trying large numbers of combinations of otherwise-unused drugs against a large number of untreatable illnesses,” and a salami-slicing exercise looking for public opinion changes in subgroups of the population), I recommended fitting a multilevel model to estimate the effects in question. The idea is that such a model will estimate a distribution of treatment effects that is concentrated near zero, and the resulting inferences for the individual effects will be partially pooled toward zero, with the anticipated result in these cases that none of the claims will be so strong any more.

Here’s a simple example:

Suppose the prior distribution, as estimated by the hierarchical model, is that the population of effects has mean 0 and standard deviation of 0.1. And now suppose that the data-based estimate for one of the treatment effects is 0.5 with a standard error of 0.2 (thus, statistically significant at conventional levels). Also assume normal distributions all around. Then the posterior distribution for this particular treatment effect is normal with mean (0/0.1^2 + 0.5/0.2^2)/(1/0.1^2 + 1/0.2^2) = 0.10, with standard deviation 1/sqrt(1/0.1^2 + 1/0.2^2) = 0.09. Based on this inference, there’s an 87% posterior probability that the treatment effect is positive.

We could expand this hypothetical example by considering possible alternative prior distributions for the unknown treatment effect. Uniform(-inf,inf) is just too weak. Perhaps normal(0,0.1) is also weakly informative, and maybe the actual population distribution of the true effects is something like normal(0,0.05). In that case, using the normal(0,0.1) prior as above will under-pool, that is, the inference will be anti-conservative and be too susceptible to noise.

With a normal(0,0.05) prior and normal(0.5,0.2) data, you’ll get a posterior that’s normal with mean (0/0.05^2 + 0.5/0.2^2)/(1/0.05^2 + 1/0.2^2) = 0.03, with standard deviation 1/sqrt(1/0.05^2 + 1/0.2^2) = 0.05. Thus, the treatment effect is likely to be small, and there’s a 72% chance that it is positive.

Also, all this assumes zero bias in measurement and estimation, which is just about never correct but can be an ok approximation when standard errors are large. Once the standard error becomes small, then we should think about including an error term to allow for bias, to avoid ending up with too-strong claims.

**Regularization vs. discovery?**

The above procedure is an example of *regularization* or smoothing, and from the Bayesian perspective it’s the right thing to do, combining prior information and data to get probabilistic inference.

A concern is sometimes raised, however, that regularization gets in the way of *discovery*. By partially pooling estimates toward zero, are we reducing our ability to discover new and surprising effects?

My answer is no, there’s *not* a tradeoff between regularization and discovery.

How is that? Consider the example above, with the 0 ± 0.05 prior with 0.5 ± 0.2 data. Our prior pulls the estimate to 0.03 ± 0.05, thus moving the estimate from clearly statistically significant (2.5 standard errors away from 0) to not even close to statistical significance (less than 1 standard error from zero).

So we’ve lost the opportunity for discovery, right?

No.

There’s nothing stopping you from gathering more data to pursue this possible effect you’ve discovered. Or, if you can’t gather such data, you just have to accept this uncertainty.

If you want to be more open to discovery, you can pursue more leads and gather more and higher quality data. That’s how discovery happens.

B-b-b-but, you might say, what about discovery by luck? By regularizing, are we losing the ability to get lucky? Even if our hypotheses are mere lottery tickets, why throw away tickets that might contain a winner?

Here, my answer is: If you want to label something that might likely be wrong as a “discovery,” that’s fine by me! No need for a discovery to represent certainty or even to represent near-certainty. In the above example, we have a 73% posterior probability of seeing a positive effect in an exact replication study. Call that a discovery if you’d like. Integrate this discovery into your theoretical and practical understanding of the world and use it to decide where to go next.

**P.S.** The above could be performed using longer-tailed distributions if that’s more appropriate for the problem under consideration. The numbers will change but the general principles are the same.

Dr. Gelman makes an appearance in this NYT Magazine story as the not nice man who hurts the feelings of the Power Posing lady:

https://www.nytimes.com/2017/10/18/magazine/when-the-revolution-came-for-amy-cuddy.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=mini-moth®ion=top-stories-below&WT.nav=top-stories-below

Steve:

I think the NYT article is fair, given the inevitable space limitations. I wouldn’t’ve chosen to have written an article about Amy Cuddy—I think Eva Ranehill or Uri Simonsohn would be much more interesting subjects. But, conditional on the article being written largely from Cuddy’s perspective, I think it portrays the rest of us in a reasonable way. As I said to the reporter, I don’t have any personal animosity toward Cuddy. I just think it’s too bad that the Carney/Cuddy/Yap paper got all that publicity and that Cuddy got herself tangled up in defending it. It’s admirable that Carney just walked away from it all. And it’s probably a good call of Yap to pretty much have avoided any further involvement in the matter.

The only thing that really bugged me about the NYT article is when Cuddy is quoted as saying, “Why not help social psychologists instead of attacking them on your blog?” and there is no quoted response from me. I remember this came up when the reporter interviewed me for the story, and I think I responded right away that I

havehelped social psychologists! I’ve given several talks to psychology departments and at professional meetings, and I’ve published several papers in psychology and related fields on how to do better applied research.So, I think the article was basically fair but I do wish they hadn’t let that particular false implication by Cuddy go unchallenged. Then again, I also don’t like it that Cuddy attacked the work of Simmons and SImonsohn without supplying any evidence. It seems that the rule is that it’s “bullying” to criticize Cuddy, even with ample evidence, but it’s just fine for Cuddy or her advisor Susan Fiske to lash out with false criticisms for which no evidence is presented.

I find it really interesting to read the comments on that NYT piece. Lots of people there seem to assume and/or view that Cuddy was bullied by the critics named in the piece, and even involve things like sexism.

What i find interesting is that i have a hard time finding anything in the story that could be perceived as bullying (according to a reasonable definition) by the critics. If, and that’s a big if, there has been any bullying, i think it was done on blogposts, twitter, etc. by other people than those named in the story in my understanding. But i can’t even find any examples of that in the piece. The “worst” to me is the quote “I’ve wondered whether some of Amy Cuddy’s mistakes are due to the fact that she suffered severe head trauma as the result of a car accident some years ago”, which i think can be considered to be bullying, but can also be seen as a bad joke or a sincere question. Regardless, if we start policing those kind of remarks, it all becomes a little too close to censorship.

Perhaps it might be useful for those that think there is a bullying-problem to:

1) make a difference between possible bullying by fellow scientists and by the general public. I think it’s reasonable to try and make this distinction when possible, as i view possible bullying by fellow scientists or the general public as separate things.

2) make clear what is seen as bullying, which can then be followed by a discussion of why this is the case, and why this is/is not a valid view in light of the scientific enterprise.

I still have not seen any evidence in the Cuddy case of what i think is bullying (according to a reasonable definition) by a fellow-scientist.

+1

To nearly all women in present Academia, who have been treated like hothouse flowers all their lives, any criticism constitutes “bullying.” That’s how far we have sunk….

Peter:

I don’t think this sort of comment is very helpful! How can anybody respond to it? Analysis of individual cases are fine (for example, one can argue about whether my blog posts are a form of “bullying”; I don’t think so but Cuddy does; I guess I’d like to hear why she thinks so) and statistics are fine too (one could do a survey of academics and ask about bullying experiences, what constitutes bullying, etc.), but I don’t see what’s gained by empty generalities.

+1

Steve:

“Up-and-coming social psychologists, armed with new statistical sophistication, picked up the cause of replications, openly questioning the work their colleagues conducted under a now-outdated set of assumptions.”

Steve Goodman raised this as a necessary? step in changing the methodology a field embraces https://ww2.amstat.org/meetings/ssi/2017/onlineprogram/AbstractDetails.cfm?AbstractID=304001

That was my experience in clinical research in the 1980s and 1990, newer ambitious clinicians took senior roles away from those who had them using randomized trial methods, statistics and health economics. There will be very personal losses and gains so politics necessarily enters – those “whether I embrace your principles or mistress” exchanges. But the economy of research -Peirce, C. S. (1879). Note on the theory of the economy of research. – takes no prisoners, nor can it afford to.

Note also, the soon to be published meta-analysis https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3054952 has mentioned three levels of evidence: evidential value, clear evidential value and remarkably strong evidential value. The full paper is yet to be available, but in my 20+ years working in meta-analyses of clinical research I don’t remember ever getting clear evidential value except perhaps for side effects.

The way I try to explain this to people concerned about Type II errors is: the only reason anyone ever thought p < 0.05 should count as "discovery" in the first place is that it's supposed to be hard to get by luck. If you do things that make it easy to get by luck then the whole justification for caring at all about small p-values evaporates, and you may as well be testing hypotheses using a magic 8 ball.

To use the "some of these lottery tickets might be winners" analogy: you bought those lottery tickets because you thought the winning ones would be worth something. But it turns out you were playing a special kind of lottery, in which the value of the winning tickets is reduced in proportion to the quantity of non-winning tickets. And now that "three-of-a-kind" scratch card you're holding onto can't be converted into cash after all.

+1 to first paragraph. (Second paragraph seems kinda contrived.)

I agree, I immediately liked my analogy less as soon as I submitted it. The “winning ticket” analogy that Andrew references refers to real discoveries in a pile of false discoveries, and so refers to the truth that the method is supposed to detect. My analogy refers to how trustworthy the method’s claim to have detected something is, rather than the truth of the thing itself.

In defense of psychology, to paraphrase Greg Cochran’s observation, a lot of fields have similar problems, but the research psychologists tend to have enough of a conscience to feel somewhat bothered by their problems.

I wonder if the focus on psychology is also a function of the publicity given to some of these studies. We know about power posing and himmicanes and all that because they get written about in the popular press, making them easy targets for skeptics (and making the skeptics’ criticism also press-worthy). If people are happily p-hacking away in some unsexy field that never produces TED talks and clickbaity articles, whose gonna spot it beside those in the field?

+1

I’m glad the focus has been on us psychologists. Research is a human activity and we’re the ones who’ve taken upon ourselves the task of building the science of human behavior. If we psychologists aren’t willing to take the science of science seriously, well, we should just hand in our degrees and take up retail or something.

+1

On those concerned with the trade-off between regularization and discovery:

It seems this is confusing the estimation approach with the value function. Ideally, the threshold at which we label something a ‘discovery’ would be based on the expected distribution of outcomes conditional on whether or not we apply this label. Even if one accepts that a fixed threshold will be used (silly idea, but whatever), why not define it in the domain that is seemingly of interest, ie the posterior probability that the quantity we are estimating is above some line? If you want more discoveries, just move the threshold.

That’s a sensible solution for someone already converted to HB estimates. If you want more discoveries, just move the threshold. But I don’t think the gripe is with the threshold per se. The gripe is with the estimates themselves, that they’re not MLEs. The gripe isn’t that the threshold has shifted but that the estimates have. I think it all stems from not fully understanding just what hierarchical modeling does and the reasons behind some estimates shrinking more than others. Perhaps the method is perceived as selectively shrinking only the most interesting or discovery-worthy metrics, or that the HB shrinkage is happening in some ad hoc manner.

For example, take something like the principle of a Bonferroni correction when making multiple statistical comparisons. I’ve not seen students or researchers put up a fuss over revising the threshold for discovery from the point of view of tightening the p-value at which one calls something significant when doing, say, one’s doing hundreds of statistical tests. Somehow correcting for multiple comparisons (shrinking the p-value) isn’t objectionable. The gripe is with the other side of the coin.

In an odd coincidence, the very day of this post by Andrew (10/19) I was presenting HB methods and Stan to a group of about 100 practitioners at a data science conference geared for the tech industry (Apple, Google, FB, etc. attendees) I used Stan to demonstrate how HB estimation methods will almost always boost the accuracy of one’s predictions (lower MAPEs, etc.) when fitting models to multiple groups. I did this using a real-world data example. We had a very lively discussion. (As an aside, I couldn’t make the deadline for submitting this to StanCon 2018, but that would’ve been a nice outlet as well). Some of these otherwise experienced data science practitioners had a very hard time with the concept. They were truly convinced that I was sacrificing discovery, that I was over-shrinking my results and distorting my estimates. Such is the entrenched belief in the superiority of MLEs. For this audience, and in most cases, I don’t think that a theoretical discussion or set of proofs are convincing at all. The way through this–and the path I followed–is to clearly demonstrate using hold-out tests and starting with very basic examples (the early season baseball batting averages example is a good one!), that HB estimates will almost always produce lower MAPEs, fewer egregious outliers, and more accurate predictions, than the MLEs. One needs to demonstrate this in black and white. One can even show with simulated data how the least stable groups are the ones receiving the greatest shrinkage; not the more stable ones.

It helps to acknowledge that, yes indeed, for *some* groups, the HB/Stan estimates can indeed get moved further away from their “true” values. However, for far, far more groups, the HB/Stan estimates are actually the ones *closer* to their corresponding true values and the MAPE is lower–than compared with MLEs.

In the end, the trade-off really isn’t between discovery and regularization at all. The real trade-off is whether one wants estimates closer to the truth for more groups or for fewer groups. Regularization doesn’t get in the way of discovery; it more often saves you the embarrassment of making some really, really bad claims!