Blake McShane sent me some material related to a paper of his (McShane et al., 2016; see reference list below), regarding various methods for combining p-values for meta-analysis under selection bias. His remarks related to some things written by Uri Simonsohn and his colleagues, so I cc-ed Uri on the correspondence. After some back and forth, I decided the best approach would be to read through the articles in question, write my summary, and then invite McShane and Simonsohn to append their thoughts.

So here goes:

The p-curve (Simonsohn, Nelson, and Simmons, 2014a, b) and p-uniform (van Assen, van Aert, and Wicherts, 2015) methods for meta-analysis under selection bias can be seen as alternative estimators of the model proposed by Hedges (1984). Hedges used maximum likelihood; p-curve and p-uniform use slightly different estimation methods. As a statistician, my usual inclination would be to start with maximum likelihood and, for sparse data, to move to a Bayesian approach. Simonsohn et al. and van Assen et al., with backgrounds not in statistics but in experimental psychology, came up with their own data-based estimates which perform similarly to the maximum likelihood estimate of Hedges but with some loss of efficiency and robustness in small samples and under the model misspecifications considered by McShane et al. (2016). In a review of the p-curve and p-uniform methods, van Aert, Wicherts, and van Assen (2016) write, “Hedges’s method and its performance are not further examined in this article because it is currently not applied in practice.” I don’t know enough about practice in psychological methods to be able to evaluate that remark.

Simonsohn et al. wrote a response to McShane et al. in which they (Simonsohn et al.) agreed that, when it comes to estimating effect size, all three methods described above can be viewed statistically as three similar estimators of the same model. Simonsohn et al. continue to recommend p-curve for applied work, but I don’t see that they would object to researchers instead using the p-uniform or Hedges estimates instead when estimating the underlying distribution of effect sizes or p-values. Simonsohn also points out several contributions of their p-curve papers beyond the estimate of effect size, which puts their approach within their larger plan for reproducible science.

One specific disagreement among these authors is the performance of the various estimates when combining studies whose underlying effects sizes differ. There’s an interesting mini-debate here which I won’t try to adjudicate but will briefly mention—McShane et al. argue that effect sizes can differ a lot, even among studies that attempt to be exact replications; Simonsohn et al. argue, based on an analysis of the ManyLabs replication paper, that the evidence for varying effects is not so clear. The related dispute is how these estimates perform in the presence of varying effects: McShane et al. point out a bias in the estimated average effect size across studies, while Simonsohn et al. argue that average population effect size is not so important, and they care more about average power. Simmons, Nelson, and Simonsohn (2018) also claim that their estimate is unbiased for the goal of estimating the average effect size of the studies included in the analysis. I started to lose interest in this part of the argument because I’m not so interested in average effect sizes without some clear sense of what’s being averaged over, and I’m _really_ not interested in average power.

On a different point, one thing I did not see mentioned in any of these discussions is that p-values are not in general uniformly distributed even under the null hypothesis, for two reasons: (a) discrete data and (b) composite null hypotheses in which the distribution of the test statistic depends on nuisance parameters. The amount of deviation from normality arising from these complexities depends on the particular problem, and I suspect the deviations are small for the simple models with relatively large sample sizes in the examples being considered (as noted in supplement 4 of Simonsohn, Nelson, and Simmons, 2014a); still, I’m uncomfortable with the casual statements by all these authors implying the uniformity of the distribution of p-values under the null hypothesis, as I fear that they are furthering a common misconception.

The use of any of these techniques in practice should be considered as only approximate for all but the simplest models, even in the absence of forking paths or p-hacking, and even if the selection model is correct. The prevalence of forking paths and p-hacking in published p-values makes any of these selection estimates even more suspect. A few years ago, I expressed these same concerns about an attempt by Jager and Leek to estimate a “science-wide false discovery rate” using a model fit to a distribution of published p-values (Gelman and O’Rourke, 2013).

This attitude of mine—a skepticism about any attempt to estimate effect sizes or aggregate properties using distributions of published p-values—is consistent with statements of McShane et al. (2016):

“In theory, more general selection methods can be designed to account for all of these issues. However, the population average effect size estimates produced by selection methods can be highly sensitive to the data model and the selection model assumed (particularly the latter), and more realistic data models and selection models typically cannot be well estimated without a large amount of data. Moreover, even if such more general selection methods were practically tractable, they would still fail to account for issues of selection resulting from the availability and accessibility of studies; such selection is (a) likely to be as important or more important than selection resulting from the size, direction, and statistical significance of study results and (b) far more difficult to model (even conceptually). . . . Consequently, given the idealistic model assumptions underlying selection methods and the sensitivity of population average effect size estimates to them, we advocate that selection methods should be used less for obtaining a single estimate that purports to adjust for publication bias ex post and more for sensitivity analysis.”

And of Simonsohn (2018):

“Given a set of studies, we can ask what is the average effect of those studies. We have to worry, of course, about publication bias, p-curve is just the tool for that. If we apply p-curve to a set of studies it tells use what effect we expect to get if we run those same studies again. To generalize beyond the data requires judgment rather than statistics. Judgment can account for non-randomly run studies in a way that statistics cannot.”

I don’t completely agree with that last statement from Simonsohn. Or, I should say, I agree that to generalize beyond the data requires judgment, but I think that here, as in many other settings, statistics can help guide that judgment. We almost never work with random samples, and we use our judgment all the time to extrapolate. But when using such judgment, I see no reason to turn the statistics off and start guessing.

That said, at some point the extrapolation gets so difficult as to be close to impossible, and that’s how I feel about methods of trying to draw inference about an underlying effect, or population of effects, from the distribution of published p-values. There’s just too much selection, both within and between papers. As I wrote above, I think that publication bias is the least of it. So I agree with both McShane et al. and Simonsohn et al. that these methods should be thought of as methods of demonstrating how bad the selection bias can be in a literature, under best-case assumptions, rather than as a method of estimating underlying effect sizes.

Thus, I can see how the observed distribution of p-values can be helpful to look at, if for no other reason than to reveal problems with naive interpretations of published p-values; as Simonsohn et al. demonstrate, even simple displays of theoretical and empirical distributions of p-values can give some insight. One reason I like the term “p-curve” (setting aside any disagreements about which estimator to use) is that it emphasizes the distribution of estimates. Simonsohn writes, “p-curve is, in our view, the best diagnostic tool for assessing whether a set of statistically significant findings as a whole rule out selective reporting as the only explanation for the set of findings.” Setting aside the question of what tool is best, I think this statement by Simonsohn is consistent with my general view that all these tools are most useful as a sort of rhetorical approach to show how bad things can be, even in the best-case scenario.

I get concerned, though, if people take these methods too literally. Consider the classic file-drawer-effect paper by Rosenthal, which I assume was written to demonstrate how serious this selection problem can be, but is sometimes twisted around to give the opposite meaning (by doing the calculation of how many papers would need to have been discarded to be consistent with a particular pattern of published results, and then claiming that since no such massive “file drawer” exists, the published claims should be accepted). I wouldn’t want researchers to take p-curve, or the Hedges approach, as evidence that a literature of uncontrolled p-values is approximately just fine.

As is often the case, I find myself more convinced by the demonstration of bias than by the attempted bias correction. In that sense, I see the Hedges procedure, or p-curve, or p-uniform, as being comparable to Type M and Type S errors (Gelman and Tuerlinckx, 2000) as a way of quantifying some effects of selection bias in statistical inference, but the desired solution is to go back to the original, unselected, data. All these methods can be useful in giving us a sense of the scale of bias arising in idealized situations.

References

Gelman, A., and Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics 15, 373-390.

Hedges, L. V. (1984). Estimation of effect size under nonrandom sampling: The effects of censoring studies yielding statistically insignificant mean differences. Journal of Educational and Behavioral Statistics 9, 61–85.

McShane, B. B., Böckenholt, U., and Hansen, K. T. (2016). Adjusting for publication bias in meta-analysis: An evaluation of selection methods and some cautionary notes. Perspectives on Psychological Science 11, 730-749.

Simmons, J., Nelson, L. D. and Simonsohn, U. (2018). P-curve handles heterogeneity just fine. Data Colada blog, 8 Jan. http://datacolada.org/67

Simonsohn, U. (2015). “The” effect size does not exist. Data Colada blog, 9 Feb. http://datacolada.org/33

Simonsohn, U., Nelson, L. D., and Simmons, J. P. (2014b). P-curve and effect size: Correcting for publication bias using only significant results. Perspectives on Psychological Science 9, 666–681.

van Aert, R. C. M., Wicherts, J. M., and van Assen, M. A. L. M. (2016). Conducting Meta-Analyses Based on p Values

Reservations and Recommendations for Applying p-Uniform and p-Curve. Perspectives on Psychological Science 11, 713-729.

van Assen, M. A. L. M., van Aert, R. C. M., and Wicherts, J. M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods 20, 293-309.

**Comment by Blakeley McShane, Ulf Böckenholt, and Karste Hansen**

McShane et al. write:

We thank Andrew for his thoughtful discussion of our paper. We largely agree with his post—particularly in our mutual skepticism of using the p-curve or any other method to obtain a definitive adjustment for publication bias—as it is consistent with many issues discussed in our paper. We have a few brief comments; a more extensive set of comments can be found here [http://blakemcshane.com/mbh.response2.pdf].

[1] As Andrew notes, Simonsohn, et al. reinvented the wheel: the p-curve employs a model for the observed data that is identical to that of Hedges (1984). However, the p-curve uses an ad hoc improvised estimation procedure rather than maximum likelihood as in Hedges (1984) and thus yields inferior estimates.

[2] Meta-analytic research has for decades focused on estimating the population average effect size (PAES) and estimating it accurately as measured by mean square error or similar quantities. However, when effect sizes are heterogeneous, the p-curve yields upwardly-biased and highly inaccurate estimates of the PAES.

[3] While we agree with Andrew that heterogeneity is one among many difficulties in modeling selection, we view it as a central one because heterogeneity is the norm in psychological research. It is rife and large in comprehensive meta-analyses of psychological studies (Stanley et al., 2017; van Erp et al., 2017). More interesting, it persists—and to a reasonable degree—even in large-scale replication projects where rigid, vetted protocols with identical study materials are followed across labs in a deliberate attempt to eliminate it (Klein et al., 2014; Eerland et al., 2016; Hagger et al., 2016; McShane, Böckenholt, & Hansen, 2016).

Given [2] and [3], the p-curve cannot be relied upon to provide valid or definitive adjustments for publication bias.

[4] In defending the p-curve [http://datacolada.org/67], Simonsohn, et al. avoid confronting [2] and [3]: contra the norm in meta-analytic research and contra their own paper’s title / topic (“p-Curve and Effect Size”), they selectively report a novel estimand (average historical power rather than the PAES) and a novel evaluation metric (bias rather than accuracy).

We share Andrew’s complete lack of interest in average power! However, even if one is interested in average power, one must evaluate a proposed estimator with regards to accuracy. Bias alone is uninteresting. In our more extensive comments [http://blakemcshane.com/mbh.response2.pdf], we show the p-curve produces highly inaccurate (if low bias) estimates of average power in the setting proposed by Simonsohn, et al. Therefore, one should be skeptical about using the p-curve to estimate not only the PAES but also average power.

[5] In downplaying heterogeneity [http://datacolada.org/63], Uri misses the point: regardless of the degree of heterogeneity found in large-scale replication projects, that there is any whatsoever in this setting where every effort is taken to eliminate it is both substantively interesting and strong evidence that it simply cannot be eliminated. Indeed, to argue for his point that homogeneity is the norm in psychological research, it is not sufficient to argue or provide evidence that there is low or even no heterogeneity where one would not expect to find it. Instead, one must show evidence that there is low or no heterogeneity precisely where one would expect to find it.

[6] Along with Andrew and contrary to Simonsohn, et al., we are skeptical of the ability of any method to obtain a definitive adjustment for publication bias and we certainly are not advocating for “our” method. However, the p-curve is sufficiently flawed in practice that it has no place even in sensitivity analysis or for “rhetoric.” Instead, the more sophisticated selection methods that evolved from Hedges (1984) should be the starting point for this purpose (for a review, see Hedges and Vevea, 2005).

Additional references

Hedges, L. V., & Vevea, J. L. (2005). Selection method approaches. In H. R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.), Publication bias in meta-analysis: Prevention, assessment and adjustments (pp. 145–174). Chichester, England: John Wiley & Sons.

Klein, R. A., Ratliff, K., Nosek, B. A., et al. (2014). Investigating variation in replicability: A “many labs” replication project. Social Psychology 45, 3, 142–152.

Stanley, T. D., Carter, E. C., Doucouliagos, H. (2017). What Meta-Analyses Reveal about the Replicability of Psychological Research. Deakin Laboratory for the Meta-Analysis of Research, Working Paper, November 2017.

van Erp, S., Verhagen, A. J., Grasman, R. P. P. P., & Wagenmakers, E.-J. (2017). Estimates of between-study heterogeneity for 705 meta-analyses reported in Psychological Bulletin from 1990-2013. Journal of Open Psychology Data, in press.

**Comment by Uri Simonsohn**

Simonsohn writes:

I really appreciate that Andrew reached out to us for comments, and that he was responsive to our feedback. In the 200 words that follow I briefly direct readers to our original positions on some of the key issues.

1) What are we trying to learn about?

Most meta-analysts claim to want to estimate the average of all possible experiments on a research question. We believe that this population, and hence its average, does not exist (http://datacolada.org/33). P-curve informs us “only” about the studies included in p-curve (http://datacolada.org/67). It tells you what you expect to get if you were to rerun those study designs again.2) Discreteness.

Andrew is concerned about tests with discrete distribution of p-values. We worried about this a few years ago also. In Supplement 4 of our first p-curve paper we explain why we are not worried anymore.3) Compound tests.

I am not sure I understand Andrew’s concern with compound tests. My best guess is that it involves the reasonable concern that some scientific hypotheses are hard to map onto a single or handful of statistical tests that can then be included in p-curve. But, in practice, having p-curved or seen p-curves of 100s of studies, I have not come across a single one for which this is an actual problem.4) Heterogeneity of effect size across studies.

a) Heterogeneity is less severe among observably identical studies than usually feared (http://datacolada.org/63)

b) Heterogeneity does not affect the performance of p-curve (http://datacolada.org/67)

**P.S.** Marcel van Assen, an author of the paper on the p-uniform method, posted a response in comments which I’m appending here:

I am trained as a mathematical psychologist, which provides a decent training in mathematical statistics, but of course this training does not compare to that of a mathematical statistician.

One reason NOT to start with ML is that the likelihood of conditional probabilities is very skewed and the approximations are therefore far off in some conditions. However, these problems can be solved using likelihood profiling.

McShane never contacted us to ask why we implemented p-uniform as we did, and just implemented ML without wondering why one may considering not using it…

McShane also assumes heterogeneity is the norm in psychology. Looking at Manylabs studies and RRRs, the data say otherwise. In published meta-analyses there often IS considerable heterogeneity; this is not as much because effects vary all over the place, but because psychologists tend to combine quite unrelated studies.

I would just add here that I’d expect heterogeneity to be a big deal in almost any topic in which effect sizes are large. It may be that the Many Labs studies didn’t find a lot of heterogeneity, but I’m guessing that the main effects are small too, and that the data are consistent with variation in treatment effects that’s the same order of magnitude as the average effects.

I do not understand the following:

a) Heterogeneity is less severe among observably identical studies than usually feared (http://datacolada.org/65)

The link is to a blog post regarding the supposed benefits of volunteering on health. There is no mention of heterogeneity in the whole post. What an I missing?

Can’t be certain, but I think it’s supposed to be blog #63 not #65.

http://datacolada.org/63

Makes sense, thank you. Hopefully Simonsohn will confirm :)

Oops. Sorry my bad.

Andrew, could you please fix the link?

Fixed.

“I don’t know enough about practice in psychological methods to have a sense of “

Andrew:

Are there words, maybe sentences missing after this??

D’oh! I patched it.

“they would still fail to account for issues of selection resulting from the availability and accessibility of studies; such selection is (a) likely to be as important or more important than selection resulting from the size, direction, and statistical significance of study results and (b) far more difficult to model (even conceptually)”

I don’t understand this. In this day and age, any study published in the last several decades is readily accessible around the world through the internet; at worst it is behind a paywall. So what is the accessibility issue apart from publication bias, or perhaps the limited accessibility of very old research, or studies that are still in press or only published in the last few weeks? These strike me as, in practice, negligible sources of error. What am I missing here?

I think the point is maybe there are many small studies that are done but not published? Not sure.

That any study published in last several decades is on the internet was refuted by Dr. Ioannidis. He claims that within a few years their record disappears.

I’m curious if taking an approach like this would be reasonable in this setting:

1) Fit a bunch of Hedges-type models (different selection sub-models, assuming hetereogeneity or not, etc) using Bayesian estimation instead of MLE.

2) Combine the models using stacking to produce an estimate of the effect size.

That seems a recommended approach in other settings where we don’t know the true model (and even assume none of the candidate models are the true model), so why not here?

If we do not know the true extent of publication bias (plus p hacking) how can we use any of these methods to estimate the effect size? The problem with all of these is not what we know, but what we know we don’t know: the extent to which researchers hide and fiddle with their results.

Yup, the insurmountable uncertainty here is primarily knowing what is causing the extra-sampling variation being observed (or even not being observed). Is it due to treatment variation with subject features (interaction), identifiable variation of treatments being given or study conduct and reporting quality (whatever it is that detracts from validity)? This may very in a reproducibility project versus usual study publication of similar studies.

If it is just treatment variation with subject features, then for a fixed mixture of patient features the average effect will be fixed and well defined. That will enable one to generalize to a population with the same mix or post-stratify to another mix.

That was the topic and I believe the source of disagreement with the author in this post https://andrewgelman.com/2017/11/01/missed-fixed-effects-plural/ Specifically I believe it is very rare to be fairly sure that the extra sampling variation is primarily being driven by treatment variation with subject features, the author does not. Furthermore, I believe the author failed to make it clear to readers that this certainty about the only being treatment variation with subject features was required, while the author seemed to believe the limitation and caveats they gave were adequate.

Hi, I am one of the authors of p-uniform.

I am trained as a mathematical psychologist, which provides a decent training in mathematical statistics, but of course this training does not compare to that of a mathematical statistician.

One reason NOT to start with ML is that the likelihood of conditional probabilities is very skewed and the approximations are therefore far off in some conditions. However, these problems can be solved using likelihood profiling.

McShane never contacted us to ask why we implemented p-uniform as we did, and just implemented ML without wondering why one may considering not using it…

McShane also assumes heterogeneity is the norm in psychology. Looking at Manylabs studies and RRRs, the data say otherwise. In published meta-analyses there often IS considerable heterogeneity; this is not as much because effects vary all over the place, but because psychologists tend to combine quite unrelated studies.

Cheers!

Marcel

Marcel:

Thanks. I’ve appended your comment to the above post.

> NOT to start with ML is that the likelihood of conditional probabilities is very skewed and the approximations are therefore far off in some conditions. However, these problems can be solved using likelihood profiling

Excellent point – one of the things I tried to elucidate here https://andrewgelman.com/wp-content/uploads/2011/05/plot13.pdf “[ML] plots omit important features of the likelihoods and priors they are likely to be less informative or even outright misleading – especially when the curves are not very quadratic”

Perhaps two things to note here – 1. For many applied problems quadratic (ML +/- SE) methods/plots are adequate. 2. It is hard to appreciate how reluctant (unprepared) folks are to move on to working with full likelihood functions – even mathematical statisticians.

In physics, it’s fairly common for researchers to publish papers that show a new measurement of some physical quantity that has already been measured. They usually are trying to improve on previous results, or to compare a new method against older ones. They report the mean and some kind of range. It isn’t a question of finding a “significant” difference, since the new result is often – but not always – within the error bars of the older ones. They may be looking for better precision, or for more confirmation that the actual value is within some range of values.

Here’s an example:

http://g2pc1.bu.edu/lept06/GabrielseTalk.pdf

See especially slides 13, 15, 77, and 78 for graphic comparisons with previous work.

How refreshing to see after reading so many unfortunate examples in some other fields that Andrew has been finding!

I think there are a few things that contribute to this:

– These experiments tend to be Really Big, especially in particle physics (my field), and though the analysis is far from “pre-registered” at least the community generally knows about the experiments and wants to know what they found.

– Physics can be _really_ weird, so a lot of experiments are checking the nooks and crannies of what we think we already know, just in case we find something strange there. Most of the time we just find the status quo, but it’s important to let other people know so they can go looking somewhere else instead.

– Physics tends to have “theorists” and “experimentalists”. Since physics can be so weird, theorists are generally glad to have any information we experimentalists can give them, whether or not we found anything surprising, so they can put better constraints on their models.

– Most physics experiments need a lot of theory to interpret the results, and those theories need measured input. So measuring just about anything with better precision can reduce the systematic uncertainties in other experiments, sometimes by quite a lot.

So basically I think it comes down to (a) physics is in the fortunate position that experiments can be largely theory-driven, and (b) money.

“Physics can be _really_ weird”

Agreed. But human behavior can also be really weird. I suspect that one very big difference between physics and the the social sciences is that physicists acknowledge that physics can be really weird, but social scientists rarely acknowledge that human behavior can be really weird; they seem to have some magical (to me) belief that human behavior can obey the simple theories that they “test” for.

Martha:

Also, to bring in my pet topic, “measurement”:

Physicists have been thinking seriously for centuries—millennia!—about measurement. They understand that measurement is a challenge, and they’ve put huge investments into measurements. They’ve also placed huge investments into theories.

In contrast, social science research—including some of the work that gets the most publicity—is often based on vague theories and sloppy measurement. Even when the the experiment has have some creativity (the bottomless soup bowl, etc.), the actual data collection often seems to have little thought going into it, and often experiments are done measuring people only once and not following up (thus losing the chance to control for variation among people).

Yup.

One incident that often comes to mind is a math ed dissertation committee I was on, where the candidate was doing research on “stereotype threat”, so one of the co-advisors was a psychologist who had worked in the area. His reaction to some situation (I don’t recall the details, but I think it was something like people responding to questionnaires in ways that were counter to his expectations) was “People are schizophrenic!”. Sounded to me like a) he expected people to be simpler (in particular, less variable) than they are, and b) it didn’t occur to him that his “instrument” (i.e., the questions asked of subjects) might not be very well-thought-out.

I don’t think controlling for individual variation is so important compared to observing how a behavior/whatever changes over time. For example, I doubt there is any phenomenon involving humans that doesn’t have at least circadian and seasonal cycles (mortaility, birth, illness, hormones, cardiovascular params, blinking, mood, etc).

Another example:

Try to find out how long it takes the pulse of people of various age and fitness to return to baseline after strenuous exercise. For some reason the research is dominated by checking

only one minuteafter exercise instead of reporting a curve until resting heart rate is regained… Who would run one of these studies only to check one minute after in the end? This is NHST brain damage that leads to wasting unimaginable amounts of time/money.Anon:

Good point. Controlling for variation among people, and observing how things vary over time: these are two things you get from taking repeated high-quality measurements on each person.

I am not an experimental psychologist nor statistician. But for some reason I like delving into subjects that I don’t understand. So apologies for naive comments and questions.

From Hedges (1984): “The biggest disadvantage of maximum likelihood estimation is that it requires a specialized computer program to compute the likelihood and its maxima.” Alas I thought, too bad Richard McElreath’s Statistical Rethinking book/videos and R weren’t around, I wonder what he’d have made of them. And earlier in the paper, though no single sentence captures it, he makes an Ioannidis-like argument for a literature severely distorted by dichotomania. I imagined he was looking down from Heaven saying “I told you so”. So I looked him up. Man did he get off to a fast start. From the organization of the paper and apparent deep knowledge of historical recognition of the root of the problem I assumed it came from wisdom but he was in his 20s when he started writing on these topics. Has he stated anywhere what thinks about e.g. this: https://www.buzzfeed.com/stephaniemlee/brian-wansink-cornell-p-hacking?utm_term=.tyLzgQEEz#.hhRND2eeN ?

P.S. Thanks for hosting this p-curve debate.

Thanatos:

Don’t think Richard could have put together his course/videos in 1984. The two stage sampling idea of Bayes given in Rubin, DB (1984). “Bayesianly Justifiable and Relevant Frequency Calculations for the Applies Statistician” was only put forward as conceptual given computation at the time. Furthermore, WG Cochrane and his staff were unable to get maximum likelihood estimation to actually work for things like the eight schools meta-analysis example in 1980 or 82.

Additionally, a lot of Hedges and Olkin’s work was in meta-analysis areas that did not support well defined likelihoods so they were instead focusing on effect estimates and p_values. In line with Sander’s comment here https://andrewgelman.com/2018/02/20/zoologist-slams-statistical-significance/#comment-670877 ” P can be computed easily in many cases where likelihood can’t (or computing likelihood requires additional assumptions that may be dodgy).”

Good points. Thanks.

One thing I appreciate about p-curve is the intuitiveness of the method. It’s a useful teaching tool for illustrating a serious problem, and in my judgement the “too many p-values too close to 0.05” argument resonates with people who are otherwise intimidated by or resistant to discussions of replicability / publication bias / researcher degrees of freedom.

I know Andrew and many others here are not interested in p-values and power and I understand why. But from an outreach point of view, showing the distribution of the p-value under various realistic scenarios is an effective way of reaching people who are overly attached to their p-values. It takes away some of the magic.

Ben:

I’d separate p-curve into two ideas: (1) looking at the distribution of the p-curve and comparing it to what might be seen under various theoretical models, (2) specific methods of estimation. I expect McShane et al. are correct that step (2) is better done using the method of Hedges (1984) or some similar approach, but I do think the name “p-curve” is valuable in drawing attention to step (1), which I think is the more important idea.

” Simonsohn et al. and van Assen et al., with backgrounds not in statistics but in experimental psychology, came up with their own data-based estimates…”

Just critique the methods, not the credentials.

Ignazio:

Huh? This is a description, not a critique. And I’m talking about backgrounds, not credentials. The different backgrounds are relevant when understanding how the different research groups ended up using different estimation methods.

And, for that matter, I think it’s a plus, not a minus, for a researcher to have an applied background!

Alright then, my bad, I misinterpreted this

Ah, I’m so glad to see a post on this! I’ve wondered for a while whether to recommend the p-curve to other people but information on this subject was hard to come by. It’s good to have another source of information about this.

Incidentally, there were some pretty big effect sizes in Many Labs. The most variation in effect sizes was found in these big ones (notably anchoring). Also notably, this variation was largely located on one side of zero (that is, the question was basically whether an individual lab would show a large positive effect size or a huge positive effect size). On the other hand, the effects that failed to obtain significance when summed across labs (that is, two priming studies) were clustered pretty tightly around zero rather than being spread out. At least, that’s how it was with the effects I remember well – I’d have to double check to comment on the study as a whole.

Austin:

It makes sense to me that large effects will have large variation. This is something I’ve often said, although it’s tricky to come up with a theoretical argument for it as a general principle.

Intuitively, how can there heterogeneity if the “main effect” is zero (as there is not much to vary, if there is no effect)?

This seems to be less likely than heterogeneity if the “main effect” (average effect is strongly different) from zero.

Assessing and explaining/understanding heterogeneity is one of the most difficult enterprises in the social sciences (lack of precision, even if one has, say, 50 studies), and one of the most important enterprises. Fortunately, many ManyLab studies and registered replication reports are underway, providing evidence on these important enterprises.

Marcel:

Yes, exactly, if the main effect is near zero, you’d not expect much to vary. It could be that various interactions just happen to cancel out in the population so you could have a near-zero main effect in the presence of large interactions, but this sort of thing should be unlikely.

And I agree completely that interactions are hard to estimate.

My point with respect to ManyLabs was just that ManyLabs is studying many different effects, many of which are near-zero and would be expected to have tiny interactions, some of which are farther from zero and could have large interactions. Even if it’s the case that most of the ManyLabs studies have small interactions, that would not necessarily be the case for large effects of interest.

One of the difficulties in discussing p-curve etc., is that these methods are most interesting in cases of large effects, but they seem to be most commonly applied in settings such as power pose where effects are small and swamped by noise and systematic error.

Ah, incidentally, it was unclear (at least in the case of anchoring) whether the between-labs heterogeneity was more than would be expected from within-lab heterogeneity, because the distribution of data completely wrecked the statistics Many Labs was using. Uri Simonsohn did a thing on this:

http://datacolada.org/63