“Marginally Significant Effects as Evidence for Hypotheses: Changing Attitudes Over Four Decades”

Kevin Lewis sends along this article by Laura Pritschet, Derek Powell, and Zachary Horne, who write:

Some effects are statistically significant. Other effects do not reach the threshold of statistical significance and are sometimes described as “marginally significant” or as “approaching significance.” Although the concept of marginal significance is widely deployed in academic psychology, there has been very little systematic examination of psychologists’ attitudes toward these effects. Here, we report an observational study in which we investigated psychologists’ attitudes concerning marginal significance by examining their language in over 1,500 articles published in top-tier cognitive, developmental, and social psychology journals. We observed a large change over the course of four decades in psychologists’ tendency to describe a p value as marginally significant, and overall rates of use appear to differ across subfields. We discuss possible explanations for these findings, as well as their implications for psychological research.

The common practice of dividing data comparisons into categories based on significance levels is terrible, but it happens all the time (as discussed, for example, in this recent comment thread about a 2016 Psychological Science paper by Haimowitz and Dweck), so it’s worth examining the prevalence of this error, as Pritschet et al. do.

Let me first briefly explain why categorizing based on p-values is is such a bad idea. Consider, for example, this division: “really significant” for p less than .01, “significant” for p less than .05, “marginally significant” for p less than .1, and “not at all significant” otherwise. And consider some typical p-values in these ranges: say, p=.005, p=.03, p=.08, and p=.2. Now translate these two-sided p-values back into z-scores, which we can do in R via 1 – qnorm(c(.005, .03, .08, .2)/2), yielding the z-scores 2.8, 2.2, 1.8, 1.3. The seemingly yawning gap in p-values comparing the “not at all significant” p-value of .2 to the “really significant” p-value of .005, is only 1.5. Indeed, if you had two independent experiments with these z-scores and with equal standard errors and you wanted to compare them, you’d get a difference of 1.5 with a standard error of 1.4—completely consistent with noise. This is the point that Hal Stern and I made in our paper from a few years back.

From a statistical point of view, the trouble with using the p-value as a data summary is that the p-value is only interpretable in the context of the null hypothesis of zero effect—and in psychology studies, nobody’s interested in the null hypothesis. Indeed, once you see comparisons between large, marginal, and small effects, the null hypothesis is irrelevant, as you want to be comparing effect sizes.

From a psychological point of view, the trouble with using the p-value as a data summary is that this is a kind of deterministic thinking, an attempt to convert real uncertainty into firm statements that are just not possible (or, as we would say now, just not replicable).

P.S. Related is this paper from a few years ago, “Erroneous analyses of interactions in neuroscience: a problem of significance,” by Sander Nieuwenhuis, Birte Forstmann, and E. J. Wagenmakers, who wrote:

In theory, a comparison of two experimental effects requires a statistical test on their difference. In practice, this comparison is often based on an incorrect procedure involving two separate tests in which researchers conclude that effects differ when one effect is significant (P < 0.05) but the other is not (P > 0.05). We reviewed 513 behavioral, systems and cognitive neuroscience articles in five top-ranking journals (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience) and found that 78 used the correct procedure and 79 used the incorrect procedure. An additional analysis suggests that incorrect analyses of interactions are even more common in cellular and molecular neuroscience. We discuss scenarios in which the erroneous procedure is particularly beguiling.

It’s a problem.

P.S. Amusingly enough, just a couple days ago we discussed an abstract that had a “marginal significant” in it.

28 thoughts on ““Marginally Significant Effects as Evidence for Hypotheses: Changing Attitudes Over Four Decades”

  1. When I read the abstract I was surprised not to see there at least a qualitative statement about which direction the trend took. In case you’re curious too, this from the link:

    Articles published in 2010 were 2.47 times more likely
    (95% confidence interval, or CI = [1.88, 3.22]; odds ratio =
    3.61) to describe a result as marginally significant than articles
    published in 1970.

  2. Great point about being more interested in the null hypothesis. I liked this article, http://science.sciencemag.org/content/353/6303/989, because they seemed to be more interested in finding a way of maintaining the null hypothesis, and the model that the null hypothesis represents. That model is the one that represents the stable system and had a lot of weight in terms of past research. The data they examined was considered from the point of whether the model needed to change, or not. The null hypothesis was an anchor, not a trivial ‘alternate’ hypothesis. Change, in the context of climate change, is not ‘attractive’; this is indeed the opposite of social psychology where the next new thing is the thing that supersedes the other old thing, leaving behind a trail of unwanted and sad null hypotheses!

    Loved these two paragraphs: “Given the uncertainties surrounding dynamical aspects of climate change, a reasonable null hypothesis would be that climate change is dominated by its thermodynamic aspects. The unusual behavior seen in recent decades would then reflect natural variability. The contrary hypothesis is that the accelerated warming of the Arctic is part of the climate-change signal and has changed the weather patterns in midlatitudes through changes in the tropospheric polar vortex. Such a hypothesis is not far-fetched: There are general grounds for expecting that the dynamical response to climate change will resemble the modes of internal variability (4). Unfortunately, this expectation makes it difficult to separate the signal from the noise, because they have similar spatial patterns.
    One aspect of the scientific debate has focused on whether the observed changes associated with particular hypotheses are statistically significant. This is rather beside the point, because the definition of statistical significance is arbitrary (5). A lack of statistical significance does not mean that the effect is not there, and a positive finding does not imply any attribution to climate change. It is also extremely challenging to accurately characterize the low-frequency noise from the limited observational record… “

  3. “…whether the observed changes associated with particular hypotheses are statistically significant. This is rather beside the point…”

    Beautiful. I wouldn’t have come up with “beside the point”, it is a minimally-aggressive term yet remains accurate. Great stuff.

  4. The Pritschet et al article describes their method for finding descriptions of results as “marginally significant” as follows:

    “Articles were searched using Adobe Reader for all instances of the strings “margin” and “approach.” L. Pritschet then judged whether these instances were being used to label a result as marginally
    significant.” (p. 2)

    However, https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/ documents many other ways of saying “marginally significant” that do not use the word “margin” or “approach”. Thus the actual use of the “marginally significant” meme is probably larger than what Pritscht et al found in their study.

  5. As usual, you make some good points, but as now seems standard when P-values are mentioned, there is a polemical mis-statement: “the trouble with using the p-value as a data summary is that the p-value is only interpretable in the context of the null hypothesis of zero effect”. The falsity of that statement is readily demonstrated.

    Say I know the null is false but I collect a data sample anyway. If your statement is correct then the P-value calculated from those data would tell me nothing, but in fact they tell me something about the data that is often quite useful. The P-value tells me how strange it would be for those data are according to the statistical model set to the null hypothesis. The P-value tells me something about the evidence in the data. The null hypothesis serves as a landmark in parameter space, or an anchor, but the P-value summarises the data, not something in the hypothesis space. The P-value is a summary of the data, not a summary of the null hypothesis or the parameter of interest.

    If I did not already know that the null hypothesis was false then a small P-value might push me towards an opinion regarding the falsity of the null hypothesis, and other hypothetical values of the parameter of interest close to the null hypothesis, and it might be a factor in a decision, but it does so by being an index of evidence.

    I agree wholeheartedly that the “deterministic thinking” approach of converting P-values into firm statements is a serious impediment to good science. However, the solution to problems like the use of “marginal significance” lies in encouraging scientists to base their conclusions on scientific argument and reasoning. Killing off of a particular data summary may seem like a shortcut towards such an outcome, but as the saying goes, shortcuts make long delays.

    • MIchael:

      In just about all the problems I’ve ever worked on, I know the null hypothesis is false. I even think the null hypothesis is false in power pose! I’m sure that power pose has effects, I just think these effects are highly variable, can go in all directions, and are highly context dependent.

        • Michael:

          In situations where I have no interest in the null hypothesis (that is, almost all situations I’m involved in), I also don’t have much use for a measure of distance from that null hypothesis. I just don’t see the point. I’d rather attack the problem directly.

      • As a Bayesian, it’s possible to say “these effects are highly variable…” etc, because we can talk about probability of a single unrepeatable event. But most people’s stats backgrounds require “the effect” to mean *the average effect* (or median, or something like that). The average over a whole population isn’t a variable thing. it’s just a fixed number (for a fixed population) so I think there’s an aspect of talking past each other implicit here.

        “The effect” for a Bayesian can mean something like “the effect that doing X had on this particular person at this particular time and place” while “the effect” for someone working in a Frequentist NHST context automatically means “the average effect over a population of interest”.

        • Hey Daniel, we discussed a variant of this idea on your blog a while back, and I guess I’m still not sure this is quite right. All statistics is about generalizing from a sample to a population, and I don’t think it’s right to say that the population being generalized to necessarily is different for a frequentist than a Bayesian. Yes, Bayesian methods give us a direct approach to Pr[E], via summarizing uncertainty with some probability distribution – whereas frequentism requires a *hypothetical* long-run of repeated sampling trials to even define probability, and thus naturally to estimate uncertainty. However, that long-run of trials is just hypothetical, and in particular it does not imply that the underlying population has to literally be capable of admitting thousands (or whatever) re-sampling events. No?
          At any rate, this is not to justify NHST, which is a different beast, and I totally buy into Andrew Gelman’s criticism of it.

        • I think you’re going wrong when you say “All statistics is about generalizing from a sample to a population”. Bayesian statistics is perfectly fine with estimating a single quantity with uncertainty. For example, what was the mass of my car at 6AM Pacific Time on October 17th 2016?

          A Frequentist won’t admit a probabilistic answer here. There’s *no* notion of repetition. A Bayesian can use probability to discuss which values of the mass are more likely vs less likely to have been true at that moment in time.

        • Consider a pill designed to say reduce cholesterol. I give it to 100 people. The reduction in cholesterol happens to be given exactly by the R code:

          set.seed(1234509876)

          redpre = rnorm(100,0,25)

          reduction = redpre – mean(redpre)

          redmeas = reduction + rnorm(100,0,1)

          which means reduction has exactly zero mean in this population of 100 people. But, even though there is exactly zero reduction on average, some people had reductions of 20 or 40 points, whereas some people had increases of 20 or 40 points.

          Let’s not get into whether its possible for a Frequentist to have opinions on whether a particular person had a particular reduction or not… Let’s just say that in practice, when a Frequentist sits down to analyze this problem in terms of a probability model, they will be thinking of *the effect* as the actual average effect over the 100 people, with the 100 people being the population over which the repeated sampling is well defined (or they may like to extrapolate to some larger population of which the 100 people are a fixed sample). Nevertheless, in any case, “the effect” is a single number, that happens to be exactly zero here. Whereas, a Bayesian has no problem thinking of “the effect” as a 100 dimensional vector of numbers, with effect sizes varying from one person to another. The mathematical probability object can be easily defined over these 100 dimensional vectors.

          So, mathematically, they’re talking about two totally different objects. One number, verses a 100 dimensional vector.

    • “The P-value tells me how strange it would be”. Hmm. The P-value is some point on the surface above a 3D space with one axis being distance from the null, one axis being variance in the sample, and one axis being sample size, but I would associate “strangeness” with only one of these axes – the distance from the null. So the association of the 74 loci in the “Genome-wide association study [that] identifie[d] 74 loci associated with educational attainment” (http://www.nature.com/nature/journal/v533/n7604/abs/nature17671.html) is not at all strange or surprising because the effect is so tiny (the R^2 was 200,000!

        • Ugggh. clearly commenting doesn’t like the greater/less than symbols! The last bit should be R2 was less than 0.01 and the sample size was greater than 200,000

      • Jeff, I assume you are pointing to a perceived shortcoming of P-values, but the problem is that you are assuming that the low P-value will be interpreted in isolation of the rest of the information about the study in question. If you have a study that contains many, many comparisons of gene loci with associated P-values, then you should expect that some of the P-values will be small even where their null hypotheses are true. That’s not a problem if you do not adopt the mindless “deterministic thinking” because you know that there will be lots of relatively small P-values.

        Even in that circumstance the P-values can be useful because you can rank the interestingness of the gene loci using their associated P-values. The genes with the smallest P-values are the genes that the data say are worth further investigation or scientific consideration.

        • Certainly in some papers I’ve read (this is outside my field) this seems to be how p-values are used – simply to rank the genes for further investigation. But then why not just rank on effect size? It would seem that a p-value (even one corrected by FDR or whatever) is beside-the-point. Unless NONE of the p-values are small. But then that raises the ultimate question in gene association/expression studies – what is the physiological consequence of effect size? What is a biologically small vs. big effect? I wish I knew the field better for example of research groups pursuing this because this would seem to be the ultimate goal. But most of the high profile papers that I see (like the Nature paper that I cited in my post) are all about “discovering” associations.

        • Jeff, that is indeed how P-values are often used, unfortunately. That does not mean that we should assume that the utility of P-values is restricted to that mode of mis-use.

          Your point about effect sizes is good. There are many situations in which there is little point in calculating the P-value. Gene association studies are one such situation, I suppose.

        • The “effect size” in gene studies is complex. There can be genes that are part of a feedback control system which are tightly regulated, and small perturbations from the regulated transcriptional rate cause “big” biological effects… other genes you could double or quadruple their transcription rates and not have massive effects… since things like RNA-seq and microarrays are measuring in some sense how much message RNA there is, and not “how much biological consequence there is” effect sizes are not directly observable. So, observing that there is a difference between the transcription levels of a gene in condition A vs B and that this difference exceeds what is normally seen in just the control condition A alone… tells you that something unusual is going on at the transcription level at least. There is no absolute effect size at the transcription level, so we need a dimensionless ratio, and a Z score essentially does the job.

  6. From a statistical point of view, the trouble with using the p-value as a data summary is that the p-value is only interpretable in the context of the null hypothesis of zero effect—and in psychology studies, nobody’s interested in the null hypothesis.

    Why is that? Psychology is not my field, but how is that *not* interesting? Why is the interest in the null hypothesis is discipline dependent?

    • David:

      It’s not just psychology. For example, in a medical study, different treatments will be effective for different people. Treatment A might not be consistently better than Treatment B, but that isn’t the same thing as the null hypothesis of zero effect. I’ve almost never been interested in the null hypothesis in anything I’ve ever looked at. However, I’ve heard rumors that the null hypothesis can be approximately true in certain fields such as genetics, so I’m avoiding making any global statements in this regard.

      • Do you mean somethings like these in genetic research: testing of these is any difference in the level of gene expression between conditions/treatments, or, of difference in the frequency of a single nucleotide polymorphism (change in nucleotide content) of interest between patient cohort and normal cohort?

        In both scenarios there will be some difference but the focus is on whether these differences are statistically AND (this one gets lost mostly) biologically meaningful. The majority of the publications all evaluate such questions with p-value/NHST. I would rather it was more acceptable to interpret on the gradient/effect size and proceed further to biology.

  7. What you’re getting at here is a big idea that’s hard to put into terms people generally understand. I think you’re making progress and I know it’s really hard. Here’s an example of what I mean: a piece in Slate about Anita Hill which relates over these last 25 years 2 bits, Anita Hill and the coming forward of women now about Trump and which then connects the intervening 25 years as a causal chain that to a degree is meant to explain the course of changes for women over 25 years. This is, as you know, a typical form of analysis and it appears in these casual forms of social analysis and in the semi-casual forms of social research.

    It’s very difficult for people to grasp that all these events can be described as a set and that here are two points in that set and that the set is big so you’re just picking 2 bits and those 2 bits can be set up a chain but that chain is nothing more than a relationship within the set, the one you happen to choose in the set you happen to choose, and that it may be neat but it’s likely, indeed overwhelmingly likely in complex sets, that the relationship is effectively coincidence. You can show them images of the shapes of distributions and the potential pathways within them and even bring out big functions like zeta functions that show how barely controlled complexity is but it’s hard to get. I think so much of the confusion, abstracted to an analytical level, comes from the difficulty in realizing the definition of the set restricts what you consider as outcomes, that a definition excludes (a garden of forking paths in some cases) and that the set can’t be presumed to accurately reflect anything more than whatever conclusion you can fish out (p-hacking, etc.) and that thus the entire exercise is always going to be a low powered look, much as if we considered the early telescopes versus Hubble. My point is that we’re drawn to this kind of thinking because we can infer, just as we infer the shape of our galaxy as a result, but it’s not just that most inferences are going to be wrong but that most methods of inference are wrong too. It’s tough to understand that a test or trial can be evaluated and that you can also evaluate something else but that then comparing result significance can be like arguing sports bar questions. It’s significant that Ty Cobb was a great hitter and that Ted Williams was a great hitter and we can get closer to comparing them because they played the same basic game but in clearly different eras with different expectations about results, with some different physical attributes, etc. It’s tough enough coamparing Peyton Manning to Tom Brady when they played at the same time in the same position in games against each other, when by “tough enough” I mean a method which generates a stable result when passed values. I can flip the Peyton/Tom discussion binarily just by slightly changing the attributes considered of the set, just as I can flip the Anita Hill article binarily in a bunch of ways, including the fact that Bill Clinton won women’s votes the next year though he was accused of actual rape and that it nearly always takes a person to come forward, a fact to come out and then other people come forward and other facts come out so the selection of these bits of women coming forward has no special significance in either effect or method.

    I know that’s what you’re talking about here, but man it’s a big challenge to explain to people why this and not just “here are your mistakes”. I thought of this at Yom Kippur because the rabbi mentioned the concept that people have to balance telling someone “you’re doing it wrong”. I never realized before the degree to which statistics connects to the methods of reality, that it’s getting at the fundamental truths in a different way than other mathematics, logical and heuristic systems.

  8. > I never realized before the degree to which statistics connects to the methods of reality
    Maybe that needs to be emphasized more in statistical writing and teaching.

Leave a Reply

Your email address will not be published. Required fields are marked *