Skip to content
 

“and, indeed, that my study is consistent with X having a negative effect on Y.”

David Allison shares this article:

Pediatrics: letter to the editor – Metformin for Obesity in Prepubertal and Pubertal Children A Randomized Controlled Trial

and the authors’ reply:

RE: Clarification of statistical interpretation in metformin trial paper

The authors of the original paper were polite in their response, but they didn’t seem to get the point of the criticism they were purportedly responding to.

Let’s step back a moment

Forget about the details of this paper, Allison’s criticism, and the authors’ reply.

Instead let’s ask a more basic question: How does one respond to scientific criticism?

It’s my impression that, something like 99% of the time, authors response to criticism is predicated on the assumption that they were completely correct all along: the idea is that criticism is something to be managed. Tactical issues arise—Should the authors sidestep the criticism or face it head on? Should they be angry, hurt, dismissive, deferential, or equanimous?—but the starting point is the expectation of zero changes in the original claims.

That’s a problem. We all make mistakes. The way we move forward is by learning from our mistakes. Not from denying them.

Here was my response to Allison: you think that’s bad; check out this journal-editor horror story. These people are actively lying.

Admitting and learning from our errors

Allison responded:

We (meaning the scientific community in its broadest form) definitely have a long way to go in learning how to adhere scrupulously to truthfulness, to give and respond to criticism constructively and civilly, and how to admit mistakes and correct them.

I like this line from Eric Church: “And when you’re wrong, you should just say so; I learned that from a three year old.”

I wish more people would be willing to say:

You’re right. I made a mistake. My study does not show that X causes Y. I may still believe that X causes Y, but I acknowledge that my study does not show it.

We do occasionally get folks to write that in response to our comments, but it is all too rare.

Anyway, right now I have been looking at papers that make unjustified causal inferences because of neglecting (or not realizing) the phenomenon of regression to the mean. Regression to the mean really seems to confuse people.

And I replied: You write:

I wish more people would be willing to say:

You’re right. I made a mistake. My study does not show that X causes Y. I may still believe that X causes Y, but I acknowledge that my study does not show it.

I’d continue with, “and, indeed, that my study is consistent with X having a negative effect on Y. Or, more generally, having an effect that varies by context and is sometimes positive and sometimes negative.

Also, I think that the causal discussion can mislead, in that almost all these issues arise with purely correlational studies. For example, the silly claim that beautiful parents are more likely to have daughters. Forget about causality; the real point is that there’s no evidence supporting the idea that there is such a correlation in the population. There’s a tendency of people to jump from the “stylized fact” to the purported causal explanation, without recognizing that there’s no good evidence for the stylized fact.

11 Comments

  1. Carlos Ungil says:

    Their reply doesn’t seem so bad to me, at least from a quick glance. I was curious about how a typical paper in the field look like. The first recent reference from Dr Allison returned by google scholar (“Adiposity and Reproductive Cycling Status in Zoo African Elephants”, how cool is that?) doesn’t look much better from the statistical inference point of view. Looking just at correlations:

    > BCS was strongly correlated with age (ρ = 0.603; P = 0.003), weight (ρ = 0.759; P < 0.0001; Figure 3A), FFM (ρ = 0.702; P = 0.001; Figure 3B), and FM (ρ = 0.583; P = 0.007; Figure 3C) but not with relative fat (ρ = 0.256; P = 0.276; Figure 3D) or percent body fat (ρ = 0.337; P = 0.146).

    Ok. If p0.05 (p=0.146 and p=0.276) we say there is no correlation.

    > The correlation between FM and relative fat with glucose (ρ = 0.379, P = 0.100; ρ = 0.555, P = 0.011, respectively; Figure 4A-4B), insulin (ρ = 0.369, P = 0.110; ρ = 0.352, P = 0.128, respectively; Figure 4C-4D), and leptin (ρ = 0.384, P = 0.095; ρ = 0.399, P = 0.081, respectively; Figure 4E-4F) nearly reached significance.

    Why is not the conclusion that FM and relative fat were not correlated with glucose, insulin and leptin?

    Fine, let’s say that if 0.05 < p FM, adjusted for FFM, was correlated with glucose (ρ = 0.520; P = 0.022)

    Ok, p and trended toward significance with insulin (ρ = 0.371; P = 0.118) and leptin (ρ = 0.403; P = 0.087).

    Is “trended toward significance” synonymous with “nearly reached significance”? It seems so looking at the numbers.

    > FM was not correlated with SAA (ρ = 0.007; P = 0.979; Figure 4G) or TNF-α (ρ = −0.0353; P = 0.883; Figure 4H).

    Ok, those p-values are high and those correlation estimates are low.

    > Weight was not correlated with water turnover rate (ρ = 0.357; P = 0.123).

    Why do they say that weight is not correlated with water turnover rate (ρ = 0.357; P = 0.123) but results like (ρ = 0.369, P = 0.110), (ρ = 0.371; P = 0.118), (ρ = 0.352, P = 0.128) deserve to be described as “trended toward significance” or “nearly reached significance”?

    > Glucose was correlated with insulin (ρ = 0.430; P = 0.046).

    Ok, so glucose was correlated with insulin (ρ = 0.430; P = 0.046) and FM, adjusted for FFM, was correlated with glucose (ρ = 0.520; P = 0.022). So why did they say then that the correlation between glucose and relative fat (ρ = 0.555, p = 0.011) nearly reached significance?

  2. Carlos Ungil says:

    (Now with properly escaped less than signs)

    Their reply doesn’t seem so bad to me, at least from a quick glance. I was curious about how a typical paper in the field look like. The first recent reference from Dr Allison returned by google scholar (“Adiposity and Reproductive Cycling Status in Zoo African Elephants”, how cool is that?) doesn’t look much better from the statistical inference point of view.

    > BCS was strongly correlated with age (ρ = 0.603; P = 0.003), weight (ρ = 0.759; P < 0.0001; Figure 3A), FFM (ρ = 0.702; P = 0.001; Figure 3B), and FM (ρ = 0.583; P = 0.007; Figure 3C) but not with relative fat (ρ = 0.256; P = 0.276; Figure 3D) or percent body fat (ρ = 0.337; P = 0.146).

    Ok. If p < 0.5 we say there is correlation and if p>0.05 (p=0.164 and p=0.276) we say there is no correlation.

    > The correlation between FM and relative fat with glucose (ρ = 0.379, P = 0.100; ρ = 0.555, P = 0.011, respectively; Figure 4A-4B), insulin (ρ = 0.369, P = 0.110; ρ = 0.352, P = 0.128, respectively; Figure 4C-4D), and leptin (ρ = 0.384, P = 0.095; ρ = 0.399, P = 0.081, respectively; Figure 4E-4F) nearly reached significance.

    Why is not the conclusion that FM and relative fat were not correlated with glucose, insulin and leptin?

    Fine, let’s say that if 0.05 < p < 0.10 correlations “nearly reach significance”.

    But wait, the correlations between FM and relative fat and insuling (p=0.110 and p=0.128) did also nearly reach significance? Where is the threshold? Somewhere between 0.128 and 0.146?

    And why don’t they conclude that there is correlation between relative fat and glucose (ρ = 0.555, p = 0.011)? Aren’t we using p=0.05 as threshold for significance?

    > FM, adjusted for FFM, was correlated with glucose (ρ = 0.520; P = 0.022)

    Good, p < 0.05 is statistically significant and they say that there is correlation.

    > and trended toward significance with insulin (ρ = 0.371; P = 0.118) and leptin (ρ = 0.403; P = 0.087).

    Is “trended toward significance” synonymous with “nearly reached significance”? It seems so looking at the numbers.

    > FM was not correlated with SAA (ρ = 0.007; P = 0.979; Figure 4G) or TNF-α (ρ = −0.0353; P = 0.883; Figure 4H).

    Ok, those p-values are high and those correlation estimates are very low.

    > Weight was not correlated with water turnover rate (ρ = 0.357; P = 0.123).

    Why do they say that weight is not correlated with water turnover rate (ρ = 0.357; P = 0.123) but results like (ρ = 0.369, P = 0.110), (ρ = 0.371; P = 0.118), (ρ = 0.352, P = 0.128) deserve to be described as “trended toward significance” or “nearly reached significance”?

    > Glucose was correlated with insulin (ρ = 0.430; P = 0.046).

    Ok, so glucose was correlated with insulin (ρ = 0.430; P = 0.046) and FM, adjusted for FFM, was correlated with glucose (ρ = 0.520; P = 0.022). So why did they say then that the correlation between glucose and relative fat (ρ = 0.555, p = 0.011) nearly reached significance?

    • Anoneuoid says:

      The same quotes with the clutter removed:

      > BCS was strongly correlated with age, weight (Figure 3A), FFM (Figure 3B), and FM (Figure 3C) but not with relative fat (Figure 3D) or percent body fat.

      > The correlation between FM and relative fat with glucose (Figure 4A-4B), insulin (Figure 4C-4D), and leptin (respectively; Figure 4E-4F) nearly reached significance.

      > FM, adjusted for FFM, was correlated with glucose

      > and trended toward significance with insulin and leptin.

      > FM was not correlated with SAA (Figure 4G) or TNF-α (Figure 4H).

      > Weight was not correlated with water turnover rate.

      > Glucose was correlated with insulin.

      I really think all those numbers are included just to make it seem more “sciency” and imposing. I don’t see what they add.

      Then of course we see what we are left with is a list of “this is correlated with that”. Why present this info as prose at all? A table would be much better. If you really want to convey the max information then scatter plots of each pair are obviously best though.

      • bxg says:

        I think these criticisms of the elephant paper are missing an important point
        about the structure of the paper.

        The paper has a ‘results’ section, where the quotes above come from.
        But the results section is largely just a laundry list of summary statistics
        and comparisons, with pretty much everything on display (weak and strong).
        It could almost be computer generated.

        Yes, while presenting a comprehensive list of statistics, the results
        section does comment about things being significant or not – which is
        perhaps annoying and unnecessary. But they aren’t filtering what they
        show us (either by dropping comparisons or dropping the underlying
        numbers), so this is mostly harmless. For instance, when they say in
        this section that something with p = 0.08 is almost significant – well,
        perhaps they shouldn’t do that – but arguably 0.08 is ‘almost’ 0.05 so
        as a verbal description of the numbers it’s not actually wrong or confusing.

        It’s a later step of research, when one states conclusions – the things you
        present to the reader as things to take away from the research – that
        gets tricky. Once I’m selecting what results to present as findings,
        and (e.g.) play loose with whether an ‘almost’ significant result is
        included, I deserve harder criticism.

        But the elephant paper didn’t find much its data, and draws few and weakly
        stated conclusions. (It is even hesitant to draw negative ones, acknowledging
        lower power from its sample). The main (only?) strongly stated conclusion
        seems to be touting the efficacy of a measurement technique.

        Their results section would be doing the reader a grave wrong – relative to its purpose –
        if it dropped the actual statistics and only gave us the boolean results after a 0.05 cutoff test.
        Or if it were less complete (less list-like). And I don’t think it’s that useful to get
        hyper-pedantic about what they call significant or not _in this section_.

        • Carlos Ungil says:

          > For instance, when they say in this section that something with p = 0.08 is almost significant – well, perhaps they shouldn’t do that – but arguably 0.08 is ‘almost’ 0.05 so as a verbal description of the numbers it’s not actually wrong or confusing.

          It does get confusing when p=0.011 is described as “nearly reached significance” or when p=0.123 is described as “not correlated” while p=0.128 is described as “nearly reached significance”.

          > Once I’m selecting what results to present as findings, and (e.g.) play loose with whether an ‘almost’ significant result is included, I deserve harder criticism.

          Does the “showed a trend for association with leptin” mention in the abstract deserve some criticism?

          > I don’t think it’s that useful to get hyper-pedantic about what they call significant or not _in this section_.

          I didn’t think of my comment as being particularly useful either. But I’m curious: do you think that getting hyper-pedantic about the following statement _in the discussion section_ would have been more useful?

          “FM, adjusted and unadjusted, almost reached significance with insulin levels. There was one insulin outlier, and although there was no reason to exclude it, running the analyses without this data point resulted in a significant correlation between unadjusted and adjusted FM and insulin.”

          • bxg says:

            Fair enough. It would be a bit futile to debate the problems with these statements, since – while I probably view them as more a bit benign, in context, than you do – your implied complaints are clearly valid.

            I stand by (but acknowledge that I was reacting a bit more to anoneouid’s suggested rewrite, than to your notes) my belief that the ‘results’ section should be treated in the spirit in which it is given, and that while we could really tear it apart if treated as “laundry list of conclusions” that’s not really fair or productive. My belief is about this section of the paper, and this only, but this section seems to be what your initial comment drew on exclusively.

            Personally, and _relative_ to what goes on out there, I like the modesty of the discussion and conclusions for this paper (nontrivial data collection; the pressure to get a big ‘result’ would be high) but it’s not at all beyond (statistical) criticism. As you show now, and I would not disagree with.

            • Carlos Ungil says:

              To be clear, it’s not that I don’t see this as benign. I just found the lack of consistency in how to abuse p-values quite amusing. I don’t really have a problem with that paper, I just forced myself to find some other issues after you replied. For better of worse, few papers are beyond statistical criticism…

              My comment was on the correlations because they were a lot of them (a laundry list as you said) and the description didn’t make much sense. I agree with anoneuoid that a table would be better (but I don’t think that removing the numbers altogether would be better). In fact the inconsistency problem I highlighted was due to trying to describe them on blocks and not looking at each p-value separately.

  3. Hence says:

    “Oopsy. I didn’t know / have overlooked that. Thank you.” not a common response, right?

    A few timely coincidences here, felt impelled to comment.

    Psychologically, one problem is that acknowledging a mistake is usually a loss of face. It’s truth versus social standing.

    If the correctee grants the existence of the mistake, the audience reevaluates the social hierachy in favor of the corrector. How is the corrector perceived, if not as someone smarter and more discriminating about the topic than the correctee? If the correctee would lose social points, there’s resistance.

    (“Correctee”? Correct me if this word doesn’t exist. If you do and you are correct, then what have I become?)

    A working solution should have all parties win. Corrector and audience can make it clear that the correctee’s acceptance of the correction is actually praiseworthy, not a loss.

    This is why I recently wrote this as a complement to those letters to authors. It clarifies the intentions of the corrections coming from the funny-looking site (about ending NHST).

    I referenced McShane & Gelman 2017 (“Abandon Statistical Significance”) in one of the letters.

    Now, this is just the corrector speaking. Have a look. Would you also agree that their acceptance would be nothing but praiseworthy?

    • Keith O'Rourke says:

      A young sociologist I overhead on a train many years put very concisely “when people ask questions at academic conferences there essential saying ‘I am smart and I know important things'”.

      Later it occurred to me that the usual responses are “that’s either not true or not important or both and you actually don’t seem smart”.

      (Got to this point of your comments and realized I need another cup of coffee ;-)
      > Any similarities between this fictional story and actual mammals, currencies, organizations, caloric snacks, and agricultural crops is entirely coincidental (p = 0.047)

  4. Hence says:

    Keith,

    I would agree that an appropriate degree of caffeination seems recommended before reading that page. :)

Leave a Reply