“I do not agree with the view that being convinced an effect is real relieves a researcher from statistically testing it.”

Florian Wickelmaier writes:

I’m writing to tell you about my experiences with another instance of “the difference between significant and not significant.”

In a lab course, I came across a paper by Costa et al. [Cognition 130 (2) (2014) 236-254 (http://dx.doi.org/10.1016/j.cognition.2013.11.010). In several experiments, they compare the effects in two two-by-two tables by comparing the p-values, and not by a test of the interaction. A mistake, very much like you describe it in the Gelman and Stern (2006) paper.

I felt that this should be corrected and, mainly because I had told my students that such an analysis is wrong, I submitted a comment to Cognition (http://arxiv.org/abs/1506.07727). This comment got rejected. The main argument of the editor seems to be that he is convinced the effect is real, so who needs a statistical test? I compiled the correspondence with Cognition in the attached document.

In the end, with the help of additional quotes from your paper, I persuaded the editor to at least have the authors write a corrigendum (http://dx.doi.org/10.1016/j.cognition.2015.05.013), in which they report a meta-analysis pooling all the data and find a significant interaction.

I think it is a partial success that this corrigendum now is published so readers see that something is wrong with the original paper. On the other hand, I’m unhappy that it is impossible to check this new analysis since the raw data are not accessible. Moreover, this combined effect does not justify the experiment-by-experiment conclusions presented before.

I’d like to thank you very much for your paper. Without it, I’m afraid, my complaints would not have been heard. It seems people are still struggling with understanding the problem even when being pointed to it, and even in a major journal in Psychology.

Lots of interesting things here. The editor wrote:

In the end, given that I’m convinced the effect is real, I’m just not sure that the community would benefit from this interchange.

To which Wickelmaier replied:

My comment is not about whether or not the effect exists (it may or may not), my comment is about the missing statistical test. I do not agree with the view that being convinced an effect is real relieves a researcher from statistically testing it.

I agree. Or, to put it another way, I have no objection to a journal publishing a claim without strong statistical evidence, if the result is convincing for some other reason. Just be open about what those reasons are. For example, the article could say, “We see the following result in our data . . . The comparison is not statistically significant at the 5% level but we still are convinced the result is real for the following reasons . . .”

There was indeed progress. The editor responded to Wickelmaier’s letter as follows:

You have convinced me that there’s a serious problem with Costa et al.’s analysis. But I also remain convinced by his subsequent analyses that he has a real effect. I think your suggestion to have him write an erratum (or a corrigendum) was an excellent one. . . .

This hits the nail on the head. It’s ok to publish weak evidence if it is interesting in some way. The problem comes when there is the implicit requirement that the evidence from each experiment be incontrovertible, which leads to all sorts of contortions when researchers try to show that their data prove their theory beyond a reasonable doubt.

As always, we must accept uncertainty and embrace variation.

13 thoughts on ““I do not agree with the view that being convinced an effect is real relieves a researcher from statistically testing it.”

  1. The obvious put obviously (I hope.) – “The problem comes when there is the implicit requirement that the evidence from each experiment be incontrovertible, which leads to all sorts of contortions when researchers try to show that their data [in their individual, island on its own study] prove their theory beyond a reasonable doubt.”

  2. The data:
    “Recently, a serious financial crisis has started. Without any action, the company you manage will lose 600,000 euros. In order to save this money, two types of actions are possible.

    Gain version:
    If you choose Action A, 200,000 euros will be saved.
    If you choose Action B, there is a 33.3% chance that
    600,000 euros will be saved and a 66.6% chance that
    no money will be saved.
    Which action do you choose?

    Loss version:
    If you choose Action A, 400,000 euros will be lost.
    If you choose Action B, there is a 33.3% chance that no
    money will be lost and a 66.6% chance that 600,000
    euros will be lost.
    Which action do you choose?”

    They report that when these questions were posed in the native language 71% of people in the “Gain Version” (49/69) chose action A vs 56% in the “Loss version” (40/71). In the non-native language this was 67% (47/70) vs 61%(43/70). This is only about 10 person difference between all groups. Is it possible that 1/7 students just read the first sentence and say “save 200k, sure. Lose 400k, no way. They aren’t paying me enough to do math”?

  3. There’s a lovely tounge-in-cheek paper which points out that parachutes are a medical intervention, and they have never had a randomised controlled trial. The authors suggest therefore that there is no evidence that parachutes work to stop injury if you jump out of aircraft, and that a RCT is needed – and helpfully suggest that supporter s of RCTs volunteer for such a trial.

    I think the same principle applies here. I don’t need a statistical test to tell me that jumping out of an aircraft with no parachute is a bad idea. I’m happy to rely on my very strong prior.

    • What if you grew up on an asteroid mining colony and were ten generations removed from contact with anyone living on a planet? Someone could tell you the “myth” about places where you get hurt if you fall.

        • I was thinking of “2001: A Space Odyssey”:
          ‘Floyd stared in fascination at the self-assured little lady, noting the graceful carriage and the unusually delicate bone structure. “It’s nice to meet you again, Diana,” he said. Then something – perhaps sheer curiosity, perhaps politeness – impelled him to add: “Would you like to go to Earth?”

          Her eyes widened with astonishment; then she shook her head. “It’s a nasty place; you hurt yourself when you fall down. Besides, there are too many people,”‘

    • There’s a big difference between a prior which is the posterior distribution of some enormous body of evidence, and a prior that … well… isn’t.

      We have a tight posterior distribution for the use of the following ODE for predicting velocity at impact:

      ma = -mg + Drag(v, A) + Bouyancy(V)

      with particular tight distributions on g and the drag coefficient, etc. all of which are informed by enormous bodies of previous experimental work (including measuring the terminal velocity of parachutists in various positions prior to opening the ‘chute).

      The beautiful thing about the parachute example is that it shows how important *using valid information in a prior* is. The take-away should be, but I’m afraid may not always be: “Enormous bodies of evidence shouldn’t be ignored when analyzing a new situation”. It’s a little too easy to take away from that parachute example “We don’t really need RCTs”.

    • Being part of the group of people who thought we had sorted _most of this out_ for the clinical research community in the 1980s this really is scary.

      (Not those promoting CBT but those struggling over how to bring meta-analysis to bear on the issues – thoughtfully.)

Leave a Reply

Your email address will not be published. Required fields are marked *