The statistical crisis in science: How is it relevant to clinical neuropsychology?

[cat picture]

Hilde Geurts and I write:

There is currently increased attention to the statistical (and replication) crisis in science. Biomedicine and social psychology have been at the heart of this crisis, but similar problems are evident in a wide range of fields. We discuss three examples of replication challenges from the field of social psychology and some proposed solutions, and then consider the applicability of these ideas to clinical neuropsychology. In addition to procedural developments such as preregistration and open data and criticism, we recommend that data be collected and analyzed with more recognition that each new study is a part of a learning process. The goal of improving neuropsychological assessment, care, and cure is too important to not take good scientific practice seriously.

15 thoughts on “The statistical crisis in science: How is it relevant to clinical neuropsychology?

  1. For what its worth, I think this paper comes dangerously close to equating “statistical crisis” with “replication crisis”. Although the replication crisis seems to have occurred because of misunderstanding statistics (if an effect is statistically significant it is “real”, thus no need to check again), there is no need for any statistics use to resolve this.

    To fix the replication crisis, researchers just need to figure out what conditions are necessary to get similar results each time. I guess calculating means and sds may still be required for certain designs, but, as (the commenter here) Glen Sizemore likes to note, you can run studies even without calculating those rudimentary statistics.

    Also, a general theme of recent papers on this topic is that we all just need to “move forward” with new/improved methodology. I understand where this sentiment comes from, but the truth is that for decades many fields were not even trying to replicate each others work, and at the same time nearly everyone misunderstood the method they used to draw conclusions. All this work needs to be redone properly so we can clear up any misconceptions that have made it into textbooks. The damage has been done (and it is likely unbelievably huge), so this needs to be addressed before we can be confident again.

  2. Anon:

    Yes, as we’ve often discussed, all the replication in the world won’t help if the underlying designs are too noisy. And we see a feedback loop, where researchers do hopelessly noisy, dead-on-arrival studies, but succeed (in the sense of statistically significance), which encourages hopelessly noisy follow-up studies, etc. One possible benefit of replication is indirect, if researchers, wanting to succeed (in the sense of getting positive results in replications), are motivated to get better measurements. But I think this will only work if the connection between measurement and success is made more clear.

    • But I think this will only work if the connection between measurement and success is made more clear.

      It almost seems trivial. A paper should be considered successful if it amounts to “If you do a, b, c as I described here, you should measure x +/- y.” It need to have nothing to do with “tests” or “hypotheses” at all (although surely there is one somewhere motivating the measurement).

  3. This line seems unfair to Dana Carney:

    “From a science-communication perspective, a major challenge is that the researchers on the original study (Carney, Cuddy, & Yap, 2010) seem to refuse to accept that their published findings may be spurious…”

    I see that the article was submitted before Carney published her statement on power pose in late September, so maybe there was no chance to amend it. I do think that Carney deserves a huge amount of credit for laying out publicly the methodological flaws of the 2010 paper. We don’t often get to see a detailed testimony of exactly how a flawed study and flawed data analysis comes into being.

      • Marcus, how so? She hadn’t been the center of attention in the power pose controversy. I assume she could have easily stayed quiet about it, just as Andy Yap has.

        I’d also think that if she were just responding to pressure, she could have released a statement that went into far less detail than the one she released. To me she stands out as the exception to the usual cases Andrew draws attention to on this blog, where researchers refuse to admit error or engage seriously with criticism. Her statement on power poses looks damn near heroic when compared to Wansink or Bargh or Durante etc.

        • Ben:

          I agree. Lots of people get big nudges but still dig in. Even Mark Hauser never admitted that he cheated, as far as I know—and Hauser is used on the NIH website as an example of research misconduct!

        • “I’d also think that if she were just responding to pressure, she could have released a statement that went into far less detail than the one she released.”

          In particular, she volunteered the information that there was lots of data-peeking — which I don’t think had been claimed by critics.

      • @Marcus: Were you told so, or is this public information?

        Also, even if Dana Carney’s statement was triggered by a nudge, what stands out is that those-whose-research-did-not-replicate appear insensitive from the start to even the slightest of nudges.

        • I had fairly extensive discussions with Dana about all this via e-mail prior to her public statements but I agree that she acted with great integrity in the end.
          In other news – our multiverse analyses of the Carney et al. and Ranehill et al. data just got accepted for publication.

  4. Hi Andrew,

    I notice that in this paper you mention the Reliable Change Index when discussing the challenge of translating group findings to individual patients. Funny, I’d have expected that the RCI wouldn’t really be your cup of tea – it tests a null hypothesis that’s implausible, and one that we aren’t usually actually interested in (i.e., that the “true scores” are equal at two time points). And its power is so low in realistic scenarios that it’s often in the weighing-a-feather-while-the-kangaroo-is-jumping category.

    I wrote about some of this stuff here: http://thepathologicalscience.blogspot.com/2017/01/lots-of-problems-with-reliable-change.html

    Would be interested to hear your thoughts on the RCI at some point!

    • Matt’s Blog: “Now collecting large samples of data for trials of psychotherapy interventions is a pretty expensive and painful process, so single case experimental designs are still reasonably popular in clinical psych (well, at least here in New Zealand they are). And to be able to do a single case experimental design and yet still run a proper objective inferential statistic has its appeal – it feels much more objective than all that business of just qualitatively interpreting plots of single case data, doesn’t it?”

      GS: Two issues are raised here: First, there is the implication that (what I’ll call) single-subject designs (SSDs) are to be turned to only when it is inconvenient to do a “real” study. Second, of course, is the reference to “subjectivity” (a topic frequently muddled and no less is the case here on Gelman’s Blog). Now…admittedly, it is hard to see exactly what Matt is arguing since “…it feels much more objective than all that business of just qualitatively interpreting plots of single case data, doesn’t it?” can be taken two ways: 1.) it could mean that perhaps “subjectivity” is not to be scoffed at after all or 2.) that as silly RCI is, the case could be made that something must be done about the “subjective” evaluation of the data. I’m guessing more meant the latter.

      And Matt raises the issue of “subjectivity” again, later:

      “Collect lots of data points. If we can see how variable an individual participant’s scores are within the baseline and intervention periods, this gives a much more direct (albeit subjective) impression of whether random fluctuation in scores could account for an observed difference between baseline and treatment. And with a very large number of data points collected at regular intervals you may even be able to use methods designed for time series data (e.g., change point regression) to detect a change in behaviour at the time of the intervention.”

      As to the issue of desirability of SSDs: Psychology and much of medicine has as its subject matter aspects of the functioning of individuals. Period. It is between-subject group designs that are inferior and to use them when a question is amenable to using SSDs should be considered a breach of good scientific practice. And where they must be used, but the subject matter is some aspect of an individual’s functioning, the emphasis, WRT data analysis, should be on not obscuring the individual data. If I were to write a paper geared toward early grad students on alternatives to NHST (when between-subject designs must be used), I would discuss visual display of the data – and if one is using a lot of subjects (and one wonders when this would be the case when NHST is out of the picture and you are looking for meaningful effects) at least show the damn distributions (so they may be “visually-inspected”)! For more reasonable experiments – say 5-10 in each group, one should plot the datum points for each subject in a vertical line, open circles for “control,” filled for “experimental.” Now doubt other visual displays would be appropriate – but the issue is that I would encourage visual inspection of the data…so…

      …on to “subjectivity.” First, what Matt is talking about doesn’t necessarily raise the issue of “subjectivity” at all – at least, not in the sense of “private events.” After all, anyone may look at the data to which the researcher is responding – if the responses of all who look at the data involve “subjectivity,” then ALL of perception “involves subjectivity” (indeed, this is – arguably — the most popular philosophical and “scientific” view of all time; i.e., “representationalism” or “indirect realism” which is as silly as it is popular). [Incidentally, the alternative is that perception is a form of action…not reproduction.] Imagine Leeuwenhoek’s contemporaries…should they believe in “animalcules”? Why…they can build a microscope themselves and see things similar to what Leeuwenhoek saw. But that would, you know, only be “subjective.” Just “visual inspection*.” I’m guessing that most here would say that the visual-inspection-of-animalcules and the visual-inspection-of-data are fundamentally different – the former, “objective” and the latter, “subjective.” But, again, there is nothing “subjective” about the data or the animalcules. And, not to belabor the point too much, if you say, “well, it is the *interpretation* of the data that is ‘subjective.’” I can say that seeing the animalcules involves “interpretation” as well – indeed, that IS the widely-accepted view (but not my view)! The issue is really that everybody “makes the same interpretation” when the microorganisms are involved, but possibly not when the data are inspected. But the actual behavioral fact is that what is really being alluded to (poorly) when one talks about “interpretation” is the history of the researcher. It is the researcher’s history that makes him or her respond in particular ways to visual stimuli whether the stimuli are data or microorganisms. Now, that was a bit of a long discussion (but it is a lot of folks around here who insist on raising philosophical issues with which they are ill-equipped to deal) but the main point, again, is that the person reporting “there’s an effect” or “I’ll be damned! There are little animals!” Do so because of, and under the control of, their history and the current stimuli.

      But maybe all of that philosophical stuff (which is introduced constantly around here) is overkill; we’re not worried about “subjectivity” and deep philosophical issues when someone reports to seeing what Leeuwenhoek saw and we should not, similarly, be worried about someone reporting an effect when the ranges of the (subjectively-apparent!) stable states in two conditions do not overlap in a properly-done SSD-type experiment. So…you could say that there is a hard-and-fast non-subjective calculation relevant to SSDs – there is an effect if the ranges do not overlap! But, of course, often we are interested in the function relating independent- and dependent-variables and there the ranges of nearby parameter values will overlap – but seeing the shape of the function as, say, a bitonic, inverted-U shaped function is as philosophically noncontroversial as seeing a microorganism through a microscope. OK…I’m done for now…

      *But the issue *is* tricky. Some responses to stimuli may, ultimately, be responses to private events, even when some particular publicly-observable stimulus occasions the response – and this can be, indeed, related to the sense of “subjective” that Matt is using.

Leave a Reply to Anoneuoid Cancel reply

Your email address will not be published. Required fields are marked *