The syllogism that ate social science

I’ve been thinking about this one for awhile and expressed it most recently in this blog comment:

There’s the following reasoning which I’ve not seen explicitly stated but is I think how many people think. It goes like this:
– Researcher does a study which he or she thinks is well designed.
– Researcher obtains statistical significance. (Forking paths are involved, but the researcher is not aware of this.)
– Therefore, the researcher thinks that the sample size and measurement quality was sufficient. After all, the purpose of a high sample size and good measurements is to get your standard error down. If you achieved statistical significance, the standard error was by definition low enough. Thus in retrospect the study was just fine.

So part of this is self-interest: It takes less work to do a sloppy study and it can still get published. But part of it is, I think, genuine misunderstanding, an attitude that statistical significance retroactively solves all potential problems of design and data collection.

Type M and S errors are a way of getting at this, the idea that just cos an estimate is statistically significant, it doesn’t mean it’s any good. But I think we need to somehow address the above flawed reasoning head-on.

40 thoughts on “The syllogism that ate social science

  1. Andrew,

    Don’t you think that, by now, students/researchers are sufficiently aware that NHST is being questioned?

    Who is still a hardliner? I’m posting to the Psychological Methods Discussion Group. I expected getting a hammering. Surprisingly, it has been relatively pleasant experience. Some are discussing Valentin Amrhein’s Peer J article. Not much pushback yet.

    Maybe after my partying tonight, with a couple of beers under my belt, I might get a little rowdier. LOL just kidding. I am hoping that one of those guys will come forward.

    Meanwhile they are focused on Robert Sternberg, I gather.

    • Sameera Daniels: The controversy over null hypothesis testing is nothing new. It has been going on in psychology since the 1960s. One of the problems with the Psychological Methods Discussion Group on FaceBook is that the participants are by and large so young that they have no historical perspective on this or any other statistical issue.

      This 1999 article by David Krantz gives a bit of an overview:

      https://amstat.tandfonline.com/doi/pdf/10.1080/01621459.1999.10473888?needAccess=true

      • Thank you posting the article. I’m reading it now. And will comment when done. Hopefully tomorrow as I have to go out this evening.

        Re: Controversies in Statistics I assure you that I’m pretty well versed about the NHST controversy as I was working on an article on P-value. I keep in touch with some well versed statisticians and physicians who have written articles about it.

        Re: Facebook Psychological Methods Discussion Group, I wasn’t sure the demographics of the 17,000 plus members of the Facebook group. I am guessing they are in the 30-40 age range. But some members I recognize who are active in the Open Science Network. The problem is that blogs, as even Andrew points out, are not always satisfying b/c people have short term interests sometimes. It’s hard to get into depth on a subject.

        An epidemiologist called my attention to the Facebook group. Didn’t really examine the postings until I returned from my trip last week. I’m enjoying them so far. Very sweet guys.

        I just don’t like it when a group acts as a clique. I shy away from that kinda environment. And I am apt to defend an underdog. Public criticism of Robert Sternberg self-citations is valid. Though I speculate hardly anyone has read his work. I have. So I wanted to point out that we should also focus on the substance of any one author’s work. His observations about the lack of creativity in academia are not new either. All in all he has interesting observations to share.

        Be back tomorrow.

        • What is being done, if anything, to revamp statistics textbooks? I guess Geoff Cumming’s Introduction to The New Statistics is the latest which is definitely an improvement. But I think it’s worth conceiving another one. Or should I say design a series of course texts. I wonder for example which statistics textbooks are being used in different universities. Has there been any systematic appraisal of this? Is it even warranted? These questions have cropped up for me recently. Particularly as I attended Yale’s symposium recently.

        • Sameera Daniels: David Krantz is about 80 years old by now, I’d guess. Top of the line intellect. (See the two books on measurement by Krantz, Luce, Suppes, and Tversky.)

          Does the world really need yet another article on p-values?

        • Anonymous,

          We do not need another article on p-values.

          What I would say is that some good % of articles rely on too few examples of misuses & abuses of p-values, confidence intervals, etc if any at all, in statistics books.

          At a Yale symposium recently, I emphasized that students had to be exposed to a much wider curricula that includes cognitive psychology perspectives, logic, & the history of the controversies in science/statistics: preferably, at the undergraduate level, as prerequisites to introductory statistics class or included in an introductory statistics textbook. Additionally, I mentioned to John Ioannidis that I’d been acutely aware of how the sociology of expertise plays out over time. What I meant was ‘devil advocacy’ still prominent in these controversies.

          Based on casual conversations with expertis, I have surmised that nearly all universities are teaching traditional statistics despite the extent of discussions and attention to abuses and misuses of NHST in the biomedical enterprises and social sciences.

          I have been reflecting on the article you posted. Sometimes I spend a day or two before I make a comment. So patience please, LOL. I almost purchased the book Krantz reviewed in the article. I am reading another book: The Significance Test Controversy by Morrison and Henkel.

          Moreover I’m glad that you mentioned Krantz books b/c I see on Amazon the sale of the Foundations of Measurement Theory, 3 volumes. Intermittently, I have weighed the efficacy of having a background in it. I’m a novice at all this. I rely on the good graces of experts for recommendations of books. So thank you. I’ll check them out this spring.

          Lastly I have been dissuaded from the Benjamin et al position on p-values. I do understand John’s reasoning and despite sympathetic to it in some respects, I think it also has some serious drawbacks. I think John Ioannidis more recent evaluation of current controversies is actually very good.

          I think Sander Greenland’s work I favor as well.

        • > nearly all universities are teaching traditional statistics despite
          You are likely right about that and there is an awful amount of inertia to get over. The last access I had to a undergrad intro stats exam answer sheet (2O14) gave the correct answer to what is a p_value as the probability the null is true (it was in a math and stats department).

          On the other hand, thousands of university lecturers and tutors giving roughly the same worn out ineffective courses year after year after year – is bizarre.

          More likely needs to be written. Mike Evans makes clear separation between chosen and checked for validity (model) versus the application of rules of a theory of inference to the “chosen and checked model” that is kept. Sander Greenland makes a separation between assessing incompatibility of the data with a general model – using log2(p) or compatibility intervals – versus assessing relative compatibility within a specific model (using likelihood or posterior intervals). Perhaps suggesting non-experts just stick to assessing incompatibility. Andrew’s Bayesian workflow emphasizes a similar separation. These seem somewhat new or need to be written about more.

  2. Andrew, I don’t think it is fair to say that this problem is limited to the social sciences. Lots of suboptimal practices happen in clinical research too. That said, I think this is exactly right. And there’s a flip side to it too: Regardless of whether we are talking about politics or research, I tell my students that when we encounter information that is contrary to our expectations, it sets in motion processes that make it more likely that we will retain our beliefs and that make us feel like we have been evenhanded in our approach. So:

    – Researcher does a study which he or she thinks is well designed.
    – Researcher does not obtain statistical significance.
    – Researcher looks for data entry errors, checks assumptions, looks at measurement things like coefficient alpha, all of which involve multiple forking paths
    – Research finds that transforming data, and/or Winsorizing (or deleting) “outliers”, and/or dropping a “bad” item on a scale, etc. yields statistical significance
    – Researcher is pleased with him or herself for practicing good science.

    This is one reason that detailed preregistration will help – if done well it will commit researchers to a particular set of approaches, and they will presumably have to justify any deviations to the preregistered plan.

    • Jeff:

      Yes, but at least in bioscience there’s often some sort of mechanism. In social science there’s typically nothing like a mechanism—indeed, just about any effect can go in either direction. With less constraint from theory, social sciences are more dependent on statistical reasoning—and statistical fallacies.

      • “Yes, but at least in bioscience there’s often some sort of mechanism. ”

        Well, sometimes. Not as often as you might think. Living organisms have plenty of homeostatic mechanisms and feedback loops. Sometimes you do an intervention that you think should have an effect in some direction, and it starts to, but then that sets in motion a “corrective response” that “over-reacts” and the net effect is the other way, which might in turn trigger a third mechanism that does who knows what. And sometimes a molecule that in cell cultures has one effect, when administered in vivo gets metabolized into something that has the opposite effect, or no discernible effect.

        Social science may lack mechanisms, but in bioscience there are so many competing mechanisms that perhaps there might as well be none.

    • Jeff

      I am all for preregistration. But I think hearing Feyerabend during my teens probably influenced me more than I concede sometime. I view the scientific enterprise through its sociology of expertise which a fascinating experience. It stands to reason b/c my dad took me to several symposiums every year. I listened to what they said to each other; and what they thought of each other. LOL

  3. The most important thing I learned from Tversy in grad school was something along the lines of, “The mind looks for shortcuts. Knowing this does not stop it from looking for shortcuts. Just like knowing how an optical illusion works doesn’t stop you from seeing the illusion.”

    My summary was, “We’re wired for cargo cult thinking.” So it seems like a reasonable hypothesis that researchers will naturally internalize the NHST process as a ceremony that produces Valid Results. Add in a dash of good old cognitive dissonance once a researcher becomes Respected and it’s an enormous mental effort to question the ceremony.

  4. A more direct statement of the false syllogism:
    1. You need statistical significance or your effect might be just noise.
    2. You have statistical significance
    3. Therefore, your effect is not noise.

    I trust the logical flaw in this syllogism is easy to spot. Hint: it has the same form as:
    1. One needs to be 35 years old or more to be President of the United States.
    2. I am over 35.
    3. Therefore, I am Donald Trump.

    • Hey—I’ve heard Trump is on twitter so maybe I shouldn’t be so surprised to see him here in the blog comments. This is even better than when we had Scott Adams commenting here!

    • I wish I had that cartoon when I was dealing with biophysicists. Because they were good at using formulas they could not grasp what they would gain from working with me. Once they tested a main effect when they were only interested in a particular interaction. It took a halve a dozen meetings for me to discern that and largely by luck. Stopped working with them after that. Sad actually.

  5. All statistical models are false.

    Yours is a statistical model.

    Therefore your model is false.

    QED.

    *** but how is approximation factored in? It, approximation, humbly, makes no false claims to truth but rather suggests we tack back and forth towards truth. But what of those working at the sails’ coffee grinders? Heroes, or goats? Or both?

    So it goes out here upon the vasty deep. Always.

  6. One way of addressing this flawed reasoning head on would be taking frequentism seriously. M and S errors are the consequence of not taking the errors of using rules in repeated application seriously.

  7. my sleight of hand sense is tingling!
    ‘sample size and measurement quality was _sufficient_.’
    ‘Thus in retrospect the study was _just fine_.’

    In my experience, researchers usually have a decent understanding of the value of their evidence. They certainly lie to themselves a bit and on top of that oversell when writing the paper, but in general there’s a very robust correlation between strength of evidence and confidence. I’ve yet to meet anyone who has ‘an attitude that statistical significance retroactively solves all potential problems of design and data collection’. So I think this latter statement is an incorrect belief falsely attributed to a non trivial number of people in order to avoid thinking harder about the problem.

    Trying to come up with an account that could empirically be distinguished from the proposed syllogism, I note that the syloggism appears tautological: ‘Researcher does a study which he or she thinks is well designed’ overlaps with ‘researcher thinks that the sample size and measurement quality was sufficient’. So I’m wondering, whether it is actually saying anything beyond somethin like ‘When things work out, people usually don’t question their mental models and procedure, and when things don’t work out they usually first check for errors (instead of questioning the mental model).’ Which is at best a very useful heuristic, at worst something like confirmation bias. So what exactly is new here?

    • Markus:

      When I referred to “an attitude that statistical significance retroactively solves all potential problems of design and data collection,” I was indeed exaggerating.

      What I should’ve said was, “an attitude that statistical significance retroactively solves all potential problems of uncertainty in design and data collection.”

      People do understand that statistical significance doesn’t solve problems with bias. The problem is that people think that statistical significance renders moot any problems with variance and uncertainty. Our upcoming post scheduled for 5 Nov has a very clear example.

      Regarding your last question (“So what exactly is new here?”): What’s new here is the specific reasoning that is made (often implicitly but sometimes explicitly) by researchers, that once they have statistical significance they retroactively don’t need to worry about quality of measurement. I think this is a very common attitude, and it’s wrong.

      • There are subject matter areas where significance does “not render moot any problems with variance and uncertainty”. There is still hypothesis testing but not always two-sided tests of difference. Many devices and biologic treatments use non-inferiority testing with confidence intervals. A confidence interval of the difference that includes 0 and the lower bound will not be accepted. Also in bio-assay the use of two sided equivalence bounds reduce significance findings when it very easy to find any two-sided difference as significant. Also capability analysis is a way to assure that variance is reduced as a process improves.

      • I do think that concrete examples would be very helpful. I was reviewing the chapter ‘What’s Wrong with Statistical Tests-And Where Do We Go From Here’ in Beyond Significance Testing, Rex Kline. The explanations of the fallacies entailed in misinterpretations of p-values are clear enough. But it would have been clearer if concrete examples were given too. He does a fine job in so far as examples in data analyses practices.

        A prerequisite for the book is an undergraduate course in behavioral science statistics. And don’t have a clue whether such a prerequisite is adequate.

  8. What is ‘retroactive’ about this?
    Everyone I know runs their studies thinking their measurement quality is good enough. It often isn’t, but in my limited experience people then either don’t realize there is a problem or decide this is the best they can do with the limited time and money available to them. When these studies get significant results in line with theory, there is no update of beliefs in measurement quality, researchers continue to believe what they believed before, namely that it’s good enough.
    If someone were to argue explicitly along the syllogism that would be a bad argument, but even then: Everything working as planned (from the POV of the researcher after the article was published) can’t be evidence _for_ insufficient measurement quality.

Leave a Reply to Sameera Daniels Cancel reply

Your email address will not be published. Required fields are marked *