Statistics is like basketball, or knitting

I had a recent exchange with a news reporter regarding one of those silly psychology studies. I took a look at the article in question—this time it wasn’t published in Psychological Science or PPNAS so it didn’t get saturation publicity—and indeed it was bad, laughably bad. They didn’t just have the garden of forking paths, they very clearly did a series of analyses, then they finally reached something statistically significant and then they stopped and made some graphs and presented their conclusions.

OK, fine. There’s a lot of incompetent research out there. It’s easier to do bad research than to do good research, so if the bad research keeps getting published and publicized, we can expect to see more of it.

But what about these specific errors, which we keep seeing over and over again. I can’t imagine these researchers are making these mistakes on purpose!

The only reasonable inference to conclude here is that applied statistics is hard. Doing a statistical analysis is like playing basketball, or knitting a sweater. You can get better with practice.

How should we think about all this? To start with, I think we have to accept statistical incompetence not as an aberration but as the norm. The norm among researchers, thus the norm among journal referees, thus the norm among published papers.

Incompetent statistics does not necessarily doom a research paper: some findings are solid enough that they show up even when there are mistakes in the data collection and data analyses. But we’ve also seen many examples where incompetent statistics led to conclusions that made no sense but still received publication and publicity.

How should we react to this perspective?

Statisticians such as myself should recognize that the point of criticizing a study is, in general, to shed light on statistical errors, maybe with the hope of reforming future statistical education.

Journalists who are writing about quantitative research should not hold the default belief that a published analysis is correct.

Researchers and policymakers should not just trust what they read in published journals.

Finally, how can I be so sure that statistical incompetence is the norm, not an aberration? The answer is that I can’t be so sure. The way to study this would be to take a random sample of published papers, or perhaps a random sample of publicized papers, and take a hard look at their statistics. I think some people have done this. But from my own perspective, seeing some of these glaring errors that survived the journal reviewing process, and seeing them over and over, gives me the sense that we’re seeing the statistical equivalent of a bunch of saggy sweaters knitted by novices.

Also, this: statistical errors come from bad data analysis, but also from bad data collection and data processing (as in that notorious paper that defined days 6-14 as peak fertility). One message I keep sending is that we should all be thinking about data quality, not just data analysis.

The story

OK, and now here’s the story that motivated these thoughts.

I received this email from Alex Kasprak:

Hello Dr Gelman,

I am a science writer at BuzzFeed . . . writing a brief round up of all the ‘scientific’ claims made about beards in 2015. I was wondering if you (or someone you know) would be able to comment on how rigorous and/or problematic the statistical methods are in three such studies:

The Association Between Men’s Sexist Attitudes and Facial Hair (Oldmeadow-2015.pdf)
– A sample of 500 men from America and India (nowhere else) shows a significant relationship between sexist views and the presence of facial hair.

A lover or a fighter? Opposing sexual selection pressures on men’s vocal pitch and facial hair (Saxton-2015.pdf)
– 20 men and 20 women rated the attractiveness and perceived masculinity of 6 men at different stages of facial hair development.

The Role of Facial and Body Hair Distribution in Women’s Judgments of Men’s Sexual Attractiveness (Dixson-2015.pdf)
– ~3000 women ranked the attractiveness of 20 men of varying degrees of body and facial hair coverage (in paired choices) to determine fuller beards and less body hair was preferential.

Please let me know if you might have time to speak with me about this. . . .

I responded as follows:

Hi Alex. I do not recommend taking this stuff seriously at all. I don’t have the energy to read all of this, but I took a quick look at the first paper and found this:

“Since a linear relationship has been found between facial hair thickness and perceived masculinity . . . we explored the relationship between facial hair thickness and sexism. . . . Pearson’s correlation found no significant relationships between facial hair thickness and hostile or benevolent sexism, education, age, sexual orientation, or relationship status.”

And then this: “We conducted pairwise comparisons between clean-shaven men and each facial hair style on hostile and benevolent sexism scores. . . . For the purpose of further analyses, participants were classified as either clean-shaven or having facial hair based on their self- reported facial hair style . . . There was a significant Facial Hair Status by Sexism Type interaction . . .”

So their headline finding appeared only because, after their first analysis failed, they shook and shook the data until they found something statistically significant. All credit to the researchers for admitting that they did this, but poor practice of them to present their result in the abstract to their paper without making this clear, and too bad that the journal got suckered into publishing this. Then again, every paper can get published somewhere, and I’ve seen work just as bad that’s been published in prestigious journals such as Psychological Science or the Proceedings of the National Academy of Sciences.

As long as respectable journals will publish this work, and as long as news outlets will promote it, there will be a motivation for researchers to continue going through their data to find statistically significant comparisons.

I should blog this—is it ok if I quote you?

Alex replied:

Would you mind holding off on blogging until we’ve published our piece, by any chance?

And I responded:

Yes, sure, no problem at all. Right now our blog is backlogged until March 2016!

And, indeed, here we are.

44 thoughts on “Statistics is like basketball, or knitting

  1. The only reasonable inference to conclude here is that applied statistics is hard

    Compared to what? Quantum Field Theory or storming the beaches at Normandy? The only reasonable inference to conclude here is that people have lots of fundamental beliefs which aren’t true. Classical statistics isn’t merely misapplied, it’s wrong in very fundamental ways. It’s an error. Can you name one instance in which people consistently screwed up for a century where it turned out there weren’t fundamental problems?

    Did bleeding fail to cure medical patients in the middle ages because doctoring is hard? Did the predicted world revolution of the proletariat fail to materialize because communism wasn’t tried aggressively enough? Was the only problem with phlogiston that it hadn’t been taught well?

  2. A major reason for this, I think, is that many researchers – perhaps the majority – never receive any formal statistics training at all. I certainly never did, and I at least have a background in mathematics and computer science. Instead, statistics usually seem to be treated much like computer science and programming: something you’re expected to pick up piecemeal “on the job” without any understanding of the underlying theory or best practices.

    Needless to say, the quality of both statistics and software you get from that mindset is similarly hideous. And yes, a lot of research software is so riddled with issues the data is near worthless even if the underlying model and the statistical analysis is sound.

  3. You’re giving people (and other disciplines) a lot of credit with this “statistics is hard” thing. Sure it’s hard, but when I see people do bad statistics it’s because they would really really really like their poorly planned experiment or low N study to work out to something exciting so they can graduate/get tenure/etc… I have a lot of sympathy for the individuals involved but I can’t help but look at the situation and see dismal (non-statistical) experimental training and hopeful yet bumbling advising.

    I would reorder the solutions as 1) improve advising about project planning (at all career levels); 2) improve experimental design training; and 3) statistics is hard and you need practice.

    • Or, you know, consult a statistician. It really bothers me that academic statistics has (for the most part) long since drifted away from being a service field where it could actually help scientists instead of focusing on insular theoretical work of limited applied benefit.

      • The thing is, most people go off, design a study, collect the data, and come ask the statistician… too late!!!

        Also I wonder how much of academic stats drifting off to do other stuff is because when people came to them to design studies they said “Oh you need to do X,Y,Z to make this work right… which turned out to cost 18x as much as the researcher really wanted to spend (time, money, whatever)” so the researcher decides to just go off and run some p values until they get < 0.05 and then publish it like their colleagues do, since that's all they really wanted in the first place anyway.

        Seriously though, if most of your colleagues publish weak stuff costing $5000 every 2 or 3 months, and get tenure and grants, what do you think when you decide to "do it right" and go consult a statistician and they tell you the truth? That to do your study and make it really believable costs $5 Million dollars and 6 years time? (or whatever, a couple orders of magnitude more than what your colleagues do).

        • I say this because it’s been my experience that sometimes academic people I know pretty well come to me with a project idea and I give them advice about how to approach it, and they will often say “well, that sounds really cool, but we’d never be able to do all of that, can you just tell me how to analyze this particular narrow question with the data I already have…” (answer might well be “you can’t, it’s fundamentally confounded with XYZ”, followed by “well, can we just get a p-value to tell us whether Q occurred by random chance?”)

          I’ve even heard people say things like “we don’t need to put in anything in the grant budget for data analysis, we’ll just get some collaborator to run some standard software on it” while they’re writing the grants! This in a couple year long laboratory experiment which will produce a large bioinformatics type dataset… and they don’t think they even need to mention the cost of analysis in the grant!

          If no-one else is putting $40,000 for several months of PhD level data analysis in their grant, then when you put it in, your grant will be marked down for wasting resources or whatever (ironically, it’s the people who just run out and collect data that is un-analyzable are the ones wasting the resources… sigh).

          The people who analyze grants understand why you’d need salary for a postdoc biologist, animal housing costs, lab supplies, partial salary of the PI, buy some special apparatus, a grad-student stipend… But if they see $40,000 for a specialist in data analysis, the response is likely to be:

          “what, just to calculate some p values with a couple t-tests? this isn’t a human clinical trial for a drug or something, why do they need that?” or “there’s software for free that will do this for you” or “this institution has a site license for *insert canned software here* so they can just plug it into that…”

          It doesn’t help that companies want to sell canned software, and people want to believe that they should buy canned software. Unless you’re doing a highly repetitive task, canned software can’t possibly satisfy your needs. And, if you’re doing a highly repetitive task, then you’re not doing research. Sometimes canned software can be useful for some sub-task (like data reduction mainly) but when it comes to something like “does substantive scientific explanation explain this data well?” you can’t possibly do it with canned software. You have to build a model… fitting the model is something “canned” (Thank you Stan developers!) but *building* it requires translating substantive concepts into quantitative descriptions.

          Perhaps this is the real reason we have so much straw-man NHST. You can automate straw-man NHST without knowing anything about the substantive hypothesis! yay, efficiency!

      • No, the problem is that they consult one statistician, then another, then another, until they get an answer they like. It doesn’t help.

      • But who wants to play second fiddle their whole life? You seem to suggest that statisticians provide IT-like technical support for scientists, and little more. This may actually be true for those who perform certain kinds of carefully designed regulatory-agency certified analyses of RCT’s. However, for many of us in Public Health statistics is more about modeling, and less about estimation (which often can be done via canned software). In other words, the statisticians often function as the scientists, and not just the assistants to them.

        And modeling is all about innovation and context. Nobody can consistently and accurately simply compute the answer, as the models are all imperfect (“All models are wrong, some are useful”), and they are loaded with contextual assumptions and innovative constructs.

        I, as a tenured biostatistician at a R1 research university, actually think there is a valid argument for getting rid of the concept of the IT-support style consulting statistician and combining Statistics (in this case Biostatistics) with more substantive fields such as Epidemiology. Rather than offering degrees Biostatistics, with little substantive training, or Epidemiology, with little real statistical training, why not offer degree programs in something like Statistical Epidemiology, which would combine rigorous statistical training with substantive expertise?

        In my experience, the statisticians with little to no understanding of the science underlying the problem at hand are as dangerous as the context experts with no data-analytic skills to guide them.

  4. “The only reasonable inference to conclude here is that applied statistics is hard. Doing a statistical analysis is like playing basketball, or knitting a sweater. You can get better with practice.” -AG

    Except that with applied statistics there has been virtually no feedback loop like there is with basketball and knitting. In fact, just the opposite, the institutional feedback in academia seems to mostly reinforce the bad statistics.

      • The interesting thing is that feedback loop for applied statistics fail, mainly because the only available feedback is opinion of peers (there is no direct feedback from reality like in say engineering)

        But the same could be said about Mathematics, and yet it works!

        I wonder what is the cause and what can we learn from it.

        • That is interesting (mine below was on lack of feedback from peers).

          Now, I think you do get some feedback from reality if you are involved in repeated studies that address similar or the same question. That rarely happened in my career and I thought it was distinctive of Andrew’s career (e.g. repeat elections polls)

          Now math is very different as the reality there is simply the abstractions you have made and the feedback on that reality happens usually very fast – try publishing a false theorem or get the derivations incorrect in an submitted paper!

    • Let me add one this way this happens to me (that I am hoping the ASA might help with.)

      I work with scientists with the usual stats training background and they will believe anything I say about stats (understandable but bad).

      When we collaborate with outside groups they have usually have one statistician and in any difference of opinion/approach they might say “I am a statistician too and I disagree – I think for instance the p_value being > .05 is a good indication that there is no effect” and given their scientists will agree with them (bad, but understandable) it ends there.

      Pointing to the statistical literature is no help as I think you can find support for almost anything there (except basic math errors), at least if you go into the statistics in field X literature.

      Many practicing statisticians don’t get out of that no feedback loop and very seldom anyway.

      So maybe the ASA statement with 20 separate papers (about 5 degrees of freedom?) in the supplement might help – I’m hoping.

  5. Have you noticed how resentful investigators get when you bring these sorts of things up? They often brush it off as techno-statistical mumbo jumbo.

    • Little has changed since Meehl wrote his great papers. It comes through very clearly in them that (some of) his colleagues considered him a boring pedant.

  6. > So their headline finding appeared only because, after their first analysis failed, they shook and shook the data until they found something statistically significant. All credit to the researchers for admitting that they did this, but poor practice of them to present their result in the abstract to their paper without making this clear, and too bad that the journal got suckered into publishing this. Then again, every paper can get published somewhere, and I’ve seen work just as bad that’s been published in prestigious journals such as Psychological Science or the Proceedings of the National Academy of Sciences.

    I haven’t read the paper you’re characterizing as bad. However: they did a series of analyses, reported them, and published the results. Why isn’t that just regular exploratory statistics? Even if the authors don’t see it as exploratory, is there a problem with us just reading it that way? — as a reasonable, if potentially noisy, exploration of the data, suggesting testable hypotheses for future analysis?

    From this perspective, it’s not bad statistics: it’s just exploratory not confirmatory.

    What about this view?

    • Jeff:

      Exploration is fine. The point is that the evidence for their particular claims is much weaker than they seem to think. If the study is exploratory, it would be appropriate to graph all the possible differences and comparisons of interest, not to focus on one of the comparisons which happens to be statistically significant.

      To put it another way: I like exploration, but to pull out statistically significant comparisons, is a terrible way to do exploration.

      • *can be* a terrible way to do exploration.

        If you’re exploring for ideas, for models, for figuring out how the world works, p values are pretty meaningless.

        On the other hand, if you’re building a “detector” of events, then p values are like a “squelch” knob on your old-fashioned CB radio, when the “background” model is calibrated well, the p value “lets through” observations that are “different” from the norm. In that sense, they are a good (ok?) way to explore for where/when “unusual” things happened.

        examples:

        trying to find a sunken ship? drag a sonar bouy, build a “background” model, and then continue dragging it, marking locations on the map where the echo has p < 0.001 Now you can probe those spots more carefully the next go round.

        trying to detect where forest fires are starting before they get big? Fly your satellite over the forest, build a background model for infrared emission, and then fly your satellite over each day and report any locations where infrared has a small p value in a 1 tailed test…

        both of these are situations where 1) you have a *real* model, and 2) you're looking for where it's violated.

  7. As long as we are blogging about knitting and statistics, see

    http://www.purplekittyyarns.com/info/knitting-needle-conversion.html

    KNITTING NEEDLE CONVERSION
    Metric U.S. UK/Canada
    2.0 0 14
    2.25 1 13
    2.75 2 12
    3.0 – 11
    3.25 3 10
    3.5 4 –
    3.75 5 9
    4.0 6 8
    4.5 7 7
    5.0 8 6
    5.5 9 5
    6.0 10 4
    6.5 10 1/2 3
    7.0 – 2
    7.5 – 1
    8.0 11 0
    9.0 13 00

    We observe that the metric diameter is in mm followed by the U.S. designation and then finally, the UK/Canadian designation. As the needle diameter increases, the U.S. designation increases while the Canadian decreases.
    Of the 17 needle sizes, nine of them have the U.S. plus UK/Canada (mysteriously?) “pairwise” sum to the number 14 (plus one instance where the pairwise sum is 13.5). As always, the annoyance of missing data, three for the U.S. and only one for UK/Canada.

  8. “There’s a lot of incompetent research out there.”

    On a slightly related note, I was just watching the movie “The Big Short” and at about the 1:26 mark they have Richard Thaler (father of Behavioral Economics) explaining to Salina Gomez (no joke) that quote “this is a classic error, in basket ball it’s called the hot hand fallacy”.

    oops.

  9. I am a 4th year graduate student studying behavioral neuroscience (during which time I have taken at least one graduate-level statistics course/year), and I did my undergraduate degree in cognitive science (during which time I took 2 statistics courses). A major problem, I think, is that textbooks/courses often leave room for substantial ambiguity about how to properly apply statistics in many of the real-world research scenarios where the data, phenomena, and questions are not as clean or clear-cut as the examples provided in the textbook. I have had professors (and other resources) give contradictory advice about such things as multiple comparison correction, how to appropriately handle violations of assumptions for various tests, what constitutes a “family” of inferences, etc. I have also had arguments with my PI over things like multiple comparisons correction in follow-up tests for ANOVAs or using degrees of freedom adjustment for violations of sphericity (trying to convince them that it should be done). My work is in human functional/structural MRI, and it is a field that is obsessed with statistics, but that seldom agrees on what is appropriate or when. I also know plenty of MR physicists and engineers that understand statistical concepts but that consistently misapply inferential statistics. Basically, my point is that I think one of the major problems facing scientists is that they are rarely given consistent, helpful advice on how to handle anything outside of the textbook examples that are substantially more straightforward than the questions that they are actually addressing. Asking a statistician is also a gamble, because as stated before, you often get different (or multiple alternate) answers to the same question.

    • I think that one major part of the problem is that people dispense advice and recommendations. You won’t find statisticians who know how to do inference agreeing with each other; why should researchers using statistics in their work agree on one procedure? There is an element of subjectivity and opinion in what one should or should not do. The effect that recommendations and advice have is to create cookie cutter-like automated procedures for doing statistics (“just tell me which button to press”). There is no substitute for understanding. The basic theory is not that hard to understand, it’s definitely easier than quantum mechanics and launching an invasion of Normandy.

  10. I’m sorry if this is an inappropriate place to ask, but what texts in applied statistics would you suggest to graduate students in psychology or cognitive science?

  11. I think a big part of the problem in psychology is that the people drawn to many parts of the field usually have little interest in math or applied math. They have interesting questions about how the mind or brain works, and often design very clever experiments, but have so little interest in the logic behind statistics that they they don’t notice even gross errors, and may not understand the advice they receive from a statistician. And given how little so many psychologists know about stats, it doesn’t take much to appear as an expert statistician to them, so they are often seeking advice from those who know only a little more about stats than they do. As a psych stats textbook author myself, I’m very aware of how poorly vetted stats texts are in the field of psychology. Publishers tend to think that anyone with a PhD in any area of psychology who teaches stats at a college can be trusted to write a good stats text, or evaluate one. By comparison, imagine trying to become a competent physicist, even though you are not comfortable with mathematical concepts. Of course, there are some very quantitatively sophisticated psychologists out there, but they often seek out the parts of psych that are closer to the natural sciences, and when they don’t, there work can be easily drowned out by the sheer volume of research published by those who are much less mathematically inclined.

    • I agree with much of what you say, but would add a couple of related things:

      1. There is a phenomenon (I sometimes call it “the game of telephone effect,” after the kid’s game where people sit in a circle, one person whispers a phrase to the person on one side, that person whispers it to the next person, and so on until after it’s gone around the circle it has typically been changed dramatically and often comically) where one person figures out what seems to be a good way to explain something statistical, someone else agrees but changes it a little, and so on until the explanation is far from the reality.

      2. I would expand your comment “that the people drawn to many parts of the field usually have little interest in math or applied math” and say that the people drawn to many parts of the field don’t care about precise language, but tend to think in broad terms, which just doesn’t work when you’re talking about statistics.

      • I would also add that the current American culture encourages an interest in psychology generally, but actively discourages interest in math and science. Those who are interested in, and excel at, math and science are called “geeks.” Therefore, many who go into psych research, even when they have a great aptitude for math and/or science, have already been drained of their motivation to study and understand statistical concepts.

  12. Statistics isn’t really like knitting a sweater. A sweater is usually made to be worn, and if it doesn’t fit, that is the “fail” for the sweater.

    But for statistics, the person doing the statistics, and publishing the papers, doesn’t generally have to “wear” statistical results that clearly don’t fit.

    I’m sure if Steven Pinker had to have his career path and promotions impeded by some of the horrible and unfounded inferences he’s made about the cognitive inferiority of women, he made be a little more careful in his statements. Alas, he doesn’t.

  13. I think many of us who try to do rigorous experimental psychology research receive, as a matter of course, basic stats training (regression, t-tests, correlations, ANOVA, etc). Many of us also seek out significant training beyond the basic sequence in most programs. And, as a result, many of us in the psychology field are just as horrified by papers published along those lines as you are. We’re equally horrified by the warping of conclusions in our own studies into better sounding/selling stories by journalists, so I was happy to see that you went to the original paper to see what it really claimed. But, also, one big issue that I’ve noticed in my career is a somewhat inexplicable language barrier between stats, as taught in experimental psych, and stats, as taught in stats departments. I have much reason to be especially knowledgable in stats by virtue of conversations with my brother who received his PhD in stats several years ago (and who pointed me to your blog). But, very early on in his training, we discovered how little overlap there was in the language that and in the designs focused on in psych vs math stats. His knowledge of time series analyses dwarfed mine within his first year but, even post graduation, he knows far less about means comparisons than I (think I) do. The best of us do not, as some commenters think, rely on outside statisticians or only think about stats after collecting data. The best of us very eager to learn as much about stats as possible to improve as statisticians and scientists. If you know of ways around the language barriers, you could immeasurably improve the discussions at my family’s thanksgiving table (or immeasurably harm the discussions, if we let anyone else in the family have a vote).

  14. I think it’s worse than you make it out to be in part because of your (apparent) focus on academia. In the rest of the world non-statisticians perform all sorts of analyses AND make critical decisions about how to build a bridge, what substances go into food packaging and content, how water quality is measured, how forests are managed, and how data is collected (not just how much data but how it is selected). Everyone with a computer and R (or SAS or SPSS or for those of us old enough to remember BMDP) doesn’t think twice about making global inferences from data and analyses they don’t know they don’t understand.

    (I recognize there are plenty of wildlife biologists, physicists, and hydrologists that are better at Statistics than I am – and certainly better at constructing sentences. There are really good ones out there.)

    Researchers put their research inferences tucked away in journals. But much of the rest of the world puts their inferences (right or wrong) into practice.

Leave a Reply to Joyce Ehrlinger Cancel reply

Your email address will not be published. Required fields are marked *