The aching desire for regular scientific breakthroughs

This post didn’t come out the way I planned.

Here’s what happened. I cruised over to the British Psychological Society Research Digest (formerly on our blogroll) and came across a press release entitled “Background positive music increases people’s willingness to do others harm.”

Uh oh, I thought. This sounds like one of those flaky studies, the sort of thing associated in recent years with Psychological Science and PPNAS.

But . . . the British Psychological Society, that’s a serious organization. And the paper isn’t even published in one of their own journals, so presumably they can give it a fair and unconflicted judgment.

At the same time, it would be hard to take the claims of the published paper at face value—we just know there are too many things that can go wrong in this sort of study.

So this seemed like a perfect example to use to take what might be considered a moderate stance, to say that this paper looked interesting, it’s not subject to obvious flaws, but for the reasons so eloquently explained by Nosek et al. in their “50 shades of gray” study, it really calls out for a preregistered replication.

So I went to the blog and opened a new post dated 25 Dec (yup, it’s Christmas here in blog-time) entitled, “Here’s a case for a preregistered replication.”

And I started to write the post, beginning by constructing a long quote from the British Psychological Society’s press release:

A new study published in the Psychology of Music takes this further by testing whether positive music increases people’s willingness to do bad things to others.

Naomi Ziv at The College of Management Academic Studies recruited 120 undergrad participants (24 men) to take part in what they thought was an investigation into the effects of background music on cognition. . . .

The key test came after the students had completed the underling task. With the music still playing in the background, the male researcher made the following request of the participants:

“There is another student who came especially to the college today to participate in the study, and she has to do it because she needs the credit to complete her course requirements. The thing is, I don’t feel like seeing her. Would you mind calling her for me and telling her that I’ve left and she can’t participate?”

A higher proportion of the students in the background music condition (65.6 per cent) than the no-music control condition (40 per cent) agreed to perform this task . . .

A second study was similar but this time the research assistant was female, she recruited 63 volunteers (31 men) in the student cafeteria . . . After the underling task, the female researcher made the following request:

“Could I ask you to do me a favor? There is a student from my class who missed the whole of the last semester because she was very sick. I promised her I would give her all the course material and summaries. She came here especially today to get them, but actually I don’t feel like giving them to her after all. Could you call her for me and tell her I didn’t come here?”

This time, 81.8 per cent of the students in the background music condition agreed to perform this request, compared with just 33 per cent of those in the control condition. The findings are all the more striking given that the researchers’ requests in both experiments were based on such thin justifications (e.g. “I don’t feel like giving them to her after all”).

Shoot, this is looking pretty bad. I clicked through to the published paper and it seems to have many of the characteristics of a classic “Psychological Science”-style study: small samples, a focus on interactions, multiple comparisons reported in the research paper and many other potential comparisons that could’ve been performed had the data pointed in those directions, comparisons between statistical significance and non-significance, and an overall too-strong level of assurance.

I could explain all the above points but at this point I’m getting a bit tired of explaining, so I’ll just point you to yesterday’s post.

And, to top it all off, when you look at the claims carefully, they don’t make a lot of sense. Or, as it says in the press release, “The findings are all the more striking.” “More striking” = surprising = implausible. Or, to put it another way, this sort of striking claim puts more of a burden on the data collection and analysis to be doing what the researchers claim is being done.

Also this: “no previous study has compared the effect of different musical pieces on a direct request implying harming a specific person.” OK, then.

When you think about it, even the headline claim seems backwards. Setting aside any skepticism you might feel about background music having any consistent effect at all, doesn’t it seem weird that “positive music increases people’s willingness to do others harm”? I’d think that positive music would, if anything, make people nicer!

And the reported effects are huge. Background music changing the frequency of a particular behavior from 33% to 80%? Even Michael LaCour didn’t claim to find effects that large.

As is unfortunately common in this sort of paper, the results from these tiny samples are presented as general truths; for example,

The results of Study 1 thus show that exposure to familiar, liked music leads to more compliance to a request implying harming a third person. . . .

Taken together, the results of the two studies clearly show that familiar and liked music leads to more compliance, even when the request presented implies harming a third person.

Story time!

Where are we going here?

OK, so I wrote most of the above material, except for the framing, as part of an intended future post on a solid study that I still wasn’t quite ready to believe, given that we’ve been burned so many times before by seemingly solid experimental findings.

But, as I wrote it, I realized that I don’t think this is a solid study at all. Sure, it was published in Psychology of Music, which I would assume is a serious journal, but it just as well could’ve appeared in a “tabloid” such as Psychological Science or PPNAS.

So where are we here? One more criticism of a pointless study in an obscure academic journal. What’s the point? If the combined efforts of Uri Simonsohn, E. J. Wagenmakers, Kate Button, Brian Nosek, and many others (including me!) can’t convince the editors of Psychological Science, the #1 journal in their field, to clean up its act regarding hype of noise, it’s gotta be pretty hopeless of me to expect or even care about changes in the publication policies of Psychology of Music.

So what’s the point? To me, this is all an interesting window into what we’ve called the hype cycle which encompasses not only researchers and their employers but also the British Psychological Society, which describes itself as “the representative body for psychology and psychologists in the UK” and also an entirely credulous article by Tom Jacobs in the magazine Pacific Standard.

I have particular sympathy for Jacobs here, as his news article is part of a series:

Findings is a daily column by Pacific Standard staff writer Tom Jacobs, who scours the psychological-research journals to discover new insights into human behavior, ranging from the origins of our political beliefs to the cultivation of creativity.

A daily column! 365 new psychology insights a year, huh? That’s a lot of pressure.

The problem with the hype cycle is not just with the hype

And this leads me to the real problem I see with the hype cycle. Actual hype doesn’t bother me so much. If an individual or organization hypes some dodgy claims, fine: They shouldn’t do it, but, given the incentives out there, it’s what we might expect. You or I might not think Steven Levitt is a “rogue economist,” but if he wants to call himself that, well, we have to take such claims in stride.

But what’s going on with the British Psychological Society, that in some way seems more troubling. I don’t think the author of that post was trying to promote or hype anything; rather, I expect it was a sincere, if overly trusting, presentation of what seemed on the surface to be solid science (p less than 0.05, published in a reputable journal, some plausible explanations in the accompanying prose). And similarly at Pacific Standard.

The hype cycle doesn’t even need conscious hype. All it needs is what John Tukey might call the aching desire for regular scientific breakthroughs.

You don’t have to be Karl Popper to know that scientific progress is inherently unpredictable, and you don’t need to be Benoit W. Mandelbrot to know that scientific breakthroughs, at whatever scale, do not occur on a regular schedule. But if you want to believe in routine breakthroughs, and you’re willing to not look too closely, you can find everything you need this week—every week—in psychological science.

And that is how the hype cycle continues, even without anyone trying to hype.

The disclaimer

OK, here we are, at that point in the blog post. Yes, some or all the claims in this paper could in fact represent true claims about the general population. And even if many or most of the claims are false, this work could still be valuable in motivating people to think harder about the psychology of music. I mean, sure, why not?

As always, the point is that the statistical evidence is not what is claimed, either in the published paper or the press release.

If someone wants, they can try a preregistered replication. But given that the authors themselves say that these results confound expectations, I don’t know that it’s worth the effort. It’s not where I’d spend my research dollars. In any case, as usual I am not intending to single out this particular publication or its author. There’s nothing especially wrong with it, compared to lots of other papers of its type. Indeed, what makes it worth writing about is its very ordinariness, that this paper represents business as usual in the world of quantitative research.

Those explanations

As always, we get stories which I can’t take seriously because they assume the truth of population statements which haven’t actually been demonstrates. For example:

Why should positive background music render us more willing to perform harmful acts? Ziv isn’t sure – she measured her participants’ mood in a questionnaire but found no differences between the music and control groups. She speculates that perhaps familiar, positive music fosters feelings of closeness among people through a shared emotional experience. “In the setting of the present studies,” she said, “measuring connectedness or liking to the experimenter would have been out of place, but it is possible that a social bond was created.”

Both the researcher and the publicist forgot the alternative explanation that maybe they are just observing variation in some small group that does not reflect any general patterns in the population. That is, maybe no explanation is necessary, just as we don’t actually need to crack open our physics books to explain why Daryl Bem managed to find some statistically significant interactions in his data.

The aching desire for regular scientific breakthroughs

Let me say it again, with reference to the paper by Ziv that got this all started. On one hand, sure, maybe it’s really true that “background positive music increases people’s willingness to do others harm,” despite that the author herself writes that “a large number of studies examining the effects of music in various settings have suggested” the opposite.

But here’s the larger question. Why should we think at all that a little experiment on 200 college students should provides convincing evidence overturning much of what we might expect about the effects of music. Sure, it’s possible—but just barely. What I’m objecting to here is the idea—encouraged, I fear, by lots and lots of statistics textbooks, including my own, that you can routinely learn eternal truths about human nature via these little tabletop experiments.

Yes, there are examples of small, clean paradigm-destroying studies, but they’re hardly routine, and I think it’s a disaster of both scientific practice and scientific communication that everyday noisy experiments are framed this way.

Discovery doesn’t generally come so easily.

This might seem to be a downbeat conclusion, but in many ways it’s an optimistic statement about the natural and social world. Imagine if the world as presented in “Psychological Science” papers were the real world. If so, we’d routinely be re-evaluating everything we thought we knew about human interactions. Decades of research on public opinion, smashed by a five-question survey on 100 or so Mechanical Turk participants. Centuries of physics overturned by a statistically significant p-value discovered by Daryl Bem. Hundreds of years of data on sex ratios of children, all needing to be reinterpreted because of a pattern some sociologist found in some old survey data. Etc.

What a horrible, capricious world that would be.

Luckily for us, as social scientists and as humans trying to understand the world, there is some regularity in how we act and how we interact, a regularity enforced by the laws of physics, the laws of biology, and by the underlying logic of human interactions as expressed in economics, political science, and so forth. There are not actually 365 world-shaking psychology findings each year, and the strategy of run-an-experiment-on-some-nearby-people-and-then-find-some-statistically-significant-comparisons-in-your-data is not a reliable way to discover eternal truths.

And I think it’s time for the Association for Psychological Science and the British Psychological Society to wake up, and to realize that their problem is not just with one bad study here and one bad study there, or even with misapplication of certain statistical methods, but with their larger paradigm, their implicit model for scientific discovery, which is terribly flawed.

And that’s why I wrote this post. I could care less on the effect of pleasant background music to change people’s propensities to be mean. But I do care about how we do science.

45 thoughts on “The aching desire for regular scientific breakthroughs

  1. This post really gets to the heart of the matter. But why do these breakthroughs occur all the time in social sciences instead of, say, physics (except for the occasional cold fusion or perpetual motion machine crank)? Here’s a thought: it’s a combination of 1) people are generally tough to predict, so small sample sizes can give great variability, and 2) to the extent that human behavior is at all predictable, 50,000 years of societal evolution has helped us to pretty well understand our fellow human being. So the research conclusion “Comfortable environment causes people to smile more” is never made because we already know it, and therefore doesn’t “advance the science.”

  2. Good post.

    The thing that stands out to me is that the researcher’s request is so broad. It’s a favor to do a bad thing. They should have also tested favors to do a good thing. That way you could isolate that maybe people are more willing to give favors when there’s music.

    • I was coming here to say something similar. I think Andrew’s surprise:

      “doesn’t it seem weird that “positive music increases people’s willingness to do others harm”?
      I’d think that positive music would, if anything, make people nicer!”

      This seems like a classic case of potential misinterpretation of results. Most likely under my prior, positive music in this case made the subject much more likely to do a favor for the experimenter even if that favor was something that wasn’t very nice to a nebulous third party.

      • Daniel:

        I should clarify. My point is that what they found could be interpreted as natural and even obvious, or counterintuitive and exciting. I was going with the researchers’ claim that their finding was counterintuitive, and saying that, in that case, maybe it should not be trusted. But I agree that you could flip it around and say that it makes sense. Indeed, the authors of the paper in question seem to be flipping back and forth between describing their finding as natural (they hypothesized it ahead of time) and, at the same time, running counter to their literature. The theory is just so vague that just about any result could be considered both a confirmation and a surprise. In that way it’s like the fecundity-and-clothing and fecundity-and-voting studies, where it’s easy to come up with a story to fit any possible pattern in the data.

        • Sure, that makes sense, but I was trying to point out that although such a large effect size seems “unlikely” under the given hypothesis (that nice music makes people more willing to harm others) there is a very simple hypothesis (that nice music makes people more willing to do favors for the experimenters) which makes such a large effect size seem perfectly plausible. If they had just dropped a pencil on the floor and asked the student to bend down and pick it up and 80% did in the nice music condition and 30% in the other condition… no-one would have spent 2 seconds looking at this paper.

  3. Totally agree. Scientists have generated their own personal hell, one little cheat at a time. This includes letting funding agencies take credit for work performed before the grant started, letting them imagine that they can get a breakthrough in 18 months of 3 people at 25% salary support, and so on. Bad money drives out the good, there is no free lunch. The only way to redress this is to degrade the understanding of nations and societies about the price of science and its payoff. They will pay more and get less – because they have been getting much less than they thought, all along.

  4. > , I fear, by lots and lots of statistics textbooks, including my own, that you can routinely learn eternal truths about human nature via these little tabletop experiments.

    That would be me my guess at what former students of most stats courses take away (always hate seeing, “in this course students will learn how to _correctly_ analyses data (even draw conclusions)”.

    Very good post, I look forward to when they are no longer needed (like in about 2 generations)

  5. Regarding the study: So listening to music you like improves your mood, and when you’re in a good mood, you’re more inclined to help someone who requests something of you. To me, these don’t seem like revolutionary or even interesting claims. Abstracts and titles tend to state stylized facts in the most striking way possible, which is understandable since we all know that’s part of the game. I would love it if journals required authors to submit 10 or 20 titles and abstracts and editors would pick the most boring, least counter-intuitive ones. Or even just have some intern write the title and abstract (or your greatest rival!). Instead of “nice music makes you do harm,” this article would then have been described as “nice music makes you nice to a requester,” or something equally boring, and we could avoid a lot of the problems associated with the hype surrounding some of these papers.

  6. Great post.

    In one of my early publication bias talks, I expressed similar ideas: that we should not expect so many groundbreaking findings in rapid succession. An audience member near the back commented that the excitement of generating such findings was the whole reason he wanted to become a scientist; and I suspect that attitude is typical for many scientists in psychology. They love the rush of discovery (so do I), and it seems that they can get it on a regular basis.

    My response to the audience was to share an anecdote from a colleague, Howie Zelaznik. As Howie put it, over the course of a research career a scientist publishes lots of papers, but that you should really aim for just 5. Obviously, you need more than 5 papers to get promoted and to satisfy grants and so forth, but many of those papers are just scientists marking time. Realistically most of us can only hope for about 5 papers that really matter, and part of our job is to recognize when we have accumulated enough understanding about a topic to write one of those 5 papers.

    There is nothing wrong with publishing papers that are largely just marking time; it is the typical state of a scientist and and sometimes those papers slowly move in a meaningful direction or matter quite a bit to some small audience. I think I have 2 of my 5 really meaningful papers published, and I might be working on my third.

    • The thing about 5 meaningful papers is something that I also wrote recently in a blog post here, and then deleted because the associated rant was too harsh. But the thing about there really being only 5 papers that matter (I’d say plus or minus 2 or 3) is really true. And more to the point, there are plenty of scientists who will publish ZERO and I’m afraid many of those ZEROs are published by people with huge well funded labs and lots of papers coming out each year.

      • Donald Geman wrote an opinion piece in 2007, “Ten Reasons Why Conference Papers Should be Abolished”. A few quotes,

        “Community-wide, do we really believe that every few months there are several hundred advances worthy of our attention? How many good ideas does the typical researcher (or anybody for that matter) have in a lifetime? A Poisson distribution with a small parameter?”

        “Are we making progress? Are we steadily (if slowly) building on solid foundations? It is difficult to know. Given all the noise due to the sheer volume of papers, the signal, namely the important, lasting stuff, is awfully difficult to detect.”

        “My own (half-serious) suggestion is to limit everybody to twenty lifetime papers. That way we would think twice about spending a chip. And we might actually understand what others are doing – even be eager to hear about so-and-so’s #9 (or #19).”

        http://www.cis.jhu.edu/publications/papers_in_database/GEMAN/Ten_Reasons.pdf

        Robin

  7. As some people have commented, insofar as the studies indicated any effect, that effect is probably that people who are listening to positive background music are more willing to do what they are asked. That effect seems reasonable to me, although I am skeptical about this effect size.

    As for making people more willing to do harm, it could easily be that music makes people less willing to do harm, but that the above effect swamps it.

  8. Two thoughts.

    1. Andrew, this is not at all a criticism! It is a feature. But these posts remind me of the joke about the old comedians who tell each other jokes by numbers. “47.” “Yeah, that was a good one.” At this point you could quote an abstract and cite a list of your perennial points.”1-5, 7, 10, 12 with extra hype in the journal press release. …”

    2. I’ve said it before but the one person who is best situated to promote or break the hype cycle for the general educated public is NPR’S Shankar Vedantam. The poor guy’s whole beat is the psych hype machine, and he’d do millions of listeners a great service if he’d learn these lessons and just report with suitable skepticism.

    • Very:

      Yup, I have to admit that I’m getting a bit bored of Tol. He doesn’t really have anything interesting to say. And, unlike Weggy, he’s not in the statistics profession so it seems easy to ignore him.

  9. You missed a trick. According to Wikipedia (which cites his autobiography, which I haven’t checked), Benoit Mandelbrot’s middle initial was B, not W. The B, of course, stood for “Benoit B. Mandelbrot”. :-)

    Back to the study: I cannot see how the author reaches the conclusion that the students were “prepared to harm someone”, without having first controlled for their willingness to help the researcher (with a task that didn’t harm anyone). With the researcher sitting there, and the person to be harmed /a/ not present and /b/ being described as a bit of a loser, I’m not surprised that people complied. Until we see that control, we’re potentially into “music puts you into a good mood, and prepared to help people in the same room”, which seems a lot less implausible.

    • Steve:

      Yes, it’s well known and well accepted that music can affect behavior. If the paper in question had simply said, “Music can affect behavior,” it would be uncontroversial and, indeed, unpublished. What I’m saying is that there’s not much evidence for the specific claims made in this paper.

  10. I don’t understand why you insist on giving research such as this the benefit of the doubt. I am more inclined to see it as conscious hyping and fraud, rather than an accidental mistake that we should be tolerant of. Presumably sometimes accidental mistakes occur, but I think this is likely only true in a small minority of cases. I think that papers such as this are the result of predictable mistakes that anyone with half a brain and a basic modicum of honesty would anticipate and avoid. I want to go into full rebellion mode, and call out people who are doing terrible work as terrible. I feel like no one will ever have much incentive to respond to polite criticisms such as yours, and you need to add more punch to your remarks. Maybe I’m just frustrated, impatient, and young. I certainly want to avoid the quick-judgement failure modes of Twitter mobs and Tumblr activists. But still…

    • 27:

      Not too many people call me “polite,” so this comment makes my day!

      Seriously, though, perhaps the issue is that, as a political scientist, I’ve been trained to think about the motivations of others, even when I disagree with them.

      Thus:

      Yes, I’m pretty sure that Weggy and Chrissy know that they’ve copied material from others without attribution, but I’m guessing that Weggy thinks it’s ok because he’s a man of many responsibilities, and he didn’t want to let people down by telling them he wasn’t up to the job, and I’m guessing that Chrissy thinks that everybody does it and, in any case, he’s just trying to entertain.

      Yes, I’m pretty sure that Marc Hauser knows that he was cheating with his analysis of the monkey tapes, but I’m guessing that he thinks/”knows” his theories are true and so he doesn’t want to waste everyone’s time on minor technicalities such as certain monkeys not behaving the way he knows they should.

      As for the author of the paper described above: I have no idea, but my guess is that she’s doing research the way she’s been trained to do and that she genuinely feels she’s making important discoveries. Statistics is hard. Recall from earlier posts that even great scientists such as Alan Turing and Daniel Kahneman have overstated the strength of evidence. Uncertainty is difficult to deal with, and it seems that people will take just about any excuse to move toward certainty.

      • I somehow share the sentiment of 27chaos, but from my current perspective as a PhD student in psychology, it seems me to me that the underlying problem is that many psychologist, even the ones who strive to turn into a respectable science, can’t help but violate Feynman’s first principle, which I quote here:

        “The first principle is that you must not fool yourself—and you are the easiest person to fool. So you have to be very careful about that. After you’ve not fooled yourself, it’s easy not to fool other scientists. You just have to be honest in a conventional way after that” ― Richard Feynman.

        As the evolutionary biologist Robert Trivers have argued, self-deception has proved adaptive throughout human evolution, so I am not really sure how we can control such a potent innate bias when many in the field choose to give economic security and fame a higher priority than the pursuit of truth. If anything, my experience has taught me that the structure of psychology graduate training tends to filter out the idealist knowledge seeker and rewards those who are willing to accept that there is an inevitable substantial business-like aspect to doing research and publishing in psychology (I can’t speak of other fields from my limited experience).

        But that is only one facet of the institutional problems that plague the field, some of which John P. A. Ioannidis describes in his paper ‘Why Most Published Research Findings Are False’. He argues, for example, that a research finding is less likely to be true when

        1. the studies conducted in a field are smaller
        2. effect sizes are smaller
        3. there is a greater number and lesser preselection of tested relationships
        4. there is greater flexibility in designs, definitions, outcomes, and analytical modes
        5. there is greater financial and other interest and prejudice
        6. more teams are involved in a scientific field in chase of statistical significance

        Probably there are more one can think of, but as long there are no top-down structural improvements, I am afraid change will be very slow, proceeding at the rate of biological evolution!

        At any rate, here is an example I recently came across of a minor structural change that can take us in the right direction: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4527093/

  11. Nosek seems to see a bright side to failed replications that’s quite ridiculous.

    Perhaps the incentives generated by his very own study failing replication causes him to take this stand.

    Maybe we just need to drastically cut funding for these areas. I fail to see how we, as a society, gain much via a study on fat arms, red clothes et cetra, even if the noisy measurements and other issues were ironed out.

    • Rahul:

      I can’t be sure, but given Nosek’s strong opinions (as expressed for example through his alternative persona, Arina K. Bones), I have a feeling that he’s just trying to be diplomatica and to hold together a loose coalition of replicators. See further discussion in this recent post.

  12. Hello Dr. Gelman,

    As an occasional reader of this blog, I often find these articles and discussions helpful in my understanding of statistics. I was wondering if you would be willing to sometimes post psychology and sociology studies that – in your opinion – are very well done. This would help get a better feeling and provide examples on how to do a social science study properly in terms of how one can protect oneself against the pitfalls you describe (forking paths, small samples, poor measurement).

    Thank you.

  13. I think your point about hype from organisations like the BPS is important. If professional bodies aren’t promoting higher standards among those they represent then I’m not really sure what they’re for.

Leave a Reply to Martha Cancel reply

Your email address will not be published. Required fields are marked *