Skip to content

The hype cycle starts again

catwash

Completely uncritical press coverage of a speculative analysis.

But, hey, it was published in the prestigious Proceedings of the National Academy of Sciences (PPNAS)! What could possibly go wrong?

Here’s what Erik Larsen writes:

In a paper published in the Proceedings of the National Academy of Sciences, People search for meaning when they approach a new decade in chronological age, Adam L. Alter and Hal E. Hershfield conclude that “adults undertake a search for existential meaning when they approach a new decade in age (e.g., at ages 29, 39, 49, etc.) or imagine entering a new epoch, which leads them to behave in ways that suggest an ongoing or failed search for meaning (e.g., by exercising more vigorously, seeking extramarital affairs, or choosing to end their lives)”. Across six studies the authors find significant effects of being a so-called 9-ender on a variety of measures related to meaning searching activities.

Larsen links to news articles in the New Republic, Washington Post, Salon, Pacific Standard, ABC News, and the British Psychological Society, all of which are entirely uncritical reports, and continues:

I [Larsen] show that each of the six studies in the paper consist of at least one crucial deficiency hindering meaningful inferences. In several of the studies the results stems from the fact that the end digits are not comparable as 9, for example, is more likely to be found among younger age decades as the age range in all the studies is from 25 to 65. In other words, if people are more likely to engage in an activity at a younger age compared to later in life, higher end digits are more likely to measure such differences compared to lower end digits. When controlling properly for age the differences reported in some of the studies fails to reach statistical significance. In other studies, results were questionable due to empirical shortcomings.

Larsen conveniently made the data accessible via R code in this Github repository.

You can follow the link for all of Larsen’s comments but here are a few:

In Study 1, the authors use data from the World Values Survey. They conclude that “9-enders reported questioning the meaning or purpose of life more than respondents whose ages ended in any other digit”. However, if one take age decades into consideration, the 9-ender effects fails to reach statistical significance (p=0.71) despite the sample size of 42.063.

In the replication material, some respondents from the World Values Survey are excluded. In the full data set (obtained at http://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp), there are 56,145 respondents in the age range 25-64. The difference between 9-enders and the other respondents on the Meaning variable is 0.001 and is non-significant . . .

In Study 2, age decades are less relevant due to the experimental design. Here the authors find significant effects of the experimental stimuli, and the effects seem robust. However, no randomization tests are reported. One sees that the experimental group differ systematically from both the baseline and control group on pre-randomization measured variables. People in the experimental group are, for example, on average over 6 years younger than the baseline group (p<0.001) and have a higher distance to a new decade than the control group (p=0.02), questioning the possibility to draw design-based inferences. . . .

In Study 3, the authors argue that 9-enders are more likely to seek extramarital affairs. What did the authors do here? They “categorized 8,077,820 male users, aged between 25 and 64 years, according to the final digit of their ages. There were 952,176 9-enders registered on the site”.

Before looking at the data, one can ask two questions. First, whether those aged 30 are more likely to report the age of 29 and so on. The authors argue that this is not the case, but I am not convinced. Second, and something the authors doesn’t discuss, whether age is a good proxy for the age users had when they signed up at the specific site (unless everybody signed up before in the months before November 2013). . . .

In Study 4, the authors look at suicide victims across the United States between 2000 and 2011 and show that 9-enders are more likely to commit suicide. Interestingly, the zero-order correlation between suicide rate and 9-enders is non-significant. In the model conducted by the authors (which shows a significant result), when controlling properly for age, the effect of 9-enders fails to reach statistical significance (p=0.13).

In Study 5, the authors show that runners who completed marathons faster at age 29 or 39. However, simple tests show that there is no statistical evidence that people run faster when they are 29 compared to when they are 30 (p=0.2549) or 31 (p= 0.8908) and the same is the case for when they are 39 compared to 40 (p=0.9285) and 41 (p=0.5254). Hence, there is no robust evidence that it is the 9-ender effect which drives the results reported by the authors. . . .

To relay these comments is not to say there is nothing going on: people are aware of their ages and it is reasonable to suppose that they might behave differently based on this knowledge. But I think that, as scientists, we’d be better off reporting what we actually see rather than trying to cram everything into a single catchy storyline.

Of all the news articles, I think the best was this one, by Melissa Dahl. The article was not perfect—Dahl, like everybody else who reported on this article in the news media, was entirely unskeptical—but she made the good call of including a bunch of graphs, which show visually how small any effects are here. The only things that jump out at all are the increased number of people on the cheating website who say they’re 44, 48, or 49—but, as Larsen says, this could easily be people lying about their age (it would seem like almost a requirement, that if you’re posting on a cheaters site, that you round down your age by a few years)—and the high number of first-time marathoners at 29, 32, 35, 42, and 49. Here’s Dahl’s summary:

These findings suggest that in the year before entering a new decade of life, we’re “particularly preoccupied with aging and meaningfulness, which is linked to a rise in behaviors that suggest a search for or crisis of meaning,” as Hershfield and Alter write.

Which just seems ridiculous to me. From cheaters lying about their ages and people more likely to do their first marathon at ages 29, 32, 35, 42, and 49, we get “particularly preoccupied with aging and meaningfulness”??? Sorry, no.

But full credit for displaying the graphs. Again, I’m not trying to single out Dahl here—indeed, she did better than all the others by including the graphs, it’s just too bad she didn’t take the next step of noticing how little was going on.

One problem, I assume, is the prestige of the PPNAS, which could distract a reporter from his or her more skeptical instincts.

Here, by the way, is that graph of suicide rates that appeared in Dahl’s news article:

suicides

According to the graph, more people actually commit suicide at 40 than at 39, and more do it at 50 than at 49, while more people kill themselves at 58 than 59. So, no, don’t call the suicide hotline just yet.

P.S. To get an image for this post, I started by googling “hype cycle” but all I got was this sort of thing:

hypecycle

All these are images of unidirectional curves; none are cycles! Then I went on wikipedia and found that “hype cycle” is “a branded graphical tool.” How tacky. If you’re going to call something a cycle, you gotta make it return to its starting point!

The hype cycle

The “cycle” of which I speak goes something like this:

1. Researcher is deciding what to work on.

2. He or she hears about some flashy paper that was published in PNAS under the auspices of Susan Fiske, maybe something like himmicanes and hurricanes. The researcher, seeing this, sees that you can get fame and science chits by running a catchy psych experiment or statistical analysis and giving it a broad interpretation with big social implications.

3. The analysis is done, it’s submitted to PNAS.

4. PNAS sends it to Susan Fiske, who approves it.

5. The paper appears and gets huge exposure and uncritical media attention, both in general and (I’m sorry to say) within the field of psychology.

Return to 1: Another researcher is deciding what to work on. . . .

That’s the hype cycle: The press attention and prestige publication motivates more work of this kind.

P.S. Let me emphasize: I’m not opposed to this sort of work. I think the age analysis is clever (and I mean that in a good way), and I think it’s great that this sort of thing is being done. But, please, can we chill on the interpretations? And, journalists (both in general and within our scientific societies), please report this with a bit of skepticism? I’m not saying you need to quote an “opponent” of the study, just don’t immediately jump to the idea that the claims are generally valid, just because they happened to appear in a top journal.

Remember, PNAS published the notorious “himmicanes and hurricanes” study.

Remember, the Lancet published the notorious Iraq survey.

Remember, Psychological Science published . . . ummm, I think you know where I’m going here.

Reporting with skepticism does not require “debunking.” It’s just a matter of being open-minded about the possibility that claimed results to not really generalize to the larger populations or questions of interest, in the above case it would imply questioning a claim such as, “We know that the end of a perceived era prompts us to make big life decisions,” which is hardly implied by some data that a bunch of 42-year-olds and 49-year-olds are running marathons.

On deck this week

Mon: The hype cycle starts again

Tues: I (almost and inadvertently) followed Dan Kahan’s principles in my class today, and that was a good thing (would’ve even been more of a good thing had I realized what I was doing and done it better, but I think I will do better in the future, which has already happened by the time you read this; remember, the blog is on a nearly 2-month lag)

Wed: Leif and Uri need to hang out with a better class of statisticians

Thurs: Quantitative literacy is tough! Or, I had no idea that, in 1958, 96% of Americans disapproved of interracial marriage!

Fri: Arizona plagiarism update

Sat: Unstrooping names

Sun: A question about varying-intercept, varying-slope multilevel models for cross-national analysis

What do Rick Santorum and Andrew Cuomo have in common?

Besides family values, that is?

Both these politicians seem to have a problem with the National Weather Service:

The Senator:

Santorum also accused the weather service’s National Hurricane Center of flubbing its forecasts for Hurricane Katrina’s initial landfall in Florida, despite the days of all-too-prescient warnings the agency had given that the storm would subsequently strike the Gulf Coast.

The Governor:

Governor Cuomo’s attempt to scapegoat the National Weather Service for an inaccurate forecast in advance is not only completely in error—the NWS did an outstanding job—but is a disservice to the public and to the hard-working staff of this federal agency. No forecast of such an historical disaster is going to be absolutely perfect, but no one who lives here can say this event was not well forecast in advance, or that the warning headlines of its impact to come were not well explained in advance…his statement is disinformation, purposeful or ill-informed.

Hey, politicians are politicians, they have to make lots of compromises. But, as a statistician, I’m repulsed by this sort of anti-data attitude coming from either political party.

Princeton Abandons Grade Deflation Plan . . .

. . . and Kaiser Fung is unhappy. In a post entitled, “Princeton’s loss of nerve,” Kaiser writes:

This development is highly regrettable, and a failure of leadership. (The new policy leaves it to individual departments to do whatever they want.)

The recent Alumni publication has two articles about this topic, one penned by President Eisgruber himself. I’m not impressed by the level of reasoning and logic displayed here.

Eisgruber’s piece is accompanied with a photo, captioned thus:

The goal of Princeton’s grading policy is to provide students with meaningful feedback on their performance in courses and independent work.

Such a goal [writes Kaiser] is far too vague to be practical. But let’s take this vague policy at face value. How “meaningful” is this feedback when 40% of grades handed out are As, and 80% of grades are either As or Bs? (At Stanford, Harvard, etc., the distributions are even more skewed.)

Here are some data:

6a00d8341e992c53ef01bb07905f4e970d

My agreement with Kaiser

As a statistician, I agree with Kaiser that if you want grades to be informative, it makes sense to spread out the distribution. It’s an obvious point and it is indeed irritating when the president of Princeton denies or evades it.

I’d also say that “providing students with meaningful feedback” is one of the least important functions of course grades. What kind of “meaningful feedback” comes from a letter (or, for that matter, a number) assigned to an entire course? Comments on your homeworks, papers, and final exams: that can be meaningful feedback. Grades on individual assignments can be meaningful feedback, sure. But a grade for an entire course, not so much.

My impression is that the main functions of grades are to motivate students (equivalently, to deter them from not doing what it takes to get a high grade) and to provide information for future employers or graduate schools. For these functions, as well as for the direct feedback function, more information is better and it does not make sense to use a 5-point scale where 80% of the data are on two of the values.

One can look at this in various ways but the basic psychometric principle is clear. For more depth, go read statistician Val Johnson’s book on grade inflation.

My discomfort

OK, now for my disagreement or, maybe I should say, my discomfort with Kaiser’s argument.

Grad school grades.

In grad school we give almost all A’s. I’m teaching a course on statistical communication and graphics, and I love the students in my class, and I might well give all of them A’s.

In other grad classes I have lots of grading of homeworks and exams and I’ll give mostly A’s. I’ll give some B’s and sometimes students will complain about that, how it’s not fair that they have to compete with stat Ph.D. students, etc.

The point is, if I really believe Kaiser’s principles, I’d start giving a range of grades, maybe 20% A, 20% B, 20% C, 20% D, 20% F. But of course that wouldn’t work. Not at all. I can’t do it because other profs aren’t doing it. But even if all of Columbia were to do it . . . well, I have no idea, it’s obviously not gonna happen.

Anyway, my point is that Princeton’s motivation may well be the same as mine: yes, by giving all A’s we’re losing the ability to give feedback in this way and we’re losing the opportunity to provide a useful piece of information to potential employees.

But, ultimately, it’s not worth the trouble. Students get feedback within their classes, they have internal motivation to learn the material, and, at the other end, employers can rely on other information to evaluate job candidates.

From that perspective, if anyone has the motivation to insist on strict grading for college students, it’s employers and grad schools. They’re the ones losing out by not having this signal, and indirectly if students don’t learn the material well because they’re less well motivated in class.

In the meantime, it’s hard for me to get mad at Princeton for allowing grades to rise, considering that I pretty much give all A’s myself in graduate classes.

P.S. Kaiser concludes:

The final word appears to be a rejection of quantitative measurement. Here’s Eisgruber:

The committee wisely said: If it’s feedback that we care about, and differentiating between good and better and worse work, that’s what we should focus on, not on numbers.

The wisdom has eluded me.

Let me add my irritation at the implicit equating of “wisdom” with non-quantitative thinking.

Blogs > Twitter

Tweeting has its virtues, I’m sure. But over and over I’m seeing these blog vs. twitter battles where the blogger wins. It goes like this: blogger gives tons and tons of evidence, tweeter responds with a content-free dismissal.

The most recent example (as of this posting; remember we’re on an approx 2-month delay here; yes, this site is the blogging equivalent of the “slow food” movement), which I heard about on Gawker (sorry), is Slate editor Jacob Weisberg, who took a break from tweeting items such as “How Can You Tell if a Designer Bag Is Fake?” to defend serial plagiarist Fareed Zakaria:

Screen Shot 2014-09-30 at 1.22.44 AM

(Just to interject: to say that Zakaria is a serial plagiarist is not to say he has nothing to offer as a journalist. Plagiarists ranging from Martin Luther King to Doris Kearns Goodwin to Laurence Wayne Tribe have been functional members of society when not copying the work of others without attribution. (OK, I’m kidding about that “Wayne” thing; I just thought Tribe needed a middle name too, to fit with his distinguished associates.)

OK, back to the story. The tweet is Weisberg’s empty defense of Zakaria’s plagiarism. OK, not completely empty, but what can you do in 160 characters? Weisberg’s point is that it’s ok to quote without attribution on TV because on TV there’s no room for footnotes. Fair enough. I can tell right now that (on the rare occasions that) I go on radio or TV, I would never use someone else’s words without attribution, but Zakaria’s a lot more busy than I am, and a lot more practiced in using other people’s words, so maybe the guy just can’t help it.

OK, back to the story. (Hey, I keep saying that! Maybe we’re having too many digressions here. Anyway . . . ) Now comes the blogger. “@blippoblappo & @crushingbort” are blogging, they have as much space as they want, indeed their only limitation is their own capacity to get bored by their own writing. OK, blogging is rarely crisp, but they have the space to make their points. They don’t need to pick and choose, they can say everything they need to say.

In particular, in their post, “Yes, the indefensible Fareed Zakaria also plagiarized in his fancy liquor columns for Slate,” they have this:

Screen Shot 2014-09-30 at 1.09.37 AM

Winner: blog. It’s not even close. When it’s information vs. soundbites, information wins.

Of course this is not to say that bloggers are always correct or that every tweeter is wrong. Not at all. But I do think there’s a benefit to being able to lay out the whole story in one place.

50 shades of gray goes pie-chart

Rogier Kievit sends in this under the heading, “Worst graph of the year . . . horribly unclear . . . Even the report doesn’t have a legend!”:

worst

My reply:

It’s horrible but I still think the black-and-white Stroop test remains the worst visual display of all time:

What’s particularly amusing about the Stroop image is that it does have color—but only in its label!

But I hate to start off the weekend on such a downer, so let me point you to a much more delightful infographic (link from Arthur Charpentier):

fruits

Aahhhh, much better.

“If you’re not using a proper, informative prior, you’re leaving money on the table.”

Well put, Rob Weiss.

This is not to say that one must always use an informative prior; oftentimes it can make sense to throw away some information for reasons of convenience. But it’s good to remember that, if you do use a noninformative prior, that you’re doing less than you could.

Soil Scientists Seeking Super Model

I (Bob) spent last weekend at Biosphere 2, collaborating with soil carbon biogeochemists on a “super model.”

IMG_0425

Model combination and expansion

The biogeochemists (three sciences in one!) have developed hundreds of competing models and the goal of the workshop was to kick off some projects on putting some of them together intos wholes that are greater than the sum of their parts. We’ll be doing some mixture (and perhaps change point) modeling, which makes sense here because of different biogeochemical processes at work based on system evolution and extrinsic conditions (some of which we have covariates for or can be modeled with random effects), and we’re also going to do some of what Andrew likes to call “continuous model expansion.”

Others at the workshop also expressed interest in Bayesian model averaging as well as model comparison using Bayes factors, though I’d rather concentrate on mixture modeling and continuous model expansion, for reasons Andrew’s already discussed at length on the blog and in Bayesian Data Analysis (aka BDA3, aka “the red book”).

One of the three workshop organizers, Kiona Ogle, did a great job laying out the big picture during the opening dinner / lightning-talk session and then following it up by making sure we didn’t stray too far from our agenda. This is always a tricky balance with a bunch of world class scientists each with his or her own research agenda.

So far, so good

We got a surprising amount done over the weekend—it was really more hackathon than workshop, because there weren’t any formal talks.

GitHub repositories: Thanks to David LeBauer, another of the workshop organizers, we have GitHub organization, with repositories with our work so far. David and I were really into pitching version control, and in particular GitHub, for managing our collaborations. Hopefully we’ve converted some Dropbox users to version control.

Stan “Hello World”: The soil-metamodel/stan repo includes a Stan implementation of a soil incubation model with two pools and feedback, which I translated from Carlos Sierra’s system SoilR, an R package implementing a vast variety of linear and non-linear differential-equation based soil-carbon models (the scope of which is explained in this paper).

Taking Michael Betancourt’s advice, I implemented a second version with lognormal noise and a proper measurement error model (see the repo), which fits much more cleanly (higher effective sample size, less residual noise, obeys scientific constraints on positivity).

“Forward” and “Backward” Michaelis-Menten: Bonnie Waring, a post-doc, not only survived having a scorpion attached to her ankle during dinner one night, she’s leading one of the subgroups I’m involved with on reimplementing and expanding these models in Stan. Apparently, Bonnie’s seen much worse (than little Arizona scorpions) working in Costa Rica at the lab of Jennifer Powers (the third workshop organizer), to which Bonnie’s returning to run some of the enzyme assays we need to complete the data.

I’m very excited about this particular model combination, which involves some state-of-the art models taking into account biomass and enzyme behavior. There are two different forms of Michaelis-Menten dynamics under consideration, as they both make sense for different subsystems of the aggregate soil and organic matter biogeochemistry.

The repo for this project is soil-metamodel/back-forth-mm, the readme for which has references to some papers, including one first-authored by another workshop participant, Steve Allison, one of the workshop participants, and some colleagues, Soil-carbon response to warming dependent on microbial physiology (Nature Geoscience).

Global mapping: Steve’s actually involved with a separate group doing global mapping, using litter decomposition data. The GitHub repo is soil-metamodel/Litter-decomp-mapping.

They’ve got some stiff competition (ODE pun intended), given the recent fine-grained, animated global carbon map that NASA just put out.

Non-linear models: Kathe Todd-Brown, another post-doc, helped me (and everyone else) unpack and understand all of the models by breaking them down from narratives to differential equations. Kathe’s leading another subgroup looking at non-linear models, which I’m also involved with. I don’t see a public GitHub repo for that yet.

Science is awesome!

Right after Carlos, David, and I first arrived, we ran into a group of tourists, including some teenagers, who asked us, “Are you scientists?” We said, “Why yes, we are.” The teenager replied, “That’s super awesome.” I happen to agree, but in nearly 30 years doing science, I can’t remember ever getting that reaction. So, if you’re a scientist and want to feel like a rock star, I’d highly recommend Biosphere 2.

It’s also a fun tour, what with the rain forest environment (i.e., a big greenhouse), and the 16 ton rubber-suspended “lung” for pressure equalization.

Retrospective clinical trials?

Kelvin Leshabari writes:

I am a young medical doctor in Africa who wondered if it is possible to have a retrospective designed randomised clinical trial and yet be sound valid in statistical sense.

This is because to the best of my knowledge, the assumptions underlying RCT methodology include that data is obtained in a prospective manner!

We are now about to design a new study in a clinical set up and during literature search we encountered few data published with such a design in mind. It has risen some confusion among the trialists whether we should include the findings and account for them in our study or whether the said findings are mere products of confusion in a mathematical/statistical sense!

My reply: You can have retrospective studies with good statistical properties—if you know the variables that predict the choice of treatment, then you can model the outcome conditional on these variables, and you should be ok. I don’t think it makes sense to speak of a retrospective RCT but you can analyze retrospective data as if they were collected prospectively, if you can condition on enough relevant variables. This is the sort of thing that Paul Rosenbaum and others have written about under the rubric of observational studies.

Replication controversies

I don’t know what ATR is but I’m glad somebody is on the job of prohibiting replication catastrophe:

Screen Shot 2014-11-18 at 7.07.28 PM

Seriously, though, I’m on a list regarding a reproducibility project, and someone forwarded along this blog by psychology researcher Simone Schnall, whose attitudes we discussed several months ago in the context of some controversies about attempted replications of some of her work in social psychology.

I’ll return at the end to my remarks from July, but first I’d like to address Schnall’s recent blog, which I found unsettling. There are some technical issues that I can discuss:

1. Schnall writes: “Although it [a direct replication] can help establish whether a method is reliable, it cannot say much about the existence of a given phenomenon, especially when a repetition is only done once.” I think she misses the point that, if a replication reveals that a method is not reliable (I assume she’s using the word “reliability” in the sense that it’s used in psychological measurement, so that “not reliable” would imply high variance) then it can also reveal that an original study, which at first glance seemed to provide strong evidence in favor of a phenomenon, really doesn’t. The Nosek et al. “50 shades of gray” paper is an excellent example.

2. Her discussion of replication of the Stroop effect also seems to miss the point, or at least so it seems to me. To me, it makes sense to replicate effects that everyone believes, as a sort of “active control” on the whole replication process. Just as it also makes sense to do “passive controls” and try to replicate effects that nobody thinks can occur. Schnall writes that in the choice of topics to replicate, “it is irrelevant if an extensive literature has already confirmed the existence of a phenomenon.” But that doesn’t seem quite right. I assume that the extensive literature on Stroop is one reason it’s been chosen to be included in the study.

The problem, perhaps, is that she seems to see the goal of replication as a goal to shoot things down. From that standpoint, sure, it seems almost iconoclastic to try to replicate (and, by implication, shoot down) Stroop, a bit disrespectful of this line of research. But I don’t see any reason why replication should be taken in that way. Replication can, and should, be a way to confirm a finding. I have no doubt that Stroop will be replicated—I’ve tried the Stroop test myself (before knowing what it was about) and the effect was huge, and others confirm this experience. This is a large effect in the context of small variation. I guess that, with some great effort, it would be possible to design a low-power replication of Stroop (maybe use a monochrome image, embed it in a within-person design, and run it on Mechanical Turk with a tiny sample size?), but I’d think any reasonable replication couldn’t fail to succeed. Indeed, if Stroop weren’t replicated, this would imply a big problem with the replication process (or, at least with that particular experiment). But that’s the point, that’s one reason for doing this sort of active control. The extensive earlier literature is not irrelevant at all!

3. Also I think her statement, “To establish the absence of an effect is much more difficult than the presence of an effect,” misses the point. The argument is not that certain claimed effects are zero but rather that there is no strong evidence that they represent general aspects of human nature (as is typically claimed in the published articles). If an “elderly words” stimulus makes people walk more slowly one day in one lab, and more quickly another day in another lab, that could be interesting but it’s not the same as the original claim. And, in the meantime, critics are not claiming (or should not be claiming) an absence of any effect but rather they (we) are claiming to see no evidence of a consistent effect.

In her post, Schnall writes, “it is not about determining whether an effect is “real” and exists for all eternity; the evaluation instead answers a simply question: Does a conclusion follow from the evidence in a specific paper?”—so maybe we’re in agreement here. The point of criticism of all sorts (including analysis of replication) can be to address the question, “Does a conclusion follow from the evidence in a specific paper?” Lots of statistical research (as well as compelling examples such as that of Nosek et al.) has demonstrated that simple p-values are not always good summaries of evidence. So we should all be on the same side here: we all agree that effects vary, none of us is trying to demonstrate that an effect exists for all eternity, none of us is trying to establish the absence of an effect. It’s all about the size and consistency of effects, and critics (including me) argue that effects are typically a lot smaller and a lot less consistent than are claimed in papers published by researchers who are devoted to these topics. It’s not that people are “cheating” or “fishing for significance” or whatever, it’s just that there’s statistical evidence that the magnitude and stability of effects are overestimated.

4. Finally, here’s a statement of Schnall that really bothers me: “There is a long tradition in science to withhold judgment on findings until they have survived expert peer review.” Actually, that statement is fine with me. But I’m bothered by what I see as an implied converse, that, once a finding has survived expert peer review, it should be trusted. Ok, don’t get me wrong, Schnall doesn’t say that second part in this most recent post of hers, and if she agrees with me—that is, if she does not think that peer-reviewed publication implies that a study should be trusted—that’s great. But, from her earlier writings on this topic give me the sense that she believes that published studies, at least in certain fields of psychology, should get the benefit of the doubt: that, once they’ve been published in a peer-reviewed publication, they should stand on a plateau and require some special effort to be dislodged. So when Study 1 says one thing and pre-registered Study 2 says another, she seems to want to give the benefit of the doubt to Study 1. But I don’t see that.

Different fields, different perspectives

A lot of this discussion seems somehow “off” to me. Perhaps this is because I do a lot of work in political science. And almost every claim in political science is contested. That’s the nature of claims about politics. As a result, political scientists do not expect deference to published claims. We have disputes, sometimes studies fail to replicate, and that’s ok. Research psychology is perhaps different in that there’s traditionally been a “we’re all in this together” feeling, and I can see how Schnall and others can be distressed that this traditional collegiality has disappeared. From my perspective, the collegiality could be restored by the simple expedient of researchers such as Schnall recognizing that the patterns they saw in particular datasets might not generalize to larger populations of interest. But I can see how some scholars are so invested in their claims and in their research methods that they don’t want to take that step.

I’m not saying that political science is perfect, but I do think there are some differences in that poli sci has more of a norm of conflict whereas it’s my impression that research psychology has more of the norms of a lab science where repeated experiments are supposed to give identical results. And that’s one of the difficulties.

If scientist B fails to replicate the claims of scientist A who did a low-power study, my first reaction is: hey, no big deal, data are noisy, the patterns in the sample do not generally match the patterns in the population, certainly not if you condition on “p less than .05.” But a psychology researcher trained in this lab tradition might not be looking at sampling variability as an explanation—nowhere in Schnall’s blogs did I see this suggested as a possible source of the differences between original reports and replications—and, as a result, they can perceive a failure to replicate as an attack on the original study, to which it’s natural for them to attack the replication. But once you become more attuned to sampling and measurement variation, failed replications are to be expected all the time, that’s what it means to do a low-power study.
Continue reading ‘Replication controversies’ »