11 thoughts on “Quality control problems at the New York Times

  1. A lot of psychologists see to take this pretty seriously (although hardly anyone seems to believe it). The guy has a solid reputation as an experimenter, the design of the experiments is thoughtful and clean and it's being published in a top APA journal
    – some people have raised some interesting concerns
    http://www.ruudwetzels.com//articles/Wagenmakerse
    and initial replications have not been successful
    http://ssrn.com/abstract=1699970

    But I think this is definitely in the realm of science. It's highly likely it's not actually psi, but in that case it has serious implications for experimental psychology more broadly – he followed a particularly careful version of standard protocol after all.

    Now, the quantum stuff seems just silly but apart from that, why _don't_ you think this should be in the science section?

  2. Sebastian:

    A discussion of the experiment could be in the science section. But the uncritical, gee-whiz reporting . . . that doesn't seem so serious to me, certainly not to the high standards that are (usually) upheld by the NYT science section. At the very least, the concerns of experts should be raised.

    Steven Levitt (the face of Freakonomics) is one of the nation's leading experts on quantitative social science. But I don't think any expertise or original thinking is being applied here. Speaking more generally, diluting the Freakonomics brand by associating it with reports of ESP doesn't seem so wise to me.

    Let me rephrase this. The cited research very well be correct. But if Levitt et al. want to make that claim, they should make the claim! They could acknowledge that many researchers are skeptical but they are not for various reasons (which they'd then have to give). That would make it interesting. Reporting a press release, not so much. I'm pretty sure the NYT science section wouldn't run something this thoughtless.

  3. "The statistics are simple enough and the sample sizes small enough that it would make for a thought provoking introductory statistics lab to try to replicate the findings."

    But note: The findings may be less significant than they seem, for reasons noted in the Wagenmakers et. al. article.

  4. There was a very credulous report on this in the usually reliable BPS blog: http://bps-research-digest.blogspot.com/2010/11/d… Ah, I just looked, that's the blog that Levitt links to, so he's just copying stuff from other blogs not press releases.

    If the article were reporting true effects, that would be incredible, really turning our understanding cause and effect on its head. But, as the paper that Sebastian links to discusses, there are some fundamental statistical problems here (multiple testing, anyone?) suggesting we need not worry yet.

  5. David:

    I took a look at the comments at the BPS link. One thing the defenders of the study don't seem to realize is that all the statistical sophistication in the world won't help you if you're studying a null effect. This is not to say that the actual effect is zero–who am I to say?–just that the comments about the high-quality statistics in the article don't say much to me.

  6. It's absolute rubbish that Bem uses a high quality or standard method for his data analysis. As Wakenmakers (v good rebuttal – http://www.ruudwetzels.com//articles/Wagenmakerse… points out, Bem himself has proposed an exploratory data analysis methodology that raises serious problems in accepting the results of any of his statistical tests at face value, confusing confirmatory and exploratory experiments. Wagenmakers points out that the 'Bem Exploratory Method' (BEM) leads to significant hypothesis tests in the face of null effects of necessity. He then refers to Bem's BEM throughout the rest of the article. Hilarious and well played. Wagenmakers also points out that Bayesian and frequentist t-tests should converge to similar results if there are real effects – and in this case they don't.

    Bem has a serious and potentially terminal case of emeritus disease (a variant of the Nobel disease – http://scienceblogs.com/insolence/2010/11/luc_mon… How this got through review is a mystery (and an indictment of the journal). It certainly wouldn't be published in Psych Review, Cog Sci or anything like that.

    If any psychologists take this utter, awful nonsense seriously, they aren't in my department. It's hard to express how much this makes my blood boil.

  7. WRT Freakonomics – this was actually put up by one of the blog editors and not by Levitt/Dubner or one of their regular contributors (those contributions are named). I'm not sure if that makes it better or worse – but I think it's harsh to blame this one on Levitt.

    CM – well "high quality" is in the eye of the beholder. My sense is that the methods are pretty common within psych – but I'm not a psych researcher so I might be off.
    The experimental protocol seems pretty solid to me (the one experimental flaw that has been pointed out so far seems insignificant). The point is – if this isn't psi, it's an important warning to experimental researchers and I think it'd actually be a useful article for a methods class.

  8. Sebastian, have to agree with you about high quality being in the eye of the beholder. And actually, you might be right about this stuff being common in psych. It is not common in my department, but that could be my good luck rather than a universal in the discipline. What I am criticizing specifically is the reporting of sub-group analysis without a priori specification or motivation. Bem's BEM. I don't think you can divorce the experimental methodology from the data analysis. And it looks like the analysis has a good dose of post-hoc.

    If I can quote Wagenmakers, quoting Bem:

    “There are two possible articles you can write: (1) the article you planned to write when you designed your study or (2) the article that makes the most sense now that you have seen the results. They are rarely the same, and the correct answer is (2).” (Bem, 2003, pp. 171-172)

    "The conventional view of the research process is that we first derive a set of hypotheses from a theory, design and conduct a study to test these hypotheses, analyze the data to see if they were confirmed or disconfirmed, and then chronicle this sequence of events in the journal article. If this is how our enterprise actually proceeded, we could write most of the article before we collected the data. We could write the introduction and method sections completely, prepare the results section in skeleton form, leaving spaces to be filled in by the specific numerical results, and have two possible discussion sections ready to go, one for positive results, the other for negative results. But this is not how our enterprise actually proceeds. Psychology is more exciting than that (…)”

    If Bem thinks maintaining a distinction between exploratory and confirmatory research is not 'exciting' then I think he doesn't have the temperament to be involved in serious work. Personally I get pretty excited about learning about the world in a real and rigorous manner.

  9. I think they could have used this link to introduce readers to this area of research which can make interesting reading if given a balanced treatment. The problem is the one-sided reporting of the new article, and no attempt to provide the other side (which is probably the majority here).
    I browsed the paper quickly plus the commentary by Wakenmakers, et. al. Agree that the statistics is probably too simplistic.
    One aspect of their statistical test really bothers me, and I wonder if anyone else has noticed, or have comments…
    In their first experiment, they report a t-statistic with 99 degrees of freedom and an average of 53.1% of "correctly guessing" which side the erotic image will be placed. Elsewhere, they told us that there were 100 students in the experiment, and each student was shown 36 images, some of which were erotic images. It seems to me that in order for them to have 99 df, they must be running the test at the student level, which meant they must have aggregated each student's multiple trials into an average. If this is so, then the 53.1% cannot be interpreted as a proportion of students who got the images correctly. You'd think a random effect is needed, and a recognition that trials for any given student are not independent.

  10. "I guess there's a reason they put this stuff in the Opinion section and not in the Science section, huh?"

    I always thought the best science writing the NYT ran recently appeared in Judson's Wild Side which also ran in the Opinion Section.

Comments are closed.