Even though it’s published in a top psychology journal, she still doesn’t believe it

Nadia Hassan writes:

I wanted to ask you about this article.

Andrea Meltzer, James McNulty, Saul Miller, and Levi Baker, “A Psychophysiological Mechanism Underlying Women’s Weight-Management Goals Women Desire and Strive for Greater Weight Loss Near Peak Fertility.” Personality and Social Psychology Bulletin (2015): 0146167215585726.

I [Hassan] find it kind of questionable. Fortunately, the authors use a within-subject sample, but it is 22 women. Effects in evolutionary biology are small. Women’s recall is not terribly accurate. Basically, to use the phrasing you have before, the authors are not necessarily wrong, but it seems as though the evidence is not as strong as they claim.

Here’s the abstract of the paper in question:

Three studies demonstrated that conception risk was associated with increased motivations to manage weight. Consistent with the rationale that this association is due to ovulatory processes, Studies 2 and 3 demonstrated that it was moderated by hormonal contraceptive (HC) use. Consistent with the rationale that this interactive effect should emerge when modern appearance-related concerns regarding weight are salient, Study 3 used a 14-day diary to demonstrate that the interactive effects of conception risk and HC use on daily motivations to restrict eating were further moderated by daily motivations to manage body attractiveness. Finally, providing evidence that this interactive effect has implications for real behavior, daily fluctuations in the desire to restrict eating predicted daily changes in women’s self-reported eating behavior. These findings may help reconcile prior inconsistencies regarding the implications of ovulatory processes by illustrating that such implications can depend on the salience of broader social norms.

Ummm, yeah, sure, whatever.

OK, let’s go thru the paper and see what we find:

This broader study consisted of 39 heterosexual women (the total number of participants was determined by the number of undergraduates who volunteered for this study during a time frame of one academic semester); however, 8 participants failed to respond correctly to quality-control items and 7 participants failed to complete both components of the within-person design and thus could not be used in the within-person analyses. Two additional participants were excluded from analyses: 1 who was over the age of 35 (because women over the age of 35 experience a significant decline in fecundability; Rothman et al., 2013) and 1 who reported a desire to lose an extreme amount of weight relative to the rest of the sample . . .

Fork. Fork. Fork.

We assessed self-esteem at each high- and low-fertility session using the Rosenberg Self-Esteem Scale (Rosenberg, 1965) and controlled for it in a supplemental analysis.

Fork. (The supplemental analysis could’ve been the main analysis.)

Within-person changes in ideal weight remained marginally negatively associated with conception risk . . . suggesting that changes in women’s current weight across their ovulatory cycle did not account for changes in women’s ideal weight across their ovulatory cycle.

The difference between “significant” and “not significant” is not itself statistically significant.

Notably, in this small sample of 22 women, self-esteem was not associated with within-person changes in conception risk . . .

“Not statistically significant” != “no effect.”

consistent with the idea that desired weight loss is associated with ovulation, only naturally cycling women reported wanting to weigh less near peak fertility.

The difference between “significant” and “not significant” is not itself statistically significant.

One recent study (Durante, Rae, & Griskevicius, 2013) demon- strates that ovulation had very different implications for women’s voting preferences depending on whether those women were single or in committed relationships.

Ha! Excessive credulity. If you believe that classic “power = .06” study, you’ll believe anything.

OK, I won’t go through the whole paper.

The point is: I agree with Hassan: this paper shows no strong evidence for anything.

Am I being unfair here?

At this point, you say that I’m being unfair: Why single out these unfortunate researchers just because they happen to have the bad luck to work in a field with low research standards? And what would happen if I treated everybody’s papers with this level of skepticism?

This question comes up a lot, and I have several answers.

First, if you think this sort of evolutionary psychology is important, then you should want to get things right. It’s not enough to just say that evolution is true, therefore this is good stuff. To put it another way: it’s quite likely that, if you got enough data and measured carefully enough, that the patterns in the general population could well be in the opposite direction (and, I would assume, much smaller) than what was claimed in the published paper. Does this matter? Do you want to get the direction of the effect right? Do you want to estimate the effect size within an order of magnitude? If the answer to these questions is Yes, then you should be concerned when shaky methods are being used.

Second, remember what happened when that Daryl Bem article on ESP came out? People said that the journal had to publish that paper because the statistical methods Bem used were standard in psychology research. Huh? There’s no good psychology being done anymore so we just have to fill up our top journals with unsubstantiated claims, presented as truths?? Sorry, but I think Personality and Social Psychology Bulletin can do better.

Third, should we care about forking paths and statistical significance and all that? I’d prefer not to. I’d prefer to see an analysis of all the data at once, using Bayesian methods to handle the multiple levels of variation. But if the claims are going to be based on p-values, then forking paths etc are a concern.

What, then?

Finally, the question will arise: What should these researchers do with this project, if not publish it in Personality and Social Psychology Bulletin? They worked hard, they gathered data; surely these data are of some value. They even did some within-person comparisons! It would be a shame to keep these data unpublished.

So here’s my recommendation: they should be able to publish this work in Personality and Social Psychology Bulletin. But it should be published in a way that is of maximum use to the research field (and, ultimately, to society):

– Post all the raw data. All of it.

– Tone down the dramatic claims. Remember Type S errors and Type M errors, and the garden of forking paths, and don’t take your p-values so seriously.

– Present all the relevant comparisons; don’t just navigate through and report the results that are part of your story.

– Finally, theorize all you want. Just recognize that your theories are open-ended and can explain just about any pattern in data (just as Bem could explain whatever interaction happened to show up for him).

And finally, let me emphasize that I’m not saying I think the claims of Meltzer et al. are false, I just think they’re not providing strong empirical evidence for their theory. Remember 50 shades of gray? That can happen to you too.

30 thoughts on “Even though it’s published in a top psychology journal, she still doesn’t believe it

  1. Excellent points as usual. With respect to this – “happen to have the bad luck to work in a field with low research standards” – I prefer to think of it as a distribution of research standards, that unfortunately do span from occasionally very low to sometimes quite high ;)

  2. Back in the 90s I found enormous pushback against the notion that evolution by natural selection could be relevant for understanding human behavior. The level of bile coming from those within social sciences who thought that cultural/political phenomena were primary determinants of human action was astonishing. In response, many evolutionary psychologists/anthropologists dedicated their efforts to demonstrating that evolution was relevant. That was always paramount, not the specifics of each analysis. Each “success”, forking paths not withstanding, was seen as evidence supporting an evolutionary perspective.

    I don’t work in that field anymore.

  3. My colleague JP de Ruiter has a throwaway line to the effect that “Social psychology basically tries to show that people are either assholes or zombies”. In this case, these women are presumably supposed to be driven by some unconscious desire to make themselves lose weight in order to make themselves more attractive so they get pregnant, or something like that, even though most of them have no intention of getting pregnant in the next month; hence, “zombies”. (The “assholes” part refers to those studies that show that, given half a chance, people in labs behave in ways that are *exactly* like evil, grasping, monopoly capitalists.)

    Apart from the fact that “losing weight => more attractive” is (as far as I can tell) a recent, Western, socially-constructed thing (i.e., not something one would expect to find being driven by evolution; historically, gbeign thin meant you were less likely to survive the winter), the bigger problem I have with this kind of study is that if the effects were real, we would have anecdotal evidence of it by now. Shakespeare or Plato would have hinted at it. While anecdotes are not evidence, I sometimes wonder if we ought not to insist on (blinded) anecdotal corroboration of all of these thousands of effects that we are “discovering”, which people had no clue about.

    Incidentally, the abstract page for the article (http://psp.sagepub.com/content/early/2015/05/13/0146167215585726.abstract) states that “A more recent version of this article was published on [06-08-2015]”, but the link under “more recent” leads to a PDF file that says “revision accepted April 13, 2015” and doesn’t mention any revisions or corrigenda. I’m not sure exactly what this means. Maybe the version that’s online for the July 2015 edition of the journal is the definitive one after all.

    • Well, there’s certainly anecdotal evidence to the effect that women are more horny near maximum conception risk, so it seems reasonable to hypothesize that women will try to make themselves more attractive as short-term partners during that period. The problem I have with the specific idea of weight-loss is that weight-loss is a long-term strategy, so it doesn’t make much sense to start dieting three days before maximum conception risk.

      As for the “more recent version”, you’re probably overinterpreting this. I have also published an article with Sage, and there is also a more recent version of that article available. What happened is that they published the article before I had given them my o.k. on the proofs, I complained, and my additional corrections were then included in the “more recent” version. This was only minor stuff like wording, capitalization and so forth. (Based on my sample of 1, Sage editors appear to be idiots.) Why the journal won’t just remove the older version, I don’t know.

    • Nick:

      The “assholes or zombies” quip is well-phrased, and it relates to how, as a political scientist, I dislike a lot of this work because I think it has an implicit political message not to trust people’s opinions or decisions.

  4. Andrew, you write, “… should we care about forking paths and statistical significance and all that? I’d prefer not to. I’d prefer to see an analysis of all the data at once, using Bayesian methods to handle the multiple levels of variation.” If I read that too literally, I hear encouragement to craft one MLM with all of the possible predictors and then see what falls out. It is hard to have a forking path if the path length is exactly 1 (model).

    In http://statmodeling.stat.columbia.edu/2015/01/29/six-quick-tips-improve-regression-modeling/, you write, “Fit many models. Think of a series of models, starting with the too-simple and continuing through to the hopelessly messy. Generally it’s a good idea to start simple.” That fits my intuitive approach and what I read as Tukey’s EDA mentality, but it puts more emphasis on model checking to deal with the forking.

    Can you clarify how you see those fitting together in practice?

  5. Andrew, you wrote:

    “I’d prefer to see an analysis of all the data at once, using Bayesian methods to handle the multiple levels of variation. But if the claims are going to be based on p-values, then forking paths etc are a concern.”

    I am assuming you mean that if Bayesian methods are used, with all data used at once, forking paths is not a concern. Why not?

    • The sad part is I never remember seeing anyone do the recommended bit about “an analysis of all the data at once, using Bayesian methods to handle the multiple levels of variation” for any of the papers criticized, say, on Andrew’s blog using the forking paths critique.

      I for sure would love to see an alternative analysis of any of these allegedly bad papers.

      • This had bothered me for a while, but this is how I understand it now:

        1) Garden of forking paths is a problem mainly for inferences based on p-values, because p-values are probability statements that make strong assumptions about how the data would look like under repetition. If we ‘fork’ based on data, those reference distributions do not mean what they should anymore, at least to the problem at hand.

        2) I think Prof. Gelman isn’t saying that Bayesian methods are overfitting-proof, just that the inferences do not depend on (simplistic) assumptions on how the data would look like under different replications. Or, when we do care about replications, we use a full model do generate new data points, as in predictive model checking. We can also ‘hack’ a bayesian analysis to produce what we want, but the hacking part will be much more evident if we work with a full probability model.

        3) The final model used to draw inferences is, in many aspects, a super-model that includes the many models fitted during the initial data analysis. This way, Bayes rule is able to propagate uncertainty between parameters, giving a more realistic picture of uncertainty in estimates.

        For a example, I think Prof. Gelman’s ‘chicken brains experiment’ (that appears in BDA, Data analysis using regression and many articles) gives a gist of how to do it.

      • In my opinion most of these papers are pretty much DOA. Essentially for the reasons Feynman gives here: https://www.youtube.com/watch?v=EYPapE-3FRw . Studying humans without assessing how they change over time (trends, cycles, variance) will not result in enough information to guess at what is going on.

        Now, they are going to have trouble coming up with any quantitative theories if barely anyone in the field is taught to use tools like calculus/programming. Even those that do try will be largely ignored by the vast majority who can only “nod yes” without understanding what is being said and will not pass that info along to their students/peers. Instead, NHST takes over. This practice supplies a false sense of rigor to the testing of vague theories. These are difficult, if not impossible, to disprove and can only ever be corroborated in an extremely weak sense.

        tl;dr: There probably is nothing useful to be done beyond reporting the data and methods so that they can be used to sanity check later, more comprehensive, studies.

        • Very true. My point is that these papers are DoA no matter what. NHST vs Bayesian and the whole garden of forking paths business is just a red herring in the case of these sort of studies.

          The core premise of these studies itself is deeply flawed.

        • >”The core premise of these studies itself is deeply flawed.”

          Yes, they are designed as if a significant p-value can provide an answer to the questions being asked, they are designed for NHST. Naive grad students accept projects that are incapable of achieving the stated goal and then must “play the game” if they wish to continue down that career path. Then they have incentive not to admit anything was wrong. It is like a gang initiation rite. This continues to happen, with no notice of possible controversy, even though the problem was realized long ago:

          “After reading Meehl[1967] and Lykken [1968] one wonders whether the function of statistical techniques in the social sciences is not primarily to provide a machinery for producing phoney corroborations and thereby a semblance of ‘scientific progress’ where, in fact, there is nothing but an increase in pseudo-intellectual garbage. Meehl writes that ‘in the physical sciences, the usual result of an improvement in experimental design, instrumentation, or numerical mass of data, is to increase the difficulty of the “observational hurdle” which the physical theory of interest must successfully surmount; whereas, in psychology and some of the allied behaviour sciences, the usual effect of such improvement in experimental precision is to provide an easier hurdle for the theory to surmount’. Or, as Lykken put it: ‘Statistical significance [in psychology] is perhaps the least important attribute of a good experiment; it is never a sufficient condition for claiming that a theory has been usefully corroborated, that a meaningful empirical fact has been established, or that an experimental report ought to be published.’ It seems to me that most theorizing condemned by Meehl and Lykken may be ad hoc3. Thus the methodology of research programmes might help us in devising laws for stemming this intellectual pollution which may destroy our cultural environment even earlier than industrial and traffic pollution destroys our physical environment.”
          Lakatos, I. (1978a). Falsification and the methodology of scientific research programs. In J. Worrall & G.Currie (Eds.), The methodologl, of scientific research programs: Imre Lakatos philosophical papers (Vol. 1, pp. 8-101). Cambridge, England: Cambridse University Press.
          http://strangebeautiful.com/other-texts/lakatos-meth-sci-research-phil-papers-1.pdf

          My hope is the internet will allow those interested in a science career to at least be aware of this issue *before* embarking on such a path.

      • I have done this, but not for papers Andrew has brought up. In most cases, the effect is no longer significant. It’s really very amazing how fragile the p<0.05 is. In one experiment I analyzed, the significant effect went away when I put back the data for an item that had been removed (for reasons unknown) in the published analysis. I wonder how many of studies would actually pan out if all the data and the code that led to the paper were released and peer-reviewed along with the paper. I will try this the next time I am asked to review, take a look at the data and analyses myself.

  6. The authors (perhaps not necessarily these authors, but some generic ones) assert that they’ve followed the proper procedures of science, and are protected from the dangers of forking paths, small sample sizes, etc. Andrew (perhaps not really Andrew, but an amalgamation of views put forth in posts and comments) asserts that looking for small effects amid noise is really, really dangerous, and that the methods blindly followed to deal with them are easily susceptible to intentional or unintentional manipulation. What can we do to bring these views together?

    I argued earlier that math aside, it’s useful to develop an intuition for what noise “looks like,” and what comes out of statistical procedures when they’re fed small, noisy inputs. I illustrate this here, in which my six-year-old and I looked for significance in the relationship between the length of twigs and their orientation in our backyard. (It was fun — I recommend trying it!) There are other, better, examples of this, e.g. the wonderful article on becoming younger by listening to the Beatles [link]. Of course, most people are not going to go out and examine twigs (though they should). What, then can be done?

    Suppose we (a vague “we”) collected all sorts of absurd experimental datasets, and analyzed them according to conventionally-used statistical procedures, reaching suitably nonsensical conclusions. These could be collected in a Journal of Forking Paths, and one could point to its articles in arguments. “If you believed [this study], then you should believe [that one], too.” One could direct journalists to it, as well! This could turn the discussion of flawed papers away from more abstract statements of “these methods are shaky” to something more concrete.

  7. Great recommendations under ‘What should these researchers do with this project’ (no sarcasm).

    Honestly doubt it would get accepted in a standard ‘decent’ journal following that presentation though (assume you also believe this?). Best hope for that sort of format is probably as a preprint/working paper/no-page-limit open-access journal (even blog post).

  8. Perhaps we are teaching statistics backwards. Instead of teaching students to try and come up with the correct result, we could teach what it feels like to rationalize one’s way through to non-objectivity.

    A final exam question might go: This dataset consists of 5 completely uncorrelated variables — I’ve labeled the columns as ‘weight of cat’, ‘probability of attrition’, ‘color of cat [in RGB]’, ‘current age of subject’ and ‘SAT verbal score’. Find a way to make 3 statistically significant correlations and one non-significant correlation. You get an extra point for each spurious t-test you can come up with. The catch is that your entire analysis has to form part of a coherent story. Bonus points go to the 5 most concise answers.

    • One of my favorite homework assignments I gave last year had 3 parts:

      1 – Theory: they had to discuss the ins/outs/predictions of a theoretical model with particular parameter values as it related to a particular reform.

      2 – Comparisons: then they compare data from two groups and describe only what they were seeing, as in “Test scores in the treatment group were 0.1 standard deviations higher than those in the control group” or “there was a 12%, or 1.3 percentage point, difference in labor force participation between groups”.

      3 – Speculate Wildly: This was the part where the students came up with explanations for why the comparisons were different. This had to relate back to the theoretical model, but then they also had to give alternative explanations that came from adapting the parameters in the original model. Basically the idea was to make them realize that if they picked multiple sets of reasonable-sounding parameter values or theoretical constraints, or focused on one or another set of comparisons, they could interpret the result in a number of different ways.

      But mostly I just wanted them to associate the phrase “Speculate Wildly” with the act of interpreting regression coefficients in the context of a theoretical model that has more than 2 moving parts. Because I’m learning Psychology reading this blog.

      • jrc: Neat.

        If you are up for it there was a graphical methods used fairly widely (https://en.wikipedia.org/wiki/Correspondence_analysis) and the SAS implementation was in error for a number of years and the error caused artifacts to be added to the plots (I noticed the problem and with Michael Greenacre contacted the error which was fixed very quickly.)

        But there are all these real examples of researchers interpreting what is known to to be artifacts in published papers.

        Wonder what percentage have been retracted?

        • I was just about to write a snarky response about my need to spend more of my research time pointing out problems in the literature and the exceptionally positive feedback I have gotten from editors and referees on that work… but then I googled a bit and now I’m curious.

          I’m less interested in how many papers have been retracted than I am in thinking about a way to quantify the effect of coding-errors on the scientific discourse. How many citations clearly erroneous papers have had? How much grant money they generated for followup research? Ooohh – maybe the change in their citation rate before/after the bug was announced and fixed? That might say something about information dissemination or the ethics of how and when we cite.

          I like this as a research setup both statistically and professionally because it is not any of the authors’ fault – I doubt anyone who implemented that package could’ve written the math code themselves, so making the mistake is totally uncorrelated with researcher quality (given the set of researchers who would use SAS to do Correspondence Analysis – no judging, it wouldn’t even cross my mind to code the math directly, just clarifying the population of interest). That means that, other than using this program to do this technique, these papers are just exactly like the other papers in their field. Plus you don’t have to focus on any particular paper or author. Plus plus you get examples from a bunch of different fields.

          Cool natural experiment. Someone should suggest it to a hard money tenured professor. I hear they have some time to think about stuff.

          But probably you should just not tell me any more about this. No OK, tell me a little bit more – you have a list of these papers?

  9. From the 538 link:
    “After all, what scientists really want to know is whether their hypothesis is true, and if so, how strong the finding is. “A p-value does not give you that — it can never give you that,” said Regina Nuzzo, a statistician and journalist in Washington, D.C., who wrote about the p-value problem in Nature last year. Instead, you can think of the p-value as an index of surprise. How surprising would these results be if you assumed your hypothesis was false”

    I love these journalist attempts at explaining pvalues. I really mean no offense to the authors. I remain mystified as to the point of these calculations found in nearly every paper, so I attempted to understand this explanation.

    I imagine a world where my hypothesis is false, then do some math about a different (null) hypothesis to get a pvalue. Next, for some reason, I am to ask myself how surprised I feel to see these results, while continuing to imagine my hypothesis to be false.

    We wish to be objective on this important matter. Luckily there is a way to do this, at least in part. Surprise at the results is unfortunately not quantified, but is at least indexed, by the pvalue we have calculated. We then either look up our p value in surprise tables found at the back of textbooks, or use statistical software which denotes the amount of suprise by placing 0-3 stars next to the p value. (NB Do not miss the amazing power of this method. The number of stars was arrived at using calculations that contain zero information about our hypothesis.)

    In the end, all we know is how many stars worth of suprised we would be at our results in a world where our hypothesis is false. Even accepting that the above procedure can provide this information, I still dont see the point.

Leave a Reply to Rahul Cancel reply

Your email address will not be published. Required fields are marked *