Replication crisis crisis: Why I continue in my “pessimistic conclusions about reproducibility”

A couple days we again discussed the replication crisis in psychology—the problem that all sorts of ridiculous studies on topics such as political moderation and shades of gray, or power pose, or fat arms and political attitudes, or ovulation and vote preference, or ovulation and clothing, or beauty and sex ratios, or elderly-related words and walking speed, or subliminal smiley faces and attitudes toward immigration, or ESP in college students, or baseball players with K in their names being more likely to strike out, or brain scans and political orientation, or the Bible Code, are getting published in top journals and getting lots of publicity. Indeed, respected organizations such as the Association for Psychological Science and the British Psychological Society have promoted what I (and many others) would consider junk science.

I should emphasize that, if all that was wrong with these studies was that they were ridiculous, you could say that ridiculous is in the eye of the beholder and that sometimes ridiculous-seeming claims turn out to be true. But it’s not just that. The real problem is that the evidence people take as strong support of these theories—the evidence that their supporters take as so strong that they still hold on to these theories even after attempted replications fail and fail and fail—is that there are some comparisons with p-values less than .05. What is not well understood is that, in the presence of what Simmons, Nelson, and Simonsohn have called “researcher degrees of freedom” and what Loken and I call “the garden of forking paths,” such “statistically significant” p-values provide essentially zero information.

In addition to all of this, a group of researchers coordinated replications of a bunch of experiments published in psychology journals and reported their results in a paper by Nosek et al. that appeared last year, along with headlines which (correctly, in my opinion) declared a replication crisis in science.

My recent post on this topic was triggered by two recent papers, one by Gilbert et al. disputing the claims made from the replication studies, and a response by Nosek at al. defending their work. As I wrote, I pretty much agree with Nosek et al., and I pointed the reader to this more thorough discussion from Sanjay Srivastava.

What’s new today?

In the discussion of my post from the other day, a commenter pointed to a further reply by Gilbert et al. who continue to hold that the replication project “provides no grounds for drawing ‘pessimistic conclusions about reproducibility’ in psychological science.” I replied in comments, explaining why I think they still miss the point:

Gilbert et al. write, “Why does the fidelity of a replication study matter? Even a layperson understands that the less fidelity a replication study has, the less it can teach us about the original study.” Again, they’re placing the original study in a privileged position. There’s nothing special about the original study, relative to the replication. The original study came first, that’s all. What we should really care about is what is happening in the general population.

I think the time-reversal heuristic is a helpful tool here. The time order of the studies should not matter. Imagine the replication came first. Then what we have is a preregistered study that finds no evidence of any effect, followed by a non-preregistered study on a similar topic (as Nosek et al. correctly point out, there is no such a thing as an exact replication anyway, as the populations and scenarios will always differ) that obtains p less than .05 in a garden-of-forking-paths setting. No, I don’t find this uncontrolled study to provide much evidence.

Gilbert et al. are starting with the published papers—despite everything we know about their problems—and treating the reported statistical significance in those papers as solid evidence. That’s their problem. The beauty of having replications is that we can apply the time-reversal heuristic.

On the specific point of representativeness, I do agree with you (and Gilbert et al.) that the results of the Nosek et al. study cannot be taken to represent a general rate of reproducibility in psychological science. As Gilbert et al. correctly point out, to estimate such a rate, once would first want to define a population that represents “psychological science” and then try to study a representative sample of such studies. Sampling might not be so hard, but nobody has even really defined a population here, so, yes, it’s not clear what population is being represented here.

Regarding the issue of representativeness, I disagree with Gibert et al.’s statement that “it is difficult to see what value their [Nosek et al.’s] findings have.” Revealing replication problems in a bunch of studies does seem to me to be valuable, especially given the attitudes of people like Cuddy, Bargh, etc., who refuse to give an inch when their own studies are not replicated.

Gilbert et al. are coming up with lots of arguments, and that’s fine—as Uri Simonsohn says in his thoughtful commentary, the replication project is important and it’s good to have open discussion and criticism. So, while I think Gilbert et al. are missing the big picture and I think it’s too bad they made some mistake in their article, I think it’s good that they’re continuing to focus attention on the replication crisis.

But right now I see two problems in psychological science right now:

1. Lots of bad stuff is being published in top journals and being promoted in the media.

2. Even after all this, Gilbert et al., Bargh, Cuddy, etc., don’t want to face the problems in the scientific publication process.

1 and 2 work together. Part of my “pessimistic conclusions about reproducibility” come from the fact that, when problems are revealed, it’s a rare researcher who will consider that their original published claim may be mistaken.

If the Barghs, Cuddys, etc. would recognize their problems with their past work, then item 1 above would not be such a problem. Sure, lots of low-quality research might still be published (I mean “low quality” not just retrospectively in the sense of not getting replicated, but prospectively in the sense of being too noisy to have a chance of revealing anything useful as John Carlin and I discussed in our recent paper), but there’d be churn: Researchers would openly publish work as exploratory speculation—no more null-hypothesis-significance-testing and taking statistical significance to represent truth—and they’d recognize when their research had reached a dead end.

But, as long as these researchers will not admit their mistakes, as long as they continue to hold on to every “p less than .05” claim they’ve ever published—and as long as they’re encouraged in that view by Gilbert et al.’s move-along-no-problem-here attitude—then, yes, we have a serious problem.

Daryl Bem, the himmicanes guys, Marc Hauser, Kanazawa, . . . none of these guys ever, as far as I know, acknowledged that they might be wrong. It’s taken a lot of effort to explain to people why “statistical significance” doesn’t mean what they think it means, and why publication in a top journal is not a badge of quality. The reproducibility project of Nosek et al. provides another angle on this serious problem. As Gilbert et al. correctly point out, that project is itself imperfect: like any empirical study, it is only as good as its data, and it makes sense for interested parties to examine its replications one by one. And, as also Gilbert et al. correctly point out, the studies in this project are not a representative sample of any clearly defined population and so we should be careful in our interpretation of any replication percentages.

What continues to concern me is the toxic combination of items 1 and 2 above. You guys who feel that Nosek et al. are giving science a bad name: You should be more bothered than anybody about the behavior of researchers such as Bem, Bargh, etc., who refuse to let go of their published by fatally flawed and unreplicable results.

P.S. Someone pointed me to a new note by Gilbert et al. which takes no position on whether “some of the surprising results in psychology are theoretical nonsense, knife-­edged, p-­hacked, ungeneralizable, subject to publication bias, and otherwise unlikely to be replicable or true.” So I don’t know if if they really believe that women are three times more likely to wear red during days 6-14 of their cycle? I wonder if they really believe that elderly-related words make people walk more slowly? Or that Cornell students have ESP? Or that obesity is contagious? Etc. I’m sure there will always be people who will believe some or even all these things—they all got published in top peer-reviewed journals, after all, and some of them even appeared in the New York Times and in Ted talks! And that’s fine, there should be a diversity of beliefs. I hope there are also people out there believing the opposite statements, which I think are also just about as well supported by the data: that women less likely to wear red during days 6-14 of their cycle, that elderly-related words make people walk faster, that Cornell students have an anti-ESP which makes them consistently give bad forecasts (hey—that would explain those hot-hand findings!), that obesity is anti-contagious and when one of your friends gets fat, you go on a diet. All these things are possible. For now, I’d just like it if people would stop saying or acting as if the statistical evidence for these claims is “overwhelming” or that “you have no choice but to accept that the major conclusions of these studies are true.” By now it should be clear from statistical grounds alone that the evidence in favor of these various claims is much weaker than has been claimed, but I think the replication project of Nosek et al. has been valuable in showing this in another way.

P.P.S. Also this update from Nosek et al., going into some of the details on one of the replications that had been criticized by Gilbert et al. Very helpful to see this example.

64 thoughts on “Replication crisis crisis: Why I continue in my “pessimistic conclusions about reproducibility”

    • I have asked Andrew to consider a blog post about this study – which I have not read yet. I think there are several things worth exploring about these replications in economics versus those in psychology (or other fields). It is important to realize that these economic studies were in experimental economics. So, the high rate of replicability (I believe) is not an indication of what should be expected in other fields of economics (especially where observational studies are more typical). Indeed, replication of those studies is almost impossible, given the lack of open access to data and methods that is common in those studies.

      It is probably worth exploring whether economic experiments have different characteristics from psychological experiments. I am more familiar with the former than the latter. In general, economists are quite careful about experimental design – random assignment and careful controls of confounding variables – possibly partially due to the fact that economic experiments have a relatively short history compared with some other disciplines. However, it is also possible that economists conduct experiments in areas where careful measurement is more easily accomplished. For example, I think it is easier to conduct a well-designed study about whether increased payoffs in an ultimatum game cause people to become more “rational” than it is to study whether “power poses” have large impacts. If it is true that economists study things where good experimental design is easier to accomplish, it is still possible that economists focus on issues that are less worth studying (I have some doubts about whether these studies have valuable implications for realistic settings – but I think much of the psychology literature also has issue regarding whether the experimental settings yield valid insights into the real situations they are supposed to be exploring).

      In any case, I’d like to hear what Andrew and others think about the possible disciplinary differences that might account for the different findings in this economics replication analysis versus that in psychology. Or is it just the n=18 in the economics replication that means they just studied noise?

    • Phil:

      The same journalist who sent me the psychology replication papers also sent me the econ replication paper. It too was embargoed so I didn’t post anything on it at the time, then I got swept up in all these other stuff.

      I would not be surprised if experimental econ has a higher rate of replication than social psychology. I don’t know enough about econ to make the comparison with any confidence, but as I said in my post the other day, I feel that many social psychologists are destroying their chances by purposely creating interventions that are minor and at times literally imperceptible. Economists perhaps are more willingness to do real interventions.

      Another thing is that economists, compared to psychologists, seem more attuned to the challenges of generalizing from survey or lab to real-world behavior. Indeed, many times economists have challenged well-publicized findings in social psychology by arguing that people won’t behave these ways in the real world with real money at stake. So, just to start with, economists unlike psychologists seem aware of the generalization problem.

      • A quick glance suggests the rate for experimental economics is not dissimilar to that for cognitive psychology – which makes a certain sense given the overlap in the fields (I know several people who publish across both fields). It also makes sense that social psychology might have lower replication rates either because there are more hidden sources of variability in many such studies or because there is an indication that surprise value plays a bigger role in publication in some top journals.

      • Experimental economics is newer than experimental psychology, so it’s currently likely easier to come up with a finding that’s both new and true in economics.

        Diminishing marginal returns is a pretty standard feature of the world.

      • Andrew, Thom, Steve:

        I ran the analysis on the psych data set again with a matched set of studies (18 studies identical in the distribution of original effect sizes in the testing only main effects) and the result showed that the reproducibility in psychology would be 10 out of 18 (55%) using a similar sampling approach and the same criteria as Camerer et al. did (“eighteen (18) prominently published studies[…] using between-subjects designs”).

        So I am not convinced by Camerers theory so far. (“One theory put forward by Dr Camerer and his colleagues to explain this superior hit rate is that economics may still benefit from the zeal of the newly converted. They point out that, when the field was in its infancy, experimental economists were keen that others should adopt their methods. To that end, they persuaded economics journals to devote far more space to printing information about methods, including explicit instructions and raw data sets, than sciences journals normally would. This, the researchers reckon, may have helped establish a culture of unusual rigor and openness”)

  1. I wonder what evidence, if any, would lead these authors to change their minds about the supposed effects that they have published? I suspect that some of the authors (e.g., the himmicanes researchers) might be more open to disconfirming evidence because it is not as if they are hitting the TED talk and lecture circuit too hard with their findings, but the priming and power posing research people would probably be far more resistant to changing their minds.

  2. “You guys who feel that Nosek et al. are giving science a bad name: You should be more bothered than anybody about the behavior of researchers such as Bem, Bargh, etc., who refuse to let go of their published by fatally flawed and unreplicable results.”

    I’m actually not too bothered that a few individuals refuse to let go of their results. I’m much more concerned that the legions of current and future researchers who want to do good science have reliable guidance on the proper use of stats. For this, many of us rely on textbooks, not blog posts or recent papers in stats journals. Gelman and Hill is a pretty recent textbook that fails to discuss most of the issues you raise here. I hope you will devote as much effort (more, actually) to revising Gelman and Hill, or writing a new textbook, than to trying to convince a few individuals that they’re wrong. Yes, the blog posts and papers of Ioannidis, Simmons, Nelson, Simonsohn, you, and many others, have done a lot to turn the tide. But the sooner all this becomes textbook material, the better.

    My two cents.

    • Ed:

      Yes, Jennifer and I have been talking a lot about this. I have become aware of this replicability problem only after we wrote that book, and in the second edition we definitely want to address these issues.

  3. Sadly, the problem is really that we’re in a competitive game – which is a far cry from science. Rather than seeking truth, or some notion of it, we seem to be angling for position, power, reputation, and marketing. It would seem to me that a sincere scientist would embrace research invalidating work – even their own. If William Farr had stopped after his first (mistaken) assumptions that elevation and air quality caused cholera – well – perhaps millions would have died unnecessarily. He was open to investigating that perhaps, he was mistaken. Due to reputation, top journals have so much influence it seems early work is carved in stone. Note – it was not always this way – and we academics have created this environment. Theoretically, we could change it – but that would upset the current power base. Given that we created it, only we can fix it.

  4. Its interesting how this is turning into a political problem.

    While it is very likely true that there is a replication issue, Nosek’s paper in fact detracts from showing that.
    Had the study been better designed, it would have been a step towards exposing a replication problem.

    I am not sure of the political stances of Gilbert et al. but discrediting the validity of their criticism of Nosek’s paper ends up being a political response.

    It is important to not be fooled into thinking that Gilbert et al’s paper is trying to further the “move-along-no-problem-here” attitude.
    If anything, it emphasizes the importance of experimental rigor and not drawing conclusions (true or not) which result from bad experimental procedure.

    • Bhav:

      No, the Nosek et al. paper is very careful. It’s just that they’re doing a huge project, coordinating lots of experiments done by many different experimenters. Lots of moving parts. One can also look at particular examples such as power pose and embodied cognition and their failed replications. Also the statistical theory and practice that helps us understand how all those p-values below .05 can easily be obtained via the garden of forking paths.

  5. In my own career, I have found only hostile reactions to my attempts to replicate other work – particularly when finding fault in the results. I have had editors write to me that replication was of no interest to them, even though the work was demonstrating that their previously published work was seriously faulty. Business scholars seem less interested – only now they are beginning to recognize that, perhaps, we should be doing some replication. It becomes political because when you embark on a project that examines a well cited work, there are considerable political barriers to overcome, with editors, reviewers, and sensibilities about ‘reputation’. Top journals have their own reputations to consider (in our field, they are not so keen to recognize or retract faulty work), editors, scholars, everyone competes in the system.

    So, back to Gilbert et al. One strategy is to argue minute details detracting from the general findings, sometimes the details are not even critical. However, one can say ‘you see -the replication wasn’t done perfectly’. Note, no study is perfect, so why should the replication be held to a standard higher than the original publication? Andrew’s point – that there continues to be a serious problem with the publication process – is entirely valid and cannot be reduced by some technical specifications.

    • You are right that the field of business is even worse at admitting its errors than other social sciences. For example, the Journal of Management and the journal Personnel Psychology (both top tier journals in the field) both have an incredibly high proportion of papers that report findings that are demonstrably impossible – at least judging by what is posted on pubpeer. Many of the errors are pretty self-evident but nothing ever really gets corrected.

  6. Just a small point about a statement made in the recent NYT article by the authors of the original replication study. Here’s the statement: “Brian A. Nosek, a colleague of Dr. Wilson’s at Virginia who coordinated the original replication project, which took several years, countered that the critique was highly biased: ‘They are making assumptions based on selectively interpreting data and ignoring data that’s antagonistic to their point of view.'”

    Nosek should look up “antagonistic” in a dictionary. It implies bias, a negative attitude toward something. Good data doesn’t have an attitude, or express an attitude.

  7. I have read this blog for a long time. I get the impression that what bothers people is that a lot of bad research is promoted in the media, point 1 above. I suspect a lot of “bad” research has been published for years but there seems to be a promotional aspect that has been increasing. I was taught years ago, over 40+, that one study really doesn’t prove anything in the social sciences. Thus, the need for replication and extension. Am I missing something?

    • Yes, you are missing the idea that given the current publication process, the only replications and extensions likely to get published have a higher probability than is typically acknowledged of not being representative of the real effects of the relationships under study for the reasons often discussed on this blog.

      • Curious,

        I don’t follow what you are saying. I would say that, given current publication practices,
        1) Published studies have a high probability of not showing the real effects of the relationships under study
        2) Replications are rarely published;
        and that these occur for the reasons often discussed in this blog.

        • I thought this paragraph in the Sanjay Srivastava piece

          “The RPP replication rate was 47%. The high-powered (N>6000) Many Labs pooled-sample replication rate was 85%.”

          from https://hardsci.wordpress.com/2016/03/03/evaluating-a-new-critique-of-the-reproducibility-project/

          Was scarily similar to this from Tversky and Kahneman Law of Small Numbers

          “Suppose you have run an experiment on 20 subjects, aid have obtained a significant result which confirms your theory (z = 2.23, p < .05, two-tailed). You now have cause to run an additional group of 10 subjects. What do you think the probability is that the results will be significant, by a one-tailed test, separately for this group?"
          If you feel that the probability is somewhere around .85, you may be pleased to know that you belong to a majority group. Indeed, that was the median answer of two small groups who were kind enough to respond to a questionnaire distributed at meetings of the Mathematical Psychology Group and of the American Psychological Association.

          On the other hand, if you feel that the probability is around .48, you belong to a minority. Only 9 of our 84 respondents gave answers between .40 and .60. However, .48 happens to be a much more reasonable estimate than .85."
          http://pirate.shu.edu/~hovancjo/exp_read/tversky.htm

      • Martha:

        The argument I am making is:

        If there is a bias towards original publication of not showing real effects and studies typically published are those with p < .05, then the replications likely to get published would also have p < .05, though would also likely have substantially different effect sizes simply due to chance. This is seen in organizational psychology for example even in the presence of response tendencies which are likely to increase the correlation rather than decrease it as are all of the statistical corrections implemented.

        Paul Barrett uses the Big Five personality constructs to demonstrate this. He shows that the meta-analytic 90% credibility range for meta-analytic published studies for the correlation between Conscientiousness and Stability to be (.27, .73). This interval represents 4 distinct meta-analytic studies and population level estimates of the true parameter and not simply individual small sample studies. http://www.pbarrett.net/stratpapers/metacorr.pdf

        What would explain this?

        A few possibilities come to mind:

        1. The relationship is actually dynamic rather than static as assumed and that each study actually represents a distinct sub-population within which the relationship is estimated accurately.
        2. The replications are similarly biased toward generating statistically significant and innacurate population parameters.
        3. Sample size and representativeness are far greater problems in personnel psychology than is typically acknowledged and for which statistical corrections are inadequate solutions to the problem of estimating population level parameters.
        4. Measurement methods are so filled with noise that any signal of a relationship between the two constructs is simply by chance.

        • “If there is a bias towards original publication of not showing real effects and studies typically published are those with p < .05, then the replications likely to get published would also have p < .05,"

          I'm not convinced about this statement; it might be true under some circumstances, but in the current climate of interest in reproducibility in psychology, my understanding is that there are increasing numbers of replications which are published without regard to statistical significance.

          I'm not familiar enough with the subject of your second paragraph to be able to comment intelligently on the rest of your response.

        • Martha:

          Perhaps the recent zeitgeist has changed replication publishing practices. However, prior to these changes the typical way in which a replication was published was as a replication and extension of a previous study. Given this practice, my argument holds for the vast majority of meta-analytic studies in the literature.

  8. “along with headlines which (correctly, in my opinion) declared a replication crisis in science.”

    Was this specific to psychology or to science in general?
    (Put another way, as in Ionnadis study of medical research, sometimes the specificity of results for a well-studied field gets lost and over-generalized to “science” in general. The problems may or may not be widespread, may manifest differently in different fields, but the imprecision leads to things like this.

    • John:

      I don’t know about coral reefs but there does seem to be a bias in science (and in publication more generally) toward sensationalism. The particular concerns that I’ve focused on come from null hypothesis significance testing, but there is a larger problem, especially with the so-called tabloid journals like Science, Nature, and PPNAS that they’ll pretty much only publish papers that are presented as exciting breakthroughs, which creates all sorts of motivations to exaggerate. Going along with this are statistical methods that are focused on “power”—that is, the probability of getting “statistical significance”—so that in many fields the standard operating procedure of research seems to be to gather a bunch of data and shake until something with “p less than .05” comes out. I’m most aware of this problem in social psychology: As discussed in the above post or my other recent post on the topic, psychologists seem to work very hard to implement their treatments in homeopathic doses, which makes noise the dominant or only thing they could possibly study. But it could well be a problem in other fields; certainly, I’ve heard some statistical horror stories about medical research.

      • From someone working in what could be called “biomedical research”, mainly human microbiome studies, I agree with much of what you say. But I also think that many times the problem is not so much with the “statistical significance” part but with a) the unscrupulous presentation of results as if they were true, verified effects/differences, and/or b) the complete lack of understanding that there is a HUGE difference between an experimental study, an observational study, and even more an observational study that is totally exploratory.

        Although I’m aware of the problems with p-values, and especially with NHST, these are actually quite useful in my field, for a few practical/pragmatic reasons:

        1) these studies are based on assessing differential abundance variation for hundreds or thousands of outcome variables, each one of them a bacterial “species”. And since at this stage of knowledge we cannot possibly know what effect sizes are biologically meaningful, just assessing effect sizes (estimates increases and decreases in abundance together with some way of measuring uncertainty) isn’t going to be useful when you need to do that for hundreds or thousands of individual taxa at a time. NHST here is still problematic for reasons that are not worth repeating but nevertheless it’s a useful, pragmatical approach that helps filter an unmanageable amount of results into something that you can concentrate on. When you get those few “hits” then you can look at the mean abundances of the bugs, the estimated effect sizes in terms of abundance, etc, and start considering if those can potentially make biological sense or are just essentially noise and untrustable.

        2) also, It is important to keep in mind that these studies that are not only observational but purely exploratory provide results that can’t be in any way “trusted” as much as in, say, an experimental study that is essentially looking a few particular and well defined outcomes, in the sense that you have a clear scientific hypothesis that can be translated to a clear statistical “hypothesis”. In other words, these studies are very useful BUT they essentialy *generate* hypothesis.

        This brings to two other points that I’d like to make. One is that many people rely on textbooks to learn beyond the inefficient statistics college courses that rarely prepare you for work on real data, specially if you work on something that can’t be solved with the typical namby-pamby textbook ANOVA. It certainly doesn’t help that almost (all?) every general general textbook for 101 stats or regression or whatever fails to clearly point out the differences between experimental and observational studies, not to mention purely exploratory studies (which happen to also be very important in many fields, apparently to the astonishment of the majority of statisticians). Second, although I totally understand and agree with the problematic use of p-values, I still think that in many contexts, like the one I work on, there isn’t much of an alternative and they are certainly useful in practice AS LONG AS people understand the limitations inherent in their use.

        I think the real problem in many cases is both the capability of understanding the differences in the nature of the studies and/or the dishonest stance of the researchers, promoting the results far beyond what the methods used can provide them.

        • >”Although I’m aware of the problems with p-values, and especially with NHST, these are actually quite useful in my field…these studies are based on assessing differential abundance variation for hundreds or thousands of outcome variables, each one of them a bacterial “species”…

          That is just a poorly designed study (it is probably designed with an incorrect interpretation of the p-value in mind). Instead what you want to measure is how the relative/absolute abundance changes over time under various conditions (eg temperature, nutrient levels, etc). Then look for patterns in that data you can make sense of with a theoretical model. If the same model can then fit other data, there is an indication you are learning something about the world. Scientific advances are made by discovering generalizable patterns, not differences.

          >”at this stage of knowledge we cannot possibly know what effect sizes are biologically meaningful…When you get those few “hits” then you can…start considering if those can potentially make biological sense”

          Filtering the data by p-value doesn’t provide any information about what makes biological sense. You are in the same state of ignorance regarding that both before and after the filtering step. You need to compare the data to some theoretical model to determine whether a result makes “biological sense”. If you have such a model, that should be used to filter the data, not p-values based on a null hypothesis that abundance stays exactly the same. If you don’t, you need to collect suitable data, explore it, and come up with one.

        • > “That is just a poorly designed study (it is probably designed with an incorrect interpretation of the p-value in mind). Instead what you want to measure is how the relative/absolute abundance changes over time under various conditions (eg temperature, nutrient levels, etc). Then look for patterns in that data you can make sense of with a theoretical model. If the same model can then fit other data, there is an indication you are learning something about the world. Scientific advances are made by discovering generalizable patterns, not differences.”

          No, you’re missing the point. What I want to measure depends entirely on the objectives of the study. Your example is fine but it only works In cases in which knowledge of a biological system is more advanced than what is the case in many of these pilot studies. In many cases your variables are there just to control for confounding effects while you just want to measure a single effect, the presence of the disease vs the controls, as reflected by the effect of disease status on the microorganisms. I can obviously obtain the effects for the variables and I do, but that is not the main point in such exploratory work. The main point is to 1) try as much as possible to control for variables that may confound the results while looking at the effect of the presence of the disease, and b) measure the effects associated with the confounders you’re trying to control for. Your example works fine in setups in which we actually do know quite a bit more, which is the case in many environmental microbiology studies. In most of the cases we work with, we don’t have that luxury.

          > “Filtering the data by p-value doesn’t provide any information about what makes biological sense. You are in the same state of ignorance regarding that both before and after the filtering step. You need to compare the data to some theoretical model to determine whether a result makes “biological sense”. If you have such a model, that should be used to filter the data, not p-values based on a null hypothesis that abundance stays exactly the same. If you don’t, you need to collect suitable data, explore it, and come up with one.”

          You’re missing the point again. The theoretical models in many of these cases are very crude and therefore just looking at ES estimates is not enough, especially when you’re looking at hundreds or thousands of effect sizes at the same time. You need to use some method that catches the most obvious *potential* differences. We have little knowledge of what effect size for a particular microorganism may be biologically relevant. Multiply that to hundreds or thousands and it’s a nightmare. The alternative is to use p-values. It’s not that one thinks there are really no differences. In biological data, there’s virtually nothing that is exactly the same. The point is that you can reduce a dataset of thousands or hundreds into a few cases. You’re essentially red-flagging taxa. The others are simply those cases for which the differential abundance *for all practical purposes* is indistinguishable from zero, given the power of the study. After that you can explore those particular taxa in detail, because now you have something that you can investigate the literature, connect with the diseae, see if the effect size is trivially large due to low count reads, whatever. Otherwise, it is not possible to look at specific taxa.

          I think that you are right about what you say in general, but your examples only apply to particular cases in microbial ecology, in specific, those cases in which significant theoretical models exist. That is not the case in many of these microbiome studies. Unfortunately, a lot of these studies are overblown in the media and many are certainly low quality.

        • >”In many cases your variables are there just to control for confounding effects while you just want to measure a single effect, the presence of the disease vs the controls, as reflected by the effect of disease status on the microorganisms. I can obviously obtain the effects for the variables and I do, but that is not the main point in such exploratory work. The main point is to 1) try as much as possible to control for variables that may confound the results while looking at the effect of the presence of the disease, and b) measure the effects associated with the confounders you’re trying to control for.”

          Getting an idea of plausible values and the influence of various environmental factors is a prerequisite to estimating the effect of disease state, especially for something like relative abundance of microorganisms.

          Say a microbe divides twice an hour, r1=48 times/day with temperature 37 C, and r2=49 times/day when temperature is slightly different (eg 38 C). Assuming no cell death and equal starting abundance, the increase will follow something like: A(t,r)=A0*2^(r*t). If you check the abundance after 3 days: A(t=3,r2=49)/A(t=3,r1=48)=8, meaning there would be an 8-fold change in relative abundance just due to position in the incubator (or possibly gut) causing a small difference in division rate.

          So first of all, I think the claim there is no simple theoretical model to apply is wrong. Secondly, your method will result in either all sorts of spurious findings or extremely high inter-experiment variability. You need to get an idea of the confounding effects before testing the treatment (or disease state in this case) just to know how to run a good experiment.

        • Anoneuoid,

          as I said before, I do agree with most of what you say, but you seem to not be aware of the particular conditions in which many of these studies are performed. They are pilot studies, purely observational and exploratory. We suspect the involvement of microbiota in some diseases for different reasons but those reasons are mostly circumstantial and they are not amenable to detailed modelling given the lack of detailed medical knowledge. In your particular example regarding temperature, you show lack of understanding of the limitations of these studies. We may suspect interactions with, say, the central nervous system through the vagus nerve, or other types of interactions. But due to the lack of detailed knowledge or clues that can be translated into a model you’re stuck with looking at disease stages vs controls and trying to see if some specific taxa, or metabolic pathways, or whatever are substantially over-represented in disease stages. That will, hopefully, provide some focus that can then be translated into more detailed investigations, even experimentation. But at this stage, you’re stuck with a very birds-eye view, sequencing thousands of phylogenetic marker genes and trying to control for factors that you know can have effects and measuring them (e.g. antibiotics, diet, whatever) but essentially you’re looking for overabundant taxa, or changes in community structure, or general diversity indices, or overabundant metabolic pathways, or whatever. And p-values, sucky as they are, do prove useful as a first approach to these difficult cases in which medical knowledge is severily lacking.

          > “You need to get an idea of the confounding effects before testing the treatment (or disease state in this case) just to know how to run a good experiment.”

          And as I said we do have some variables that we know have high potential as confounders. But the actual drivers of the disease are unknown and at this stage cannot be modelled due to lack of precise knowledge. It’s the disease stage that has to stand in for flagging taxa variation of potential interest while controlling for the most obvious factors. As for “running a good experiment”, you’re again missing the point. These aren’t experiments.

          > “Secondly, your method will result in either all sorts of spurious findings or extremely high inter-experiment variability.”

          Oh, but I totally agree with you. Let me make clear that I share every single reservation that Gelman and all of you have regarding these issues. That is why I’ve been a lurker for a very long time on this blog. I believe that the vast majority of “hits” will be spurious and I pity my colleagues from medicine, with whom I work, who think otherwise. I keep bringing their unrealistic expectations back to Earth. I think the Garden of Forking Paths & Friends are huge problems in this kind of research. Most of this won’t ever be replicated or verified. But sometime you find some gold nuggets and it does pay off. Notice that at this stage, you’re not trying to explain the variation. All you’re doing is looking at the data to see if it’s worth it to try and get funding to investigate further based on a possible effect of disease state, but you need to have some evidence that something may be going on and worth investigating, and that’s what these pilot studies are for. This is very different from what you seem to be envisioning in your head.

          If it helps, I do feel largely unsatisfied with these studies. But they do pay off occasionally, and that makes them worth it. But I don’t plan to stick around, because I’d rather be doing more satisfying modelling in fields in which we can indeed go beyond because more knowledge is available, like environmental microbiology or epidemiology, where we have a good idea of the interactions and factors driving abundance of taxa, metabolic pathways, etc. Anyway, my point is that, for now, those damn p-values do show some limited use, and their use in these studies are not a product (at least for the most part) of lack of statistical acumen. It just happens that for some cases, they do prove useful and there aren’t realistic practical alternatives.

          I hope this makes it somewhat clearer and if you disagree, well, we’ll have to agree to disagree. I just hope you understand that we’re all on the same boat here and this has nothing to do with “bad experimental design” or “lack of understanding of what a p-value is”.

          Cheers.

        • PABP,

          I really expected you to argue my example of the temperature effect was physically implausible for one or another reason, at which point I would say “see we know way more about this phenomenon than you thought”. The problem is that this type of study is designed around looking for significant differences rather than collecting useful information about gut microbes. I don’t see how it is going to change the problem that there is “lack of detailed knowledge or clues”. It reminds me of this Feynman quote:

          “For example, there have been many experiments running rats through all kinds of mazes, and so on—with little clear result. But in 1937 a man named Young did a very interesting one. He had a long corridor with doors all along one side where the rats came in, and doors along the other side where the food was. He wanted to see if he could train the rats to go in at the third door down from wherever he started them off. No. The rats went immediately to the door where the food had been the time before.

          The question was, how did the rats know, because the corridor was so beautifully built and so uniform, that this was the same door as before? Obviously there was something about the door that was different from the other doors. So he painted the doors very carefully, arranging the textures on the faces of the doors exactly the same. Still the rats could tell. Then he thought maybe the rats were smelling the food, so he used chemicals to change the smell after each run. Still the rats could tell. Then he realized the rats might be able to tell by seeing the lights and the arrangement in the laboratory like any commonsense person. So he covered the corridor, and, still the rats could tell.

          He finally found that they could tell by the way the floor sounded when they ran over it. And he could only fix that by putting his corridor in sand. So he covered one after another of all possible clues and finally was able to fool the rats so that they had to learn to go in the third door. If he relaxed any of his conditions, the rats could tell.

          Now, from a scientific standpoint, that is an A‑Number‑l experiment. That is the experiment that makes rat‑running experiments sensible, because it uncovers the clues that the rat is really using—not what you think it’s using. And that is the experiment that tells exactly what conditions you have to use in order to be careful and control everything in an experiment with rat‑running.

          I looked into the subsequent history of this research. The subsequent experiment, and the one after that, never referred to Mr. Young. They never used any of his criteria of putting the corridor on sand, or being very careful. They just went right on running rats in the same old way, and paid no attention to the great discoveries of Mr. Young, and his papers are not referred to, because he didn’t discover anything about the rats. In fact, he discovered all the things you have to do to discover something about rats. But not paying attention to experiments like that is a characteristic of Cargo Cult Science.”

          http://calteches.library.caltech.edu/51/2/CargoCult.htm

        • PAPB: in this type of experiment, you’re using the population variation as a measurement scale for “biologically relevant”. There are many ways in which things can be biologically relevant even when they vary by less than the populational standard deviation… but you can usually guess that when some species is massively more abundant than the others, it’s probably doing *something*.

          So, from this perspective, you’re using p values to filter (which is pretty much their main logical purpose) and you’re filtering things that are “very different” from some stuff that is more “typical”. It’s one of the main uses for p values that has a logical foundation you can trust.

          That being said, you’re only going to find certain kinds of issues using this kind of filtering technique. For example, you can find big earthquakes by p-value filtering a seismometer’s output, but you can’t as easily use that sort of thing to distinguish between small earthquakes and traffic on the road outside the building.

          Searching for things to study because they’re “very different” from a bunch of other more typical things, is basically the main logical use for p values applied to Data.

          http://models.street-artists.org/2015/07/29/understanding-when-to-use-p-values/

  9. Let’s hope that, if nothing else, the back and forth rounds in the psych replication crisis leads to a stringent critical appraisal of all aspects of these psych experiments (the methodology, the measurements, the subjects, and the statistics). I’m not too sanguine though. My main worries with the replicationist conclusions in psychology are that they harbor many of the same presuppositions that cause problems in (at least some) psychological experiments to begin with, notably the tendency to assume that differences observed–any differences– are due to the “treatments”, and further, that they are measuring the phenomenon of interest. Even nonsignificant observed differences are interpreted as merely indicating smaller effects of the experimental manipulation, when the significance test is shouting disconfirmation, if not falsification.

    It’s particularly concerning to me in philosophy because these types of experiments are becoming increasingly fashionable in “experimental philosophy,” especially ethics. Ethicists rarely are well-versed in statistics, but they’re getting so enamored of introducing an “empirical” component into their work that they rely on just the kinds of psych studies open to the most problems. Worse, they seem to think they are free from providing arguments for a position, if they can point to a psych study, and don’t realize how easy it is to read your favorite position into the data. This trend, should it grow, may weaken the philosopher’s sharpest set of tools: argumentation and critical scrutiny. Worse still, they act like they’re in a position to adjudicate long-standing philosophical disagreements by pointing to a toy psych study. One of the latest philosophical “facts” we’re hearing now is that political conservatives have greater “disgust sensitivity” than liberals. The studies are a complete mess, but I never hear any of the speakers who drop this “fact” express any skepticism. (Not to mention that it’s known the majority of social scientists are non-conservatives–by their definition.)

    One of the psych replication studies considered the hypothesis: Believing determinism (vs free-will) makes you more likely to cheat. The “treatment” is reading a single passage on determinism. How do they measure cheating? You’re supposed to answer a math problem, but are told the computer accidentally spews out the correct answer, so you should press a key to get it to disappear, and work out the problem yourself. The cheating effect is measured by seeing how often you press the button. But the cheater could very well copy down the right answer given by the computer and be sure to press the button often so as to be scored as not cheating. Then there’s the Macbeth effect tested by unscrambling soap words and getting you to rate how awful it is to eat your just run-over dog. See this post:http://errorstatistics.com/2014/04/08/out-damned-pseudoscience-non-significant-results-are-the-new-significant-results/I could go on and on.
    Maybe this new fad is the result of the death of logical positivism and the Quinean push to “naturalize” philosophy; or maybe it’s simply that ethics has run out of steam. Fortunately, I’m not in ethics, but it’s encroaching upon philosophical discussions and courses. It offends me greatly to see hard-nosed philosophers uncritically buying into these results. In fact, I find it disgusting.

    I went ahead and placed this as a comment on my blog which is on “repligate returns”: http://errorstatistics.com/2014/04/08/out-damned-pseudoscience-non-significant-results-are-the-new-significant-results/comment-page-1/#comment-139512

    • Mayo:

      Yes, one of the worst messages sent out by the statistics profession is that if you have statistical significance with a clean survey or experiment, you can just treat the science as a black box.

      This is the black-box model of science presented in many statistics textbooks and this is the model that seems to be followed by researchers on power pose, ESP, embodied cognition, education interventions, etc.

      Sure, in each of these cases, researchers have some motivating theory—but that theory is taken just as a justification to raise funds for the experiment, or to get the results published. The data analysis is presented as standing on its own.

      And, sure, there are some cases where this is so, for example where you see a big shift in the polls, far beyond what could be explained by sampling error, and this tells you that something is going on. Or where a controlled experiment shows a large, persistent, and replicable effect, something that can be taken as a “stylized fact” for science to try to explain.

      But my impression is that such clean cases are rare. Far more often we need substantive theory to understand our data.

      This can be taken as a plea for the use of Bayesian inference (and, indeed, I like Bayesian inference and find it useful)! But it’s really a more general point, as Carlin and I discussed in our completely non-Bayesian paper on type M and type S errors.

      I’ll have to write more on this, but for now let me just emphasize that I think the model of the theory-free experiment has led to all sorts of confusion.

  10. I’m sure there is bias towards sensationalism … and I never liked p-values in college, and being at Bell Labs didn’t make me like them more. I attend internal seminars fairly often at UCSF, and discussions often revolve around the methodologies and statistics of various studies (the current hot button being e-cigarettes). As a long-time follower of some of the parapsychology studies (like PEAR), I’ve always been doubtful of those.

    I’m unsurprised that many studies on humans are not strong, and do not replicate.

    On the other hand, I took a grad psych course in sensation and perception, and I thought that area had some sold results. Likewise, the 10% of our lab at BTL that were psychologists seemed to get good results, which had to work across large populations.

    I read Science most weeks, and it seems that most results in some fields will replicate (if the claims are such that replication makes sense), or if there is a spectacular result, people will try very hard to replicate and the result may disappear, liek cold fusion.

    So, it seems to me that:
    a) There may well be a bias towards spectacular results.
    b) Some fields may have a lot of “wrong” papers, or maybe that’s specific subfields within them.

    My concern is that when it is said there is a crisis in “science”, that seems overly-general and not helpful in improving, since one needs to focus on where the problems are … just as we improve software performance by measuring where the time goes and working on the biggest consumers.

    • I’d agree that not all of science is tainted. But many of the most publicly-visible fields of science (health, pharma, climatology, business, psychology, etc) have serious issues. True, other fields that are also highly visible (physics, astronomy, etc) are still pretty rigorous, but if the front-page-science fields crash and burn, the case for public funding of science becomes even harder than it currently is. Especially since fields like physics take lots of collaborators, working over many years, with enormous and expensive experiments, while psychology studies can be spewed out on a monthly basis. (Not to mention that people, organizations, and governments are likely to act on health or psychology claims, but there’s not much they can do about quarks.)

      My background is computer science and I’ve noted an increasing publish-or-perish subculture in computer science and machine learning papers where the authors appear to choose their topic by looking for two keywords that haven’t previously appeared together in the literature. So you get a paper where someone combines algorithm A with algorithm B and finds nothing interesting — as you’d expect based on what A and B do — but somehow it still gets published and the conclusion is that the new algorithm C is “competitive” and “deserves further study”. This doesn’t usually involve significance tests, etc, so it’s probably not on Andrew’s radar, but considering that the methods are so ridiculous and mundane I’m persuaded that the authors had to stretch and bend their tests to achieve “competitive” status and I doubt that they’d replicate.

      Some fields are infamous for not sharing (publicly-funded) data for a decade or more. Except with researchers who are friends. And once the data is published any replication attempt or criticism is automatically discounted because “we’ve moved on from that”. It doesn’t matter whether findings seem like they might replicate if the culture is such that it’s impossible to even try.

      So I’d urge Andrew not to back down when talking about “science”. There are some mountain tops of rigor standing above the clouds, but…

      • Wayne, what leads you to include climatology in your list?
        I think that John Mashey is right to focus on the human subject aspect, but it is one of four issues: (1) the complexity of human behavior and the difficulties of working with human subjects, (2) the financial interests of those sponsoring research, when they may have the ability to shape and/or suppress results (notable in pharma, bio-medical, and health generally), (3) fields where theories produce particularly fuzzy predictions, and/or outcomes, and (4) large cross-disciplinary departments, notably Management, where the default assumption is that most colleagues cannot judge one another’s research, and therefore the only metric is the prestige of the journal that published it.
        If you add to one or more of these four factors the general structures of incentives facing academics and their publishers, you get a lot of malpractice. And, with a range from medicine to management, it could be taken as representing a general crisis in science.
        But that would be wrong. I take Mashey’s point to be that none of these four issues is prevalent in, e.g., climatology: the “controversy” in climatology is manufactured by people who for one reason or another do not like the findings of the careful scientific work done in that field. To call it, so broadly, a “crisis in science” is both to misdiagnose the problem, and to give climate change deniers and their ilk another tool for the promotion of doubt.

      • This isn’t actually relevant to your point, but it routinely bothers me that people conflate “particle physics” with “physics.” No, it is not the case that “fields like physics take lots of collaborators, working over many years,… [etc]. The considerable majority of physicists (e.g. me) work in areas of condensed matter physics (materials, phase transitions, pattern formation, …), biophysics, atomic physics, optics, and such areas — groups are small, experiments don’t cost billions and (here’s the part you can disagree with) the topics are really the most fascinating things in the universe!

        • The particle physicists are the ones spending the ginormous budgets and publishing papers where the author list is 18x as long as the content… so It’s understandable that they have more recognition. But I agree with you, physics is so much more than just particle physics.

        • Daniel,

          I replied to your above comment at the bottom of the thread, because there seems to be a limitation on the amount of replies that can be nested.

        • Perhaps it has got to do with which parts of Physics are the “frontiers” in any given era with the most novel / significant results emerging?

          At some point in the past (perhaps 200 years ago) that area happened to be mechanics, followed by thermodynamics, stat mech. etc. and the current decades just happen to be those of particle physics’ glory.

          Also, with condensed matter physics etc. a lot of stuff has already matured & been encroached upon by the applied guys from material sciences, metallurgy, etc. So also for optics. Particle physics still happens to retain its “purity”.

        • At the risk of off offending my particle physics colleagues: it’s *very* debatable whether particle physics is a frontier of physics. Note that particle physics experiments generally test theories (e.g. the Standard Model) that are decades old, and that are really quite solid. If I had to pick frontiers of physics based on our not really knowing what principles are out there to be discovered, I’d pick astrophysics/cosmology or biophysics (disclaimer/bias: my field!). Of course, it is both fair and correct to point out that in science in general, a lot of stuff has matured.

  11. @Daniel Lakeland,

    It seems that there is a limit to nesting the replies. Yes, what you describe is the case. P-values as filters (very imperfect ones at that, obviously), for situations in which “deviation from the usual” is a red-flag to pick up that taxon and investigate it further, including different methods. I don’t find it satisfactory, but it seems in many cases to be the only practical option to move forward. Sometimes it pays off, some times it doesn’t.

    @Anoneuoid says:
    “…at which point I would say “see we know way more about this phenomenon than you thought.”

    Ignoring your condescending tone for a moment, has it crossed your mind that perhaps I know what I’m talking about way more than you do?

    • PABP: my suspicion is that Anoneuoid is in fact a molecular biologist or someone with a reasonable biology background at least, he or she has made a lot of biology like remarks in comments in the past, and in fact tends to talk about how poor the theoretical underpinnings are in biology, and how that’s based on kind of “giving up” way too quickly before even thinking about what we do know model-wise.

      In any case, there’s nothing wrong with “screening” through a crap load of data to find things that are “different” from the norm. But, it’s pretty much just a first step, and in biology it can be such a large expensive first step that people have a tendency to want to think of it as more like a complete research program. particularly if 4 or 5 years of your PhD are built on just doing screening experiments … you might want to think that it’s more than just collecting a bunch of data so that someone 10 years down the line can finally start studying something important.

      • Thanks for your reply. Yes, it’s just a first step, and only in specific, particular cases. These are pilot, very early-start studies, that will only pay off (if ever) after many years of research using different methods and approaches, but they are all part of a continuum. But nevertheless, in some contexts, p-values are useful as filters, as much as one may hate them.

        As for Anoneuoid, from his/hers comments he/she may very well be a molecular biologist, but that accommodates as many different subfields and work experiences as saying that one is a statistician. I understand what he/she means by his/hers comments, but it’s clear that he/she doesn’t know about my particular work and so he/she needs to entertain the possibility that – SHOCKING! – I may be right and he/she may be wrong. Otherwise, I think all of us here share the same feelings for p-values.

        Thanks for your time in replying. I appreciated your comments and blog.

        • PABP,

          I didn’t mean to be condescending, I just really think there is no reason to limit yourself to default null hypothesis pvalues (which do not tell you anything about why there is a difference). Anyway, I looked into this topic a bit and think I figured out the type of study you are referring to. If I’m wrong about that, sorry, it was interesting anyway:

          As far as I can tell the methods in this paper[1] are typical. What actually gets measured is relative number of 16s rRNA gene copies (collected from feces, with >97% similarity to sequences in a database, etc). It appears the copy number per cell varies at least from 1-15[2], so using this method you can see a 15:1 ratio between two types of bacteria when relative abundance is actually 1:1. That method definitely needs to be improved to the point we are confident relative abundance is actually being measured (by normalizing to copy number) before testing any type of theory.

          On the other hand, I discovered that (contrary to the claim that little is known about this topic) there is a “universal law” the relative abundance follows. A phenomenon identified >100 years ago. Not only that, but there is an actual *glut* of theories/models regarding this pattern, including one Ronald Fisher came up with himself. From ref 3:

          “Species abundance distributions (SADs) follow one of ecologys oldest and most universal laws – every community shows a hollow curve or hyperbolic shape on a histogram with many rare species and just a few common species. Here, we review theoretical, empirical and statistical developments in the study of SADs. Several key points emerge. (i) Literally dozens of models have been proposed to explain the hollow curve.
          […]
          The first theory attempting to explain the mechanism underlying hollow-curve SADs was by Motomura (1932). He pointed out that a sequential partition of a single niche dimension by a constant fraction leads to the geometric distribution. Fisher et al. (1943) argued for the logseries distribution as the limit of a Poisson sampling process from a gamma distribution (where the gamma was chosen only because of its general nature).”

          There are even papers applying these theories to gut microbiota data.[4] Unfortunately I got stuck there. In eq 6-7 of that paper they calculate the expected number of species with n individuals, but there is a strange term in the numerator: (1-b/d)^(S/b), where b=birth rate, d=death rate, and S=”immigration” rate. So according to this theory the birth rate has to always be less than the death rate, or else you get negative (or even undefined) probabilities? Can anyone figure out what is going on there?

          [1]http://www.ncbi.nlm.nih.gov/pubmed/24336217
          [2]http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1797146/
          [3]http://www.ncbi.nlm.nih.gov/pubmed/17845298
          [4]http://www.ncbi.nlm.nih.gov/pubmed/26821617

        • Anoneuoid,

          I’m not limiting myself to anything. I’m talking about specific cases in which using p-values as filters is desirable. Just because a study uses 16S rRNA and looks at abundances doesn’t mean that all studies are the same or require the same analysis methods.

          Just some points:

          “It appears the copy number per cell varies at least from 1-15[2], so using this method you can see a 15:1 ratio between two types of bacteria when relative abundance is actually 1:1. That method definitely needs to be improved to the point we are confident relative abundance is actually being measured (by normalizing to copy number) before testing any type of theory.”

          The number of copies of 16S is not per random cell, is per taxon. If you are comparing the abundance *of the same taxon* in two different groups it will not be a problem, and that’s precisely what you want do in many cases. Plus, we do know what the number of 16S copies is per taxon in many cases, so in fact we can correct for it bioinformatically using reference databases. The only problem that may occur is when too much divergence occurs between multiple copies of the same gene in a population of cells of a given taxon. Unfortunately in those cases there ain’t much you can do, but if you look at higher taxonomic levels the problem tends to disappear, like genus level, but you loose resolution. There are also alternatives to the 16S gene, but no reliable, curated databases. Each study is a study, so you have to consider the impact things have in your study in particular.

          “On the other hand, I discovered that (contrary to the claim that little is known about this topic) there is a “universal law” the relative abundance follows.”

          I don’t know who claimed that, certainly not me. Yes, every ecologist knows about SADs and RSAs. It’s Community Ecology 101. But it’s not directly relevant for my case.

          Anoneuoid, thanks for the effort, but as I said before, the only person here that knows the specificity of the problems I work with is me, and I find it a bit tiring that you just keep insisting over and over again, as if it was against your religion that the damn p-values can be useful for something occasionally. Take a brake, man.

          By the way, thanks for that last reference. I didn’t know that paper. Quite recent, too.

        • “If you are comparing the abundance *of the same taxon* in two different groups it will not be a problem, and that’s precisely what you want do in many cases.”

          Say we have three taxon and two different groups. For each we know both absolute abundance and copy number. We are only interested changes of the first taxon (with abundance=10). A change in the absolute abundance of the third taxon (from 2-6 cells) will alter the relative abundance of the first taxon (from .38-.33). However if we are actually measuring relative copy number, the change in apparent relative abundance for the first taxon would be 0.31-0.21. So, besides that our estimates of the relative abundance are inaccurate, even the change for taxon 1 (whose absolute abundance did not change at all) has been overestimated by 2x. Is this not what happens?

          ###
          Abundance1=c(10,14,2);Abundance2=c(10,14,6)
          CopyNum=c(1,1,4)

          > Abundance1/sum(Abundance1)
          [1] 0.38461538 0.53846154 0.07692308
          > Abundance2/sum(Abundance2)
          [1] 0.3333333 0.4666667 0.2000000
          >
          > (Abundance1*CopyNum)/sum(Abundance1*CopyNum)
          [1] 0.3125 0.4375 0.2500
          > (Abundance2*CopyNum)/sum(Abundance2*CopyNum)
          [1] 0.2083333 0.2916667 0.5000000
          ###

          >”Yes, every ecologist knows about SADs and RSAs. It’s Community Ecology 101. But it’s not directly relevant for my case…I find it a bit tiring that you just keep insisting over and over again, as if it was against your religion that the damn p-values can be useful for something occasionally.”

          If accurate measurements and the existence of a “universal law” constraining the expected observations are irrelevant to the design of the study, I do doubt anything useful will come of it, let alone that it is optimal(as I noted earlier, this type of study is designed around NHST rather than scientific considerations). If you want to call this objection “religious”, so be it, but such studies deviate pretty far from the method that gave us the benefits we enjoy today.

          Why not just skip the filtering and fit a couple different SAD models? Use those parameters to estimate something like average bacterial birth/death ratio in diseased vs normal. Using that birth/death ratio, come up with explanations to explain that numerical value (eg 1 T-cell can kill up to 10^4 bacteria/min in culture and the disease causes inflammation which increased this T-cells indicator by 10x so a plausible explanation would be…). Then do a new study to test these possible explanations.

        • Anoneuoid, the RSAs are useful for whole community level assessment, and we do those, but I’m talking about particular cases in which you may be interested in flagging particular taxa out of thousands of them due to conspicuous differential abundances. RSAs won’t be of particular use here.

          Look, you have the right attitude regarding research and I value your interest, but what you imagine in your mind as the reasons why I do this or do that does not correspond to the actual practical problems. We just keep talking past each other and I have better things to do with my time. It’s Friday.

  12. Hey Andrew, I think RPP is very valuable and I agree with many of the sentiments in your post. That said, there is one tidbit in Gilbert et al. worth praising. They use the ManyLabs sets of replication to address the question of over-dispersion across a set of replication experiments. In effect, they attempt to estimate how much variance there is across subtle methodological changes across replications. This strikes me as a really good idea. Unfortunately, there approach here is naive, and I wish they had done so in a more principled fashion. I may explore the issue at some point, which I can, because Nosek and colleagues’ data are open. Best, Jeff

    • Jeff:

      That came up in meta-analysis of clinical trails – was the variation in effect due to biological differences in the studies or differences the methodological differences? Unfortunately there, these are highly conflated but fortunately this was realized just before an RSS statistical group (2001/2) announced that the variation was real (biological) and needed to be conveyed to clinicians.

  13. I think a big part of this issue is that there is one “original” study and one “replication” study.

    When two studies on the same topic find opposite results (I don’t just mean a “replication failure” as in Group A > Group B at p = .023 in the original and Group A > Group B at p = .187 in the replication; I mean Group A > Group B in one study and Group B > Group A in the other)–how do we know which one better reflects the effect at the population level? I’ve been working on a blog post at http://psychsci.blogspot.com on this topic, and I expect to post about it in the first few months of 2017.

    Nosek, the leader of the Open Science Collaboration replication team, is also involved in the Many Labs Project–the first MLP publication, which can be found at http://econtent.hogrefe.com/doi/pdf/10.1027/1864-9335/a000178 (or if you have trouble accessing that, you can find it at https://osf.io/ebmf8/), tested the effects of 13 studies, multiple times apiece. It’s not surprising to me that this approach yielded very different conclusions: 10 out of 13 original effects were successfully replicated, and an eleventh effect apparently showed weak support. So, if you replicate a weak effect once and the replication doesn’t meet the p < .05 criterion, does that REALLY count as a failed replication? There's no clear answer to that question, and I'm not confident that the Open Science Collaboration used the best approach to evaluate the success of a replication attempt.

    Plus, as I intimated above: when there are only two studies on a topic, and the data lead researchers to different conclusions–which study more accurately reflects the population? The best evidence is still found in multiple replication attempts…

    • “So, if you replicate a weak effect once and the replication doesn’t meet the p < .05 criterion, does that REALLY count as a failed replication? "

      As I see it, thinking in terms of "testing" and "failed replications" is part of the problem. It is more to the point to consider things like: What is the quality of each study? Are the measures good ones? What are the "raw" effect sizes? — are they practically significant? Is the phenomenon being studies really worth studying?

      • Martha:

        Yes. To put it another way: If the original study was “dead on arrival” with a much higher level of noise than signal, then we’ve learned just about nothing from the original study having a result of p less than .05. One reason I recommend the “time reversal heuristic”—using the replication as the starting point—is that I’d expect a replication to be more focused on the particular pattern being studied. A published paper will have all sorts of researcher degrees of freedom regarding how to code data, what data to include and exclude, what comparisons to study, what to control for, etc., to the extent that the published p-value tells us just about nothing. The replication will typically be more focused, so the results of the comparison of interest can be interpreted more directly.

        • “The replication will typically be more focused, so the results of the comparison of interest can be interpreted more directly.”

          “Typically” sounds too optimistic. A useful replication *should* be more focused.

        • Martha:

          My statement is comparative, not absolute. I’m not saying that all replications are focused, but I do think that most replications are much more focused than the published studies they’re replicating.

        • Martha: I absolutely agree that one must consider the quality of the evidence, and also take effect size into account. Not all studies are created equal; some provide a much higher quality of evidence than others! However, it’s still conceivable that two very similar studies could yield opposite results, given enough variance. In that case, which is “right?” Which result better indicates the truth at the population level? Or should we combine them and say that there is no effect? This question is a particularly thorny one, I think. This is especially the case since, on a practical level, researchers HAVE TO draw conclusions at some point. Or, at least, such is the norm.

          As you recommend, I advocate considering multiple measures like effect size, 95% CIs, and even Bayes Factors (see my under-review manuscript at https://osf.io/preprints/psyarxiv/hp53k/), to minimize some of the kinds of bias that are frequent topics on this blog. However, I also advocate (see the same link) using multiple experiments, as a single experiment can yield a spurious result due to a variety of factors, and statistical procedures aren’t always capable of correcting for those factors.

          I like to use simple Newtonian gravity as an analogy. We take gravity as a fact because dropped objects ALWAYS fall to the ground unless there’s a good explanation (e.g. weird wind currents or magnetic repulsion). So every time you drop something, it’s like conducting an informal experiment to test the theory–this analogy makes it very easy to see the powerful evidence that can be provided by many experiments vs. just one, or even a few!

          My overarching point here is that directly contradicting evidence from only two experiments should yield a high degree of uncertainty (a prior probability of close to 0.5, in Bayesian terms) for other researchers who investigate a particular effect. But that’s not what happened after the Open Science Collaboration study–instead, people seem to have fallen victim to a recency effect [http://psychology.wikia.com/wiki/Recency_effect], trumpeting all over the Internet that “Psychology studies don’t replicate!!! Everything we thought we knew is a lie!!! Who can we trust?!?!”

          I exaggerate, but not by much. Part of that reaction is surely driven by the desire for click-driven ad revenue among media outlets, but the end result, at least in the short term, is that the wider public distrusts academic researchers. People are taking the OSC replications as strong evidence that psychology research does not replicate! Such a conclusion may well be justified, but certainly not on the basis of a single replication that yielded p > .05! People, especially non-academics, tend to distrust what they hear about psychology or biomedical research in popular media; I believe that it is our job as scientists to dispel that distrust. And we can do that in a straightforward manner, by conducting more experiments BEFORE publishing (i.e. skepticism about our own findings), instead of trying to draw conclusions based on a single study (or even one study + one replication). I think the fact that the field continues to overlook this simple solution is a travesty! It may not be a “sexy” solution (as a new analytic technique might be), but it would be an effective one…and, I’d argue, THE MOST effective one.

          But, as it stands, we have a variety of psychology studies that yielded p < .05 (for the little bit that's worth), and then a single replication attempt for each study that often did not yield p < .05, and people are crying that psychology studies are therefore untrustworthy. But it seems to me that such a conclusion is unwarranted, just as the all-too-common conclusion that an effect is real if p < .05 (in one study) is unwarranted. See Etz & Vandekerckhove's (2016) Bayesian re-analysis of the Open Science Collaboration (2015) results [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0149794], which yielded the much more optimistic conclusion that most of the replications yielded qualitatively similar, though quantitatively weaker, effects. I suppose my position is something along the lines of the Morey & Lakens paper discussed here: https://medium.com/@richarddmorey/new-paper-why-most-of-psychology-is-statistically-unfalsifiable-4c3b6126365a#.goxrnvn4j

          This is a complicated, and very interesting, question that cuts to the very heart of science.

          Andrew: I think your time-reversal heuristic is a highly useful tool! I think you're quite right about replications tending to be more focused on investigating a particular effect than the original study was, as the effect 'discovered' by the original study was much more likely to have occurred due to any of a variety of Simmons, Nelson, & Simonsohn's (2012) "researcher degrees of freedom" [http://journals.sagepub.com/doi/abs/10.1177/0956797611417632].

          The beginning of your first reply, though, reminded me of Ziliak & McCloskey (2009). See their paper at http://www.deirdremccloskey.com/docs/jsm.pdf, which makes the eye-opening point that inferential statistical techniques measure "precision" rather than "oomph," and even worse, substitute precision FOR oomph!

          The inferential procedures that we are taught in psychology (particularly t-tests, ANOVAs, & correlations) measure something like signal-to-noise ratio; they tell us the size of the effect as a function of the variability in the data. The problem is that even a huge effect can be swamped by excessive variance. Does the high variance mean the effect isn't real? Not necessarily, as Ziliak & McCloskey's diet pill example illustrates.

          I think Ziliak & McCloskey's argument is extremely underappreciated, and if people realized how fundamentally this criticism undermines much social science research, there would have to be a widespread revolution in the way social science researchers draw conclusions! To illustrate just how underappreciated their argument is, my first reaction upon reading this article was to use Cohen's d or another appropriate effect size measure to find "oomph." But, when running the numbers for a dataset to illustrate their diet pill example, I was surprised to find that Cohen's d exhibits the same problem (again, see my manuscript at https://osf.io/preprints/psyarxiv/hp53k/ and the associated Precision v oomph .csv and .jasp files at https://osf.io/qhevq/files/)!

          In retrospect, it should have been obvious from looking at the Cohen's d formula that Cohen's d still measures precision, but I suspect that something psychological was going on that was similar to Cohen's own (1994) criticism of NHST at http://www.ics.uci.edu/~sternh/courses/210/cohen94_pval.pdf: "…we want so much to know what we want to know that, out of desperation, we nevertheless believe that it does [tell us what we want to know]!"

          Thanks to the emphasis on novelty, along with the still-enforced "p < .05 rule," it's entirely possible that psychology research (and, even worse, biomedical research) is ignoring effects of practical import but high variability (Ziliak & McCloskey's "oomph"), in favor of studying tiny but precise effects (Ziliak & McCloskey's "precision"). And Cohen's d apparently doesn't help the matter much, since it will still likely lead less-statistically-sophisticated researchers into the trap of mistaking "precision" for "oomph."

          This leaves open the crucial question: on what basis SHOULD we draw inferences from our research? It seems that researchers don't want to let go of the mathematical trappings that lend us the facade of scientific credibility, so we will keep running t-tests and ANOVAs and correlations and regressions until the figurative cows come home. But adding stuff like effect-size measures and Bayes Factors only serve to further complicate the analysis, while doubling down the emphasis on precision at the expense of oomph–thereby further impeding the progression of science by continuing to fixate on a principle (Ziliak & McCloskey's "precision") that is of questionable usefulness in the world outside of academia.

        • “This is especially the case since, on a practical level, researchers HAVE TO draw conclusions at some point. Or, at least, such is the norm.”

          I’d say we need to change the norm and accept uncertainty as an honest summary.

        • Martha: I agree with you! We need to better acknowledge the uncertainty inherent in research! Unfortunately, that is easier said than done when we’re incentivized to do the opposite [and disincentivized to conduct proper science, as the story related at the beginning of https://papers.ssrn.com/sol3/papers2.cfm?abstract_id=2062465 shows]. We are rewarded (in reputation, prestige, job offers, book deals, grant opportunities, etc.) for running to the media as soon as we have a “discovery,” even if that “discovery” is based on the flimsy foundation of p < .05 and nothing more.

          As an early-career researcher, this structure is not at all lost on me as I gear up to choose a career path. I fastidiously practice caution and behave with the utmost scientific integrity, but frankly, I'm probably shooting myself in the foot by doing so. I can build a better name for myself, and thereby solicit a better job offer, by trumpeting findings based on flimsy evidence–but telling a compelling or surprising story–to a journalist who simply doesn't know what a problem it is to rely on p < .05. If the journalist questions the strength of my evidence, I just hide behind the hand-wave of "that's what the statistical test showed." And do you really think that most journalists (with a handful of exceptions) know or care about the nuts and bolts of how sample size and variance are used to compute standard error?

          For some insightful reading on the topic of the perverse and counterproductive incentive structure in science, check out http://rsos.royalsocietypublishing.org/content/3/9/160384 and http://online.liebertpub.com/doi/pdf/10.1089/ees.2016.0223. It's some real eye-opening stuff!

Leave a Reply to Martha (Smith) Cancel reply

Your email address will not be published. Required fields are marked *