Study published in 2011, followed by successful replication in 2003 [sic]

This one is like shooting fish in a barrel but sometimes the job just has to be done. . . .

The paper is by Daryl Bem, Patrizio Tressoldi, Thomas Rabeyron, and Michael Duggan, it’s called “Feeling the Future: A Meta-Analysis of 90 Experiments on the Anomalous Anticipation of Random Future Events,” and it begins like this:

In 2011, the Journal of Personality and Social Psychology published a report of nine experiments purporting to demonstrate that an individual’s cognitive and affective responses can be influenced by randomly selected stimulus events that do not occur until after his or her responses have already been made and recorded (Bem, 2011). To encourage exact replications of the experiments, all materials needed to conduct them were made available on request. We can now report a meta-analysis of 90 experiments from 33 laboratories in 14 different countries which yielded an overall positive effect in excess of 6 sigma . . . A Bayesian analysis yielded a Bayes Factor of 7.4 × 10-9 . . . An analysis of p values across experiments implies that the results were not a product of “p-hacking” . . .

Actually, no.

There is a lot of selection going on here. For example, they report that 57% (or, as they quaintly put it, “56.6%”) of the experiments had been published in peer reviewed journals or conference proceedings. Think of all the unsuccessful, unpublished replications that didn’t get caught in the net. But of course almost any result that happened to be statistically significant would be published, hence a big bias. Second, they go back and forth, sometimes considering all replications, other times ruling some out as not following protocol. At one point they criticize internet experiments which is fine, but again it’s more selection because if the results from the internet experiments had looked good, I don’t think we’d be seeing that criticism. Similarly, we get statements like, “If we exclude the 3 experiments that were not designed to be replications of Bem’s original protocol . . .”. This would be a lot more convincing if they’d defined their protocols clearly ahead of time.

I question the authors’ claims that various replications are “exact.” Bem’s paper was published in 2011, so how can it be that experiments performed as early as 2003 are exact replications? That makes no sense. Just to get an idea of what was going on, I tried to find one of the earlier studies that was stated to be an exact replication. I looked up the paper by Savva et al. (2005), “Further testing of the precognitive habituation effect using spider stimuli.” I could not find this one but I found a related one, also on spider stimuli. In what sense is this an “exact replication” of Bem? I looked at the Bem (2011) paper, searched on “spider,” and all I could find is a reference to Savva et al.’s 2004 work.

This baffled me so I went to the paper linked above and searched on “exact replication” to see how they defined the term. Here’s what I found:

“To qualify as an exact replication, the experiment had to use Bem’s software without any procedural modifications other than translating on-screen instructions and stimulus words into a language other than English if needed.”

I’m sorry, but, no. Using the same software is not enough to qualify as an “exact replication.”

This issue is central to the paper at hand. For example, there is a discussion on page 18 on “the importance of exact replications”: “When a replication succeeds, it logically implies that every step in the replication ‘worked’ . . .”

Beyond this, the individual experiments have multiple comparisons issues, just as did the Bem (2011) paper. We see very few actual preregistrations, and my impression is that when something counts as a successful replication there is still a lot of wiggle room regarding data inclusion rules, which interactions to study, etc.

Who cares?

The ESP context makes this all look like a big joke, but the general problem of researchers creating findings out of nothing, that seems to be a big issue in social psychology and other research areas involving noisy measurements. So I think it’s worth holding a firm line on this sort of thing. I have a feeling that the authors of this paper think that if you have a p-value or Bayes factor of 10^-9 then your evidence is pretty definitive, even if some nitpickers can argue on the edges about this or that. But it doesn’t work that way. The garden of forking paths is multiplicative, and with enough options it’s not so hard to multiply up to factors of 10^-9 or whatever. And it’s not like you have to be trying to cheat; you just keep making reasonable choices given the data you see, and you can get there, no problem. Selecting ten-year-old papers and calling them “exact replications” is one way to do it.

78 thoughts on “Study published in 2011, followed by successful replication in 2003 [sic]

    • +1. Actually, + 100. I’m even thinking of writing a paper, right now, blaming the cult of NHBFST or Null Hypothesis Bayes Factor Significance Test. And I’ll get another publication as an ESP paper, predicting things ahead of time based on a random event (Gelman’s post).

  1. The point that disurbs me about the forking paths business is that it is so all-encompassing.

    I can’t see any good way for a researcher to protect himself from this critique. Ergo people will apply it selectively. How can a meta analysis ever prove that the file drawer problem didn’t happen?

    • Rahul:

      > meta analysis ever prove that the file drawer problem didn’t happen?
      They can’t, however the FDA and other agencies can in that they can limit to evidence to studies that were pre-discussed them and audited afterwards. (Part of why Stephen Senn thinks why bona fide researchers should have access to that data.)

      What has distracted my attention this morning is this new paper by Ioannidis at http://onlinelibrary.wiley.com/doi/10.1111/eci.12171/full

      On this list of 400 top researchers, there are more than half a dozen I have had hands experience doing research with. Some I personally think do good research and some not as good as I think they could. Maybe I am wrong about that, but how could an independent third party ever figure this out from whats publicly available?

      If there were audits of a random sample of their past research (with access to data and study documentation) we would at least have a reasonable chance of finding out.

    • Rahul:

      First, remember that forking paths != file drawer. These are two different problems.

      Second, I think that these problems are much more of an issue when research is summarized by a null hypothesis significance test (whether implemented as a p-value or a confidence interval a Bayes factor or whatever) than when research is more exploratory and open-ended. In the Red State Blue State project, for example, we did hundreds of analyses and presented dozens of these, but ultimately what we were doing was circling around the problem and addressing it in different ways, we weren’t framing things as a null hypothesis to be tested. Similarly in my other research projects: in each case we are doing some sort of measurement and making inferences, we’re not pinning our conclusions on the rejection of a null hypothesis.

      Third, as John Carlin and I discuss in our recent article, and as Button et al. discussed in theirs, effect size and measurement error are important. Many of the studies I have criticized involve tiny effects, sloppy measurements, and huge amounts of variation. The less signal and more noise you have, the more you have to worry about methodological issues. Conversely, if you’re studying a large effect with clean measurements in a low-variation setting, it hardly matters what you’re doing with your data.

      • Andrew:

        When you wrote “Think of all the unsuccessful, unpublished replications that didn’t get caught in the net.” I assumed you were critiquing the meta-analysis on the basis of potential, unpublished contrary evidence.

        Isn’t that the file-drawer effect?

      • I somewhat agree with your second point, but it seems in a way, nihilistic: All research cannot be exploratory & open-ended.

        Sure, there’s a place for such work, but at some point isn’t there the need for stronger, more decisive analyses?

        • Rahul:

          Is there a need for a decisive analysis on ESP? This may never occur: in the absence of any detectable effect, there is always the possibility that something is out there that has not been discovered (hence the willingness of that psych journal to publish Bem’s article in the first place).

          Is there a need for a decisive analysis of the correlation between body odor and partisanship, or of the correlation between sex ratio and parental attractiveness, or of the effects of subliminal smiley-faces on political attitudes, etc.? I don’t think so. All these patterns (to the extent that they exist in the population and not just in the particular datasets being analyzed) are contingent on existing social settings; they are not universal truths of human nature.

          My colleagues and I have done what I consider to be important work on the effects of redistricting, and on political polarization, and on toxicology, and on many many other things. All of our analyses are ultimately provisional and contextual. But we can still learn.

          Sure, in the presence of strong, persistent effects, it’s possible to get more decisive analyses. There are some phenomena, in the physical and the human world, that are clear and robust and can be made clearer by statistical analysis. And, as various people have pointed out, it’s possible to get convincing evidence in such cases via preregistered replication.

          But many important and interesting phenomena are not so clear-cut, and I think we have to accept this rather than placing all scientific endeavor in framework in which enduring truths can be discovered.

          P.S. Thanks for pushing me in the comments. This is an issue that comes up a lot but this is the first time I’ve been able to answer the question so coherently.

        • Andrew:

          You make it seem as if Poli Sci research is inconsequential. Not sure you htink this but it is implied.

          In a world of limited resources, natural monopolies (which includes political institutions) etc we have to make choices that are prescriptive as well as predictive.

          If ESP really works, and if we can improve it in some way, then three is a tonne of innovation waiting. So it seems to me the question — Does it work or is it just an artifact? — is important bc even if the effect is minute, to know that it is there is typically enough to start a process of improvement that eventually leads to much larger effects.

          My 2 cents in 2 minutes.

        • Andrew:

          Interesting points. For one, I find it unsettling to believe that whether a phenomenon can or cannot be decisively analysed would depend on our perceived “need” of the analysis.

          “Need” is subjective. Some people find it interesting to study child-names. De gustibus.

          What I’d find uncomfortable was if someone told me that my ability to study a question was in some way fundamentally constrained by its “usefulness”. To me those are orthogonal concepts.

          I’m perfectly fine with someone saying that the ESP question cannot be decisively settled right now because we don’t have enough data, or haven’t done the right experiments or something like that. But if the inability is linked with “need” (or rather the lack of it) I find that hard to understand.

          PS. Thanks for the thanks! Sometimes I worry about being the resident gadfly here. :) Especially with my non-statistical, non-academic credentials.

        • Rahul:

          You were the one who asked, “isn’t there the need for stronger, more decisive analyses?” So I was answering your question regarding the “need.”

          If you’d asked, “isn’t it possible…,” I would’ve answered that question! My short answer to that, by the way, is that sometimes it’s possible and sometimes it isn’t, and sometimes it depends on resources.

        • I think, what Prof. Gelman implies (he is capable, of course, to answer himself…) is that in social sciences basically everything depends on everything. Actually the same is true in natural sciences as well. But the trouble in social sciences is that there is often not a good candidate for one major cause or a few really important ones, but a huge amount of small-scale dependencies. Research program which says “let’s pick up a random possible dependence and prove that it is there” is meaningless in this context. We already know that it is there and that the effect is small and that any observed effect would be contingent on any number of variables not under control.

        • I couldn’t disagree more. How do you know that everything is dependent on everything else? And even if that is true, it is wrong to assume that you then have to study all possible causes together. Good luck with that. There is something called a disturbance for a reason. Is your data matrix of size N x Infinity? Finally, all this stuff about social science being different. I personally think it is a little bit of an excuse.

        • OK. Smiley faces affect political attitudes. What about weather, last name of the last person you’ve met, type of beer you last ordered in the bar, how much spam you’ve got in your last mail, etc. etc.? The tiny significance of each of those events is compensated by their huge numbers. How to deal with it? I have no idea. That’s why a random internet commenter (that is me) cannot be a substitute for a researcher with the substantive knowledge. As for the difference with natural sciences, sometimes we simply do not ask questions that obviously depend on too many factors, but sometimes, like in weather forecasts and thermonuclear reactors we would very much like to make progress but alas.

        • Isn’t it fairly rare that there’s a legitimate need for an decisive analysis of _whether an effect exists_? IMO it’s generally not clear that this is even a well-defined question, so whence comes the need to decide it?

        • Do reagents A & B when mixed give C? Does smoking cause cancer? Does an epoxy coated pipe rust less than a galvanized pipe? Does bearing A fail faster than bearing B?

          Are these valid examples for determining whether an effect exists?

        • Your first question perhaps is, but would we use even think of using statistics to verify it?
          For the others, I’d say no, and most definitely not for the cancer question.

          Does smoking cause cancer? Why would we care? Seriously, we may find pretty much every chemical one might be exposed to is carcinogetic with some vanishingly tiny correlation; if it’s 1/10^10, so what, and why would it be interesting to find out? Though the situation is worse than this. The statement “Chemical A has a x > 0 correlation with cancer” doesn’t _even mean anything_ for miniscule enough “x”. Because (among other objections) this depends impossibly precisely on the exact reference class is (all people? Including those dead? Are we even limiting to humans (if not where are the boundaries?) People yet unborn? Etc.) and unless you are willing to pin this and other details to unattainable precision such a statement is not even wrong.

          Now “Does smoking cause cancer” as a shorthand for something about the magnitude of the connection (something like: it’s a big enough effect to cause material harm, and big enough so that the statement even makes robust sense without being too particular about the reference class) – well _that_ is a different question, very important of course. But that question is no longer merely “does the effect exist” – no matter what the superficial framing is.

        • 1. The worry about heterogeneity applies to effect sizes as well.

          2. Obviously we want to deal with important causes (which is more than effect size if the cause doesn’t move around much in practice)

          3. Knowing that X can be used to manipulate Y though effect is small can be important if at present we have no other ways of manipulating Y. Then we might look for moderators to boost effect. Or gain a better understanding of the underlying mechanism that sugests other interventions, etc..

          4. So I don’t think everything causes everything else but it is true that an effect can have many causes. The poison may be in the dose but causality is not like a poison.

          4.

        • I still think effect sizes are important in those questions. Does A&B when mixed give C in enough quantity to make it worth bothering? (ie. suppose you take one mole of A and one mole of B and when mixed you get essentially one mole of D but you do actually get 10^-6 moles of C… that’s useful to know that theoretically you can get some C, but not nearly as useful as if it produced even say 0.1 moles of C.

          Does smoking cause cancer? This is interesting because the effect of smoking on cancer is large, how about the question “does salt cause cancer?” I think it’s pretty clear from millions of years of consuming salt that if it causes cancer, it’s a tiny effect.

          does an epoxy coated pipe rust less… Well it had better extend the life of the pipe by at least a few percent, otherwise I don’t really care.

          Does bearing A fail faster than B? Well it had better be at least a few percent otherwise, who cares?

          and more importantly for the last two, it had better be the case that there’s some useful level of trade-off between cost and longevity, otherwise who cares??

    • EJ:

      No need for precognition here: you and I had access to the same data (this Bem et al. meta-analysis paper) and we drew similar conclusions. I too reviewed the paper for a journal several months ago but my review was much shorter than yours! The whole thing is pretty sad, but I feel that part of the blame goes to the field of statistics, which I fear often sells itself as a way of laundering uncertainty, as Eric Loken and I wrote in our article, The AAA tranche of subprime science.

    • Noah:

      I looked at the linked post by Scott Alexander and much is reasonable but I think the author gives too much credit to Bem. The author of the post writes, “This is far better than the average meta-analysis. Bem has always been pretty careful and this is no exception. . . . Bem definitely picked up a signal. The only question is whether it’s a signal of psi, or a signal of poor experimental technique.”

      In contrast, I see no evidence that Bem picked up a signal, nor do I see Bem’s meta-analysis as careful in any relevant sense of the word. Alexander writes that Bem “apologizes for not using pre-registration, but says it’s okay because the studies were exact replications of a previous study.” But, as discussed above, these were not exact replications or even close to such.

      Near the end of his post, Alexander writes, “That is my best guess at what happened here – a bunch of poor-quality, peer-unreviewed studies that weren’t as exact replications as we would like to believe, all subject to mysterious experimenter effects. This is not a criticism of Bem . . .” But I think it should be a criticism of Bem. People pointed out multiple comparisons issues in his work and he brushed these criticisms aside. He did a meta-analysis and represented years-old studies as “exact replications.” He misrepresented his data and he avoided opportunities to do better work. I’m not saying that Bem is evil, but, yes, I think it’s fair to criticize a scientist for this sort of behavior, and I don’t let him off the hook just because his effect sizes are small and his measurements are noisy.

      • I agree that Scott Alexander could (and probably should) have been more willing to criticize Bem, but his overall point seems consistent with your argument above in favor of estimation and exploration rather than (or in addition to) decisive testing. Bem seems to have more or less followed most of the rules for doing a meta-analysis, with at least one major exception being the seemingly extremely important garden of forking paths.

        As is nicely illustrated in the comments to EJ’s blog post, you can make various cases for which studies to include and which to exclude, and the result can look (fairly) different depending on those choices. It seems like you can even get a very small but still positive overall effect size estimate without making a series of solely bad decisions (i.e., you can do better than Bem in selecting the studies to include and justifying your selection rule).

        What’s the solution to the garden of forking paths? Pre-registration? Open access to data and software? An actual theory that explains the (alleged) statistical effect (and has some set of properties that good theories have, whatever those might be)? Is it even reasonable to ask what the solution is?

        • > seems to have more or less followed most of the rules for doing a meta-analysis
          Those are just training wheel defaults to get started and then critically evaluate what sense can actually be made of all the currently available information.

          The real uncertainties (borrowing Mosteller and Tukey’s term) in meta-analysis involves how the data was selectively brought together as well as how the research was recounted as opposed to actually conducted. With studies that have already been conducted, that level of uncertainty is about the same as in non-randomised studies – is the discerned best estimate of effect too large to have arisen from bias? I don’t recall any of the meta-analysis I was involved in getting all the way to credibly establishing that.

          (Also, a substantive uncertainty is involved in discerning something common (exchangeable) that can be learned about amongst diversity while adequately allowing for that diversity.)

          To answer Rahul’s question, that is why it is not that valuable for someone to try and redo that meta-analysis _properly_ (except if its a big effect or for pedagogical purposes). You need to go prospective and even then, succeeding with really small effects may be challenging as it hard to get everyone to do things and report them in an exact enough way.

        • Exactly – those don’t knows you don’t know are an important part that can only be effectively reduced by going prospectively.

        • What exactly do you mean by “going prospectively”? Pre-selecting and pre-registering all studies that will go into a meta analysis? Or…?

        • Rahul:

          Read the link to E.J. given by Daniel below.

          I don’t I have to fully read it but did notice this point
          > it is that meta-analysis is a tool that is fraught with danger.

          I fully agree with that, but the worst strategy is trying to avoid doing them, it is by doing them that you find out where the dangers arise and how they can be lessened in the future. And you can’t escape doing a meta-analyis anyways as Andrew commented on here http://statmodeling.stat.columbia.edu/2012/02/12/meta-analysis-game-theory-and-incentives-to-do-replicable-research/

        • By prospective I think he means taking whatever you think you can learn from a meta-analysis and then designing a follow up experiment with pre-registered methodology to confirm that your effect is real.

      • I’d be more convinced by critiques of Bem’s meta analysis if someone (Andrew?) published a competing meta analysis that reached conclusions grossly different from Bem’s.

        Sure, the file drawer problem would still exist but at least we could possibly absolve Bem of cherry picking studies, selective data inclusion rules etc.

        Also, I’m not sure about Andrew’s “But of course almost any result that happened to be statistically significant would be published, hence a big bias.” Post Bem’s sensational, controversial original ESP paper, any replication contradicting Bem’s results would have a fairly decent chance of acceptance. No?

        • Rahul:

          Post Bem’s sensational, controversial original ESP paper, any replication contradicting Bem’s results would have a fairly decent chance of acceptance.

          That’s actually a point E.J. is making here: http://osc.centerforopenscience.org/2014/06/25/a-skeptics-review/

          Worry 1. The authors wish to study replications of Bem’s work. This means they should only consider studies that were inspired by Bem (2011). A quick look at Table A1 shows that very many studies in the meta-analysis preceded Bem, sometimes by as much as 10 years. It is possible that the earlier studies had advance access to Bem’s protocol, but if this is the case it should be made clear from the outset. A related worry is that skeptics only got interested after the publication of Bem (2011). Hence, I believe that there may be a difference between replications “pre-Bem” (conducted and reported by proponents only) and “post-Bem” (conducted by proponents and skeptics alike). This is a factor that should be taken into account. Perhaps the size of the effect suddenly decreased after 2011?

          Indeed, when I consider only those psi replications that have been published post-Bem, I find Galak, Ritchie, Robinson, Subbottsky, Traxler, and Wagenmakers (the Hitchman studies seemed to be about creativity and luck, so I did not incorporate that study; including it does not change the pattern of results). Below is a table of their experiments and effect sizes:

          (…)

          When it comes to a proper assessment of the replication success for Bem’s studies, I think the above table is the correct one. I did not conduct the meta-analysis but from eyeballing the numbers it seems that there is nothing there whatsoever. The fact that this picture is changed by adding the other studies supports the assertion that they are contaminated by researcher bias and a lack of control over the analysis procedure.

        • I seem to have used the quote-tags in a wrong way. Everything starting with Worry 1 is from E.J. Wagenmakers, the sentence “Post Bem’s sensational, controversial original ESP paper, any replication contradicting Bem’s results would have a fairly decent chance of acceptance” is from Rahul.

      • @Andrew:

        I think Bem is an easy target for what is really a failure of the underlying statistical methods & procedures, as they are commonly understood & applied.

        If, based on the same quality of data but in a different field, (say) Bem’s conclusion had been “Yoga improves memory” instead of “ESP exists” it’d have been vetted, published and accepted, without much of a fuss at all.

        • And again: This is the point. Bem is an obvious example for the problems with the standards and publication criteria regarding methods & procedures. That’s why all the replication debate started in psychology: if Bem can (could) make these claims in accordance to our standards, our standards are probably not strict enough.

        • Overall, I think the Bem saga has had a positive impact on the field. Inadvertently, yes, but still.

          Accepted analysis processes led to ridiculous conclusions & that caused us to reexamine our approaches. Hopefully in a general way, not just restricted to ESP nor psychology.

        • Rahul:

          I don’t know about yoga but I do frequently post on the political science equivalents. I agree that there’s a problem in going after one study at a time, given that millions are published each year, but I hope that by looking at these individual cases we can develop some general principles.

  2. It’s true that the ideal approach would have been a prospective study. However, that is not always possible, and retrospective studies can be valuable. I wonder therefore, what people think the appropriate analysis method is for retrospective analyses. It seems to me the best approach for retrospective analyses would possess two critical factors:
    1) inclusion of ALL data, unless it has a known gross defect (such as the paper was retracted)
    2) avoid averaging result sets, and instead treat each as an independent attempt. I think this would require defining many more parameters for each experiment than is normally done, thus ensuring appropriate comparisons are made.

      • “As we cannot assume a priori that time is linear, as we perceive it, or that God is limited by a linear time, as we are, the intervention was carried out 4-10 years after the patients’ infection and hospitalisation.”

        !!!1!

      • Keith: sometimes humor, like a court jester, can say things we all need to hear but don’t dare articulate. I’m sure Joseph/Entsophy/Anonymous will jump in here and back me up, billions or trillions of research dollars have been put forward over the last few decades with little more theory to back them up than “gee this might work, let’s find out”. The result… cancer death rates largely flat, we don’t know what causes so much obesity, we don’t know how moderate sun exposure / vitamin D deficiency affects melanoma, cancer, or other diseases, we don’t know how to cure chronic sinus infections…. And the clinical studies that are tried are nowhere near as clean and nice as this one (large sample size, no dropouts! entirely blinded patients! high quality PRNG used for treatment assignment, etc)

        • OK, but that study made the journal cut only because there was a p < .05 without having to admit doing more than a few comparisons.

          Randomization does work, but as implemented in the current academic context it may understood by few and work extremely slowly.

          This was my best kick at the can http://statmodeling.stat.columbia.edu/2012/02/12/meta-analysis-game-theory-and-incentives-to-do-replicable-research/ – others took similar kicks – 25 years later there seems to be glimmer of hope as I related to Fernando somewhere in this long comment thread. When I and others talked/wrote about this problem long ago, we got responses like "would not be of interest to a professional statistical audience".

          But my real advice, as I am sure Joseph/Entsophy/Anonymous will back me up – is don't get sick enough to need treatment.

        • Randomization is more or less a good method of measurement in the presence of unknown uncontrollable effects, and of course, measurement helps us. But it doesn’t help us avoid the need to come up with core theory that is solid and robust and predictive. And it doesn’t free us from the need to actually work on problems of consequence if we want to have solutions of consequence.

        • Yup. Randomization and controlled experimentation won’t solve problems of small effects, biased and noisy measurements, high variation, and nonrepresentativeness of samples. The sad thing is that a lot of statistics education doesn’t make this clear. The result: people-who-should-know-better like Daniel Gilbert who will put trust in a study just cos it’s (a) statistically significant and (b) published in a leading journal.

        • It’s not just the problem with small effects, biased and noisy measurement, high variation, nonrepresentativeness and all those stats issues that are problematic. It’s also that we seem to use RCT as a stand-in for a model of the world. Like our whole goal as scientists is to be able to predict the answer to a lot of individual unconnected questions of the form “is X better than Y for Z” instead of things like “scarring in epithelia occurs primarily because of an interaction between cells of type X and cells of type Y mediated by proteins in the class Z, and such scarring reduces the effectiveness of the epithelia at resisting bacterial infection…” or “the transport of gasoline vapors into the bloodstream occurs in such and such a manner and affects cognitive function for such and such a time after exposure to levels found near gasoline pumps, and this seems to cause such and such a number of car accidents each year by its effect on reaction time and alertness”

          It’s not that people don’t do any of that kind of research, but I do fear that this kind of model-building research is somehow risky for academic careers, hard to fund, and doesn’t make its way into the clinical world enough.

        • Daniel:

          I agree but I think that these concepts are related. In particular, with no model of the world you won’t have big effects or unbiased and precise measurements.

          But a model of the world isn’t enough. Often it has to be a quantitative model of the world. If not, you end up in the swamps with Nicolas Wade and the authors of the ovulation-and-voting study.

        • Yes, quantitative model of the world. I don’t even consider qualitative models of the world as models really, not yet. They’re more like ideas for what you need to model. Also, it’s no good to just run a regression and call that your model. I want a-priori causal hypotheses.

          You’re absolutely right, unless you work towards a predictive model, you’re never going to really find the large effects.

          For example, suppose you’re studying ovulation, hormones, and their effect on female attitudes towards their appearance. Running the cheap online survey is potentially great for your career, but actual high quality science would require you to sit down and start to *think* about what are the *causal* processes that could affect female’s attitudes about their appearance.

          Then, you’d think about of course hormone levels and ovulation because that’s part of what got you interested to begin with, but there’d also be age, marital or relationship status, attitudes about receptiveness to attention from the opposite sex, attitudes about fertility and pregnancy, socioeconomic factors, race and skin color, weather and associated clothing choices, career and associated social expectations for dress…

          So there’s a lot that potentially goes into your model, and you wouldn’t want to just throw it all in to a linear regression or something, you’d want to actually think about the interactions and the form of the model. Like for example, if there is little choice of what to wear (ie. you wear a uniform to work) then you’re not going to see much effect regardless of anything else. Also, if there’s a strong receptiveness to attention from the opposite sex, you’re likely to see other factors amplified than if a woman is “not looking for attention in general”… So you *build a model* over weeks, months, years, and you recruit lots of different groups to get involved in your study, and you run several kinds of trials, and you explore the landscape of what matters, and in 15 or 20 years you can actually say something useful about how women make fashion choices…. of course you will never get tenure, never get funding to do the work… you won’t have “sexy” headlines…

        • > Randomization and controlled experimentation won’t solve problems of small effects, biased and noisy
          > measurements, high variation, and nonrepresentativeness of samples.

          Agree that today it won’t but am hopeful with enough effort eventually more representative samples, better control of variation, less noisy measurements, less biased, etc. will adequately address the problems.

          Peirce called this a regulative assumption, not one that we know is true or assume is true but hope is true.

          “The sole immediate purpose of thinking is to render things intelligible, and to think and yet in that very act to thing a thing unintelligible is a self stultification. … Despair is insanity …. We must therefore be guided by the rule of hope.”

          This came up in the discussions below but also the question of need to come up with good and increasing less wrong models of the world. Other than Peirce’s argument that we evolved to be good at that (which is very weak) I am not aware of any.

          (Also posted a response to Fernando but it went through as anonymous)

        • Grr, blog posted my response too soon.

          I don’t think you’re likely to disagree with me much on those points. But I really do worry about a world in which social and institutional and similar effects keep us plugging away at a certain kind of cargo-cult science that produces lots of publishable papers and makes it easier to get funding for projects that don’t really promise to give us fundamental and predictive models that can drive real improvements in people’s lives.

          It’s sort of a “it’s 2014, where’s my flying car?” attitude I know, but I’d be satisfied with a lot of things other than flying cars, such as:

          1) real, effective solutions to antibiotic resistant organisms
          2) cures for cystic fibrosis
          3) reducing the effect of heart disease on people under age 75 by 30%
          4) understanding major causes of “the obesity epidemic” in a real detailed way and finding effective ways to reverse it.
          5) Being able to regenerate organic replacement joint components instead of titanium hip implants etc
          6) Growing replacement kidneys
          7) A more significantly more effective and long lasting pertussis vaccine

          Is the way we are doing science today going to provide any or all of these things in the next 30 years? What are some similar order of magnitude things that it has provided since 1980 using current “modern” methods and funding priorities, publication priorities, tenure systems, and soforth?

        • @Daniel

          Good points. Two comments:

          1. In your comment on modeling you are essentially advocating a structural approach that uses experiments to learn about the structure, what in AI is called active learning. The key though is that this need not all be done by one person. You can combine different studies guided by the underlying structure to get something far richer than the reduced-form effects of meta-analysis. This can make the process of discovery massively parallel.

          2. I don’t think any of this will come out of Academia. They have the capability but not the oincentives. My money is in the post-industrial research lab.

        • @Anon:

          Those look like interesting projects. However, I would not start by trying to solve the actual problems you want to solve using massive -omics data.

          I would start solving the simplest possible problem using small data that still captures some of the key issues you want to address. Then scale the method to solve the actual problem you want to solve.

        • Fernando:

          I was the anonymous – good advice but I’ve been there did that many times.
          (Also was “forced” to use more challenging math than required in my DPhil thesis.)

  3. experiments purporting to demonstrate that an individual’s cognitive and affective responses can be influenced by randomly selected stimulus events that do not occur until after his or her responses have already been made and recorded

    Given the subject, having the replication come eight years before the original experiment has a sort of internal consistency.

  4. Almost everyone agrees preregistration can be a big boon for these type of studies. Unfortunately I don’t think social science is ready for it.

    First, until recently there was no formal pre-registration platform in social science. We pre-registered our study in Current Controled Trials which is mostly for clinical trials, and very abridged. One exception is EGAP, where we posted a longer version.

    Second, protocols take a lot of work, and academic researchers will be loathe to undertake them unless there is a reward of some sort. To test the waters we tried to publish our protocol. We could find no political science journal that would publish it, and a question to POLMETH listserve for suggestions did not come up with any suggestions. In the end we found a peer reviewed outlet BMC Public Health where the protocol is now published.

    Third, in theory an advantage of peer review is that you get some feedbakc that can improve the protocol. In our case we did get very good feedback, but the process from submission to publication took 18 months. Now we have to do the analysis, and perhaps in another 18 months publish the results…

    In conclusion, I got a lot of personal satisfaction from this process, and learned a lot as a scientist. However, I would hesitate to encourage students to go through this same process if they plan to make a living as academics. Rather they should enjoy the summer fishing, and exploring to come back with a compelling story.

    • PS I read other protocols before we wrote ours and I found it an interesting experience. It is the closest science gets to suspense. You get the RQ, why it is important, how they will answer it etc but not the results.

      In my case I found myself googling for the results of the protocols I read. There was an element of suspense. I enjoyed it. Like a story in two parts. I don’t think journals have realized the potential here.

    • > Rather they should enjoy the summer fishing, and exploring to come back with a compelling story.
      You could have been a bit more subtle ;-)

      > I got a lot of personal satisfaction from this process, and learned a lot as a scientist.
      Don’t underestimate the value of that – understanding the scientific process better is an advantage anywhere.

      > if they plan to make a living as academics
      Life is not fair, but it may get fairer in research as preregistration, reproducibility and replication become more prevalent than enhancing one’s reputation with work for which the quality cannot be assessed. (As Don Rubin once said, “smart people do not like being repeatedly wrong” and some smart people with resource control are starting realise believing much of the published research will do exactly that.)

      Folks started talking about registration of RCTs in the 1980’s as the disgust at the idea of analysing multiple studies done by different investigators (especially in the statistics community) faded. Folks finally started to realise – if you could not understand a set of very similar studies you could not possibly understand a (selected?) subset of one (so why ever bother to analysis an RCT?). Progress in scientific process is extremely slow.
      (Some more background here http://statmodeling.stat.columbia.edu/2012/02/12/meta-analysis-game-theory-and-incentives-to-do-replicable-research/#comment-73427)

  5. You write: “I question the authors’ claims that various replications are ‘exact.’ Bem’s paper was published in 2011, so how can it be that experiments performed as early as 2003 are exact replications? That makes no sense.”

    I wanted to point out a possible oversight, since the article seems to address the source of your confusion: “Although the formally-published journal report of Bem’s experiments did not appear until 2011, he began reporting results as they emerged from his project at annual meetings of the Parapsychological Association (2003, 2005, 2008) while simultaneously making materials available to those who expressed an interest in trying to replicate the experiments. Several informal reports of the experiments also appeared in the popular media prior to journal publication. As a result, several attempted replications of the experiments were conducted prior to 2011, and we have included them in our meta-analysis.”

    I do, of course, have a problem with how the researchers have classified their “replications”. However, your framing of this particular issue seems disingenuous. Have I misunderstood you?

    • Sleepy:

      I strongly doubt that these are exact replications. “Spider stimuli”? Whassup with that? It seems pretty clear to me that Bem et al. are counting an earlier experiment as exact replications if (a) it is vaguely similar to any one of the studies in his 2011 paper and (b) he wants to count it as a replication. That doesn’t work for me. Of course, there have been some much-publicized replications after Bem’s 2011 paper and they did fail, so there’s that. At this point Bem could’ve declared failure and gone home but instead he found old studies that fit his story. That’s not replication, indeed it’s almost the opposite of replication.

  6. Talking about the Garden of Forking Paths, check out this citation from Cournot, as cited by Shaffer (1995) on an article about Multiple Hypothesis Testing (http://www.annualreviews.org/doi/pdf/10.1146/annurev.ps.46.020195.003021):
    “… it is clear that nothing limits … the number of features according to which one can distribute [natural or social facts] into several groups or distinct categories. […] One could distinguish first of all legitimate births from those occurring out of wedlock, … one can also classify births according to birth order, according to the age, profession ,wealth, or religion of the parents… usually these attempts through which the experimenter passed don’t leave any traces; the public will only know the result that has been found worth poiting out; and as a consequence, someone unfamiliar with the attempts which have led to this result completely lacks a clear rule for deciding whether the result can or can not be attributed to chance.”

  7. To attempt to add to the discussion, remember the criticized work is an attempt to justify the finding of a small effect. It would be a substantial problem if the work was trying to justify a large effect. In other words, whether forked path or file drawer, I think it’s important to evaluate the magnitude of the claim. Maybe any claim about finding ESP is worthy of knocking about, but I doubt that would stop more work to find more dubious small effects.

    I’m much more worried about studies that lead to treatments and policy decisions because they mislead us into believing x has a material or significant effect on health, on the environment. Replication of that work is important. Honesty in presentation, including recognition of contrary evidence, is important when there are real stakes. Unless one considers having work published even if then forgotten as real stakes.

  8. Pingback: Using statistics to make the world a better place? - Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social Science

  9. The following researchers failed to reproduce Bem’s results.
    Jeff Galek of Carnegie Mellon University
    Robyn A. LeBoeuf of the University of Florida
    Leif D. Nelson of the University of California at Berkeley
    Joseph P. Simmons of the University of Pennsylvania

  10. I forgot one on the list in my last post.

    The following researchers failed to reproduce Bem’s results:

    Jeff Galek of Carnegie Mellon University
    Robyn A. LeBoeuf of the University of Florida
    Leif D. Nelson of the University of California at Berkeley
    Joseph P. Simmons of the University of Pennsylvania
    Chris French at Goldsmith’s, University of London

  11. Pingback: Retractions aren’t enough: Why science has bigger problems - Retraction Watch at Retraction Watch

  12. Pingback: Low-power pose update: Ted goes all-in - Statistical Modeling, Causal Inference, and Social Science

  13. Pingback: Pushing the guy in front of the trolley « Statistical Modeling, Causal Inference, and Social Science

Leave a Reply

Your email address will not be published. Required fields are marked *