The Open Science Collaboration, a team led by psychology researcher Brian Nosek, organized the replication of 100 published psychology experiments. They report:

A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes.

“Despite” is a funny way to put it. Given the statistical significance filter, we’d expect published estimates to be overestimates. And then there’s the garden of forking paths, which just makes things more so. It would be meaningless to try to obtain a general value for the “Edlin factor” but it’s gotta be less than 1, so *of course* exact replications should produce weaker evidence than claimed from the original studies.

Things may change if and when it becomes standard to report Bayesian inferences with informative priors, but as long as researchers are reporting selected statistically-significant comparisons—and, no, I don’t think that’s about to change, even with the publication and publicity attached to this new paper—we can expect published estimates to be overestimates.

That said, even though these results are no surprise, I still think they’re valuable.

As I told Monya Baker in an interview for a news article, “this new work is different from many previous papers on replication (including my own) because the team actually replicated such a large swathe of experiments. In the past, some researchers dismissed indications of widespread problems because they involved small replication efforts or were based on statistical simulations. But they will have a harder time shrugging off the latest study, the value of this project is that hopefully people will be less confident about their claims.”

Nosek et al. provide some details in their abstract:

The mean effect size of the replication effects was half the magnitude of the mean effect size of the original effects, representing a substantial decline. Ninety-seven percent of original studies had significant results. Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.

This is all fine, again the general results are no surprise but it’s good to see some hard numbers with real experiments. The only thing that bothers me in the above sentence is the phrase, “if no bias in original results is assumed . . .” Of course there is bias in the original results (see discussion above), so this just seems like a silly assumption to make. I think I know where the authors are coming from—they’re saying, even if there was no bias, there’d be problems—but really the no-bias assumption makes no sense given the statistical significance filter, so this seems unnecessary.

Anyway, great job! This was a big effort and it deserves all the publicity it’s getting.

Disclaimer: I am affiliated with the Open Science Collaboration. I’m on the email list, and at one point I was one of the zillion authors of the article. At some point I asked to be removed from the author list, as I felt I hadn’t done enough—I didn’t do any replication, nor did I do any data analysis, all I did was participate in some of the online discussions. But I do feel generally supportive of the project and am happy to be associated with it in whatever way that is.

If I remember it correctly, the “if no bias in original results is assumed” is there to make the assumption of the 68% result clear. Of course, the assumption does not hold, and thus the 68% is an overestimate. The 68% provides an upper bound, while the 39% provides a lower bound for reasonable beliefs about reproducibility. Where your beliefs are exactly within this range should depend on how much bias you believe trere is.

But didn’t the NY Times coverage make you want to tear your hair a bit? It contained the umpteenth journalese summary of statistical significance as “a measure of the likelihood that a result would have occurred by chance.” What a blown opportunity. (Btw the Atlantic said the same thing in an article posted this week. People are wrong on the Internet!)

Looking at how frequently everyone, even technical users, get the definition of “statistical significance” wrong, I can’t help wondering whether to blame the users or the definers of the term.

The definition really does answer a question that nobody ever is really asking.

Yes, exactly. The stats community has only itself to blame.

I’d still love to hear even the tiniest non-self-serving argument that “statistically discernable”, for all its flaws, isn’t much superior to “statistically significant” in EVERY single dimension we should care about. But suggestions that

statisticians should switch to the former have been dead on arrival; can anyone make a non-cynical case for this?

Even FiveThirtyEight can’t get it right:

“A p-value is simply the probability of getting a result at least as extreme as the one you saw if your hypothesis is false.”

http://fivethirtyeight.com/datalab/psychology-is-starting-to-deal-with-its-replication-problem/

Maybe we can start calling it YHST instead of NHST: Your Hypothesis Significance Testing.

Ha! They also had a horrible discussion of “false positives” that completely missed the point. Worth another post, I suppose.

The only large replication studies that get though the filter are those where large numbers of papers fail or are implicated. I expect the next replication round to have a higher original-paper success rate… and it’s turtles all the way down…

Who will replicate the replicators?

What “filter” are you referring to here?

I am pretty sure that this set of replications was, in a sense, pre-registered and very well publicized before the results were known.

Can you cite a large replication study that didn’t get through the filter?

In general, investigators are in charge of a relatively small number of funded projects during their career. Accordingly, I think that they tend to view each one is an unassailable nugget of reality. Non-replicability can be easily explained away by details of the study. Compare that to a statistician’s view: we analyze thousands of datasets during our career and expect volatility!

Well put.

Dr. Gelman has discussed this elsewhere, but it seems especially apt now: social scientists and the consumers of their work must shift away from a paradigm in which a single significant result is seen as a revelation of truth rather than a point in a meta-analysis.

+10^3

Is this akin to the regression to the mean effect? Published estimates would overestimate on that basis too, right?

Yes, no surprise that large effect sizes are followed by smaller effect sizes and it is made even worse by the filtering of p values.

… and small effects would be followed by bigger effects in replication. But we never see that b/c the smaller effects don’t get published!

Garnett,

I don’t understand your point. I “get” these two mechanisms (p-value thresholding, regression to the mean) that would bias effect sizes higher, but I don’t see what mechanism would cause this small effects followed by bigger effects phenomenon.

JD

If there’s a real effect, and your data shows a small effect, then “regression to the mean” will mean that the next study will show a bigger effect, on average. It’s the same thing as the other way. If there’s some average mu and you get a sample mu +- n*sigma for a largish n, then your next sample is more likely to be nearer to mu, whether that’s an increase or decrease.

Now, if there’s no effect, and you get a small effect, then who knows

The journals want headline-grabbing results, and if the world became more Bayesian, they will start to impose thresholds on posterior probabilities for effect sizes to be large, with the same effect as currently for significance tests. The major problem here is not one of significance tests vs. Bayes.

I’m not convinced that this is an inevitability. The p-value filter is at least in part founder effect plus institutional inertia; a shift to a different paradigm is opens an opportunity to establish different norms. It’s not like current Bayesian investigations are subject to a Bayes factor threshold or anything like that.

True, but there is nothing in the foundations of frequentist inference either that enforces publication bias, i.e., that studies with p>0.05 should not be published or ways (forking paths) of achieving “significance” should be found. The p-value “filter”, as far as I know, goes back to journal editors in psychology who wanted to make sure that things that are published are “substantial” as results, meaning substantial improvements over what already exists, or establishing effects of “substantial” size. With that very same motivation, they’d need to filter Bayesian analyses, too.

Christian:

The issue is not with frequentist inference, it’s with unregularized estimates. Using a prior distribution pulls estimates toward zero. Simple least-squares estimates are overestimates of magnitudes of effect sizes (E|x| is larger than |E(x)|), and the statistical significance filter just makes things worse.

There is nothing (rational) in the foundations of frequentist ‘inference’, period: https://telescoper.wordpress.com/2010/12/11/deductivism-and-irrationalism/

Nonetheless, Bayesian methods do have resistance to some of the problems that p-values have, e.g., optional stopping.

You totally lost me. Correct calculation of a p-value takes account of the stopping rule. You have to calculate the probability that, if the null were true, you’d observe a test statistic at least as extreme as the one observed, given the stopping rule (and the other relevant features of the data-generating process). This is very well known.

“This is very well known.”

After ~7 years in academia I still haven’t seen this happening in the wild (accounting for stopping rules and such). Sneaky-p-eaky in the middle of an experiment or topping up with some extra participants to squash that nasty p, I’ve seen, however.

If you are concerned about getting enough evidence to make a useful inference, or with minimising false negative error rates then it can be justifiable to add fresh data after peeking. Maintenance of nominal maximal false positive error rates does not have to take precedence over other issues at all times.

Indeed, as Jeremy point out you can account for this by being careful in advance about the stopping rule and specifying in advance exactly how and when it will be applied. For example, one can design a trial of a drug so that one can peek at the results partway through and discontinue the trial if it appears that adverse results are happening or finish early if the drug shows sufficient promise early on. Both of these outcomes can be ethically demanded, the first in order to avoid actively doing harm with the drug and the second to avoid passively doing harm by withholding the drug from sick people who could benefit.

But as Rasmus notes, one seldom finds trials that are designed at the beginning to do this correctly. So while it is possible to do what Jeremy says, it seems to be rare in real life.

It is often assumed that ‘correct’ calculation of a P-value takes account of stopping rules, but that is not exactly true. The choice to take stopping rules into account is typically a choice not condition on the sample size actually achieved. Frequentists prefer such ‘unconditional’ types of P-values in many circumstances, but the definition of a P-value as the probability of data at least as extreme, assuming the null hypothesis is true, allows other conditionings to be chosen. There are many circumstances where the conditional P-value is a useful index of evidence even if its use might lead to a higher than nominal maximal false positive error rate.

Jeremy: I take it that Bill sees that as a problem to be gotten rid of, at least that’s what his comment seems to be saying.

There is no correct calculation of a p-value: http://doingbayesiandataanalysis.blogspot.co.uk/2011/10/false-conclusions-in-false-positive.html

I think that’s basically right. There are so many subjective and unverifiable ways (see Gelman’s garden of forked path theory) in which the stated p-value could be different from the “real” p-value in any real experiment that frequentist can essentially explain away any failures such as:

“Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size"

It doesn't matter if it's 36% or 3%, they'll always have an excuse ready to avoid admitting they're fundamentally wrong about anything.

Sure, that’s all you need, allow trying and trying again, being wrong with high probability.

Yes and no. If you simply switch frequentist significance tests to Bayesian significant tests and leave everything else about the way applied statisticians in the social sciences do their business, then there will be no major improvement.

But the way they’re doing their analysis is deeply un-bayesian even if they had of used Bayesian significance tests. In particular, if given some evidence E, you get P(theta|E) then every value of theta in the support of that distribution is consistent with E and should be admitted as a possibility. If you then do a significant test and conclude theta is greater than theta_0 for example, then you’ve basically contradicted your original calculation. You’re saying those value of theta are impossible when you have no new evidence which implies that.

In other words, P(theta|E) tells you what E has to say about theta. A significance test on top of that is effectively using E twice to further rule out possibilities for theta.

The correct Bayesian approach is to leave P(theta|E) as is and carry the entire distribution forward intact. The only time you can for example replace this with something like theta=0 is when P(theta|E) is so sharply peaked about theta =0 it’s effectively a delta function. This is NOT the same as setting theta=0 when it passes a significance test with theta=0 as the null. It’s radically different in fact.

The point is, if you stick with the sum/product rules of probability theory, or valid approximations to them, you’ll be fine. Current uses of significance tests do not do this even if you naively switch frequentist to analogous bayesian significance tests.

Do you have any examples of a good applied study that does use this correct Bayesian approach?

Bayesian’s today are often automatically doing it right. They just put distributions on everything, and propagate them intact through the analysis according to the equations of probability either analytically or by simulation.

The danger mostly comes from people who are used to writing frequentist papers where they constantly draw conclusions based on a quick hypothesis test (with quoted p-value) and then treat whichever hypothesis that passes as true from then on. Or for example do a test to see if some parameter equals zero, and if it passes they remove it from the model. If one of these people goes “Bayesian” by simply replacing frequentest tests with similar Bayesian ones, but otherwise don’t change anything, it’ll probably lead to crap. Every one of those “tests” introduces a kind of approximation error into the analysis, typically large even often when it’s “Bayesian”.

For example, imagine this “Bayesian” test. You get a posterior for theta, and produce a 95% Credibility Interval. If theta=0 is in this interval you declare “we fail to reject theta=0” and then decide to remove theta from the model. This kind of “Bayesian” test may seem natural to a Frequentist turned Bayesian, but it’s crap in the sense that setting theta=0 isn’t a valid approximation to the sum/product rules usually.

That isn’t to say they couldn’t do something similar correctly if you were careful. For example instead of a significance test that some theta=0, you instead determined that if theta is within D of zero, then setting it equal to zero wouldn’t introduce much error. Then if Pr{|theta| less than D}=.999 then it’s likely ok to set theta=0 and remove it from the model. The logic here is not that “there’s no statistically significant evidence theta differs from zero” but rather “theta is probably different from zero, but no so different it numerically changes the answer I care about”. This is a valid approximation to the equations of probability theory because the posterior for theta is essentially a delta function about theta=0.

I’m not convinced that Bayesians are

“automatically doing it right”.That’s why I was looking for a concrete example of a paper that you think“did it right”.A lot of these frequentist vs Bayesian debates happen in the abstract or on methods or philosophical grounds. I think it’d be productive to exactingly critique some of these doing-it-right Bayesian papers for a change.

Rahul there’s nothing to be convinced of. This is not part of a Frequentist vs Bayesian debate. I didn’t say Bayesians aren’t making mistakes, I’m claiming they often are taking care of this one specific aspect correctly. Specifically, that their work be based on the basic equations of probability theory. Many papers automatically are doing this correctly because they’re using bayes theorem (derivable from the basic equations of probability theory) and using simulation to get the posterior or whatever without doing a bunch of “significance tests” to get intermediate conclusions.

The problem is many things that will look natural to a Frequentist turned “Bayesian” will not be equivalent to using the basic equations of probability theory, not even approximately.

Rahul, as long as journals are stuck on p-values as the gold standard, people aren’t going to use Bayesian approaches because they won’t be able to get their work published.

My own feeling is that one ought to be using decision analysis, since any outcome of a trial is going to involve decisions about actions to be taken, which in turn is going to depend on a loss function. In drug testing, the loss function might take into account, e.g., the known efficacy of drugs currently in routine use (about which much more will be known in many cases), costs, and many other factors that depend on the situation. Merely testing hypotheses is quite inadequate, in my view.

One wonders, however, whether journal editors would even be able to cope with paper submitted to them that took such an approach seriously.

I totally agree about the decision analysis bit. That’s what we really need.

It may depend on the journal. I would guess that few field-specific journals would be able to cope, so one might need to submit to a statistics journals (e.g., Annals of Applied Statistics or Technometrics), which then would (for many fields) reduce visibility in one’s field. But possibly “opinion piece with discussion” type articles in field-specific journals might pave the way for more acceptance in them.

I wrote this up greater detail here:

http://www.bayesianphilosophy.com/biggest-mistake-bayesians-can-make/

It’s too bad the replicationistas in psych have come nowhere close to scratching the surface of their problems in becoming scientific: there is scant acknowledgment or criticism of the fact that they have no warrant for supposing statistically significant results are due to their “treatments”. (If this comes up in the current work, I’d be glad to hear about it.) At best, better (more self-critical) uses of statistics can help them in pinpointing the source of their problem, but in years of efforts, we don’t see any recognition of it; instead of correction we see a numbers game (and the psych people appear happy as clams at their poor showing!) If I weren’t already so skeptical of these psych effects (although I haven’t seen a list of what they’ve studied this round), and so dubious about the methodology being used by the replicationistas (e.g., how they determine the reported effects and power), I’d go on to mention that I find it incredible, and fittingly ironic, that the folks who know all about bias don’t consider the invitation to bias in the way they’re going about replications (even with “pre-registration”). People signing up to try and replicate my study can have no biases, no axes to grind, no interest in P-bashing (even though they aren’t using anything redolent of valid significance tests), nothing that unconsciously can seep into decisions about stopping and much else? Really? Non-significance is the new significance!

Mayo:

You seem to be angry but I don’t really see why. You use the term “replicationistas” which seems like it’s intended to be some sort of mockery or insult, but what I see is a bunch of serious scientists who took time out of their busy lives to carefully replicate 100 different published studies, in order to get an empirical handle on an important concern in science and science communication.

I think it would be a huge step forward if people recognized that published results are only provisional. You might say, sure, everybody knows this. But no, even in the NYT article today about this study, someone was quoted as saying, “with some theory-required adjustments, my original findings were in fact replicated.” Maybe. But I find it discouraging that researchers don’t like to admit that, just maybe, the effects they discovered and proudly published, are just things that happened in some small sample and do not represent general patterns in the population.

Prof,

That was a remarkable quote. The author seemed to suggest that her published result was *really* about Italians, so surveying American undergrads was not a replication. I highly doubt that her paper “owned” that limitation.

Kyle:

Yes, exactly. It would be interesting to track this one down and see. But it seems to follow the usual pattern, which is that the scientist’s theoretical model is flexible enough to allow for all sorts of potential interactions, which is how one can have a series of studies, each finding a different result, and all appearing to be part of a larger truth.

I see that Professor Mayo tracked down the abstract for her blog, and, predictably, the paper purported to test an evo-psych genetic theory about “mated” and “unmated” “men” and “women,” not about Italians or American undergrads or what have you. Of course, at that level of abstraction, no convenience sample will ever be good enough, and they can keep on testing and retesting that theory in low-power studies with minor forking refinements indefinitely.

Andrew: “Things may change if and when it becomes standard to report Bayesian inferences with informative priors, but as long as researchers are reporting selected statistically-significant comparisons…”

This suggests that the choice is either continued abuse of error statistical methods or to evaluate evidence and replication by Bayesian prior disbeliefs or the like, but this ignores the methodological critique that’s needed. If the question “has this study with this data been well run?” is going to turn on so and so’s beliefs in the effect, then even poorly supported results could be deemed replicable because the effect is believed in strongly, or the converse.

I’m not angry but highly allergic to not telling it like it is, when it comes to evidence and inference. It would be better if the “prior background” information consisted in the flaws, biases, and fallibilities of such studies, and a demand that researchers show how they’ve avoided the threats in their research. Instead we have more papering over, leaving unsaid or keeping shrouded under wraps what’s really going on (“shh, don’t admit serious doubts about these psych measurements, tests, treatments; don’t demand anyone put them to the test*, lest someone think we were saying their methodology was crossing over into pseudoscience”).

*In the questionable cases I’ve seen, there are ways, often easy ways, to reveal the presupposition of the test or measurement is just false. Most other domains expose such things.

Well, “serious scientists” should be spending time replicating pharma studies or physics studies or chemistry studies not crappy psych studies.

Psychology studies are too easy a target. Low hanging fruit. They are attacking methods using a cherry-picked weakest set of studies that uses these methods.

Rahul:I agree (and those serious sciences do spend time on replicating, but not like this), which is why there are dangers in drawing the kind of generalizable lessons some take from the low hanging fruit. It paves the way for the science violators in psych to continue their practices, without more stringent standards. Making the front pages for a poor replication showing actually raises their scientific credentials, in the eyes of many. They got into Science, didn’t they?

Mayo, you are making no sense at all. If psychologists try to replicate their studies, this paves the way for them to continue their practices of doing non-replicable studies?

Shravan: yes, because of a failure to pinpoint the causes of the problems. In fact, the replication report goes as far as saying that many of the non-significant results show the effects but to a smaller degree. They still assume that any observed difference is due to the “treatment”. There’s much else that is faulty in the replication methodology, and yet it will understandably be taken as exemplary for psych research.

Mayo, I suggest that maybe it’s worth it for you to spend a few years actually doing (and trying to replicate) psychology experiments, going from design to actually running them yourself and then analyzing the data. It’s not easy to understand that one cannot pinpoint the causes for variability in results under replication that easily.

You say: “…goes as far as saying that many of the non-significant results show the effects but to a smaller degree.” So what? If psychologists do NHST and make binary decisions to accept or reject based on p-values, statisticians get on their case (rightly). Now if they are looking at patterns of results under repeated sampling, it’s odd to turn around and complain about that too. I see the observation about seeing smaller effects under replication to be an instance of Andrew’s Type M error problem. Low power gave exaggerated effect sizes in the first instance, and the replication could not reproduce that (or maybe had higher power and got more realistic effects).

You also say: “There’s much else that is faulty in the replication methodology, and yet it will understandably be taken as exemplary for psych research.” What’s your alternative and can you show it works?

Sure, one can always do a better job. It’s easy to sit on the sidelines and criticize people actually trying to do something about the problems psych is facing. Why not get your hands dirty and show psychologists what the right way is to address scientific questions? Take a simple problem like the red-and-fertility stuff, or the power posing stuff, which you don’t need all that much specialist knowledge to study.

Shravan:

I think there’s no harm in psychologists trying to replicate their studies, although I’m not sure what’s the point: the fact that Psych studies are poorly replicable seems common knowledge. Ok, fine, they can reinforce that notion. Doesn’t hurt.

The key question is what they do once they re-discover this poor replicability of Psych studies. If they simply treat it as a meta-analysis and apply an averaging procedure (like Nosek seems to mention in his interview) then that would be pretty sad.

On the other hand, if this prompts them to go back to the drawing board and critically examine why there is so much non-repeatability in their studies and then try to fix those sources of uncontrolled variation (or actually to go in for less ambitious, less noisy studies), then that would be a happy outcome.

Yeah, let’s hope it ends up being the second scenario you spell out.

> If they simply treat it as a meta-analysis and apply an averaging procedure (like Nosek seems to mention in his interview) then that would be pretty sad.

I guess so – I am not sure why more has not been done here. I always saw the real purpose of a meta-analysis as an attempt to thoroughly investigate replication, with phrases like first assess replication and then assess uncertainties given this assessment. (The first sentence in the wiki entry for meta-analysis tries to get this across though probably doe not succeed.)

If studies replicate they should be saying similar things about the unknown parameters via P(study1|a) ~ P(study2|a) which would be the assessment of the similarity between likelihoods or posteriors/priors (relative beliefs) say calculating rank correlations between. (Simple example here https://phaneron0.files.wordpress.com/2015/01/replicationplot.pdf )

The bigger problem here is that journals are a venue to make claims when all that is ever really justified is reporting what happened in your study and suggesting what it indicates – full stop.

One very positive development over the last 5 to 10 years is that most statisticians have stopped ignoring or deprecating meta-analysis and now a minority might actually teach some of their students something about it (perhaps thanks to the 8 schools example which happens to be a meta-analysis.)

@Shravan

Yeah, we can hope but I don’t think it will be the second scenario.

If this replication exercise ends up pushing people down the mindless-averaging path, then there’s a chance that we are worse off than we started?

If researchers develop an attitude that gross non-replicability is routine & can be addressed by averaging discordant estimates, I’m not sure that’s a good outcome.

I was hoping Nosek would be more embarrassed & concerned when he admitted his own study did not replicate.

Absolutely; but this replication attempt has been very informative because it provides us with a safe baseline. Precisely because these studies have no real world consequences, studying their replicability (or not) provides a useful baseline. I hope that someone will now take exactly the same studies Nosek et al choose, and try to replicate them again. Why stop at one attempt?

Aren’t there any better sets of safe *and* useful studies to replicate? Safety sounds like a red herring to justify the selection of psych.

An un-replicated drug study is already doing harm if it was wrong. It is the fallacy of safety by inaction.

I believe people are already studying useful studies too (See my link to the Scifri interview), on a much larger scale. And they are failing to replicate. So the problem seems systemic and not limited to useless studies of the sort that people like me do.

You know that Brian Nosek is a Psychologist, not a pharmaceutical chemist, right? And he specifically studies “the gap between values and practices – the difference between what is intended, desired, supposed to happen and what actually happens.” It should not surprise you that a serious scientist studying the gap between values and practices would look at the espoused values and actual practices of his own field as part of his broader research program.

Yes but are people reading this as a commentary on Psych Research on on broader things?

They *should* read this as a comment on broader things. Listen to this interview on scifri:

http://www.sciencefriday.com/segment/08/28/2015/putting-scientific-research-to-the-test.html

> “serious scientists” should be spending time replicating pharma studies or physics studies or chemistry studies not crappy psych studies.

On that subject, “Let’s learn from mistakes” at RealClimate – http://www.realclimate.org/index.php/archives/2015/08/lets-learn-from-mistakes/ (I can’t recall where I got the link. Could have been here. Might have been Mike the Mad Biologist.)

Andrew,

Framing this replicability result in terms of Type M error, the fact that Nosek says (Science podcast) that the magnitude of the replicated effects was roughly half that of the original effect suggests, through some quick back-of-the-envelope simulations, that the power across these studies (assuming that in fact, mu != 0) is about 0.20. This is a much higher number than I expected; I was expecting something around the range of 0.05. But it is still pathetically low.

A blog called the Replicability Index is computing something called “post-hoc power” and claims to adjust for the inflation due to Type M error (how can one do that unless one knows what the true power is—that’s what we are trying to get at in the first place), but his estimates are bizarrely high. Perhaps he needs to apply his method to the studies chosen by Nosek et al, compute his post-hoc power and compare them with the actually observed results for a reality check. My impression is that he has hugely overestimated power.

Brian Nosek’s says in his interview that non-reproduciblity is inevitable because we have incomplete models. Nosek’s own study was found non-reproducible.

He seems to advocate some sort of averaging over the effect found in the two discordant studies.

I’m not sure how much I like this. Isn’t this like telling a freshman Chemistry student that it is ok for his repeat titration readings to not match, just use an average reading?

Some variation may be inevitable, but at the level we seem to see, I’d just want to discard both studies as crap.

Well, what he’s advocating (averaging) is a simplified version of meta-analysis, isn’t it? Why is meta-analysis bad? If we had 50 Brian Nosek studies, we could start looking at average effect sizes on the assumption that the studies are being generated by say Normal(\theta,\sigma^2).

The problem is probably compounded by the introduction of different biases being introduced in each study. To take your chemistry example, the student may repeat his studies crappily, making different mistakes (e.g., different quantities of the reagent, or even the wrong reagent, wrong dilutions) each time, and in that case taking the average of carelessly done studies, each introducing new sources of bias, is never going to give you a good estimate of the true value by averaging or meta-analysis or whatever.

But if the chemistry student were to do careful studies repeatedly, where the sources of biases are minimized and random variability is the main source of differences in results, over many replications, wouldn’t the mean give a reasonable estimate? By assuming no bias (if I recall correctly, that assumption was made?), Nosek must be considering this scenario when he mentioned averaging.

PS In chemistry class back home in Delhi, I could never get the value that the textbook or the teacher said I should be getting. I was never too hot with pipettes and the like, though.

I think averaging is reasonable when replications come close. When replications are wide apart it makes more sense to search for the reasons behind the non-reproducibility and try to first identify them and then minimize them or control for them.

With typical Psych studies I think we are in the latter domain. These replications aren’t close enough to average over. GIGO.

“I think averaging is reasonable when replications come close. When replications are wide apart it makes more sense to search for the reasons behind the non-reproducibility and try to first identify them and then minimize them or control for them.”

This is probably easier to do in chemistry than in psych type stuff (identifying reasons).

“With typical Psych studies I think we are in the latter domain. These replications aren’t close enough to average over. GIGO.”

I have to reluctantly agree.

@Rahul

A lot of crappy studies can be information about what makes studies crap (e.g. systematic sources of variability).

Much of science is not just learning about nature but about the performance of research instruments and procedures.

Meta analysis and averaging throw away all this useful structural information. In my view it is a terrible use of expensive experimental data.

An example of what I mean is in the annex to this paper http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2496670

Don’t read the paper just go to annex and tell me: Should Claude Bernard have been averaging his results from over 10 experiments or using them to update a structural model of the outcome? Implicitly he did the latter. It is the smart thing to do.

+1

> Meta analysis and averaging throw away all this useful structural information.

I don’t see why meta-analysis is habitually thought of as ignoring useful structural information?

Badly done meta-analyses do this – but they are just badly done.

Shravan:

Yes. But Psych researchers, in my opinion, are taking the lazy, sloppy path to glory.

They conduct low-powered studies with noisy measurements and little control of confounders and try to answer questions that are far too ambitious. They try to draw bizarrely broad conclusions on shoe-string budgets (undergrad surveys and Mechanical Turk).

They ought to take the slower, harder path of biting what they can chew: i.e. conduct smaller, more reasonable, more controlled experiments with well defined subject characteristics and less noisy measurements. These studies might then have a better chance of replication.

I know a lot of psychologists and the like, and work with them, and I don’t know anyone who I would describe as taking the “lazy, sloppy path to glory” (OK, maybe I know a few).

In my opinion, one major reason (not the only one) they (we) conduct low powered crappy studies is due to the way that statistics is taught in psychology and related disciplines. If they (we) can’t even understand what a p-value is, what a CI is, what a model assumption is and why it matters for inference, what hope is there of sitting down to do an experiment that makes some sense?

When I ask people why they run small sample sizes, I get answers like it would take too long to run a larger sample size. And psychologists have even published journal articles (yes, peer reviewed) in which they argue that a significant result is more convincing if it comes from a low-powered study. The education pipeline is broken; somehow the psychologists and linguists have developed their own bizarre conception of what statistics is and means, and they are running with it. Professors teaching the next generation are not helping at all because they pass on the mistakes they make to their students. This is all bad and I have no idea how this happened, but I don’t believe they sat down and came to the conclusion that it would just be easier to ignore statistical theory and take the lazy way to glory.

The replication attempt is at least a good first step, and better than the state we are in right now.

“somehow the psychologists and linguists have developed their own bizarre conception of what statistics is and means, and they are running with it.”

Not just them, it is everyone. The statistical methods widely used today are like these goggles:

https://www.youtube.com/watch?v=y91uHr-QW6A

“The education pipeline is broken; somehow the psychologists and linguists have developed their own bizarre conception of what statistics is and means, and they are running with it. Professors teaching the next generation are not helping at all because they pass on the mistakes they make to their students. “

This is what I sometimes call the “game of telephone” effect. It is analogous to the party game “telephone,” where people sit in a circle; one person whispers a phrase into the ear of the person next to them, who repeats the process to the other person next to them, and so on around the circle, until the last person says what they hear out loud and the first person reveals the actual initial phrase. The difference between input and outcome usually causes a good laugh for all.

Analogously, someone struggling to understand statistics hits on an oversimplification which makes them feel they understand when they don’t (or which they at least feel gives the idea well enough). They then pass this oversimplification on to someone else, who may add further oversimplification, and so on. We end up with vast misunderstandings of statistics, but ones that are appealing precisely because they are oversimplified. And these oversimplifications often become entrenched in textbooks.

I find Andrew’s take on this simply bizarre.

He says that “…the general results are no surprise but it’s good to see some hard numbers with real experiments.” What kind of sense does this make? I realize that there are a number of complexities in the assumptions one must make to predict how many “false positives” one would get in replicating a set of experiments which find statistically significant results and use .05 as the p value for significance. But under fairly reasonable if greatly simplistic sets of assumptions, if one were to replicate 100 such experiments, and those experiments were all proper replications, and the original experiments were themselves all selected from a properly conducted and properly reported larger set of experiments, etc., one would expect that only on the order of 5 or so wouldn’t replicate. Of course, the reality that the experiments were all different, they were selected for showing significant effects, the selection of them generally wasn’t random, etc., would alter the expected numbers somewhat — but, still, I can’t see any reason to believe that the expected number of failures would be vastly different from 5.

In fact, of course, over 60 failed to replicate. 60! This eye-popping result is “no surprise”?

And what is even the point of raising the minor tweaking in the numbers one might get from employing Bayesian techniques? How would any such change in approach get one from in the neighborhood of 5 failed replications up to over 60?

Forest for the trees, anyone?

Could Andrew perhaps explain why he doesn’t find the over 60 number surprising?

Candid:

1. As I wrote in my above post, given the statistical significance filter, we’d expect published estimates to be overestimates. And then there’s the garden of forking paths, which just makes things more so. It would be meaningless to try to obtain a general value for the “Edlin factor” but it’s gotta be less than 1, so of course exact replications should produce weaker evidence than claimed from the original studies. You can follow the links above for more details. But, in short, no, I sold not expect that “only on the order of 5 or so wouldn’t replicate.” Not at all.

2. Employing Bayesian techniques is, in many examples, not a “minor tweaking” at all. If you take some of the studies with low signal and high noise, a Bayesian estimate with informative prior will be much much different than the raw-data estimate.

I’d like to see any kind of plausible model on which a switch to Bayesian approach would generate a figure close to 60 for expected number of failures in replication.

Let me a little clearer about what kind of model I’d like to see.

I’d like to see a model under which the over 60 failures of replication using the standard approach become explicable because a Bayesian approach was not followed.

Candid:

See here. Filling in the Bayesian details is an exercise for the reader.

You must not be a regular reader, or you’d know why AG doesn’t think this is surprising. Here’s a primer. (Also, the p-value threshold for significance is not the probability of non-replication. Under the conditions you specify, one would expect that 12 studies or so wouldn’t replicate.)

Look, I realized that the .05 p value didn’t directly translate into the expected number of failures of replication, and the 12 figure you stipulate is not that surprising to me. The point I was making was about orders of magnitude.

But, of course, to see, instead of something like 12, a figure above 60, should be very surprising, right?

That’s the point.

“… one would expect that only on the order of 5 or so wouldn’t replicate.” it’s not an order of magnitude, imperfections in real experiments problem in your understanding.

If you’re expecting something like this, you simply do not understand what a p < .05 threshold means.

Not at all.

A gross oversimplification can go like this:

We want to know the probability a published study that rejected a null hypothesis is a false positive and shouldn’t replicate. Something like:

p(H = 0 | p < 0.05)

Let's go Bayes over this, and say that

p(H = 0 | p < 0.05) = p(H = 0)*p(p < 0.05 | H=0) / p(p < 0.05)

Given all sort of crazy theories psychologists are able to concoct, let's say that only 5% of all theories tested by NHST are true. So, p(H = 0) = 0.95.

We also know that, if the p-value is not product of miscalculation or manipulation, it will usually follow a uniform distribution under the null hypothesis. So, p(p < 0.05 | H = 0) = 0.05.

The denominator is a little more trickier. Marginalizing over H, we've got: p(p < 0.05) = p(H = 0)*p(p < 0.05 | H=0) + p(H = 1)*p(p < 0.05 | H=1) We have the values for the first term; how about the second one? Let's say all published findings have proper power, around 80% (hey, they've been published, after all). So, p(p < 0.05 | H = 1) = 0.8.

Plugging all values in the bayes rule,

p(H = 0 | p < 0.05) = (0.95 * 0.05) / (0.95 * 0.05 + 0.05 * 0.8) =~ 0.54

Wow, 50%. Now, we could argue about the estimate that 95% of all theories are false. OK, maybe scientists are careful enough to make this number lower. But it's not easy to find studies with proper statistical power and multiple comparisons and data dredging easily inflate type 1 errors.

Erikson:

I like the spirit of your comment but I don’t think it’s a good idea to characterize these hypotheses as true or false; I think it’s better to think of the effects as being continuous and variable.

I guess you missed the link I gave to some of AG’s writing on the subject; it’s under the text: “Here’s a primer.” The 12 vs 5 thing is pure pedantry, which is why it’s a parenthetical comment.

The probability a hypothesis being true given the evidence (data, non-frequency information, model assumptions) is something totally separate from the percentage to times false decisions are made about hypothesis. They are two totally different numbers, with totally different status, that serve totally different uses, and most importantly require two very different analysis to get at.

You can’t simply use one to predict the other. I know your Frequentist intuition screams at you “the probability in the single case = the frequency in multiple cases”, but it’s just not true most of the time. Frequentism, in case you hadn’t noticed, is a big ol’ steaming pile of horse poop.

The probability of a hypothesis given evidence is a measure of how strongly the evidence favors the hypothesis. In a sense, it tells you how much opportunity the given (partial) evidence allows for the hypothesis to be right. The vast majority of the time this number will not have a simple relationship to the percentage of mistakes made over multiple assessments.