This is what I sometimes call the “game of telephone” effect. It is analogous to the party game “telephone,” where people sit in a circle; one person whispers a phrase into the ear of the person next to them, who repeats the process to the other person next to them, and so on around the circle, until the last person says what they hear out loud and the first person reveals the actual initial phrase. The difference between input and outcome usually causes a good laugh for all.

Analogously, someone struggling to understand statistics hits on an oversimplification which makes them feel they understand when they don’t (or which they at least feel gives the idea well enough). They then pass this oversimplification on to someone else, who may add further oversimplification, and so on. We end up with vast misunderstandings of statistics, but ones that are appealing precisely because they are oversimplified. And these oversimplifications often become entrenched in textbooks.

]]>Now, if there’s no effect, and you get a small effect, then who knows

]]>Not just them, it is everyone. The statistical methods widely used today are like these goggles:

https://www.youtube.com/watch?v=y91uHr-QW6A

I don’t see why meta-analysis is habitually thought of as ignoring useful structural information?

Badly done meta-analyses do this – but they are just badly done.

]]>Yeah, we can hope but I don’t think it will be the second scenario.

If this replication exercise ends up pushing people down the mindless-averaging path, then there’s a chance that we are worse off than we started?

If researchers develop an attitude that gross non-replicability is routine & can be addressed by averaging discordant estimates, I’m not sure that’s a good outcome.

I was hoping Nosek would be more embarrassed & concerned when he admitted his own study did not replicate.

]]>I guess so – I am not sure why more has not been done here. I always saw the real purpose of a meta-analysis as an attempt to thoroughly investigate replication, with phrases like first assess replication and then assess uncertainties given this assessment. (The first sentence in the wiki entry for meta-analysis tries to get this across though probably doe not succeed.)

If studies replicate they should be saying similar things about the unknown parameters via P(study1|a) ~ P(study2|a) which would be the assessment of the similarity between likelihoods or posteriors/priors (relative beliefs) say calculating rank correlations between. (Simple example here https://phaneron0.files.wordpress.com/2015/01/replicationplot.pdf )

The bigger problem here is that journals are a venue to make claims when all that is ever really justified is reporting what happened in your study and suggesting what it indicates – full stop.

One very positive development over the last 5 to 10 years is that most statisticians have stopped ignoring or deprecating meta-analysis and now a minority might actually teach some of their students something about it (perhaps thanks to the 8 schools example which happens to be a meta-analysis.)

]]>You can’t simply use one to predict the other. I know your Frequentist intuition screams at you “the probability in the single case = the frequency in multiple cases”, but it’s just not true most of the time. Frequentism, in case you hadn’t noticed, is a big ol’ steaming pile of horse poop.

The probability of a hypothesis given evidence is a measure of how strongly the evidence favors the hypothesis. In a sense, it tells you how much opportunity the given (partial) evidence allows for the hypothesis to be right. The vast majority of the time this number will not have a simple relationship to the percentage of mistakes made over multiple assessments.

]]>I like the spirit of your comment but I don’t think it’s a good idea to characterize these hypotheses as true or false; I think it’s better to think of the effects as being continuous and variable.

]]>A gross oversimplification can go like this:

We want to know the probability a published study that rejected a null hypothesis is a false positive and shouldn’t replicate. Something like:

p(H = 0 | p < 0.05)

Let's go Bayes over this, and say that

p(H = 0 | p < 0.05) = p(H = 0)*p(p < 0.05 | H=0) / p(p < 0.05)

Given all sort of crazy theories psychologists are able to concoct, let's say that only 5% of all theories tested by NHST are true. So, p(H = 0) = 0.95.

We also know that, if the p-value is not product of miscalculation or manipulation, it will usually follow a uniform distribution under the null hypothesis. So, p(p < 0.05 | H = 0) = 0.05.

The denominator is a little more trickier. Marginalizing over H, we've got: p(p < 0.05) = p(H = 0)*p(p < 0.05 | H=0) + p(H = 1)*p(p < 0.05 | H=1) We have the values for the first term; how about the second one? Let's say all published findings have proper power, around 80% (hey, they've been published, after all). So, p(p < 0.05 | H = 1) = 0.8.

Plugging all values in the bayes rule,

p(H = 0 | p < 0.05) = (0.95 * 0.05) / (0.95 * 0.05 + 0.05 * 0.8) =~ 0.54

Wow, 50%. Now, we could argue about the estimate that 95% of all theories are false. OK, maybe scientists are careful enough to make this number lower. But it's not easy to find studies with proper statistical power and multiple comparisons and data dredging easily inflate type 1 errors.

]]>If you’re expecting something like this, you simply do not understand what a p < .05 threshold means.

]]>See here. Filling in the Bayesian details is an exercise for the reader.

]]>I’d like to see a model under which the over 60 failures of replication using the standard approach become explicable because a Bayesian approach was not followed.

]]>But, of course, to see, instead of something like 12, a figure above 60, should be very surprising, right?

That’s the point.

]]>1. As I wrote in my above post, given the statistical significance filter, we’d expect published estimates to be overestimates. And then there’s the garden of forking paths, which just makes things more so. It would be meaningless to try to obtain a general value for the “Edlin factor” but it’s gotta be less than 1, so of course exact replications should produce weaker evidence than claimed from the original studies. You can follow the links above for more details. But, in short, no, I sold not expect that “only on the order of 5 or so wouldn’t replicate.” Not at all.

2. Employing Bayesian techniques is, in many examples, not a “minor tweaking” at all. If you take some of the studies with low signal and high noise, a Bayesian estimate with informative prior will be much much different than the raw-data estimate.

]]>In my opinion, one major reason (not the only one) they (we) conduct low powered crappy studies is due to the way that statistics is taught in psychology and related disciplines. If they (we) can’t even understand what a p-value is, what a CI is, what a model assumption is and why it matters for inference, what hope is there of sitting down to do an experiment that makes some sense?

When I ask people why they run small sample sizes, I get answers like it would take too long to run a larger sample size. And psychologists have even published journal articles (yes, peer reviewed) in which they argue that a significant result is more convincing if it comes from a low-powered study. The education pipeline is broken; somehow the psychologists and linguists have developed their own bizarre conception of what statistics is and means, and they are running with it. Professors teaching the next generation are not helping at all because they pass on the mistakes they make to their students. This is all bad and I have no idea how this happened, but I don’t believe they sat down and came to the conclusion that it would just be easier to ignore statistical theory and take the lazy way to glory.

The replication attempt is at least a good first step, and better than the state we are in right now.

]]>He says that “…the general results are no surprise but it’s good to see some hard numbers with real experiments.” What kind of sense does this make? I realize that there are a number of complexities in the assumptions one must make to predict how many “false positives” one would get in replicating a set of experiments which find statistically significant results and use .05 as the p value for significance. But under fairly reasonable if greatly simplistic sets of assumptions, if one were to replicate 100 such experiments, and those experiments were all proper replications, and the original experiments were themselves all selected from a properly conducted and properly reported larger set of experiments, etc., one would expect that only on the order of 5 or so wouldn’t replicate. Of course, the reality that the experiments were all different, they were selected for showing significant effects, the selection of them generally wasn’t random, etc., would alter the expected numbers somewhat — but, still, I can’t see any reason to believe that the expected number of failures would be vastly different from 5.

In fact, of course, over 60 failed to replicate. 60! This eye-popping result is “no surprise”?

And what is even the point of raising the minor tweaking in the numbers one might get from employing Bayesian techniques? How would any such change in approach get one from in the neighborhood of 5 failed replications up to over 60?

Forest for the trees, anyone?

Could Andrew perhaps explain why he doesn’t find the over 60 number surprising?

]]>“This is probably easier to do in chemistry than in psych type stuff (identifying reasons).”

Yes. But Psych researchers, in my opinion, are taking the lazy, sloppy path to glory.

They conduct low-powered studies with noisy measurements and little control of confounders and try to answer questions that are far too ambitious. They try to draw bizarrely broad conclusions on shoe-string budgets (undergrad surveys and Mechanical Turk).

They ought to take the slower, harder path of biting what they can chew: i.e. conduct smaller, more reasonable, more controlled experiments with well defined subject characteristics and less noisy measurements. These studies might then have a better chance of replication.

]]>I think there’s no harm in psychologists trying to replicate their studies, although I’m not sure what’s the point: the fact that Psych studies are poorly replicable seems common knowledge. Ok, fine, they can reinforce that notion. Doesn’t hurt.

The key question is what they do once they re-discover this poor replicability of Psych studies. If they simply treat it as a meta-analysis and apply an averaging procedure (like Nosek seems to mention in his interview) then that would be pretty sad.

On the other hand, if this prompts them to go back to the drawing board and critically examine why there is so much non-repeatability in their studies and then try to fix those sources of uncontrolled variation (or actually to go in for less ambitious, less noisy studies), then that would be a happy outcome.

]]>You say: “…goes as far as saying that many of the non-significant results show the effects but to a smaller degree.” So what? If psychologists do NHST and make binary decisions to accept or reject based on p-values, statisticians get on their case (rightly). Now if they are looking at patterns of results under repeated sampling, it’s odd to turn around and complain about that too. I see the observation about seeing smaller effects under replication to be an instance of Andrew’s Type M error problem. Low power gave exaggerated effect sizes in the first instance, and the replication could not reproduce that (or maybe had higher power and got more realistic effects).

You also say: “There’s much else that is faulty in the replication methodology, and yet it will understandably be taken as exemplary for psych research.” What’s your alternative and can you show it works?

Sure, one can always do a better job. It’s easy to sit on the sidelines and criticize people actually trying to do something about the problems psych is facing. Why not get your hands dirty and show psychologists what the right way is to address scientific questions? Take a simple problem like the red-and-fertility stuff, or the power posing stuff, which you don’t need all that much specialist knowledge to study.

]]>On that subject, “Let’s learn from mistakes” at RealClimate – http://www.realclimate.org/index.php/archives/2015/08/lets-learn-from-mistakes/ (I can’t recall where I got the link. Could have been here. Might have been Mike the Mad Biologist.)

]]>http://www.bayesianphilosophy.com/biggest-mistake-bayesians-can-make/

]]>My own feeling is that one ought to be using decision analysis, since any outcome of a trial is going to involve decisions about actions to be taken, which in turn is going to depend on a loss function. In drug testing, the loss function might take into account, e.g., the known efficacy of drugs currently in routine use (about which much more will be known in many cases), costs, and many other factors that depend on the situation. Merely testing hypotheses is quite inadequate, in my view.

One wonders, however, whether journal editors would even be able to cope with paper submitted to them that took such an approach seriously.

]]>But as Rasmus notes, one seldom finds trials that are designed at the beginning to do this correctly. So while it is possible to do what Jeremy says, it seems to be rare in real life.

]]>A lot of crappy studies can be information about what makes studies crap (e.g. systematic sources of variability).

Much of science is not just learning about nature but about the performance of research instruments and procedures.

Meta analysis and averaging throw away all this useful structural information. In my view it is a terrible use of expensive experimental data.

An example of what I mean is in the annex to this paper http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2496670

Don’t read the paper just go to annex and tell me: Should Claude Bernard have been averaging his results from over 10 experiments or using them to update a structural model of the outcome? Implicitly he did the latter. It is the smart thing to do.

]]>“A p-value is simply the probability of getting a result at least as extreme as the one you saw if your hypothesis is false.”

http://fivethirtyeight.com/datalab/psychology-is-starting-to-deal-with-its-replication-problem/

Maybe we can start calling it YHST instead of NHST: Your Hypothesis Significance Testing.

]]>This is probably easier to do in chemistry than in psych type stuff (identifying reasons).

“With typical Psych studies I think we are in the latter domain. These replications aren’t close enough to average over. GIGO.”

I have to reluctantly agree.

]]>With typical Psych studies I think we are in the latter domain. These replications aren’t close enough to average over. GIGO.

]]>The problem is probably compounded by the introduction of different biases being introduced in each study. To take your chemistry example, the student may repeat his studies crappily, making different mistakes (e.g., different quantities of the reagent, or even the wrong reagent, wrong dilutions) each time, and in that case taking the average of carelessly done studies, each introducing new sources of bias, is never going to give you a good estimate of the true value by averaging or meta-analysis or whatever.

But if the chemistry student were to do careful studies repeatedly, where the sources of biases are minimized and random variability is the main source of differences in results, over many replications, wouldn’t the mean give a reasonable estimate? By assuming no bias (if I recall correctly, that assumption was made?), Nosek must be considering this scenario when he mentioned averaging.

PS In chemistry class back home in Delhi, I could never get the value that the textbook or the teacher said I should be getting. I was never too hot with pipettes and the like, though.

]]>He seems to advocate some sort of averaging over the effect found in the two discordant studies.

I’m not sure how much I like this. Isn’t this like telling a freshman Chemistry student that it is ok for his repeat titration readings to not match, just use an average reading?

Some variation may be inevitable, but at the level we seem to see, I’d just want to discard both studies as crap.

]]>An un-replicated drug study is already doing harm if it was wrong. It is the fallacy of safety by inaction.

]]>http://www.sciencefriday.com/segment/08/28/2015/putting-scientific-research-to-the-test.html

]]>The problem is many things that will look natural to a Frequentist turned “Bayesian” will not be equivalent to using the basic equations of probability theory, not even approximately.

]]>Framing this replicability result in terms of Type M error, the fact that Nosek says (Science podcast) that the magnitude of the replicated effects was roughly half that of the original effect suggests, through some quick back-of-the-envelope simulations, that the power across these studies (assuming that in fact, mu != 0) is about 0.20. This is a much higher number than I expected; I was expecting something around the range of 0.05. But it is still pathetically low.

A blog called the Replicability Index is computing something called “post-hoc power” and claims to adjust for the inflation due to Type M error (how can one do that unless one knows what the true power is—that’s what we are trying to get at in the first place), but his estimates are bizarrely high. Perhaps he needs to apply his method to the studies chosen by Nosek et al, compute his post-hoc power and compare them with the actually observed results for a reality check. My impression is that he has hugely overestimated power.

]]>