## What to make of reported statistical analysis summaries: Hear no distinction, see no ensembles, speak of no non-random error.

Recently there has been a lot of fuss about the inappropriate interpretations and uses of p-values, significance tests, Bayes factors, confidence intervals, credible intervals and almost anything anyone has ever thought of. That is to desperately discern what to make of reported statistical analysis summaries of individual studies –  largely on their own. Including a credible quantification of the uncertainties involved. Immediately after a study has been completed, or soon after – by the very experimenters who were involved in carrying it out. Perhaps along with consultants or collaborators with hopefully somewhat more statistical experience. So creators, perpetrators, evaluators, jurors and judges all biased to a hopeful sentence of many citations and continued career progression.

Three things that do not seem to be getting adequate emphasis in these discussions of what to make of reported statistical analysis summaries are – 1. failing to distinguish what something is versus what to make of it, 2. ignoring the ensemble of similar studies (completed, ongoing and future) and 3. neglecting important non-random errors. This does seem to be driven by academic culture and so it won’t be easy to change. As Nazi Reich Marshal Hermann Goring once quipped? “Whenever I hear the word culture, I want to reach for my pistol!”.

What is meant by “what to make of” a reported statistical analysis summary, its upshot or how it should affect our future actions and thinking as opposed to simply what it is? CS Peirce called this the pragmatic grade of clarity of a concept. To him it was the third grade that needed to be proceeded by two other grades, the ability to recognise instances of a concept and the ability to define it. For instance with regard to p_values, the ability to recognise what is or is not a p_value, the ability to define a p_value and the ability to know what to make of a p_value in a given study. Its the third that is primary and paramount to “enabling researchers to be less misled by the observations” and thereby discern what to make for instance of a p_value. Importantly it also always remains open ended.

A helpful quote from Peirce might be “. . . there are three grades of clearness in our apprehensions of the meanings of words. The first consists in the connexion of the word with familiar experience. . . . The second grade consists in the abstract definition, depending upon an analysis of just what it is that makes the word applicable. . . . The third grade of clearness consists in such a representation of the idea that fruitful reasoning can be made to turn upon it, and that it can be applied to the resolution of difficult practical problems.” (CP 3.457, 1897)

Now almost all the teaching in statistics is about the first two and much (most) of the practice of statistics skips over the third with the usual, this is the p_value in your study and don’t forget its actual definition. If you do people will have the right to laugh at you. But all the fuss here is or should be about – what should be made of this p-value, or other statistical analysis summary.  How should it affect our future actions and thinking? Again, that will always remains open ended.

Additionally, ignoring the ensemble of similar studies makes that task unduly hazardous (except in emergency or ethical situations where multiple studies are not possible or can’t be waited for). So why are most statistical discussions framed with reference to a single solitary study with the expectation that, if done adequately, one should be able to discern what to make of it and adequately quantify the uncertainties involved. Why, why, why? As Mosteller and Tukey put it in their chapter Hunting Out the Real Uncertainty of Data analysis and Regression way back in 1977 – you don’t even have access to the real uncertainty with just a single study.

Unfortunately, when many do consider the ensemble (i.e. do an meta-analysis) they almost exclusively obsess about combining studies to get more power paying not much more than lip service to assessing the real uncertainty (e.g. doing an horribly under powered  test of heterogeneity or thinking a random effect will adequately soak up all the real differences).  Initially the first or second sentence of the wiki entry on meta-analysis was roughly “meta-analysis has the capacity to contrast results from different studies and identify patterns among study results, sources of disagreement among those results, or other interesting relationships that may come to light in the context of multiple studies.” That progressively got moved further and further down and prefaced with “In addition to providing an estimate of the unknown common truth” (i.e. in addition to this amazing offer you will also receive…). Why in the world would you want an estimate of the unknown common truth without some credible assessment that things are common?

Though perhaps most critical of all, not considering important non-random error in discerning what to make of a p-value or other statistical analysis summary makes no sense. Perhaps the systematic error (e.g. confounding) is much larger than the random error. Maybe the random error is so small relative to systematic that the random error  can be safely ignored (i.e. no need to even calculate a p_value)?

Earlier I admitted these oversights are culturally driven in academia and reaching for one’s pistol is almost never a good idea. Academics really want (or even feel they absolutely need) to make something out of an individual study on its own (especially if its theirs). Systematic errors are just hard too deal with adequately for most statisticians and usually require domain knowledge statisticians and even study authors won’t have. Publicly discerning what to make of  p-value or other statistical analysis summary is simply too risky. It is open ended and in some sense you will always fall short and others might laugh at you.

Too bad always being wrong (in some sense) seems so wrong.

1. Marcus says:

My co-authors and I are currently fighting a meta-analytic battle that gets at some of what you describe here. We focus on a widely studied behavioral construct in our field (we have over 200 individual studies and a total sample size of around 60,000 on this construct) that is widely believed to have a universally large positive effect. Our data strongly suggests that the “large positive effect” observed in many published papers is primarily due to a powerful design confound and that any remaining positive effect is highly variable across domains in a fairly predictable manner. We’ve spent around 5 years trying to get this published but the same journals who published many of our source articles have not been convinced by our data. Reviewer comments have varied but one common refrain has been that our implied critique ignores the fact that the “average” effect is still significantly different from zero. We are getting to the point where we are unsure if this beast will ever get published.

• Solomon Kurz says:

That sounds infuriating. What are your thougts on preprints?

• Anonymous says:

+1

5 years of possible progress wasted, let’s not make it 6 :)

• Marcus says:

Difficult in our area. Many of our journals will only consider submissions if the paper is not already available elsewhere, so posting our results online would make it impossible to ever “publish” the paper in a journal. Unfortunately all three of the authors are not yet tenured and all three of us work at universities where publishing in journals is considered the only marker of research productivity.
Another depressing experience with this paper is that we take a strong inference approach to testing competing theories and we’ve reviewers tell us that they’ve never heard of this approach.

• Martha (Smith) says:

So sad. I come from a field (math) where preprints were normal from way back. You got a result, wrote it up, submitted it for publication, and at the same time sent (by snail mail) preprints to people you thought might be interested. (I remember when preprints were mimeographed!) You talked about it at conferences, and people asked for preprints, which you obligingly sent. Or someone heard about it from someone who had seen a preprint, so would write to ask you for one. You might publish a “research announcement” in a journal that had a section for announcing results not jet published — and people might ask for a preprint in response to that. A nice side effect: people receiving preprints might spot errors, which allowed you to correct or withdraw the paper before publication. Seems like a really good process to me — but “That’s the way we’ve always done it” is a really strong force against adopting improvements.

• Keith O'Rourke says:

Marcus

> “average” effect is still significantly different from zero

That relates to “thinking a random effect will adequately soak up all the real differences” but also not realizing a credible assessment that things are common needs to be first and foremost. If that assessment suggests whats being studied in the different studies is different (and there is not an identified subset where it is similar) – a reasonable argument would be we can’t make sense of these studies and a reboot is needed.

However, if you think what is causing the variation is biological (or a real treatment interaction) then a random effect analysis with the average taken as a rough estimate of the center of such differing effects – can make sense. But if the variation is methodological (i.e. confounding which you said it was) then this make little sense.

Some argue you can assume the confounding effects are symmetric about zero (then it would be sensible to treat them as random effects) but that to me is just wishful thinking. Now perhaps there could be some interest in the upper and lower bounds to give some sense of the range of un-confounded effects but that too, I think is wishful thinking. At some point statisticians have to give up trying revive a bunch of mostly dead studies and just pronounce all of them – multiple deaths. They all get buried as there is no credible way to assess which ones aren’t really dead.

> unsure if this beast will ever get published.
You just have to persevere – perhaps putting online somewhere.

There has been this strange push back on meta-analysis overtime and notably by statisticians.

My first paper on meta-analysis https://www.ncbi.nlm.nih.gov/pubmed/3300460 was held up in review for a year and half and it was the first paper the journal paid statistical consultants to review. L’Abbe and I were graduate students in Epidemiology and Biostatistics respectively at the time at University of Toronto and none of our faculty or fellow grad students were impressed – quite the opposite.

A couple years after the paper, I attended a meeting with the Biostatistics grad students about coming up a talk for the Epidemiology/Biostatistics research day. The idea was to find someone who with a bit of statistical knowledge had really blown something because they did not have enough statistical knowledge. Then they stop talking to me.

About a week before the talk, someone leaked to me that “Penny Brasher and Larry Wasserman were doing the talk on something they found wrong in one of my papers – sorry can’t say what – but they have run it by most of the faculty and it looks bad.” Then I find out that that’s true and regarding the L’Abbe paper but I’ll be given an abstract one hour before the talk begins – out of a sense of fairness. Rob Tibshirani could not attend the talk as he was faculty but he sat outside the room so he could find out how it went right afterwards.

L’Abbe and I found the talk amusing as it claimed to have a technical basis for claiming meta-analysis should not be attempted (publication selection effects were strictly not identified) that we had already – to some extent -countered in the published paper. (Larry claimed afterwards that he left reading the paper to Penny). Long story, and now some famous names in statistics – but the point is sometimes the discipline does not get what you get (even the really talented among them) – and you just have to persevere.

2. Thanatos Savehn says:

Marcus,

I see your issue come up all the time in what I do. Expert witnesses wave a bunch of serially flawed studies and argue that whatever their flaws the “average effect is significantly different from zero”. Judges, like the reviewers you keep running into, often assume that this represents powerful evidence of causation. Their thinking goes like this:

“Defendant’s Action either did or did not cause Bad Thing (invoking the law of the excluded middle). This issue has been studied many times and the overall average finding makes highly dubious the claim that Actions like Defendant’s do not cause Bad Thing. Therefore the only alternative, that Actions like Defendant’s Action do cause Bad Thing, is the obvious conclusion.”

But it seems to me that in cases of natural variability in response to some treatment the law of the excluded middle either doesn’t apply (because in some individuals it may be causative and in others not and so both does and does not cause cancer in the cohort) or is nonsensical – “either X causes the average to be number Z or it causes the average to be number not Z” – in that 2 and 2 are no more the “cause” of 4 than 2 and 3 of 5. My point is that when classical logic bumped into statistical inference the result was not a chocolate peanut butter cup.

If this makes sense (and again I’m just a Stats fanboi desperate to learn more because my profession has become thoroughly enthralled by statistical inference and is currently in the process of enshrining into law all the misunderstandings of p-values, confidence intervals, etc), and if anyone knows of a good philosophical treatment of reasoning about causation from statistical analyses without appeals to the modus tollens, etc. I’d be grateful.

• Terry says:

Are you comfortable being more specific about what your “profession” is? Medical malpractice law?

• Thanatos Savehn says:

Sure. I do civil litigation but no Med/Mal. Sometimes it’s commercial litigation (see example below in response to Daniel) and sometimes it’s mass torts litigation in which each side loads its respective scale with p<0.05 epi studies and the jury is left to weigh them (and it's a appalling spectacle).

• I’ve done some work on forensics in engineering would be interested to work w you on some of these issues. See my website http://www.lakelandappliedsciences.com for contact info. I think codifying bad stats into law could be extremely damaging.

• Thanatos Savehn says:

Thanks. I may well. I assume there’s a c.v. there. I have one unfolding matter involving, of all things, the inspection paradox. Even very sophisticated clients don’t get these things. Another client is accumulating 3+ petabytes of data per day on its operations and sending it to a third party which then models things like time to failure of pumps, the implications of operations and maintenance changes, etc. and does the same for several other owners of these vast facilities. A Bad Thing happened and the question has become “Given this communal data lake, was the Bad Thing foreseeable and if so who had the duty to foresee it – the lake owner, the plant owner, or all of the above?” And think of all the other questions that arise from this set up. Who owns the inferences to be drawn from the data? How should those inferences be made? If I look out over my lake and, thanks to my model, infer that Plant A is about to experience a serious upset, may I short their stock? What a world, eh?

• Martha (Smith) says:

“But it seems to me that in cases of natural variability in response to some treatment the law of the excluded middle either doesn’t apply (because in some individuals it may be causative and in others not and so both does and does not cause cancer in the cohort) or is nonsensical – “either X causes the average to be number Z or it causes the average to be number not Z” – in that 2 and 2 are no more the “cause” of 4 than 2 and 3 of 5. My point is that when classical logic bumped into statistical inference the result was not a chocolate peanut butter cup. “

Yes, yes, yes!

• Keith O'Rourke says:

Thanatos: With regard to the legal/philo stuff a few suggestions.

First you are aware of the discussions by Sander Greenland and others on Mayo’s blog having commented there (e.g. I have paid the Piper and my need has dissipated.) Now, I think you are looking for what was once said of John Dewey – bring the best of philo to bear on practical problems rather than turn practical problems into philo problems. My take on Mayo is that she is drawing on philo to set out a meta-statistics (in general what should and should not be done in statistics, initially banning any role for priors).

Unfortunately, I am not aware of a philosopher that is knowledgeable enough about applying statistics in real world problems to play that John Dewey role. They stick with toy problems as for instances rather than real inquiries. Now I find Susan Haack’s work very informative, especially about CS Peirce but I have read little of her legal stuff.

Things might be different in the Canadian legal system but the last time I talked to someone with a multi-million dollar budget to set up an epidemiological unit to support legal work they seemed to conclude that the better epi would not have the likely impact in court – to be worth it. My sense is that in science, we cause to understand, in religion cause to believe, in politics cause to follow and in the courts cause to accept a version of events and the legal implications of that. So without the right impact on judge/jurors …

I want to address Marcus’ and other’s comments later, but quickly Sander Greenland and I do have arguments against taking the average as being of any interest to anyone in the Meta-Analysis chapter or Modern Epidemiology. Rothman, Lash and Greenland.

• Keith O'Rourke says:

Heres the relative snippet fro the chapter

“Because many factors (both methodologic and biologic) that could affect relative risk estimates will vary across studies, the homogeneity assumption is at best a convenient fiction. Similarly, the variation in these factors will tend to be systematic, making a random distribution for effects a fiction, and any measure of average or spread of relative risks potentially misleading.”

• Martha (Smith) says:

+1

So many people ignore (probably have never been taught) the crucial role of model assumptions in statistical inference — so they think that meta-analysis is a magic cure for between-study variability of results, blithely ignoring that meta-analysis methods themselves depend on model assumptions.

• Thanatos Savehn says:

Many thanks. Yes, that was me on ErrorStatistics but somehow an additional post didn’t make it through the filters. It made two points I’ll briefly set out here.

First, I understand that all models are wrong but some are useful, that (via Popper, paraphrasing here) all we really know is what isn’t so, and that science is like sailing beyond the edge of the map; tacking back and forth towards guesstimates of where discoveries might lie. However noble the quest may be it is of no interest to the courts. They are constituted to resolve disputes once and for all. That’s why matters are concluded with a Final Judgment. Executed prisoners cannot be unexecuted and multimillion dollar payments need not be returned even if, as in the case of bone marrow treatments denied for breast cancer patients, it is later demonstrated repeatedly that the treatment would have had no positive effect on survival (and indeed would have hastened death).

Because Final Judgments after exhausted appeals are truly Final and because the point of such judgments is to end disputes and to convince the public that justice has been done there’s a strong desire to generate an aura of certainty about the outcome and the evidence on which it was based. Thus the appeal of “significant”, “confidence” and “power” given how they are defined in Black’s Law dictionary (which, as you have already guessed, is not at all how you good people use them). Adding into the mix the enormous amount of data available in any modern litigation and the inability of the courts to understand that HARKing does not generate knowledge I came to the conclusion that the p<0.005 proposal of Valen Johnson et al was sensible in such circumstances as it at least makes (or so I imagine) it harder for hired p-miners to find nuggets of admissible evidence in caverns of noise – and that was the point I was trying to make – if we're going to hang a man for generating the wrong p-value let's at least make sure it's a moderately difficult one to generate.

Second, the Rules of Evidence outlaw speculation. That causes the following perversity. Plaintiff theorizes that hexa-methyl-scary-stuff causes glioblastoma multiforme (GM) and sues. His lawyer gets a \$750/hr epidemiologist to opine that after doing his own meta-analysis (with STATA! which some judges think of as akin to a Palantir) has identified a previously unknown pattern in the GM epi literature: hexa-methyl-scary-stuff and proxies for exposure to it are statistically significantly associated with GM. I, Thanatos Savehn, want to demonstrate that lots of things have been associated with GM and, using the same methodology, I can find many more spurious associations. But guess what? It's not admissible to prove the existence of possible alternative causes unless I can find an expert willing to state to a reasonable degree of scientific certainty/probability that e.g. eating jalapeno cheese is not just associated with GM but is in fact a cause of GM. Thus defendants are often put in the untenable position of either letting the jury hear that there's only one known cause of GM, that plaintiff was exposed to it and then trying to argue that for some reason the only cause wasn't really the legal cause, or selling out and hiring people who will get on the misinterpretation bandwagon and swear that p<0.05 is the measure of truth and they've found it in jalapeno cheese. I have always done the former (and focused on showing that the risk was de minimis) and have lost clients over it as there are plenty of defense experts and lawyers willing to play the game for a fee. I suspect this is what gets Greenland all riled up. I've never read the report he submitted "in support of the thimerosal causation theory" in the autism vaccine but assume/hope that it's just an attack on the government's "no causation cause the epi says so" methodology. https://scholar.google.com/scholar_case?case=2205029817979343066

As for Haack, she's the darling of the worst purveyors of junk science and whether she likes it or not (and I assume she wouldn't) is often cited as an authority for the propositions that science has no rules, that Popper was all wrong, that falsification proves nothing, and that it's perfectly valid for an expert to glean causation from 9 papers that show no causation because causal inference is just a subjective state of belief and lay people should decide whether one scientist's reasoning is better than another's. As you might imagine I have high hopes that the reproducibility crisis will one day draw the focus of the courts' skeptical attention to such arguments.

• Thanatos:

Whatever the case of misinformation and snow-jobbing and quantitative naivety in the judiciary, there is one thing that it seems lawyers and judges and soforth are plenty interested in, and this is the logic of argument. “If A then B, A, therefore C” is never going to fly in court if the fallacy is uncovered clearly enough.

So, what you need is an expert who will swear, not that Jalapeno Cheese is a cause of GM because also p less than 0.05, but that “STATA says p less than 0.05 in a meta analysis of epi literature therefore hexa-methyl-scary-stuff is the legal cause of GM in this case” is fallacy of the form “If A then B, A, Therefore C” at least in cases where that’s what’s going on.

In other words, not an argument that other things could cause GM, but rather that the evidence provided via STATA and etc does not establish causation, or even better, does not establish causation *in this case*, and a clear explanation of what it DOES establish instead.

For example:

Given: If (Hexa-scary is one of the many causes of GM, and plaintiff was exposed to H-M-S) then (H-S is among the possible causes of GM in the plaintiff)

and also: (If H-M-S does not cause GM or plaintiff was not exposed to H-M-S) then (H-M-S is not among the possible causes of GM in the plaintiff)

Accepted: H-M-S is one of the many causes of GM, and plaintiff was exposed to H-M-S

Therefore: H-M-S did cause GM in the plaintiff (Fallacious does not follow from the given facts)

Therefore: H-M-S is among the possible causes of GM in the plaintiff (Correct)

To go further, we need to establish that the preponderance of the evidence suggests not just that H-M-S is among the possible causes, but that H-M-S is the cause in this case.

what we’ve probably established (at best) via the epi-research is that

p(GM | administer HS) is greater than p(GM | don’t administer HS)

in the population. You could argue that (p(GM|Administer HS) – p(GM|don’t administer HS) ) / p(GM | don’t administer HS)

is the incremental relative increase in probability of GM caused by administering HS, and only if this is greater than 1 does the preponderance of evidence prevail. That is, most of the probability of having GM comes from the increase in probability caused by administration… This is equivalent to saying that p(GM | Administer) / p(GM | don’t administer) is greater than 2. In other words, administering has to double your risk or more to get a preponderance.

Anyway, email me if you think this kind of thing could help.

• Terry says:

is the incremental relative increase in probability of GM caused by administering HS, and only if this is greater than 1 does the preponderance of evidence prevail. That is, most of the probability of having GM comes from the increase in probability caused by administration… This is equivalent to saying that p(GM | Administer) / p(GM | don’t administer) is greater than 2. In other words, administering has to double your risk or more to get a preponderance.

Is this really true? Isn’t preponderance “more likely than not” that HS caused GM?

If so, wouldn’t you want to go Bayesian and compute the probability that HS was caused by all other possible factors and compare it to the probability that HS was caused by GM?

Could this be a general solution to the p<.05 problem of the plaintiff's expert? Finding that p<.05 establishes only that there is some probability that HS caused GM. But, that probability may be very small.

I'm guessing this is too powerful an argument because this is probably true of all life-threatening chemicals. It would also be pretty weak in a class action since it would concede that some members of the class were harmed by HS.

• The probabilities I’m writing down were Bayesian, so if exposure doubles your risk then it is more than 50 percent of the cause, under assumptions of causality in the first place.

More details would require more… Well details.

If in the background there’s a 1% lifetime risk of Foo and with exposure to bar there’s a 9% lifetime risk, and the plaintiff was exposed… Well the preponderance is clear. The borderline is the doubling of risk. 2x higher than background.

• Keith O'Rourke says:

Thanks for the legal insight

“I assume she wouldn’t”
Most likey not – here she talks about the perverse incentive in philosophy and its “reproducibility” crisis

• Sander Greenland says:

Since my name came up along with Haack’s: There is a lot to say here.

First, the personal: Defense lawyers love to mention I testified in this case without bothering to check (or mention) that I went on record as saying that I don’t think vaccines cause autism. Their behavior is unsurprising since their role allows and even encourages selective citation and impugning opposing experts, so it becomes a habit. But then, what I actually said is ignored or misquoted by almost all commentators invoking my name in the case, including two of the three adjudicators.

Now, the science: I was indeed attacking the idea pushed by Steve Goodman arguing for the government that you can prove a null with the available data and knowledge in a case like this. This idea (which is antiscientific to any falsificationist) has a formalization in using null-spiked priors for “Bayesian significance tests” (Jeffreys tests) as promoted in certain Bayesian writings for researchers, but rejected for these applications by me and many others (including I think Andrew) because there is no scientific/empirical basis for such a spike in “soft sciences” (the claimed biochemical arguments in this case based on dose were also far more soft than made out to be).

To elaborate: There isn’t an exact empirical basis for spikes even in physics (e.g., see Cartwright) – the best one can say that the concentration about laws is well enough approximated by spikes. Until c. 1900 Newton’s laws would have been given a spike to little practical effect; today that would ruin high-accuracy GPS; and Jeffreys showed the hazard of spikes in his vehement denial of continental drift).
But then, the approximation argument does not begin to apply in any serious problem in soft sciences, which is why they are called “soft” (I am here excluding homeopathy and many other belief systems as “serious” in terms of explanatory as opposed to pathological science, as well as the pathologies seen in the “vaccine wars”).

The sad part for American tort law is that the burden of proof is supposed to be on the plaintiff to demonstrate the null is disfavored relative to the claimed alternative (discernable harm to the plaintiff), so there should be no defense need for a null spike – especially for vaccines, where harms are recognized and accepted as costs to be weighed against benefits. American law gets this twisted around because most statistical defense arguments I’ve seen can’t get beyond silly null-testing fallacies like P>0.05 implies support for the null. The best courts I’ve seen appreciate that (as with all real science) a lot of synthesis of disparate evidence streams is needed instead to determine if the plaintiffs have met their burden (the other government expert, Fombonne – a pediatric neurologist-epdiemiologist – came much closer to doing this). At least in this case the court made what I thought was a good decision, i.e., for the defense – and I say that as one who was asked to testify by the plaintiffs!

Regulatory decisions are another can of worms entirely since those may involve precautionary principles raised by stakeholders. For example, a 10% probability of harm from a radiation exposure may not be seen as acceptable to the exposed, even if it does not lead to prevailing in a lawsuit claiming the harm occurred). Likewise, even a global-warming skeptic might accept regulations upon accepting a 1% chance of global catastrophe. Grotesque (and all-too-human) overconfidence on both sides (in the naked form of 100% certainty) plagues such emotionally-charged controversies.

3. Ken Rice says:

Hi Keith, Marcus

You might be interested in this recent paper on fixed-effects meta-analysis, where we try to lay out what it does estimate, and discuss whether (or when) this may be of applied interest, and how it can be used together with other methods. For random-effects, Higgins Thompson and Spiegelhalter (2008) similarly lay out motivations for that approach – the most compelling of which, I think, is an assumption of exchangeability.

For Marcus – while I have no idea about the details of your problem – it sounds like a combination of two things going on. First, an average association that measures an uninteresting quantity; i.e. an average of parameters that tell you more about design flaws than any underlying truth. Second, an unwillingness to engage with heterogeneity, either as a quantity of interest by itself or explaining it via some form of meta-regression. Both are tricky to discuss with skeptical audiences, in my experience, but I hope the papers above – or the literature they review – help you.

4. Keith O'Rourke says:

Thanks, the abstract looks new and interesting – but I will have wait until I am where its not pay walled.

This might interest you re – assumption of exchangeability http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

So much of the meta-analysis literature is such a re-hash of ideas that were fairly well worked out in in the 1990,s.

For instance this 2016 paper by your second author Julian Higgins “A general framework for the use of logistic regression models in
meta-analysis” which discovers that logistic regression can be used to implement meta-analysis. That was in my first paper from 1987 that I discussed above that was held in review for so long http://annals.org/aim/article/702054

Now Julian was a co-research fellow with me in Oxford and for instance was supposed to read and comment on draft of this paper that set out all the details – O’Rourke K, Altman DG. 2005. Bayesian random effects meta-analysis of trials with binary outcomes: methods for the absolute risk difference and relative risk scales. Stat Med 24:2733-2742

I guess he either did’t actually read the drafts or totally forget about that!

Now some of my stuff was less accessible,

O’Rourke K. 2001. Meta-analysis: Conceptual issues of addressing apparent failure of individual study replication or “inexplicable” heterogeneity. Ahmed SE and Reid N, editors. Lecture notes in statistics: Empirical Bayes and likelihood inference. Springer

But I find it disappointing that as field meta-analysis has seen so little progress over the past 20 to 30 years almost as if a cartel was restraining competition ;-)

5. Keith O'Rourke says:

Ken: Unlike some of your other papers this one I find unconvincing if not even wrong headed.

Yes “answers a question that is relevant to the scientific situation at hand” is the goal but Peto and Peto like automatic/thougtless post-stratification of (reified) fixed sub-population effects are very unlikely to that credibly.

I believe a better route forward, if there is knowledge of what to post-stratify on to get a stable over time and place population of interest would be MRP with an informative prior on effect variation. If not just a random effect model with an informative prior on effect variation. It is the avoidance of informative priors that drives the desperate holy grail quest to make sense of varying effects as fixed – see Dan’s new post here https://andrewgelman.com/2017/09/05/never-total-eclipse-prior/

The idea that a trial had a certain subset of the population and that the effect is fixed is that – is just a hopeful untested superstition supported by not have access to that variation – surely it will vary by time and place (though perhaps not importantly). Perhaps, inbreed lab animals in the same lab with same technician – but there we do see above SE variation. Trial are attempts to things correctly as possible, there is always some slippage. The Rubin paper you referenced was very clear about methodological in addition to biological variations.

Was not aware that this fixed versus random effects was such a big deal in genetics. Though even with Mendelian randomization I do think you mostly “speak of no non-random error”.

Or maybe I missed something.

6. What an outstanding discussion.