Brian MacGillivray writes:

I’ve just published a paper that draws on your work on the garden of forking paths, as well as your concept of statistics as being the science of defaults.

The article is called, “Characterising bias in regulatory risk and decision analysis: An analysis of heuristics applied in health technology appraisal, chemicals regulation, and climate change governance.”

MacGillivray also has a question:

So I’ve noticed that you blog a fair amount about the importance of interactions, but are often quite critical of how subgroup analyses are done in practice (chasing noise, spurious findings, etc.). So I was wondering if you have any general principles on how to: 1) design and conduct a subgroup analysis, and 2) interpret the results of subgroup analyses done by others (e.g. in a systematic review, or meta-analysis)? I guess you would favour multi-level models for this, but I was wondering about situations where this isn’t an option, e.g. where your collaborators favour significance tests / interval estimation, or the literature you’re reviewing primarily uses classical statistics.

My reply: Yes, I *do* recommend multilevel modeling and partial pooling. Otherwise your estimates are biased; they’re systematically too far away from zero. If my collaborators don’t want to do this, or if the literature favors classical approaches, I’d say to do it both ways and then discuss why I prefer the multilevel estimates. Realistically, we’ll have to expect lots of our uncertainty intervals to include zero; that’s just the way it is—but decisions still need to be made in any case.

When it comes to meta-analysis: more and more, I’m saying, don’t even do it if you can’t get the raw data from the original studies.

More broadly, my take on doing subgroup analyses (when an investigator insists on it) is to adjust alpha accordingly to compensate for looking at the same data multiple times. I lean towards a conservative Bonferroni adjustment in these cases. I’ve made the point to investigators that the more questions they ask from their data, the fewer answers they will get, so choose the questions carefully. I try to dissuade them from doing this sort of thing.

This is in the context of having done an initial analysis of a full data set, followed by the investigator wanting to focus on particular subgroups of interest — generally when they didn’t get the answer they wanted initially.

Another way to think about this though, is that the more questions you ask, the better your answers are…

Suppose you do want to look at different subgroups. If you pick and choose 1 or 2 subgroups based on a first look at the data, and analyze just those sub-groups ignoring all the information about other groups you have, your result is heavily biased towards finding what you want. If on the other hand, you use a Bayesian model across *all* the subgroups that incorporates the subgroup similarity through partial pooling, the estimates for each sub-group are going to be closer to correct and the variation estimates from sub-group to sub-group will be closer to correct.

Of course, the correct answer may be “there isn’t that much reason to think the sub-groups are very different” but that is the *correct* answer. It just isn’t the answer that gets you publications, tenure, and grants.

If you do NHST type analyses on sub-groups using just a few sub-groups you get more and more biased and wrong answers… and if you do the same thing in a Bayesian analysis using all the groups you get more and more accurate but somewhat less precise answers.

Is there a reason you think this must be a Bayesian analysis? You would get the same effect that you are advocating by using a standard mixed effects model.

Typically a “standard mixed effects model” is a Bayesian model with a flat prior. With the flat prior, you in general *don’t* necessarily get the partial pooling needed to make this make sense without very large sets of data. In other words, the inference from one sub-group is decoupled from the information in other sub-groups and it breaks down to looking at the individual sub-groups by themselves, just doing the calculation all at once.

Depending on the model, and the quantity of data, that may or may not always be true, but without a group level prior to describe your understanding of how similar the groups are to each other, you in general don’t get the same answer as you would with that “prior”.

“In other words, the inference from one sub-group is decoupled from the information in other sub-groups and it breaks down to looking at the individual sub-groups by themselves, just doing the calculation all at once.”

I’m not following. If we fit 5 random effects with a vanilla LME model, then we assume each of the deviations from the fixed mean come from a normal distribution centered at 0 a shared sd. Thus, supposed we had 4 raw estimates clustered around 1.5, but one raw estimate close to 50 with only a little data supporting this. The LME estimate for the fifth estimate would be strongly recentered toward 1.5, which sounds like the effect you are talking about in your first post. The LME estimate of the 5th effect is *not* decoupled from the estimates of the first 4. If we fitted 5 fixed effects or just used OLS, it would be.

In this model the mean of the fixed effects would be inferred as

(1.5*4 + 50)/5 = 11.2

and the sd would be inferred as

sd(c(1.5,1.5,1.5,1.5,50)) = 21.7

on the other hand, if you placed a prior that the fixed effects were say normal(1.5,.5) then the estimate for the 50 would be strongly pulled down towards 1.5, whereas the 1.5s would stay where they were.

Of course it’d have to be the case that normal(1.5,.5) made sense for your problem before you collected the data, but assuming that the real process is O(1) you’d probably do something more like normal(2,2) and still pull the 50 way down.

But in the normal(a,b) with a and b having improper priors (which is what you get with a typical mixed effect model) there’s nothing keeping the grouping from becoming normal(11.2,21.7) and all the 1.5’s being pulled up towards 11.2 and the 50 being pulled down towards 11.2 and everything being wrong.

The mixed effects model as calculated on an actual computer *is* a Bayesian model with uniform(-MaxFloat,MaxFloat) prior on the a,b. So the question is just whether uniform(-MaxFloat,MaxFloat) is a meaningful prior in your problem.

I’m afraid you’ve misrepresented how mixed effects models work.

A:

I don’t think terms such as “mixed effects models” are so useful (see my 2005 paper on Anova for my problems with the terms “fixed” and “random”), but, yes, the prior distribution on the group-level variance parameters can make a difference when the number of groups is small or moderate. This is a point that Jennifer and I unfortunately did

notmake in our book, as we didn’t fully understand it at the time of writing. With any single dataset, the simple flat-prior multilevel model estimate, or classical “mixed effects” estimate, can seem reasonable—but if you start fitting the same model to lots of datasets, you’ll quickly learn that the estimated group-level variance parameters can be unstable, and prior information can really help.Andrew: right, with 5 groups, you don’t have much to infer the mean and sd of the group level effects from, with 37 or 1100 or whatever you’d have a lot more regularization on individual one-off outlier effects. The typical “mixed effects” model is very literally equivalent to the following stan pseudocode:

a ~ uniform(-MaxFloat,MaxFloat)

b ~ uniform(0,MaxFloat)

ymean ~ normal(a,b)

sd ~ uniform(0,MaxFloat)

for(i in 1:N){

y[i] ~ normal(ymean[group[i]],sd[group[i]])

}

the end result is that if the values are more or less normally distributed, then ymean is the same as the sample mean of each group, sd[group[i]] is the same as the sample sd of the group, the a value is the mean of the ymean values, and the b value is the sample sd of the ymean values

if there are outliers, then the assumption of a normal shape to the distribution of ymean may pull outliers in a little so that the posterior ymean values are more normally distributed, but this is really a distributional shape thing rather than a scale issue. The main effect of the true Bayesian priors which is keeping things from being too far apart doesn’t occur because there’s no prior information on the size of a or b, they’re symmetrically equally a-priori likely for the whole floating point range, and so there’s no “rubber band” to pull things in towards a pre-specified common value.

That analogy of a rubber band is VERY LITERALLY how the HMC calculations are done. A normal distribution corresponds to a mass on a spring, the sd corresponds to the stiffness of the spring, and the HMC dynamics corresponds to letting that mass vibrate back and forth. The strength of the spring is related to the sd of the distribution. with b ~ uniform(0,MaxFloat) you are in essence saying that the stiffness of the spring could be any value. When you ask for a point estimate from some fitting machinery it’ll just give you the sd of the sufficient statistics (the sample means for each group).

the WHOLE POINT of flat priors / likelihood only inference is to only use information in the data. hence, your posterior is just whatever the observed values of the sufficient statistics are.

Andrew:

Agreed. Variance components fit by MLE (or moreover, ReML) methods are unstable. Variance components fit with a very small number of units (such as 5 different groups, even if there are 1000 samples per group) are especially unstable!

I just ran this using lmer from lme4, imagine 4 groups of rnorm(5,1.5,0.5) and one group rnorm(5,50,.5), we try to infer the mean of each group:

library(lme4)

dat <- data.frame(m = c(rnorm(20, 1.5,.5),rnorm(5,50,.5)),g = as.factor(c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5))))

summary(lmer(m ~ 0+ g + (1|g),data=dat))

Linear mixed model fit by REML [‘lmerMod’]

Formula: m ~ 0 + g + (1 | g)

Data: dat

REML criterion at convergence: 28.6

Scaled residuals:

Min 1Q Median 3Q Max

-1.58683 -0.70859 -0.08228 0.59895 1.61954

Random effects:

Groups Name Variance Std.Dev.

g (Intercept) 491.3574 22.1666

Residual 0.1637 0.4046

Number of obs: 25, groups: g, 5

Fixed effects:

Estimate Std. Error t value

g1 1.574 22.167 0.071

g2 1.735 22.167 0.078

g3 1.548 22.167 0.070

g4 1.685 22.167 0.076

g5 50.206 22.167 2.265

Which is exactly what I predicted, the mean of each group is the individual mean, the sd of the group level stuff is 22.2 just as expected, there is no pooling. With 300 groups most of them at mean of 1.5 the outlier group would be pulled in because the sd of the group level effects wouldn’t be affected by the outlier… but with 5 groups it is.

Daniel:

One reason it’s exactly the same is that you have misspecified the model. Note that you have a fixed effect (g1, g2…) that is fit only by the data in group 1, group 2, etc. I think what you want is to just have a fixed intercept with a random intercept for variable g. Then you can get your predictions by simply doing

predict(fit, data = data.frame(g = 5))

Once you fix that, you may also notice that it actually behaves a little in the opposite as you have described above; if you have a small number of groups, one extreme outlier group will mean the estimate of the variance is so high (see: what Andrew was referring to) that there will be hardly any pull on the estimates; with means 1.5 and 50, group 5’s prediction will be very close (but not exactly the same) as the raw mean for group 5. On the other hand, as that outlier comes closer to the “normal” group means, the variance estimate will get smaller and there will be a stronger pull toward the group mean.

You can also play with changing the sample size of group 5.

I find lmer syntax not that helpful.

Somehow you mistake what it is i’m suggesting:

Originally you said: “Thus, supposed we had 4 raw estimates clustered around 1.5, but one raw estimate close to 50 with only a little data supporting this. The LME estimate for the fifth estimate would be strongly recentered toward 1.5, which sounds like the effect you are talking about in your first post. The LME estimate of the 5th effect is *not* decoupled from the estimates of the first 4. If we fitted 5 fixed effects or just used OLS, it would be.”

Then I argued this is not true because the flat prior on the lmer type model allows the model to simply expand the variance to stretch the rubber band as much as needed to put the estimates on the individual group values.

You respond that I specified a different model from what you suggest, which is fine, I find lmer notation somewhat opaque and prefer Stan so I never use lmer. Nevertheless, you suggested:

“Once you fix that, you may also notice that it actually behaves a little in the opposite as you have described above; if you have a small number of groups, one extreme outlier group will mean the estimate of the variance is so high (see: what Andrew was referring to) that there will be hardly any pull on the estimates”

But what you’re describing is EXACTLY what I suggested, and the opposite of what you originally suggested, so I think you misunderstood something. The pull on the estimates that I describe is what you get from the Bayesian model with strong prior, not from the mixed effects model. Originally you asked why I thought it had to be a Bayesian model, because you would get the same pull with a mixed effects model. I said a mixed effects model *is* a Bayesian model, it just has a flat prior on the whole space of IEEE floats, and therefore won’t give you the pull you need to keep outliers from misleading when the number of groups is small. The lmer model *is* the flat prior model, and in fact, does exactly what I suggested.

With the model specified as you suggest, you still get exactly what I predicted, except you have to call predict instead of just reading off the coefficient. the overall mean is 11.35 and predictions for group 5 are 50.2 and the sd between groups is 21.7

summary(lmer(m ~ 1+ (1|g),data=dat))

Linear mixed model fit by REML [‘lmerMod’]

Formula: m ~ 1 + (1 | g)

Data: dat

REML criterion at convergence: 66.2

Scaled residuals:

Min 1Q Median 3Q Max

-1.58017 -0.71027 -0.08392 0.59727 1.61789

Random effects:

Groups Name Variance Std.Dev.

g (Intercept) 471.7861 21.7206

Residual 0.1637 0.4046

Number of obs: 25, groups: g, 5

Fixed effects:

Estimate Std. Error t value

(Intercept) 11.350 9.714 1.168

> predict(lmer(m ~ 1+ (1|g),data=dat),newdata=data.frame(g=5))

1

50.20309

Okay, here’s the code for you.

> library(lme4)

> # 10 groups, all mean 0

> data = data.frame(y = rnorm(100),

+ grp = rep(1:10, 10))

> # 11th group, mean 1

> data2 = data.frame(y = rnorm(10) + 1,

+ grp = rep(11, 10))

> # Note the unadjusted sample mean for 11th group

> mean(data2$y)

[1] 0.8552622

>

> # Combining all the data and fitting

> alldata = rbind(data, data2)

> fit # Fitted mean for 11th group is NOT sample mean!

> predict(fit, data.frame(grp = 11))

1

0.1441146

Why isn’t that happening in your code? That’s because you’ve picked an outlier so extreme that the variance of the mean components is so high it barely has extremely close to zero pull on the estimate.

Ooph. R should drop the arrow notation, just so it can work with wordpress. There should be a

fit = lmer(y ~ 1 + (1 | grp), data = alldata )

in there.

Also, how does Bob post his beautiful code? I tried his tip from

https://andrewgelman.com/2017/09/12/seemed-destruction-done-not-choose-two/#comment-563019

but it seems to no longer work.

a reader:

Yes you’re absolutely right as the number of groups increases, the variance between groups remains equal to the sample variance of the observed sample group means which then converges to whatever it is, and outliers have somewhat less effect. So as I said when there are 30 or 1000 or whatever groups you wind up much more regularized, as you’d expect…

Or you could provide the regularization information in the form of a prior, and you get the same regularizing effect even at small numbers of groups.

I was trying to avoid too much of contrasting Bayesian and Frequentist methods, as this can quickly devolve into something more like a religious or political argument. I’ll go so far as to say that, wherever I’ve seen both approaches (correctly) applied to the same problem, the conclusions have been essentially the same. I’ve also seen some cherry-picked cases which give the impression otherwise, at least until you look closely at the details. Life is short, so I’ll leave it at that :)

Sorry, I agree. It’s an orthogonal question.

For a meta-analysis when you can’t get the raw data from the original studies (which seems to be the norm for the medical literature), how about a generative-model based analysis using something like pcnetmeta (https://www.jstatsoft.org/article/view/v080i05)? This package uses JAGS but might be translatable to Stan.

Technically if you can’t get the raw data and the summaries provided are not sufficient for the model you wish to use (and it often isn’t) or the summaries are not actually correct – there is very that can be done to directly redress that.

The technical gory details were set out here https://phaneron0.files.wordpress.com/2015/08/thesisreprint.pdf Now that ABC is feasible you can likely skip getting the likelihood –

but you are stuck with just what information is in the summaries, which is almost zero if the summaries are incorrect or have been selective chosen.

I’m curious as to what people here think of meta-analytic methods that attempt to detect (and possibly correct for) selection biases, e.g. p-curve, p-uniform, R-index,the Test of Insufficient Variance, and selection models. Schimmack and Simmons / Nelson / Simonsohn have recently been promoting their respective forensic meta-analysis tools as ways of acquiring positive evidence for the presence of biasing selection effects. The older selection models have been used to give “corrected” estimates of primary effects by modeling the selection procedure.

Are these worthwhile tools for someone who wants to do meta-analysis from a critical angle?

Ben:

There’s nothing wrong with these methods, but there’s a limit to what you can do without the raw data. I’m skeptical of corrected estimates based on published summaries; that said, if all we have are the published summaries, I guess we have to do something. Search for “Edlin factor” on this blog.

I recall John Copas working on selection modeling selection effects around 2000, I believe arguing that they should nevr be taken as more than sensitivity analyses.

When I suggested a couple years ago to Iain Chalmers (a founder of the Cochrane Collaboration) that they stop doing meta-analyses until they had access to raw data (may not be that far in the future in some ares) he replied that “We have to make do with what we have.”

I can’t disagree with that, but the statistics and discussion should properly reflect that (e.g. Edlin factor).