Yes you’re absolutely right as the number of groups increases, the variance between groups remains equal to the sample variance of the observed sample group means which then converges to whatever it is, and outliers have somewhat less effect. So as I said when there are 30 or 1000 or whatever groups you wind up much more regularized, as you’d expect…

Or you could provide the regularization information in the form of a prior, and you get the same regularizing effect even at small numbers of groups.

]]>fit = lmer(y ~ 1 + (1 | grp), data = alldata )

in there.

Also, how does Bob post his beautiful code? I tried his tip from

http://andrewgelman.com/2017/09/12/seemed-destruction-done-not-choose-two/#comment-563019

but it seems to no longer work.

]]>> library(lme4)

> # 10 groups, all mean 0

> data = data.frame(y = rnorm(100),

+ grp = rep(1:10, 10))

> # 11th group, mean 1

> data2 = data.frame(y = rnorm(10) + 1,

+ grp = rep(11, 10))

> # Note the unadjusted sample mean for 11th group

> mean(data2$y)

[1] 0.8552622

>

> # Combining all the data and fitting

> alldata = rbind(data, data2)

> fit # Fitted mean for 11th group is NOT sample mean!

> predict(fit, data.frame(grp = 11))

1

0.1441146

Why isn’t that happening in your code? That’s because you’ve picked an outlier so extreme that the variance of the mean components is so high it barely has extremely close to zero pull on the estimate.

]]>When I suggested a couple years ago to Iain Chalmers (a founder of the Cochrane Collaboration) that they stop doing meta-analyses until they had access to raw data (may not be that far in the future in some ares) he replied that “We have to make do with what we have.”

I can’t disagree with that, but the statistics and discussion should properly reflect that (e.g. Edlin factor).

]]>Somehow you mistake what it is i’m suggesting:

Originally you said: “Thus, supposed we had 4 raw estimates clustered around 1.5, but one raw estimate close to 50 with only a little data supporting this. The LME estimate for the fifth estimate would be strongly recentered toward 1.5, which sounds like the effect you are talking about in your first post. The LME estimate of the 5th effect is *not* decoupled from the estimates of the first 4. If we fitted 5 fixed effects or just used OLS, it would be.”

Then I argued this is not true because the flat prior on the lmer type model allows the model to simply expand the variance to stretch the rubber band as much as needed to put the estimates on the individual group values.

You respond that I specified a different model from what you suggest, which is fine, I find lmer notation somewhat opaque and prefer Stan so I never use lmer. Nevertheless, you suggested:

“Once you fix that, you may also notice that it actually behaves a little in the opposite as you have described above; if you have a small number of groups, one extreme outlier group will mean the estimate of the variance is so high (see: what Andrew was referring to) that there will be hardly any pull on the estimates”

But what you’re describing is EXACTLY what I suggested, and the opposite of what you originally suggested, so I think you misunderstood something. The pull on the estimates that I describe is what you get from the Bayesian model with strong prior, not from the mixed effects model. Originally you asked why I thought it had to be a Bayesian model, because you would get the same pull with a mixed effects model. I said a mixed effects model *is* a Bayesian model, it just has a flat prior on the whole space of IEEE floats, and therefore won’t give you the pull you need to keep outliers from misleading when the number of groups is small. The lmer model *is* the flat prior model, and in fact, does exactly what I suggested.

With the model specified as you suggest, you still get exactly what I predicted, except you have to call predict instead of just reading off the coefficient. the overall mean is 11.35 and predictions for group 5 are 50.2 and the sd between groups is 21.7

summary(lmer(m ~ 1+ (1|g),data=dat))

Linear mixed model fit by REML [‘lmerMod’]

Formula: m ~ 1 + (1 | g)

Data: dat

REML criterion at convergence: 66.2

Scaled residuals:

Min 1Q Median 3Q Max

-1.58017 -0.71027 -0.08392 0.59727 1.61789

Random effects:

Groups Name Variance Std.Dev.

g (Intercept) 471.7861 21.7206

Residual 0.1637 0.4046

Number of obs: 25, groups: g, 5

Fixed effects:

Estimate Std. Error t value

(Intercept) 11.350 9.714 1.168

> predict(lmer(m ~ 1+ (1|g),data=dat),newdata=data.frame(g=5))

1

50.20309

There’s nothing wrong with these methods, but there’s a limit to what you can do without the raw data. I’m skeptical of corrected estimates based on published summaries; that said, if all we have are the published summaries, I guess we have to do something. Search for “Edlin factor” on this blog.

]]>One reason it’s exactly the same is that you have misspecified the model. Note that you have a fixed effect (g1, g2…) that is fit only by the data in group 1, group 2, etc. I think what you want is to just have a fixed intercept with a random intercept for variable g. Then you can get your predictions by simply doing

predict(fit, data = data.frame(g = 5))

Once you fix that, you may also notice that it actually behaves a little in the opposite as you have described above; if you have a small number of groups, one extreme outlier group will mean the estimate of the variance is so high (see: what Andrew was referring to) that there will be hardly any pull on the estimates; with means 1.5 and 50, group 5’s prediction will be very close (but not exactly the same) as the raw mean for group 5. On the other hand, as that outlier comes closer to the “normal” group means, the variance estimate will get smaller and there will be a stronger pull toward the group mean.

You can also play with changing the sample size of group 5.

]]>library(lme4)

dat <- data.frame(m = c(rnorm(20, 1.5,.5),rnorm(5,50,.5)),g = as.factor(c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5))))

summary(lmer(m ~ 0+ g + (1|g),data=dat))

Linear mixed model fit by REML [‘lmerMod’]

Formula: m ~ 0 + g + (1 | g)

Data: dat

REML criterion at convergence: 28.6

Scaled residuals:

Min 1Q Median 3Q Max

-1.58683 -0.70859 -0.08228 0.59895 1.61954

Random effects:

Groups Name Variance Std.Dev.

g (Intercept) 491.3574 22.1666

Residual 0.1637 0.4046

Number of obs: 25, groups: g, 5

Fixed effects:

Estimate Std. Error t value

g1 1.574 22.167 0.071

g2 1.735 22.167 0.078

g3 1.548 22.167 0.070

g4 1.685 22.167 0.076

g5 50.206 22.167 2.265

Which is exactly what I predicted, the mean of each group is the individual mean, the sd of the group level stuff is 22.2 just as expected, there is no pooling. With 300 groups most of them at mean of 1.5 the outlier group would be pulled in because the sd of the group level effects wouldn’t be affected by the outlier… but with 5 groups it is.

]]>Agreed. Variance components fit by MLE (or moreover, ReML) methods are unstable. Variance components fit with a very small number of units (such as 5 different groups, even if there are 1000 samples per group) are especially unstable!

]]>a ~ uniform(-MaxFloat,MaxFloat)

b ~ uniform(0,MaxFloat)

ymean ~ normal(a,b)

sd ~ uniform(0,MaxFloat)

for(i in 1:N){

y[i] ~ normal(ymean[group[i]],sd[group[i]])

}

the end result is that if the values are more or less normally distributed, then ymean is the same as the sample mean of each group, sd[group[i]] is the same as the sample sd of the group, the a value is the mean of the ymean values, and the b value is the sample sd of the ymean values

if there are outliers, then the assumption of a normal shape to the distribution of ymean may pull outliers in a little so that the posterior ymean values are more normally distributed, but this is really a distributional shape thing rather than a scale issue. The main effect of the true Bayesian priors which is keeping things from being too far apart doesn’t occur because there’s no prior information on the size of a or b, they’re symmetrically equally a-priori likely for the whole floating point range, and so there’s no “rubber band” to pull things in towards a pre-specified common value.

That analogy of a rubber band is VERY LITERALLY how the HMC calculations are done. A normal distribution corresponds to a mass on a spring, the sd corresponds to the stiffness of the spring, and the HMC dynamics corresponds to letting that mass vibrate back and forth. The strength of the spring is related to the sd of the distribution. with b ~ uniform(0,MaxFloat) you are in essence saying that the stiffness of the spring could be any value. When you ask for a point estimate from some fitting machinery it’ll just give you the sd of the sufficient statistics (the sample means for each group).

the WHOLE POINT of flat priors / likelihood only inference is to only use information in the data. hence, your posterior is just whatever the observed values of the sufficient statistics are.

]]>I don’t think terms such as “mixed effects models” are so useful (see my 2005 paper on Anova for my problems with the terms “fixed” and “random”), but, yes, the prior distribution on the group-level variance parameters can make a difference when the number of groups is small or moderate. This is a point that Jennifer and I unfortunately did *not* make in our book, as we didn’t fully understand it at the time of writing. With any single dataset, the simple flat-prior multilevel model estimate, or classical “mixed effects” estimate, can seem reasonable—but if you start fitting the same model to lots of datasets, you’ll quickly learn that the estimated group-level variance parameters can be unstable, and prior information can really help.

Are these worthwhile tools for someone who wants to do meta-analysis from a critical angle?

]]>(1.5*4 + 50)/5 = 11.2

and the sd would be inferred as

sd(c(1.5,1.5,1.5,1.5,50)) = 21.7

on the other hand, if you placed a prior that the fixed effects were say normal(1.5,.5) then the estimate for the 50 would be strongly pulled down towards 1.5, whereas the 1.5s would stay where they were.

Of course it’d have to be the case that normal(1.5,.5) made sense for your problem before you collected the data, but assuming that the real process is O(1) you’d probably do something more like normal(2,2) and still pull the 50 way down.

But in the normal(a,b) with a and b having improper priors (which is what you get with a typical mixed effect model) there’s nothing keeping the grouping from becoming normal(11.2,21.7) and all the 1.5’s being pulled up towards 11.2 and the 50 being pulled down towards 11.2 and everything being wrong.

]]>I’m not following. If we fit 5 random effects with a vanilla LME model, then we assume each of the deviations from the fixed mean come from a normal distribution centered at 0 a shared sd. Thus, supposed we had 4 raw estimates clustered around 1.5, but one raw estimate close to 50 with only a little data supporting this. The LME estimate for the fifth estimate would be strongly recentered toward 1.5, which sounds like the effect you are talking about in your first post. The LME estimate of the 5th effect is *not* decoupled from the estimates of the first 4. If we fitted 5 fixed effects or just used OLS, it would be.

]]>The technical gory details were set out here https://phaneron0.files.wordpress.com/2015/08/thesisreprint.pdf Now that ABC is feasible you can likely skip getting the likelihood –

but you are stuck with just what information is in the summaries, which is almost zero if the summaries are incorrect or have been selective chosen.

Depending on the model, and the quantity of data, that may or may not always be true, but without a group level prior to describe your understanding of how similar the groups are to each other, you in general don’t get the same answer as you would with that “prior”.

]]>Suppose you do want to look at different subgroups. If you pick and choose 1 or 2 subgroups based on a first look at the data, and analyze just those sub-groups ignoring all the information about other groups you have, your result is heavily biased towards finding what you want. If on the other hand, you use a Bayesian model across *all* the subgroups that incorporates the subgroup similarity through partial pooling, the estimates for each sub-group are going to be closer to correct and the variation estimates from sub-group to sub-group will be closer to correct.

Of course, the correct answer may be “there isn’t that much reason to think the sub-groups are very different” but that is the *correct* answer. It just isn’t the answer that gets you publications, tenure, and grants.

If you do NHST type analyses on sub-groups using just a few sub-groups you get more and more biased and wrong answers… and if you do the same thing in a Bayesian analysis using all the groups you get more and more accurate but somewhat less precise answers.

]]>This is in the context of having done an initial analysis of a full data set, followed by the investigator wanting to focus on particular subgroups of interest — generally when they didn’t get the answer they wanted initially.

]]>