Skip to content

Hierarchical/multilevel modeling with “big data”

Dean Eckles writes:

I make extensive use of random effects models in my academic and industry research, as they are very often appropriate.

However, with very large data sets, I am not sure what to do. Say I have thousands of levels of a grouping factor, and the number of observations totals in the billions. Despite having lots of observations, I am often either dealing with (a) small effects or (b) trying to fit models with many predictors.

So I would really like to use a random effects model to borrow strength across the levels of the grouping factor, but I am not sure how to practically do this. Are you aware of any approaches to fitting random effects models (including approximations) that work for very large data sets? For example, applying a procedure to each group, and then using the results of this to shrink each fit in some appropriate way.

Just to clarify, here I am only worried about the non-crossed and in fact single-level case. I don’t see any easy route for crossed random effects, which is why we have been content to just get reasonable
estimates uncertainty estimates for means, etc., when there are crossed random effects (

(Some extra details: In one case, I am fitting a propensity score model where there are really more than 2e8 somewhat similar treatments. Perhaps ignoring the ones that aren’t common (say, only considering treatments where I have over 1e4 treated units), I want to predict the treatment from a number of features. One approach is to go totally unpooled (your secret weapon), but I think variance will be a problem here sense there are so many features. Another approach is to use some other kind of shrinkage, like the lasso or the grouped lasso.)

Your ideas (or those of your blog readers) are appreciated!

My reply:

I’ve been thinking about this problem for awhile. It seems likely to me that some Gibbs-like and EM-like solutions should be possible. (And if there’s an EM solution, there should be a variational Bayes solution too.) Here are the pieces, as I envision it:

- The two-stage analysis (separate model fir to each group, then a group-level regression of the group estimates) leads to Gibbs (or EM), in which the fitted group-level model (including the estimation uncertainty) is used as a prior and fed back into the individual-level analyses.

- Speeding things up by analyzing subsets of the data. The trick is to do the sampling where you have more data than necessary (e.g., “California”) but not in the sparser groups (“Rhode Island,” etc.).

- Ultimately getting all the data back in the model, once things have stabilized. This has a bit of the feel of particle filtering.

- My guess is that the way to go is to get this working for a particular problem of interest, then could think about how to implement it efficiently in Stan etc.


  1. C Ryan King says:

    I remember this one and one titled “Feasible estimation of generalized linear mixed models (GLMM) with weak dependency between groups”. I think there’s a job paper in econ (MIT?) from a year back that I remember (but can’t put my hands on) on the same topic, which did go down the path of blocked laplace approximation updates. The machine learning community seem to have also thought about this problem a great deal, since they often have insane quantities of data.

  2. K? O'Rourke says:

    They might want to think a bit about fitting a propensity score model where there can be some advantage in over fitting and perhaps even some arguments against pooling at all.

    • Dean Eckles says:

      There are enough possible predictors that for many of the treatments (the ones without many observations) complete separation would occur if they are all used. Currently, I am using a more limited set. Also, the treatments are all similar (exposure to a URL shared by friends), so many of the coefficients should be similar (at least within types of URLs).

      Now, part of me says that if you can’t fit a model without pooling, then you probably don’t have enough data to be trying to estimate that effect. But I actually want to estimate average treatment effects (on the treated) across the whole population of URLs. (And actually I have an experimental baseline since this is a constructed observational study. So one of the goals is to see if this will actually work.)

  3. AL says:

    An answer is BLUP (Best Linear Unbiased Prediction; (a kind of) Empirical Bayes), aka Mixed Model. Animal breeders have applying BLUP for decades with *very* large data sets (hundreds of millions, with up to 400 million unknowns to my knowledge).
    BLUP equations are of the form Ax=b. Because it is a linear predictor, you can use standard iterative methods of Numerical Linear Algebra (like Preconditioned Conjugate Gradients) to get your estimates without never explicitely set up the system of equations.

    And the key in iterative solvers like PCG is to be able to compute Ax for successive approximations of x without actually setting up A. This depends on the model, but is usually feasible.

    Literature is rather large, but a readable account is Ignacy Misztal notes’ at (page 120) and this paper: .

    Also, googling for Preconditioned Conjugate Gradients gives lots of pointers.

    • Dean Eckles says:

      Thanks. Who knew there was so much relevant stuff on mixed effects models hidden in animal science journals! I didn’t at least.

      At least the paper I looked at, the focus was more on have varying intercepts. Have people been doing this with many coefficients varying from group (animal?) to group?

      • AL says:

        Can’t understand what you mean by “coefficients varying from group to group”.
        The general model is:
        y = Xb + Zu +Vv + Ts + (as many as you want) ….e = Tt +e, where t’ = (b’ u’ v’ s’…)’
        i.e., b, u, v, s… are *vectors* of effects (random or not) that some to make a phenotype y. In our applications these are herds, individuals, and perhaps different covariates nested within animal (what you find in that paper). But could be interaction herd-animal, animal-year , and so on. The difference between cross-classified and covariates and will be in the structure of the incidence matrix T (dummy covariates of type 0/1 or continuous covariates).

        You will get solutions for all levels of vector t.

        where b are “fixed” effects. Mixed Model Equations are always
        (T’R-1 T + S-1 ) = T’R-1y

        where S is the matrix of covariances, with typically a simple inverse (e.g., by blocks). S-1 for the “fixed” effects has simply 0 elements.

        A very nice introduction to BLUP is Robinson:
        That BLUP is a Good Thing: The Estimation of Random Effects
        Author(s): G. K. Robinson
        Source: Statistical Science, Vol. 6, No. 1 (Feb., 1991), pp. 15-32
        Published by: Institute of Mathematical Statistics
        Stable URL:

        • Dean Eckles says:

          Thanks for the link — I found the treatment of the fixed vs. random distinction interesting.

          I was wondering about cases where we have units (or groups) with multiple observation per unit and we want not just the intercept to vary from unit to unit, but coefficients for many of the available features. I bring this up simply because I’ve often encountered computational difficulties when trying to fit models (say with MCMC or PQL) with many such varying coefficients.

  4. Paul says:

    Given the huge size of your study, you will want to use EM if at all possible, even if you have to shoe-horn it a bit. The more data you have, the fewer iterations it takes for EM to converge (of course, each iteration takes longer).

    You might also find it worth looking at singular value decomposition as it was applied to tackle the Netflix Prize. It’s not quite as big a problem, but it might give you some ideas.

  5. Asim Ansari says:

    Stochastic gradient descent with its ability to work on one observation a time could be useful. This is used in the collaborative filtering literature on recommender systems.

  6. [...] our recent thread on computing hierarchical models with big datasets, someone brought up Blup. I thought it might be worth explaining what Blup is and how it relates to hierarchical [...]