Skip to content

A question about varying-intercept, varying-slope multilevel models for cross-national analysis

Sean de Hoon writes:

In many cross-national comparative studies, mixed effects models are being used in which a number of slopes are fixed and the slopes of one or two variables of interested are allowed to vary across countries. The aim is often then to explain the varying slopes by referring to some country-level characteristic.

My question is whether it is possible that the estimation of these varying slopes (the interesting ones) is affected by the fact that the slopes of other (uninteresting) variables are fixed (even though they may actually vary over countries) and if so, how it may be affected? This question is inspired by many studies examining men’s wages which, for example, include a control for level of education that does not have a random slope, while I doubt whether education will have the same effect across countries.

Do you think the decision not to include many varying slopes is predominantly methodologically informed? And do you think Bayesian analyses can provide a better solution for the kind of situations where many slopes should be allowed to vary?

I have submitted a paper using Bayesian analyses, where all individual level variables (8 or so) had varying slopes and received the following critical comment: “[the inclusion of many varying slopes] suggests that a highly constrained variance-covariance structure was imposed in order to restrict the random effects to a small enough number of parameters.” Unfortunately I only took an introductory course in applied Bayesian modeling and do not know how to respond to such a comment. I do not think the reviewer is correct, but I would not know how to react to this.

My reply:

In response to your first question: essentially what you have is an omitted variable bias so, yes, if you have varying coefficients that you treat as being fixed, then this can cause problems. As with any omitted variable bias, it depends on the details of the example, whether it’s a big problem or a little problem.

Second, when many slopes vary, it can be necessary to do more regularization. The flat priors we use in my book with Jennifer might not be enough. I think we need more research on informative priors for hyperparameters such as, in this case, group-level covariance matrices.

Finally, regarding that reviewer comment: it indeed can be difficult to estimate an 8×8 covariance matrix from indirect data in this way, so some sort of regularization (I prefer a soft to a hard constraint) is probably a good idea. Of course, holding a coefficient to be fixed can be viewed as a very strong regularization that pulls the group-level variance all the way to zero, so in that sense there’s no easy answer. If you’re fitting a regression with many predictors and not a lot of data, you need to make some assumptions, explicit or implicit.


  1. Aki Vehtari says:

    Most of the models I use have varying slopes for all variables as it is the default choice when using Gaussian processes. Naturally these models include also “soft constraint”, that is, prior on smoothness controlling “the effective number of parameters”.

  2. Can you give an example with a minimum working example (with code, I mean)? I don’t understand what it means to have a prior on smoothness that controls the effective number of parameters.

    • I can’t give a minimum working example but I can maybe try to explain what Aki means, and then, he could correct me and we could all learn something :-)

      The main usage for Gaussian Processes is to describe distributions over functions. Suppose for example you have brightness of a signal across a row of pixels. So there’s a function B(x) where x is position along the row, and B is brightness.

      Now, suppose that you’re in a situation where the lens cap is on, and there’s just diode noise at each pixel (thermally driven randomness in the electrical signal). Then each pixel could be at a brightness that is essentially independent of the other pixels. Let’s say there are 2000 pixels across the row, you’d need 2000 random values to “explain” how you got B(x). Such a “white noise” signal is (perhaps) gaussian but is super-rough. There is no standard function which is “white noise”, that is, independent gaussian values at an infinite set of points.

      On the other hand, suppose that you have an image projected onto the CCD, and it’s an image of a sand dune, the brightness changes slowly and smoothly from one place to another. If you had say 5 of the pixel values, you could interpolate the other 1995 pixels pretty accurately because they just don’t change much from place to place. In the Gaussian process world, you’d describe this perhaps with a squared exponential covariance function with a large spatial scale. Cov(B(x),B(y)) = A exp(-((x-y)/L)^2). You put a prior on L that makes it about the same size as the pixel array… and this tells the Gaussian process machinery that you expect things to change slowly from one end of the array to the other.

      Such a long length scale smooth covariance function puts high probability on functions that look like low-order polynomials, say 2nd to 5th order polynomials over the 2000 data points. There is no hard cutoff that prevents something like a 6th or 8th order polynomial from coming out of samples of this gaussian process, but the majority of these samples will be well approximated by lower order polynomials.

      So the “effective number of parameters” can be thought of as something like “If I were to approximate samples from the gaussian process with chebyshev polynomials, and put a cutoff at the Nth order, I would do a good job approximating the vast majority of samples”. If you don’t like polynomials, you could also think of splines with N knots, or radial basis functions with N centers… whatever.

      • The part I don’t understand is how “varying slopes” enters into it. More specifically, gaussian processes are meaningful to me when there is a continuous manifold over which some independent variable can vary… like for example x and y positions of pixels, or latitude and longitude, or time. Most of these “varying slopes, varying intercepts” type models are over discrete measurements, so for example you might have the following types of information about a country: GDP, Median Age of Mother at Birth of First Child, Number of Members in Median Household, Mean Distance Between Hospitals.

        These 4 dimensions are then to be used to predict some average health-care outcome across countries, such as mean difference between child’s BMI at a given age, and a reference BMI from a healthy population. You might wish to use a gaussian process over GDP to describe the “effect of GDP” on this outcome. Such a gaussian process would put probability over all sorts of different “shapes” of Outcome Vs GDP, and so you might consider that this covers “varying slopes”. But the “varying” part usually means “across countries”. So it’d be easy to incorporate such a thing, but essentially, I think of “varying slopes” as a poor man’s model, where nonlinearity and interactions, and structural/causal models are not being considered, and instead you linearize the heck out of the model and want to relax some of that linearity by allowing effects to vary from one subset to another.

        • Aki Vehtari says:

          Danel, I think your explanation was just fine.

          My definitely not a minimal example with code can be found at

          • This is fantastic.

            But in the present case, if you have an 8×8 variance covariance matrix to estimate, you really have to make sure you have the data to estimate each variance component (and correlation). You’d have to have huge amounts of data for each component to get reasonable estimates. People say to me that it doesn’t matter, these are nuisance variables anyway. But it does matter. If you have wild estimates, the SEs of your fixed effects are going to be wider. Especially in low-power situations (pretty much the norm at least in psycholinguistics), you are going to miss effects at an alarming rate. Couple that with people publishing null results as if they told you something (psycholinguistics is littered with non-results dressed up as “findings”), that’s a pure waste of money and time.

            Some simulation with the parameter estimates obtained from a model fit, and comparison with the data at hand, can convince the researcher that their models are overparameterized and that the estimates don’t really reflect anything useful. (I think I got that from Andrew’s books.)

            I’m happy to be corrected.

            • Of course, having more data lets you see smaller effects with more clarity, but the importance of that will vary a lot from research project to research project. Sometimes there is something going on that can be seen even with small data sets!

              In the more general case, I’d advocate moderately strong priors on variance and covariance components which are actually justified by some theory (ie. the correlation of age and wealth should be pretty high because wealth is an average savings rate multiplied by age, or there should be very little correlation between sex and age because our population of interest is young enough that they are evenly split btw M & F… or whatever)

              Sometimes you don’t have much theory, and then you’re left with needing a good bit of data and realistic expectations. There’s no free lunch. :-)

              • Thanks for the detailed explanation earlier (in response to Aki’s comment).

                Your suggestion makes a lot of sense in general, but would not work in areas like psychology and psycholinguistics. If one really wants Bayesian methods to become widely used, it’s best to stick with uninformative or suitably constrained but still noncommittal priors, with some obvious exceptions. The most common situation where people use such models is where they don’t have much theory associated with the variance components. They only care about (and have theories about) the fixed part of the model. It’s the latter situation that I was talking about: “Sometimes you don’t have much theory, and then you’re left with needing a good bit of data and realistic expectations.” So I guess we’re in agreement.

              • Shravan: even when you don’t have a lot of specific theory, it’s often possible to do a lot more than the ultra-flat priors often advocated as “uninformative”. Since I am not much of a psychologist or psycholinguist, let’s go with an example from biology, where there’s often plenty of variation and not a lot of theory about what causes the variation.

                So, we clone a fluorescent construct into bacteria next to a regulatory sequence, and we expose these bacteria to 20 different hormones to see if the regulatory sequence causes increased or decreased expression of the fluorescent protein in the presence of the various hormones. Some kind of CCD based fluorescence quantifier is used to measure brightness for each sample. The brightness measurements range from 0 to 2^16 based on a 16 bit A/D converter.

                Now, how much variability would you expect to see from hormone to hormone? There’s not much to go on there, so we might be tempted to use a “flat” prior on the variance across hormones. But also let’s remember that the measurement device only works over a finite range, and is designed to rarely saturate. The maximum entropy distribution over measurements is uniform over [0,2^16], the standard deviation of this distribution is 12^16/sqrt(12). So at the least we can probably put an upper bound on the variance. That isn’t true with say a half-cauchy prior sometimes (previously) advocated by Andrew.

                Furthermore, we can also guess that not every hormone a body produces could possibly regulate this particular sequence, if every hormone regulated every sequence there wouldn’t be any way to control the cell. So, the distribution of brightnesses across hormones should have something of a bi-modal or at least right-tailed shape, potentially with a lot of samples having near 0 effect, and then various hormones having big effects. Rather than focusing on the variance or “scale” of this distribution, perhaps we should focus on making sure that the effects themselves have a long-tailed prior.

                I doubt very much that Psych and Psycholinguistics fail to have any of these kinds of insights. I suspect it’s just not the kind of thing that those fields have historically spent much time thinking about, especially if they historically don’t use Bayesian methods.

                Now, whether you can get published doing high quality Bayesian modeling incorporating true valid prior information in a field dominated by people doing NHST and p-value hacking, and publishing noise as findings, in the way you’ve described before… is a different story.

              • ack, typo: 2^16/sqrt(12) not 12^16/sqrt(12)

  3. Dimiter says:

    but if the original goal of the research is causal inference about the slopes that have been allowed to vary, why does the non-inclusion of other slopes lead by default to an omitted variable bias? It is only an ‘omitted’ variable if it is related both to the outcome variable and to the main explanatory variable (input) at the same time (or more complicated versions of the same idea). If not, including the other varying slopes can improve the efficiency of the estimation but does not necessarily solve an omitted variable bias as such. So basically the way the slope of variable Z varies across countries needs to be related to the outcome Y and to the main input X to induce a bias for causal inference about the effect of X on Y. Or is there something else in the question that I am missing?

  4. How much data per variance component are we talking about? That’s crucial to this discussion. I would do one of the following:

    1.First fit a model using lmer (library lme4) and find the best justified model (the simplest model) using the generalized likelihood ratio test, as Pinheiro and Bates 2000 describe. (also see Gelman and Hill 2007)

    Now you know what a reasonable model given the data might look like.

    2. Fit a full variance covariance matrix using Stan as we did in our tutorial; this is what we often do in our own research in my lab:

    3. Examine the posterior distribution of your variance components and the correlation parameters. If you don’t have much data they’d be pretty widely spread out and centered around zero if you use the approach in our tutorial (which is based on Chung et al, published recently and available from Gelman’s home page).

    At this point, you can make your call based on the based lmer fit and the posterior distributions. If the variance components and correlations are important to you, you have to (a) either get more data (this is the most common problem with these high-level parameters, there usually isn’t enough data to estimate them), or (b) live with your model with a full vcov matrix, knowing that some of the posteriors are being dominated by the priors, or (c) back off to a simpler model. As far as the “fixed effects” go (sorry, Andrew), the results should not change.

    Be aware, however, that your correlation parameters will depend on the parameterization of the fixed effect (see a paper by Reinhold Kliegl on this, I think 2011— will post a link later). So make your decisions on parameterization based on some sensible criterion that answers the question you want to address.

  5. Kedar says:

    Hello everyone,

    I am new to Bayesian world and would like to use this approach in my present study. I am still reading on Bayesian approach and learning a lot of new things. I am posting my questions here.

    I am doing a cross-section (pre-post) study on use of sensory cues to improve locomotion in people post-stroke. Minimally clinical important difference in gait speed is shown to be 0.16m/s. I would like to know how to go about constructing a prior probability model based on MCID.

    Thank you.

Leave a Reply