From August 1990. It was in the form of a note sent to all the people in the statistics group of Bell Labs, where I’d worked that summer.

To all:

Here’s the abstract of the work I’ve done this summer. It’s stored in the file,

/fs5/gelman/abstract.bell, and copies of the Figures 1-3 are on Trevor’s desk.

Any comments are of course appreciated; I’m at gelman@stat.berkeley.edu.On the Routine Use of Markov Chains for Simulation

Andrew Gelman and Donald Rubin, 6 August 1990

corrected version: 8 August 1990

1. Simulation

In probability and statistics we can often specify multivariate distributions

many of whose properties we do not fully understand–perhaps, as in the

Ising model of statistical physics, we can write the joint density function, up

to a multiplicative constant that cannot be expressed in closed form.

For an example in statistics, consider the Normal random

effects model in the analysis of variance, which can be

easily placed in a Bayesian framework with a conjugate prior distribution.

All the conditional densities of the resulting posterior distribution

are simple, but marginal densities can only be written in integral form and

can only be calculated approximately. (For details, see Kinderman and Snell

(1980) or Pickard (1987) for the Ising model, and Lindley and Smith (1972)

for the Bayesian random effects model.)In such cases, we may not even be able to compute marginal moments of the

difficult distribution, let alone more complicated and interesting summaries

that would help us understand a probability model or posterior inference.

When direct methods such as analytic or numerical integration of “nuisance”

parameters are not computationally feasible, we might try Monte Carlo simulation;

in the simplest form, we draw a finite set of independent random samples from our

distribution, and then calculate desired distributional summaries as functions of

the sampled points. The Monte Carlo method is quite general and powerful; it is

easy to calculate arbitrary quantities of interest

such as the expected long-distance correlation in the Ising model or a posterior

95% confidence region for the largest block effect in a random effects model.

Any aspect of the distribution can be approximated to any desired accuracy if

the number of independently sampled points is large enough.

Simulation also has the advantage of flexiblility: once a sample is drawn, it

can be used to learn about any number of different distributional summaries.2. Markov chain methods

Drawing independent random samples is a wonderful tool that is unfortunately not

available for every distribution; in particular, the Ising model and random

effects posterior distributions mentioned above do not permit direct

simulation. Fortunately, a form of indirect simulation method exists for

any multivariate distribution if we can calculate its joint density

(up to a multiplicative constant) or if we can sample from all its univariate

conditional densities. The first of these methods was introduced by

Metropolis et al. (1953) in the Journal of Chemical Physics. Our work focuses

on a similar and slightly simpler method called the Gibbs sampler by Geman and

Geman (1984) in an article for the IEEE.Let F(x) be our distribution; the Metropolis algorithm takes a starting (vector)

point x0 and constructs a series x1, x2, . . ., that is a sample from an

ergodic Markov chain whose stationary distribution is F(x). Computer

simulation of the series requires calculation of the density f(x) (up to a

constant). These samples xj are not independent; however, the stationary

distribution of the Markov chain is

correct, so if we take a long enough series, the set of values {x1, . . ., xn}

takes the place of the distribution just as an

independent random sample does (although of course an independent sample

carries more information than a Markov chain sample of the

same length).The Gibbs sampler is a similar algorithm, which produces a Markov chain that

converges to the desired distribution, this time requiring draws from all the

univariate conditional densities at each iteration.3. Have we converged yet?

Markov chain simulation methods are attractive for many problems because they

enable us to flexibly summarize intractable multivariate distributions by making

full use of the mathematical structure we do know, using a tool we think we

understand–Monte Carlo simulation. Unfortunately, using a sample of a Markov

chain to estimate a distribution raises an immediate question: how long a series

is needed? After one or two steps, we are almost certainly still too close to

the starting point to hope for unbiased summaries. Asymptotically, the chain is

stationary, and all is OK (with some loss of efficiency compared to independent

samples, as mentioned above).To obtain a feeling for the practical difficulties, we ran the Gibbs sampler for

2000 steps to simulate a case of the Ising model. To give the minimum of details:

x is a vector of binary variables defined on a 100 by 100 lattice; each step of

the Gibbs sampler took on the order of 10,000 computations; and we summarize

each iterate xj by the sample correlation r on the lattice–a function r(x) that

lies between -1 and 1. Theoretical calculations (Pickard, 1987) show that

under our model–the Ising model with beta = 0.5–the marginal distribution

of r is approximately Gaussian with mean around 0.85 or 0.9 and standard

deviation around 0.01. We’d like to know whether the set {r(x1), . . .,

r(x2000)} from the simulated Markov chain can serve as a substitute for the

marginal distribution of r.Figure 1 shows the values of r(xj), for j=1 to 2000. (r(x0) = 0, and the first

few values are cut off to improve resolution on the graph.) The Markov chain

seems to have “converged to stationarity” after the thousand or so steps required

to shake off its initial state. How do we know it has converged, though? Figure

2 zooms in of the first 500 steps of the series, whose apparent convergence we

know to be illusory. For comparison we ran the Gibbs

sampler again for 2000 steps, but this time starting at a point x0 for which

r(x0) = 1; Figure 3 displays the series r(xj), which again seems to have

converged nicely. To destroy all illusions about convergence, hold

Figures 1 and 3 up to the light. The two Markov chains have “converged” to

different distributions! We are, of course, still observing transient

behavior.Interestingly, the means of the series in Figures 1 and 3 differ, but

the variances are roughly equal. We’re not sure why, but it seems to be a

general feature in these Markov chain simulations that the variance converges

before the mean.All simulations and plots were done using the New S Language:

A Programming Environment for Data Analysis and Graphics.4. The answer: parallel Markov chains

To restate the general problem: we wish to summarize an intractable

distribution F(x) by running the Gibbs sampler (or a similar method such as the

Metropolis algorithm) until the distribution of the set of Markov chain

iterates is close to F. As shown in the previous section, convergence seems

impossible to monitor from a single finite realization of the Markov chain;

consequently, we follow the implicit suggestion of Figures 1 and 3 and track

several parallel sample paths.Consider m independent runs of the Gibbs sampler, each of length n, starting

from m different initial states x10, . . ., xm0:series 1: x11, . . ., x1n

. . .

series m: xm1, . . ., xmn.Again, we focus attention on a univariate summary, say r(x); we want to

use the observed simulations rij to determine whether the series of r’s are

close to convergence after n steps.

To understand our method, consider the set of series as m blocks, each with

n observations, in the one-way analysis of variance layout (that is, ignore the

time ordering in the series). We will work with the total sum of squares

(with (mn-1) degrees of freedom) and the “within” sum of squares (with

(m-1)(n-1) degrees of freedom).First assume for simplicity that the starting points of the simulated series

are themselves independent random samples from F(x). (Of course, if this

condition could be obtained in practice,

a Markov chain simulation method would not be

needed.) With independent starting points, all values of any series

are independent of all the values of any other series, and the unconditional

variance of any point rij is just the marginal variance var r under the

distribution F. We can then estimate var r, given the “data matrix” (rij),

by [total SS - (within SS)/m] / ((m-1)n). (Algebraic derivations appear

in the longer version of this article.) Given the assumption of initial

independence, this “between” estimate of variance

(not the same as the usual “between” estimate in ANOVA) is unbiased for finite

series of any length.In contrast, the estimated variance within the series, (within SS) / ((m-1)(n-1)),

has expectation var r only in the limit as n -> infinity.

For finite series, the expected within mean square increases with n, assuming,

as is likely, that the random variables r(x1), r(x2), . . ., from the Markov

chain are positively correlated. The discrepancy between the two estimates

of var r suggests a test: declare the Markov chain to have converged when

the within mean square is close to the variance estimate between series, with

confidence intervals derived from classical ANOVA theory. Because of the

dependence within blocks, the degrees of freedom of the between and within

estimates are less than (m-1)n and m(n-1), respectively. We can

approximately correct for this information loss (once again, details will be

provided in the longer article).Once we are close enough to convergence to be satisfied, the variance estimates

and degrees of freedom corrections alluded to above allow us to estimate the

marginal summaries E r, var r, and Normal-theory confidence intervals for our

Monte Carlo approximations. We can run the series longer if more precision

is desired, and can repeat the process to study the marginal distributions of

other parameters (without, of course, having to simulate any new series of x’s).In practice, the starting points of the parallel series can never be sampled

independently with distribution F(x); the simulated series are thus no longer

stationary for any finite n, formally invalidating the above analysis. We

currently have two strategies designed to make the independence assumption

approximately true. First, we try to pick starting values that are far apart and,

if anything, more dispersed than independent random samples. The m parallel series

should then start far apart and grow closer as they approach stationarity, as in

Figures 1 and 3; since the variance between series declines with n, the

comparison-of-variances test should be conservative. Second, we reduce the

effect of the starting values by crudely throwing away the the first half

of each simulated series until approximate convergence has been reached.

Once again, Figures 1 and 3 illustrate how a few early steps

of the Markov chain can greatly distort estimates of means and variances

within series. We hope that the conservative strategies of starting with

dispersed points and throwing away early simulations will yield confidence

regions that are wider than those obtained by the ideal method, but that

still have good coverage properties.The idea of comparing parallel simulations is not new; for

example, Fosdick (1959) applied the Metropolis algorithm to the Ising model

by simulating four series independently, from each of two different starting

points. Approximate convergence was declared when the two groups of series

became indistinguishable on the scale of a prechosen error bound.5. Some references

Ehrman, J. R., Fosdick, L. D., and Handscomb, D. C. (1960).

Computation of order parameters in an Ising lattice by the Monte Carlo method.

{\em Journal of Mathematical Physics} {\bf 1} 547–558.Fosdick, L. D. (1959). Calculation of order parameters in a binary

alloy by the Monte Carlo method. {\em Physical Review} {\bf 116}, 565–573.Geman, S., and Geman, D. (1984). Stochastic relaxation, Gibbs

distributions, and the Bayesian restoration of images. {\em IEEE Transactions

on Pattern Analysis and Machine Intelligence} {\bf 6}, 721–741.Hammersley, J. M., and Handscomb, D. C. (1964), chapter 9. {\em Monte Carlo

Methods}. London: Chapman and Hall.Kinderman, R., and Snell, J. L. (1980). {\em Markov Random Fields and

their Applications}. Providence, R.I.: American Mathematical Society.Lindley, D. V., and Smith, A. F. M. (1972). Bayes estimates for the linear

model. {\em Journal of the Royal Statistical Society B} {\bf 34}, 1–41.Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and

Teller, E. (1953). Equation of state calculations by fast computing machines.

{\em Journal of Chemical Physics} {\bf 21}, 1087–1092.Pickard, D. K. (1987). Inference for discrete Markov fields: the

simplest nontrivial case. {\em Journal of the American Statistical Association}

{\bf 82} 90–96.Ripley, B. D. (1981). {\em Spatial Statistics}, p. 16ff. New York: Wiley.

Tanner, M. A., and Wong, W. H. (1987). The calculation of posterior

distributions by data augmentation. {\em Journal of the American Statistical

Association} {\bf 82}, 528–550.

I wrote the article but properly listed Rubin as coauthor, as the idea came about after many long phone conversations. I encountered the idea of between-within comparison in the 1959 paper by Fosdick (see above citations); I can’t remember how I found that paper but it must have been from a literature search, going backward from more recent sources. Anyway, when I brought up this idea, Rubin picked up on it right away, as it was close to methods he had developed for inference from multiple imputations. Once we had that connection, the idea was there. And I’d credit Rubin’s influence for my goal of estimating a potential scale reduction factor—that is, a numerical measure of lack of mixing—rather than formulating the problem as a hypothesis test.

The published article appeared over two years later in the journal Statistical Science, in a much expanded version.

In some ways, I prefer this short paper to the full version. I like the snappy style, and I like the clarity about what we believe and what we don’t know. I regret not submitting some version of the above article to a journal immediately, right then in Aug 1990. On the other hand, editors and reviewers for statistics journals can be very stuffy, and an article such as above with a concept but no theoretical derivations probably would’ve been shot down over and over and over. Maybe it just took two years to put in enough blah blah blah to make it publishable.

The above is more like a blog post than a journal article. It contains the key idea with no messing around.

P.S. You’ll notice above that I wrote, “Any comments are of course appreciated.” And you probably won’t be surprised to hear that I got no comments. It took me a long time to realize that most people don’t want to comment on things. When we were getting close to finishing the first edition of Bayesian Data Analysis back in 1994, I printed out copies and gave them to lots of prominent statisticians I knew, but very few gave any comments at all. It’s not about me; people just don’t like to read and make comments. We get some comments on the blog, but when you consider the number of comments and the number of readers, you’ll realize that most people don’t comment here either.

Sometimes it’s not about not wanting to comment. At least to me, the point is that I don’t think to have much to add that will not take a lot of time.

Sure, I’m not a prominent person in any sense. Still, I think many people feel the same.

Ah, I really like this version of the paper! Clear and short. Unfortunately, reviewers request that we speculate beyond what we do know and we end up expand our papers and including more debatable stuff.

Thanks for posting this.

I have been thinking that the avoidance of math in statistics is currently ruled out by the inability to tabulate distributions of functions of product spaces of independent random variables – even univariate random variables. In fact, with even a fair amount of advanced math (e.g. Probability and Measure, Billingsley) one can one get this only for linear functions (in general) by numerically inverting characteristic generating functions via fast fourier transform (FFT). Not sure anyone really does this more than rarely – hence the wonderfully mathematical field of higher order asymptotics.

Now, if one avoids advice by Rubin and others about repeated sampling properties of Bayes being relevant in some sense, one can fully condition on the sample in hand, get these as Andrew says, here “the conditional densities of the resulting posterior distribution [that] are simple” and the equivalent problem here, tabulating marginal distributions of functions of random parameters, can “usually” be dealt with by MCMC. Unfortunately, to lessen the chance of MCMC failure and or detect its occurrence, again requires a fair bit of mathematics :-(

With regard to blog comments, if you were to set aside comments that were of little help, are there still likely to be more from blogging. My sense that if most had comments that could be worked into something publishable, they would not be making them on the blog.