## The first version of my “inference from iterative simulation using parallel sequences” paper!

From August 1990. It was in the form of a note sent to all the people in the statistics group of Bell Labs, where I’d worked that summer.

To all:

Here’s the abstract of the work I’ve done this summer. It’s stored in the file,
/fs5/gelman/abstract.bell, and copies of the Figures 1-3 are on Trevor’s desk.
Any comments are of course appreciated; I’m at gelman@stat.berkeley.edu.

On the Routine Use of Markov Chains for Simulation

Andrew Gelman and Donald Rubin, 6 August 1990

corrected version: 8 August 1990

1. Simulation

In probability and statistics we can often specify multivariate distributions
many of whose properties we do not fully understand–perhaps, as in the
Ising model of statistical physics, we can write the joint density function, up
to a multiplicative constant that cannot be expressed in closed form.
For an example in statistics, consider the Normal random
effects model in the analysis of variance, which can be
easily placed in a Bayesian framework with a conjugate prior distribution.
All the conditional densities of the resulting posterior distribution
are simple, but marginal densities can only be written in integral form and
can only be calculated approximately. (For details, see Kinderman and Snell
(1980) or Pickard (1987) for the Ising model, and Lindley and Smith (1972)
for the Bayesian random effects model.)

In such cases, we may not even be able to compute marginal moments of the
difficult distribution, let alone more complicated and interesting summaries
that would help us understand a probability model or posterior inference.
When direct methods such as analytic or numerical integration of “nuisance”
parameters are not computationally feasible, we might try Monte Carlo simulation;
in the simplest form, we draw a finite set of independent random samples from our
distribution, and then calculate desired distributional summaries as functions of
the sampled points. The Monte Carlo method is quite general and powerful; it is
easy to calculate arbitrary quantities of interest
such as the expected long-distance correlation in the Ising model or a posterior
95% confidence region for the largest block effect in a random effects model.
Any aspect of the distribution can be approximated to any desired accuracy if
the number of independently sampled points is large enough.
Simulation also has the advantage of flexiblility: once a sample is drawn, it
can be used to learn about any number of different distributional summaries.

2. Markov chain methods

Drawing independent random samples is a wonderful tool that is unfortunately not
available for every distribution; in particular, the Ising model and random
effects posterior distributions mentioned above do not permit direct
simulation. Fortunately, a form of indirect simulation method exists for
any multivariate distribution if we can calculate its joint density
(up to a multiplicative constant) or if we can sample from all its univariate
conditional densities. The first of these methods was introduced by
Metropolis et al. (1953) in the Journal of Chemical Physics. Our work focuses
on a similar and slightly simpler method called the Gibbs sampler by Geman and
Geman (1984) in an article for the IEEE.

Let F(x) be our distribution; the Metropolis algorithm takes a starting (vector)
point x0 and constructs a series x1, x2, . . ., that is a sample from an
ergodic Markov chain whose stationary distribution is F(x). Computer
simulation of the series requires calculation of the density f(x) (up to a
constant). These samples xj are not independent; however, the stationary
distribution of the Markov chain is
correct, so if we take a long enough series, the set of values {x1, . . ., xn}
takes the place of the distribution just as an
independent random sample does (although of course an independent sample
same length).

The Gibbs sampler is a similar algorithm, which produces a Markov chain that
converges to the desired distribution, this time requiring draws from all the
univariate conditional densities at each iteration.

3. Have we converged yet?

Markov chain simulation methods are attractive for many problems because they
enable us to flexibly summarize intractable multivariate distributions by making
full use of the mathematical structure we do know, using a tool we think we
understand–Monte Carlo simulation. Unfortunately, using a sample of a Markov
chain to estimate a distribution raises an immediate question: how long a series
is needed? After one or two steps, we are almost certainly still too close to
the starting point to hope for unbiased summaries. Asymptotically, the chain is
stationary, and all is OK (with some loss of efficiency compared to independent
samples, as mentioned above).

To obtain a feeling for the practical difficulties, we ran the Gibbs sampler for
2000 steps to simulate a case of the Ising model. To give the minimum of details:
x is a vector of binary variables defined on a 100 by 100 lattice; each step of
the Gibbs sampler took on the order of 10,000 computations; and we summarize
each iterate xj by the sample correlation r on the lattice–a function r(x) that
lies between -1 and 1. Theoretical calculations (Pickard, 1987) show that
under our model–the Ising model with beta = 0.5–the marginal distribution
of r is approximately Gaussian with mean around 0.85 or 0.9 and standard
deviation around 0.01. We’d like to know whether the set {r(x1), . . .,
r(x2000)} from the simulated Markov chain can serve as a substitute for the
marginal distribution of r.

Figure 1 shows the values of r(xj), for j=1 to 2000. (r(x0) = 0, and the first
few values are cut off to improve resolution on the graph.) The Markov chain
seems to have “converged to stationarity” after the thousand or so steps required
to shake off its initial state. How do we know it has converged, though? Figure
2 zooms in of the first 500 steps of the series, whose apparent convergence we
know to be illusory. For comparison we ran the Gibbs
sampler again for 2000 steps, but this time starting at a point x0 for which
r(x0) = 1; Figure 3 displays the series r(xj), which again seems to have
converged nicely. To destroy all illusions about convergence, hold
Figures 1 and 3 up to the light. The two Markov chains have “converged” to
different distributions! We are, of course, still observing transient
behavior.

Interestingly, the means of the series in Figures 1 and 3 differ, but
the variances are roughly equal. We’re not sure why, but it seems to be a
general feature in these Markov chain simulations that the variance converges
before the mean.

All simulations and plots were done using the New S Language:
A Programming Environment for Data Analysis and Graphics.

4. The answer: parallel Markov chains

To restate the general problem: we wish to summarize an intractable
distribution F(x) by running the Gibbs sampler (or a similar method such as the
Metropolis algorithm) until the distribution of the set of Markov chain
iterates is close to F. As shown in the previous section, convergence seems
impossible to monitor from a single finite realization of the Markov chain;
consequently, we follow the implicit suggestion of Figures 1 and 3 and track
several parallel sample paths.

Consider m independent runs of the Gibbs sampler, each of length n, starting
from m different initial states x10, . . ., xm0:

series 1: x11, . . ., x1n
. . .
series m: xm1, . . ., xmn.

Again, we focus attention on a univariate summary, say r(x); we want to
use the observed simulations rij to determine whether the series of r’s are
close to convergence after n steps.
To understand our method, consider the set of series as m blocks, each with
n observations, in the one-way analysis of variance layout (that is, ignore the
time ordering in the series). We will work with the total sum of squares
(with (mn-1) degrees of freedom) and the “within” sum of squares (with
(m-1)(n-1) degrees of freedom).

First assume for simplicity that the starting points of the simulated series
are themselves independent random samples from F(x). (Of course, if this
condition could be obtained in practice,
a Markov chain simulation method would not be
needed.) With independent starting points, all values of any series
are independent of all the values of any other series, and the unconditional
variance of any point rij is just the marginal variance var r under the
distribution F. We can then estimate var r, given the “data matrix” (rij),
by [total SS – (within SS)/m] / ((m-1)n). (Algebraic derivations appear
in the longer version of this article.) Given the assumption of initial
independence, this “between” estimate of variance
(not the same as the usual “between” estimate in ANOVA) is unbiased for finite
series of any length.

In contrast, the estimated variance within the series, (within SS) / ((m-1)(n-1)),
has expectation var r only in the limit as n -> infinity.
For finite series, the expected within mean square increases with n, assuming,
as is likely, that the random variables r(x1), r(x2), . . ., from the Markov
chain are positively correlated. The discrepancy between the two estimates
of var r suggests a test: declare the Markov chain to have converged when
the within mean square is close to the variance estimate between series, with
confidence intervals derived from classical ANOVA theory. Because of the
dependence within blocks, the degrees of freedom of the between and within
estimates are less than (m-1)n and m(n-1), respectively. We can
approximately correct for this information loss (once again, details will be
provided in the longer article).

Once we are close enough to convergence to be satisfied, the variance estimates
and degrees of freedom corrections alluded to above allow us to estimate the
marginal summaries E r, var r, and Normal-theory confidence intervals for our
Monte Carlo approximations. We can run the series longer if more precision
is desired, and can repeat the process to study the marginal distributions of
other parameters (without, of course, having to simulate any new series of x’s).

In practice, the starting points of the parallel series can never be sampled
independently with distribution F(x); the simulated series are thus no longer
stationary for any finite n, formally invalidating the above analysis. We
currently have two strategies designed to make the independence assumption
approximately true. First, we try to pick starting values that are far apart and,
if anything, more dispersed than independent random samples. The m parallel series
should then start far apart and grow closer as they approach stationarity, as in
Figures 1 and 3; since the variance between series declines with n, the
comparison-of-variances test should be conservative. Second, we reduce the
effect of the starting values by crudely throwing away the the first half
of each simulated series until approximate convergence has been reached.
Once again, Figures 1 and 3 illustrate how a few early steps
of the Markov chain can greatly distort estimates of means and variances
within series. We hope that the conservative strategies of starting with
dispersed points and throwing away early simulations will yield confidence
regions that are wider than those obtained by the ideal method, but that
still have good coverage properties.

The idea of comparing parallel simulations is not new; for
example, Fosdick (1959) applied the Metropolis algorithm to the Ising model
by simulating four series independently, from each of two different starting
points. Approximate convergence was declared when the two groups of series
became indistinguishable on the scale of a prechosen error bound.

5. Some references

Ehrman, J. R., Fosdick, L. D., and Handscomb, D. C. (1960).
Computation of order parameters in an Ising lattice by the Monte Carlo method.
{\em Journal of Mathematical Physics} {\bf 1} 547–558.

Fosdick, L. D. (1959). Calculation of order parameters in a binary
alloy by the Monte Carlo method. {\em Physical Review} {\bf 116}, 565–573.

Geman, S., and Geman, D. (1984). Stochastic relaxation, Gibbs
distributions, and the Bayesian restoration of images. {\em IEEE Transactions
on Pattern Analysis and Machine Intelligence} {\bf 6}, 721–741.

Hammersley, J. M., and Handscomb, D. C. (1964), chapter 9. {\em Monte Carlo
Methods}. London: Chapman and Hall.

Kinderman, R., and Snell, J. L. (1980). {\em Markov Random Fields and
their Applications}. Providence, R.I.: American Mathematical Society.

Lindley, D. V., and Smith, A. F. M. (1972). Bayes estimates for the linear
model. {\em Journal of the Royal Statistical Society B} {\bf 34}, 1–41.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and
Teller, E. (1953). Equation of state calculations by fast computing machines.
{\em Journal of Chemical Physics} {\bf 21}, 1087–1092.

Pickard, D. K. (1987). Inference for discrete Markov fields: the
simplest nontrivial case. {\em Journal of the American Statistical Association}
{\bf 82} 90–96.

Ripley, B. D. (1981). {\em Spatial Statistics}, p. 16ff. New York: Wiley.

Tanner, M. A., and Wong, W. H. (1987). The calculation of posterior
distributions by data augmentation. {\em Journal of the American Statistical
Association} {\bf 82}, 528–550.

I wrote the article but properly listed Rubin as coauthor, as the idea came about after many long phone conversations. I encountered the idea of between-within comparison in the 1959 paper by Fosdick (see above citations); I can’t remember how I found that paper but it must have been from a literature search, going backward from more recent sources. Anyway, when I brought up this idea, Rubin picked up on it right away, as it was close to methods he had developed for inference from multiple imputations. Once we had that connection, the idea was there. And I’d credit Rubin’s influence for my goal of estimating a potential scale reduction factor—that is, a numerical measure of lack of mixing—rather than formulating the problem as a hypothesis test.

The published article appeared over two years later in the journal Statistical Science, in a much expanded version.

In some ways, I prefer this short paper to the full version. I like the snappy style, and I like the clarity about what we believe and what we don’t know. I regret not submitting some version of the above article to a journal immediately, right then in Aug 1990. On the other hand, editors and reviewers for statistics journals can be very stuffy, and an article such as above with a concept but no theoretical derivations probably would’ve been shot down over and over and over. Maybe it just took two years to put in enough blah blah blah to make it publishable.

The above is more like a blog post than a journal article. It contains the key idea with no messing around.

P.S. You’ll notice above that I wrote, “Any comments are of course appreciated.” And you probably won’t be surprised to hear that I got no comments. It took me a long time to realize that most people don’t want to comment on things. When we were getting close to finishing the first edition of Bayesian Data Analysis back in 1994, I printed out copies and gave them to lots of prominent statisticians I knew, but very few gave any comments at all. It’s not about me; people just don’t like to read and make comments. We get some comments on the blog, but when you consider the number of comments and the number of readers, you’ll realize that most people don’t comment here either.

1. Manoel Galdino says:

Sometimes it’s not about not wanting to comment. At least to me, the point is that I don’t think to have much to add that will not take a lot of time.

Sure, I’m not a prominent person in any sense. Still, I think many people feel the same.

2. Manoel Galdino says:

Ah, I really like this version of the paper! Clear and short. Unfortunately, reviewers request that we speculate beyond what we do know and we end up expand our papers and including more debatable stuff.

3. K? O'Rourke says:

Thanks for posting this.

I have been thinking that the avoidance of math in statistics is currently ruled out by the inability to tabulate distributions of functions of product spaces of independent random variables – even univariate random variables. In fact, with even a fair amount of advanced math (e.g. Probability and Measure, Billingsley) one can one get this only for linear functions (in general) by numerically inverting characteristic generating functions via fast fourier transform (FFT). Not sure anyone really does this more than rarely – hence the wonderfully mathematical field of higher order asymptotics.

Now, if one avoids advice by Rubin and others about repeated sampling properties of Bayes being relevant in some sense, one can fully condition on the sample in hand, get these as Andrew says, here “the conditional densities of the resulting posterior distribution [that] are simple” and the equivalent problem here, tabulating marginal distributions of functions of random parameters, can “usually” be dealt with by MCMC. Unfortunately, to lessen the chance of MCMC failure and or detect its occurrence, again requires a fair bit of mathematics :-(

With regard to blog comments, if you were to set aside comments that were of little help, are there still likely to be more from blogging. My sense that if most had comments that could be worked into something publishable, they would not be making them on the blog.