Iterative importance sampling

Aki points us to some papers:

Langevin Incremental Mixture Importance Sampling

Parallel Adaptive Importance Sampling

Iterative importance sampling algorithms for parameter estimation problems

Next one is not iterative, but interesting in other way

Black-box Importance Sampling

Importance sampling is what you call it when you’d like to have draws of theta from some target distribution p(theta) (or, in a Bayesian context, we’d say p(theta|y)), but all you have are draws from some proposal distribution g(theta) that approximates p. You take the draws from g, and give each of them a weight proportional to the importance ratio r=p/g. And then you compute weighted averages; for any function h(theta), you estimate E_p(h) as Sum_theta r(theta)h(theta) / Sum_theta r(theta), summing over draws theta from g. We typically can only compute p up to an unknown multiplicative constant, and often we can only compute g up to an unknown multiplicative constant, but those constants drop out when computing the ratio.

Importance sampling is an old idea, and statisticians used to think of it as “exact,” in some sense. And, back around 1990, when Gibbs and Metropolis sampling started to become popular in statistics, a lot of us had the idea that it would be a good idea to start a computation with the iterative Gibbs and Metrop algorithms, and then clean things up at the end with some exact importance sampling. But this idea was wrong.

Yes, importance sampling is simulation-consistent for most purposes, but, in general, if the importance ratios are unbounded (which will happen if there are parts of the target distribution with longer tails than the proposal distribution), then for any finite number of simulation draws, importance sampling will give you something between the proposal and target distributions. So it doesn’t make sense to think of importance sampling as more “exact” than Gibbs or Metropolis.

Indeed, importance sampling can be seen as an iterative approximation, starting with a proposal distribution and gradually approaching the target distribution (if certain conditions are satisfied) as the number of simulation draws increase. This is a point I emphasized in section 3 of my 1991 paper, that importance sampling, like Markov chain sampling, is an iterative simulation method. But, where Gibbs and Metropolis are adaptive—their proposal distributions depend on the most recently drawn theta—importance sampling is not.

Thanks to the above reasoning, importance sampling fell out of favor: for hard problems of high or even moderate dimensionality, importance sampling fails miserably; and for easy, low-dimensional problems, one can just as well use black-box MCMC (i.e., Stan).

Importance sampling is no longer the workhorse.

But importance sampling still has a role to play, in (at least) two ways. First, sometimes we want to work with perturbations of our distribution without having to re-fit, for example when doing leave-one-out cross-validatation. Thus these two papers with Aki and Jonah:

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.

Pareto smoothed importance sampling.

The other place for importance sampling is following an approximate fit such as obtained using normal approximation, variational Bayes, or expectation propagation. This is not a gimme because in moderate or high dimensions, the approx is going to be far enough away that the importance ratios will be highly variable. Still, one would expect importance sampling, if done right, to give us something between the approx and the target distribution, so it should be a step forward.

Research still needs to catch up to practice in this area. In particular, I think the theoretical framework for importance sampling should more explicitly recognize that the goal is to get a good intermediate distribution, not to expect to get all the way there.

P.S. Lexi, pictured above, looks pretty important to me! She’s the sister of the reflective cat from this post and the picture comes from Maria Del Carmen Herrojo-Ruiz.

17 thoughts on “Iterative importance sampling

  1. Thanks for bringing up this important subject. Just wanted to add that the subject of “Iterative Importance Sampling” is also know as “Adaptive Umbrella Sampling” in the field of statistical mechanics, molecular dynamics, and has quite some literature associated to it.

  2. A., if I may, I somewhat find it sort of revealing that you make no mention whatsoever of particle filters and of the enormous and still on-going literature on sequential Monte Carlo (like SMC^2). Beyond a certain dimension, MCMC cannot keep up while sequential methods in state-space models can outgrow the dimension by learning sequentially. I thus find excessive the dismissal of importance sampling as something of the past, even when agreeing that some papers in your list are not exactly rocket science. (And thanks to Aki for these links!)
    X.

    • > Beyond a certain dimension, MCMC cannot keep up while sequential methods in state-space models can outgrow the dimension by learning sequentially.

      I don’t agree with either part of that statement!
      1) It’s not MCMC that doesn’t scale beyond certain dimenisons, it’s bad proposal kernels! If we would just burn every book that suggests RWMH or Gibbs Samplers are enough, then MCMC would suddenly scale *much* better!

      2) SMC is extremely hard to use when the state is high dimensional (particle collapse is real!). The recent infinite dimensional SMC algorithms hit the same problem that the early infinite dimensional MCMC methods hit: The proposal is adapted to the prior, so its woefully inefficient if the data is informative. PFs give good point estimates in ultra-high dimensional problems (like data assimilation for weather prediction) but does not do a good job at quantifying the variability in the posterior (you typically can’t run many particles and they collapse). In the *very* special case where your system undergoes temporal Markovian dynamics, PF/SMC methods scale, but only in the time variable. They are a specialised tool and in situations that they are specifically adapted to, they will outperform MCMC.

      But yes, they are a modern realisation of the IS idea. I guess IS is SMC with an independence sampler kernel. But Andrew’s call for methods to build better problem-adaped IS proposal generalised to building more adapted SMC/PF kernels.

  3. I’ve been trying to figure out some sort of approximate HMC method that is robust to the stiffness phenomenon that occurs when some set of parameters is highly identified compared to others. Particularly for high dimensional problems.

    The insight I had was that if the HMC approximation is sloppy and samples too widely then a parallel tempering metropolis scheme with the target distro and a tempered Target distro at higher temp using the entire ensemble of samples from the approx HMC acts like a Maxwell’s demon sorting the tail samples into the higher temp ensemble and letting the core samples jiggle into their local equilibrium. Ultimately HMC is great because it travels widely, but it’s a bit delicate due to stiffness and divergences.

    Iterative methods using parallel tempering can clean up a sloppy sample even when you don’t even know the proposal distribution up to a constant. As long as the method is targeting the target distro, using the tempered target distro let’s you rapidly boil off the hot samples.

    • The same method could be applied to the output of vb in Stan. Vb gives us an efficient umbrella. Take the ensemble of draws split it into two groups, then give one group the target density and the other group the vb approx density.

      Now with probability say 80 percent choose one from each ensemble and try to swap and with probability 20 percent run each chosen point through a series of random walk metropolis steps.

      After doing this a while throw out the hot ensemble from vb and dup the target ensemble in its place. Your vb ensemble will soak up all the points that are in the deep tails and random walk will spread out the points that are too concentrated. Several rounds of this Maxwell demon process should make a vb derived sample much more target like

  4. “Still, one would expect importance sampling, if done right, to give us something between the approx and the target distribution, so it should be a step forward.”

    While I would agree that it could be thought of as ‘between,’ in the cases I’ve dealt with when importance sampling breaks down, the sample becomes degenerate, with one observation getting near 100% of the weight. It’s not really a useful step forward when this happens.

    On a related note, MCMC-MLE algorithms use importance sampling with bounds to control for the degradation of the validity of the sample ( http://dx.doi.org/10.1080/10618600.2012.679224 ).

  5. I guess it is a very personal account of importance sampling…! The statistics literature of the last 25 years features many successful applications and interesting methodological development of importance sampling.

    Readers of this blog might be interested in annealed importance sampling, particle filters, sequential Monte Carlo samplers, particle MCMC… I’ve written a reading list here not so long ago:
    https://statisfaction.wordpress.com/2017/06/30/particle-methods-in-statistics/
    Another list on SMC is on Arnaud Doucet’s website:
    http://www.stats.ox.ac.uk/~doucet/smc_resources.html

    In particular both the methodological and theoretical literature have devoted a lot of attention on the idea of intermediate distributions, how they should relate to the target, how to come up with them adaptively, etc.

  6. > Indeed, importance sampling can be seen as an iterative approximation, starting with a proposal distribution and gradually approaching the target distribution (if certain conditions are satisfied) as the number of simulation draws increase. This is a point I emphasized in section 3 of my 1991 paper, that importance sampling, like Markov chain sampling, is an iterative simulation method.

    Do you really mean “importance sampling” there? In the paper linked you talk about the “importance resampling” method being iterative. But what you define in this post is just “importance sampling”, which you describe in section 1 of that paper as a non-iterative method (together with rejection sampling).

    • I had been thinking along these lines recently. HMC is of course great because as time increases, it moves around over long distances compared to random walk. On the other hand, HMC is the solution to a dynamics problem, and dynamics problems are typically much more challenging than statics problems. In many problems I’ve worked on with Stan, the step-size drops down very small and the treedepth increases and each iteration takes thousands of HMC steps. Divergences become common as dimension increases.

      In the end, what we want is some kind of static equilibrium: any RWMH (or similar MCMC step) maps an ensemble of points into an ensemble of points for which E[f(x)] is essentially unchanged (within Monte Carlo error) for “all” f of interest. In order for this to be true, we need the particles to be maximally far apart so that we aren’t insensitive to functions whose support is in the high probability region of the distribution, but not in the range over which the particles spread… yet, it’s subject to the constraint that perturbations of the form RWMH(x) leaves the expectations invariant, so we can’t just blow the ensemble out to infinity to maximize spread.

      That tradeoff between maximizing the coverage of the space, subject to the constraint of invariance to MCMC perturbation seems like a statics problem we could solve. It seems like this technique you point to is along those lines.

    • I really like that paper! There were a couple of things that made me scratch my head about it though:

      1. The gradient has a term driving points to regions of high density and a term driving points away from each other. The term that drives points away from each other must be what drives points into the typical set if they start close to a maximum of the density, but that term is entirely kernel-based and contains no info about the density. How does that work?

      2. The KL-divergence notionally being minimized is actually infinite because the target density and the approximation (a point cloud) are mutually singular. How does that work?

      Point 2 is addressed to a certain degree in a follow-up: https://arxiv.org/abs/1704.07520.

      • http://models.street-artists.org/2017/09/05/statics-methods-for-mc-ensembles/

        That’s more or less what I’ve been thinking about the last few days and it seems like a related idea, I don’t know. In my proposal, in essence RWMH is a way to find a vector that transports the ensemble into (at first) and then along (after equilibrium) the typical set. Here we really do mean the typical set, which is normally defined over sequences of draws from a distribution, here the ensemble can be seen as a sequence of points from the distribution, and so it’s that whole sequence that is being perturbed simultaneously. Then, even if your distribution is 1D the ensemble of say 100 points is 100 dimensional and has typical set properties. In some ways this uses the “Curse” of dimensionality as a blessing… you can find lots of directions to go in that as an ensemble leaves you at more or less constant entropy. RWMH becomes a lightweight way to compute directions for travel (no gradients required). Line-search becomes a way to move points farther than they would with RW alone.

        The whole thing doesn’t collapse to a mode because you can’t synchronize 100 Random Walk Metropolis steps to all go towards the mode at the same time (unless you’re FAR out in the tail, in which case that’s exactly what you want in order to find the high prob region anyway), but near equilibrium, some points will always be going downhill while others go uphill.

        I have no idea if this idea works, but I’ve got enough infrastructure set up in Julia to try it out and see what happened.

        • Regarding this idea. My experiments suggest that it partially works. Starting from far out of equilibrium, within a reasonable computing time on a 1002 dimensional sample problem, using just 1D pure optimization at each step, about half of the points collapse down into the typical set, and the other half stay outside at much higher entropy (lower log probability).

          Once the majority of the points reaches typical set (or even close to the mode, because I’m just using optimization) it becomes impossible to move the rest of the points into the typical set because any 1D or even small set of lines (I tried 4D for example) can’t move just the out-of-equilibrium points without taking the in-equilibrium points out of equilibrium. So the scheme doesn’t work.

          On the other hand, it appears it would work somewhat if you sent each particle independently, that is, you had N control points for an N point ensemble. However, in that case, you’d just collapse all particles to the mode.

          However, it occurred to me that in optimization it’s not a problem to move the points a long distance, you just change the control points by a large amount. So the next idea is to take the ensemble and perturb ALL of it according to some perturbation kernel (such as Normal(0,1) or whatever) and then use minimization to try to re-equilibriate.

          Right now I’m having it do optimization steps and then 10% chance of a Normal(0,3) perturbation to every coordinate (there are 1002 coordinates in my example problem and 100 points). What I see is that when the perturbation comes along the ensemble entropy increases dramatically, like from 70,000 to 2,700,000 but on the very next step it will decrease dramatically as well, say down to 140,000 and then continue to decrease as you go along.

          Now, suppose that the typical set in the 1002 dimensional space is virtually completely contained in a ball of radius r (say 1-epsilon of the mass, where epsilon is much less than 1/N with N the number of points in your ensemble so that basically none of the points in your ensemble should be outside this ball). Then, if everything collapses to the mode, a single perturbation by a Normal kernel of sufficient scale completely blows all your points outside the ball. Your next task is to try to move towards the typical set: a few rounds of RWMH moves each point in your ensemble in the right direction. Extending that motion in a line until it minimizes the ensemble entropy brings you even closer, then repeating brings you closer and closer. You can imagine that with minimization, sufficient iterations would bring you entirely onto the mode.

          Now, after several rounds of this whole mixing scheme, instead of two modes in lp with one on the mode and one well outside the typical set, I have all the points near one lp value, but I think it’s somewhat too far into the tails. This is to be expected because the Normal(0,3) perturbation is in no way adapted. It needs to become a smaller perturbation as time goes along so that by the end, its effect is to counterbalance the minimization step… and I have no idea if that is possible or how to do it.

          However, the intuition helped me understand the paper a bit better I think.

          In their description the basic idea is that the objective function that each point is trying to maximize is distorted by the nearby presence of other particles. If you’re trying to maximize density, you head to the mode, but if you’re trying to maximize distorted density, and the local distortion is such that you’re in a flat region of space because the kernel averaging estimate of density is flat… then you don’t move. I think this is what is going on in the paper. The presence of the other particles is detected by an individual particle because each particle looks out over a region of space for other particles within its kernel region. If there are substantial numbers of other particles, then it stops moving.

          So, the kernel-only dependence in the repulsive part isn’t really true. As the particles themselves move into an ensemble that approximates the target density, the kernel sum for each particle detects the other particles in its region, which are arranged…. in the target density.

          In essence, we’re doing a high dimensional kernel density estimate, and trying to have each particle move into a situation in which its local kernel density estimate is in equilibrium

          On that basis, I don’t know whether the whole thing works for small ensembles, or really requires that the number of particles be large and/or scale with dimension.

Leave a Reply to Xi'an Cancel reply

Your email address will not be published. Required fields are marked *