
Aki points us to some papers:
Langevin Incremental Mixture Importance Sampling
Parallel Adaptive Importance Sampling
Iterative importance sampling algorithms for parameter estimation problems
Next one is not iterative, but interesting in other way
Black-box Importance Sampling
Importance sampling is what you call it when you’d like to have draws of theta from some target distribution p(theta) (or, in a Bayesian context, we’d say p(theta|y)), but all you have are draws from some proposal distribution g(theta) that approximates p. You take the draws from g, and give each of them a weight proportional to the importance ratio r=p/g. And then you compute weighted averages; for any function h(theta), you estimate E_p(h) as Sum_theta r(theta)h(theta) / Sum_theta r(theta), summing over draws theta from g. We typically can only compute p up to an unknown multiplicative constant, and often we can only compute g up to an unknown multiplicative constant, but those constants drop out when computing the ratio.
Importance sampling is an old idea, and statisticians used to think of it as “exact,” in some sense. And, back around 1990, when Gibbs and Metropolis sampling started to become popular in statistics, a lot of us had the idea that it would be a good idea to start a computation with the iterative Gibbs and Metrop algorithms, and then clean things up at the end with some exact importance sampling. But this idea was wrong.
Yes, importance sampling is simulation-consistent for most purposes, but, in general, if the importance ratios are unbounded (which will happen if there are parts of the target distribution with longer tails than the proposal distribution), then for any finite number of simulation draws, importance sampling will give you something between the proposal and target distributions. So it doesn’t make sense to think of importance sampling as more “exact” than Gibbs or Metropolis.
Indeed, importance sampling can be seen as an iterative approximation, starting with a proposal distribution and gradually approaching the target distribution (if certain conditions are satisfied) as the number of simulation draws increase. This is a point I emphasized in section 3 of my 1991 paper, that importance sampling, like Markov chain sampling, is an iterative simulation method. But, where Gibbs and Metropolis are adaptive—their proposal distributions depend on the most recently drawn theta—importance sampling is not.
Thanks to the above reasoning, importance sampling fell out of favor: for hard problems of high or even moderate dimensionality, importance sampling fails miserably; and for easy, low-dimensional problems, one can just as well use black-box MCMC (i.e., Stan).
Importance sampling is no longer the workhorse.
But importance sampling still has a role to play, in (at least) two ways. First, sometimes we want to work with perturbations of our distribution without having to re-fit, for example when doing leave-one-out cross-validatation. Thus these two papers with Aki and Jonah:
Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.
Pareto smoothed importance sampling.
The other place for importance sampling is following an approximate fit such as obtained using normal approximation, variational Bayes, or expectation propagation. This is not a gimme because in moderate or high dimensions, the approx is going to be far enough away that the importance ratios will be highly variable. Still, one would expect importance sampling, if done right, to give us something between the approx and the target distribution, so it should be a step forward.
Research still needs to catch up to practice in this area. In particular, I think the theoretical framework for importance sampling should more explicitly recognize that the goal is to get a good intermediate distribution, not to expect to get all the way there.
P.S. Lexi, pictured above, looks pretty important to me! She’s the sister of the reflective cat from this post and the picture comes from Maria Del Carmen Herrojo-Ruiz.