## New ideas on DIC from Martyn Plummer and Sumio Watanabe

Martyn Plummer replied to my recent blog on DIC with information that was important enough that I thought it deserved its own blog entry. Martyn wrote:

DIC has been around for 10 years now and despite being immensely popular with applied statisticians it has generated very little theoretical interest. In fact, the silence has been deafening. I [Martyn] hope my paper added some clarity.

As you say, DIC is (an approximation to) a theoretical out-of-sample predictive error. When I finished the paper I was a little embarrassed to see that I had almost perfectly reconstructed the justification of AIC as approximate cross-validation measure by Stone (1977), with a Bayesian spin of course.

But even this insight leaves a lot of choices open. You need to choose the right loss function and also which level of the model you want to replicate from. David Spiegelhalter and colleagues called this the “focus”. In practice the focus is limited to the lowest level of the model. You generally can’t calculate the log likelihood (the default penalty) for higher level parameters. But this narrow choice might not correspond to the interests of the analyst. For example, in disease mapping DIC answers the question “What model yield the disease map that best captures the features of the observed incidence data during this period?” But people are often asking more fundamental questions about their models, like “Is there spatial aggregation in disease X?” There is quite a big gap between these questions.

Regarding the slow convergence of DIC, you might want to try an alternative definition of the effective number of parameters pD that I came up with in 2002, in the discussion of Spiegelhalter et al. It is non-negative and coordinate free. It can be calculated from 2 or more parallel chains and so its sample variance can be estimated using standard MCMC diagnostics. I finally justified it in my 2008 paper and implemented in JAGS. The steps are (or should be):

– Compile a model with at least 2 parallel chains
– Set a trace monitor for “pD”.
– Output with the coda command

If you are only interested in the sample mean, not the variance, the dic.samples function from the rjags package will give you this in a nice R object wrapper.

I suppose we can implement this in stan too.

Aki Vehtari commented too, with a link to a recent article by Sumio Watanabe on something called the widely applicable information criterion. Watanabe’s article begins:

In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to 2λ/n, where λ is the real log canonical threshold and n is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion.

It’s great to see progress in this area. After all these years of BIC variants, which I just hate, I like that researchers are moving back to the predictive-error framework. I think that Spiegelhalter, Best, Carlin, and Van der Linde made an important contribution with their DIC paper ten years ago. Even if DIC is not perfect and can be improved, they pushed the field in the right direction.

1. Ben says:

Does reposting other people's (admittedly stat-related) comments count towards your 30 days of stat-blogging? Bring on 31!

2. C Ryan King says:

I emailed with Sumio Watanabe about WAIC as a practical tool for model comparison. He seemed concerned about the convergence of the thing in MCMC samples. The sampling variability is in general also a complex entity, so it's hard to calibrate it or even consistently choose the correct model. He's concerned with generality to fairly pathological cases, so I'm seeing if in reasonably well behaved cases the usual posterior asymptotics work well enough to decide what's a "big" difference in WAIC.

3. Sumio Watanabe says:

Dear Dr. Gelman, I agree with your opinion that, even if DIC needs improvement, DIC is the important proposal for Bayesian model selection. In our research, we study statistical models with hierarchical structures such as normal mixtures, in which DIC needs to be improved because the posterior distribution contains singularities.

Dear Dr. Ryan King, In the model selection problem of hierarchical models such as normal mixtures, we have to investigate whether the statistical model is almost redundant for the true distribution or not. Then the shape of the posterior distribution is far from the normal distribution. Hence singular statistical theory are necessary and it is neither special nor pathological one.