DIC has been around for 10 years now and despite being immensely popular with applied statisticians it has generated very little theoretical interest. In fact, the silence has been deafening. I [Martyn] hope my paper added some clarity.
As you say, DIC is (an approximation to) a theoretical out-of-sample predictive error. When I finished the paper I was a little embarrassed to see that I had almost perfectly reconstructed the justification of AIC as approximate cross-validation measure by Stone (1977), with a Bayesian spin of course.
But even this insight leaves a lot of choices open. You need to choose the right loss function and also which level of the model you want to replicate from. David Spiegelhalter and colleagues called this the “focus”. In practice the focus is limited to the lowest level of the model. You generally can’t calculate the log likelihood (the default penalty) for higher level parameters. But this narrow choice might not correspond to the interests of the analyst. For example, in disease mapping DIC answers the question “What model yield the disease map that best captures the features of the observed incidence data during this period?” But people are often asking more fundamental questions about their models, like “Is there spatial aggregation in disease X?” There is quite a big gap between these questions.
Regarding the slow convergence of DIC, you might want to try an alternative definition of the effective number of parameters pD that I came up with in 2002, in the discussion of Spiegelhalter et al. It is non-negative and coordinate free. It can be calculated from 2 or more parallel chains and so its sample variance can be estimated using standard MCMC diagnostics. I finally justified it in my 2008 paper and implemented in JAGS. The steps are (or should be):
- Compile a model with at least 2 parallel chains
- Load the dic module
- Set a trace monitor for “pD”.
- Output with the coda command
If you are only interested in the sample mean, not the variance, the dic.samples function from the rjags package will give you this in a nice R object wrapper.
I suppose we can implement this in stan too.
In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to 2λ/n, where λ is the real log canonical threshold and n is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion.
It’s great to see progress in this area. After all these years of BIC variants, which I just hate, I like that researchers are moving back to the predictive-error framework. I think that Spiegelhalter, Best, Carlin, and Van der Linde made an important contribution with their DIC paper ten years ago. Even if DIC is not perfect and can be improved, they pushed the field in the right direction.