Can we implement these in Stan?

Marginally specified priors for non-parametric Bayesian estimation (by David Kessler, Peter Hoff, and David Dunson):

Prior specification for non-parametric Bayesian inference involves the difficult task of quantifying prior knowledge about a parameter of high, often infinite, dimension. A statistician is unlikely to have informed opinions about all aspects of such a parameter but will have real information about functionals of the parameter, such as the population mean or variance. The paper proposes a new framework for non-parametric Bayes inference in which the prior distribution for a possibly infinite dimensional parameter is decomposed into two parts: an informative prior on a finite set of functionals, and a non-parametric conditional prior for the parameter given the functionals. . . .

Some version of this should be pretty easy in Stan, I’d think, as it’s no problem to supply probabilistic information on transformed parameters.

Adaptive Higher-order Spectral Estimators (by David Gerard and Peter Hoff):

Many applications involve estimation of a signal matrix from a noisy data matrix. In such cases, it has been observed that estimators that shrink or truncate the singular values of the data matrix perform well when the signal matrix has approximately low rank. In this article, we generalize this approach to the estimation of a tensor of parameters from noisy tensor data. We develop new classes of estimators that shrink or threshold the mode-specific singular values from the higher-order singular value decomposition. These classes of estimators are indexed by tuning parameters, which we adaptively choose from the data by minimizing Stein’s unbiased risk estimate. . . .

“Stein’s unbiased risk estimate” sounds kinda silly but it should be possible to just do this using full Bayes. In any case, the real point is to make use of this class of models rather than the messy ensembles of interactions that we are often working with now.

Hierarchical array priors for ANOVA decompositions of cross-classified data (by Alexander Volfovsky, Peter Hoff):

ANOVA decompositions are a standard method for describing and estimating heterogeneity among the means of a response variable across levels of multiple categorical factors. In such a decomposition, the complete set of main effects and interaction terms can be viewed as a collection of vectors, matrices and arrays that share various index sets defined by the factor levels. For many types of categorical factors, it is plausible that an ANOVA decomposition exhibits some consistency across orders of effects, in that the levels of a factor that have similar main-effect coefficients may also have similar coefficients in higher-order interaction terms. In such a case, estimation of the higher-order interactions should be improved by borrowing information from the main effects and lower-order interactions. To take advantage of such patterns, this article introduces a class of hierarchical prior distributions for collections of interaction arrays that can adapt to the presence of such interactions. . . .

I’d *really* like to work with this model, or something like it. I seem to recall talking with Volfovsky and there was some reason we can’t just fit it in Stan as is, but maybe there’s some way we could adapt the model, or the computation.

> Can we implement these in Stan?

I hope so, that would be great!

I’d be happy to help out if someone wants to try.

It’s not clear to me what “Stan” refers to in the post. Do you mean can we write a Stan 2.12 program for these models? Otherwise, the question is really whether Stan’s language or math library or existing algorithms would be helpful in writing C++ implementations of these models.

For the Kessler et al. paper, we have nothing at all that’d let us build Dirichlet process models. So I’d say that’s a “no” either way.

For the Gerard et al. paper, we can do singular value decomposition, but it’s not efficient as I don’t think we’ve plumbed analytic gradients through it. I don’t even know what “statistical risk under quadratic loss” means, so I’m not sure how they’re defining the estimator. It looks like something that’s not easy with simple optimization, and would thus require custom C++ code on the back end.

For the Volfovsky et al. paper, I don’t see why we couldn’t code up the models they’re talking about directly in Stan as it exists now, but I didn’t work through the details. They used a custom Gibbs sampler, but I’d think we could use NUTS as it looks like all the parameters are continuous. I’ll defer to you and Ben on whether their prior’s a good idea and how to evaluate it.

Bob:

For each model, I was wondering if it was possible to fit in Stan as is, or whether new code would be needed to let us fit the model, or whether new code would be needed for the model to fit efficiently.

Based on your response, it sounds like we could fit the Volfovsky et al. model as is. So it sounds like a good starting point would be to program that model in Stan.

Section 4 of the Kessler et al. paper considers a Dirichlet prior as opposed to a Dirichlet process model. My basic knowledge of Stan leads me to believe that this part can be handled in STAN, but I am not exactly sure how.

As a simple example application, imagine you have a magic six-sided die with unknown probabilities theta_1,…theta_6. A non-parametric Bayesian model might place a Dirichlet(1,1,1,1,1,1) prior on theta. Now, suppose you have some prior knowledge about the mean (i.e. mu) of the magic die that suggests mu is less than 3.5. Say, you wish to state that p(mu) is proportional to normal(2.5,0.5). This is an example of the Kessler paper and I wonder if this can this be done in Stan? The below model was my attempt at this toy example, but I am getting a parsing error.

The main idea in the Kessler et al paper goes beyond NP Bayes or Dirichlets.

The idea is that you have a “working prior” $p(\psi)$ that you

would use as a default for the high-dimensional $\psi$, but also

have side information $\pi(\theta)$ about $\theta=f(\psi)$, where theta is

low dimensional.

One example that comes to mind but that is not in the paper

is if you are using a multivariate normal prior for some

set of regression coefficients or population means, but also have side

information that is not easily encoded in such a prior.

It would be cool if a user could specify a working prior,

the function $\theta=f(\psi)$, and the prior over $\theta$,

and get back the marginally adjusted posterior. In the paper,

it is shown that this can be done by slightly modifying

the MCMC algorithm for the default prior.

Peter, thank you for clarifying the big picture of the Kessler et al paper (i.e. MSP). From an application perspective, the incorporation of “side” information ensures a practitioner’s voice and insight, typically low-dimensional, can be easily reflected in a decision model.

Also, for those who are interested, I have debugged my toy example from above. A good night’s sleep and the amazingly clear error messages from the Stan parser were both helpful. I would welcome any feedback on the toy example as I remain unsure whether the Stan code reflects both the problem statement and if it can can serve as a simplistic example of the MSP approach. Running the code seems to produce reasonable results and I think is suggestive of the MSP approach, but some external validation would be great.