p(C,D|q)p(q|f)

now if the posterior value for f can be inferred “mechanically” or “in closed form” so to speak, and it depends only on sigma, n then effectively

p(D|q)p(q|sigma,n)

is a “data dependent prior for q” which is in fact just a closed form approximation to the posterior for f

]]>http://andrewgelman.com/2017/07/09/updated-ponyshoe-paper-juho-piironen-aki-vehtari/#comment-523431

]]>There is effectively a “count of nonzero parameters, C” in our background information. Given a data set, and resulting values for n and sigma, there exists a probability distribution for some other parameters q that depends on some quantity, call it f, such that p(q|f)p(D|q,f) has the property “count of nonzero parameters ~ C” a property that the posterior has to have in order to accord with the knowledge set K.

suppose that in some more complex model where f is directly a parameter, the posterior for f is sharply peaked around a closed form expression in terms of n and sigma… then we can approximate the more full model by substituting the closed form expression for f.

??

]]>I don’t follow your analogy. By changing the number of suspects from 10 to 100 you’re not changing the amount of data, you’re changing the number of parameters. The procedure seems to me closer to the following:

You have several suspects and you would like the result of your analysis to identify the murderer beyond any reasonable doubt. If you don’t have enough data the posterior probability may be too dispersed, so you adapt your prior depending on the amount of data available to increase the chance that the “effective number of likely authors” is one.

]]>Consider this example: There has been a murder, and we know there is exactly 1 murderer. You have either n=10 suspects or n=100 suspects. What is you prior probability that a random suspect is the murderer in these two cases? How would you choose this prior probability so that it doesn’t depend on n?

]]>“…Fig. 6 illustrates the importance of scaling τ with the noise level σ… when the observations are scaled by multiplying them by 0.1 (bottom row), the value for τ that does not scale with σ yields clearly worse results than in the first case, while the results for the latter value remain practically unchanged.”

But by scaling *everything* by 0.1, they have created a situation wherein the β’s are *by design* proportional to σ as you move from the first to the second case! An honest test would have been to scale only the errors while keeping the β’s unchanged.

BTW, despite these criticisms, I’m very glad they wrote this paper–I’m facing exactly the issues they bring up (need a prior that both encodes sparsity and bounds effect size) and will be trying out the regularized horseshoe soon. I just won’t be using their prior on τ. :-)

]]>Thanks for the comment. The purpose of the method described by Piironen and Vehtari is prediction, not variable selection; they just use variable selection as an intermediate step to improve predictive performance.

]]>“From a Bayesian perspective, adjusting the complexity of the model based on the amount of training data makes no sense. A Bayesian defines a model, selects a prior, collects data, computes the posterior, and then makes predictions. There is no provision in the Bayesian framework for changing the model or the prior depending on how much data was collected. If the model and prior are correct for a thousand observations, they are correct for ten observations as well (though the impact of using an incorrect prior might be more serious with fewer observations).”

]]>“it is evident that to keep our prior information about m_eff consistent, τ must scale as σ/√n”

and then proceed to advocate setting τ = τ0 = Sσ/√n or τ ~ HalfCauchy(τ0) for a particular positive S. This data-dependent “prior” is nonsense — for τ to depend on n, one of the following would have to be true:

* the number of observations n somehow retrocausally influences the magnitudes of the β parameters, or

* the magnitudes of the β parameters somehow influence your choice of how much data to collect, or

* in some other way YOUR CHOICE of the amount of data to collect provides information about the β parameters.

They continue,

“Priors [on τ] that fail to [scale as σ/√n]… favor models of varying size depending on the noise level σ and the number of data points n.”

This is a problem with their concept of “effective number of parameters,” not with priors that are independent of n. OF COURSE, the effective number of parameters is going to depend on σ/√n; as this quantity gets smaller, you are capable of detecting ever-smaller regression coefficients β_j!

]]>Perhaps similarly to Andrew and sparsity, I didn’t initially understand/appreciate the underyling idea because I was thinking in terms of modelling, noise, truth etc.

It became much clearer to me when I started thinking in terms of regularisation, approximation, adequacy etc.

]]>http://andrewgelman.com/2017/01/17/laurie-davies-time-series-decomposition-birthday-data/

I offered an analysis of the birthday data. The same method can be applied to the microarray cancer classification data sets considered in Section 4.2 of Piironen and Vehtari. The only difference is that a robust version was used for the the birthday data whereas the following uses the much simpler least squares method.

Prostate data: the first three covariates together with their p-values are (2619 0.000),(203,0.57),(1735,0.50). The time required was 0.15 seconds. Using the cut-off p-value 0.01 gives 2619 as the only relevant covariate of the three. A simple linear regression using

these three covariates gives the following p-values (2619,2e-16),(203,1.97e-5),(1735,1.11e-4) which are misleading. A classification based on 2619 alone results in 8 misclassifications. Including the covariates 203 and 1735 reduces this to 7. In fact you have a 88% chance of a better classification, that s 6 or less misclassifications, if you retain 2619, replace all other covariates by white Gaussian noise and then calculate the number of misclassifications base on the first three covariates. These always include 2619. The average number of misclassifications is 5. This is based on 1000 simulations.

This does not imply that 2619 is the only relevant covariate, only that given 2619 the remaining covariates are no better than Gaussian noise. If 2619 is removed and replaced by Gaussian noise the relevant covariates are (5016,0.00) and (5035, 2.63e-05). These two result in 12 misclassifications. Now replace 5015 and 5035 by white Gaussian noise. This results in (1839 6.70e-13),(4898,1.37e-6) and (2503,5.32e-3). This process can be continued until there are no more relevant covariates (cut-off p-value 0.01). This results in 180 specified relevant genes in 80 clusters of 1-4 genes. The smallest number of misclassifications was 5 but all were based on 3-4 genes and so no better than 2619 with Gaussian covariates. The simple conclusion is that you can not do better than gene 2619 alone. More information may be available from an expert who may be able to detect a pattern in the 80 clusters of “relevant” genes.

The birthday data: a robustified version was used. The sample size is n=7305. The covariates are sin(pi*i*(1:7305)/7305) and cos(pi*i*(1:7305977305) for i=1,…,7305. This gives 14610 covariates but they were treat in pairs, sin(pi*i*(1:7305)/7305) and

cos(pi*i*(1:7305)/7305) together giving 7305 pairs. There were no dummy covariates. The choice of 0.01 for the cut-off value leads to 105 pairs. The time required was 27.5 minutes. Using the residuals one can clearly identify Valentine’s, Halloween, 1st April etc. and the drop in the number of births on the 13th of each month.

The method is not based on a linear model with error term. As there is no error term there is no modelling of an error term. The analysis applies to the data at hand. It is not Bayesian, it is not frequentist. It does not require any possible prior specification of the number of relevant covariates. No simulations are required nor cross-validation. It is fast, simple and interpretable. There is no hypothesis testing in contrast to the claim by Andrew. What hypothesis is to be tested?

Various versions of this paper have an impressive list of rejections: JRSSB, AoS, AoAS, EJS, JASA. Piironen and Vehtari do not mention it in their bibliography. On the positive side ojm likes it and Corey has stated that it is “clear, simple and powerful” or words to this effect- thank you.

]]>that this process should be independent of the order in which you apply the data, then will I think imply the product rule. adding in a requirement for the total accordance to integrate to either 0 or 1 (so that, we can rule out a theory if it predicts that certain data is absolutely impossible, and yet we observe that data) will lead to a sum rule.

One can perhaps then argue for alternative accordances… that’s fine, but I suspect mine will be the unique one that is order independent and conserves accordance.

Then, rather than probability being a “plausibility that X=a is true” it will be something like “the degree to which X=a accords with all theory and data available”. In the end then, non-identifiability becomes a fact about how the model accords with many possibilities. That then becomes a fact we can use in observer logic to make decisions about the “usefulness” of a false but intended to be useful model.

]]>If you’re really brave enough to tackle this question via model theory etc (definitely braver than me!) then an idle thought occurs to me:

Bayes usually claims to be an epistemological theory (of what we know, states of information etc) but implicitly uses what seem to be strong, non-constructive assumptions about ‘truth’ via its reliance on Boolean algebra/the excluded middle (even in continuous contexts).

Thus it seems to me to be more connected to model-theoretic truth semantics. And thus it fits awkwardly with the more epistemological idea that ‘all models/our knowledge are wrong but some are useful tools for thinking about some aspects of the world’. And with a desire for ‘generative’ or constructive model falsification as opposed to NHST style reasoning.

On the other hand, a proof-theoretic approach to ‘truth’ is typically related to constructive reasoning and hence avoiding use of the excluded middle and e.g. replacing Boolean algebras by Heyting algebras and such things. The issue is that this is at odds with all proofs of e.g. Cox’s theorem that I know of, and explicitly introduces a gap between what is true and what can be known to be true (see also Godel, as you alluded to).

So, one challenge is to develop a justification of Bayes using only constructive or proof-theoretic reasoning, rather than relying on e.g. Boolean algebra. My feeling is that it can’t be done in a non-controversially _unique_ way i.e. without allowing for a variety of other approaches to uncertain inferences that exist.

Anyway, a very off-topic, wildly speculative comment…

]]>What’s the difference to you?

]]>;-)

]]>

parameters{

vector<lower=0> [100] q;

}

transformed parameters{

real summary;

summary = sum(inv_logit((q-6)*5));

}

model{

q ~ normal(0,1);

summary ~ normal(5,.2);

}

parameters{

vector [100] q;

}

transformed parameters{

real summary;

summary = sum(inv_logit((q-6)*5));

}

model{

q ~ normal(0,1);

summary ~ normal(5,.2);

}

The idea being that in each vector of 100 q values there are about 5 of the values that are somewhere bigger than or close to 6

When I run the model, I get exactly that, if I do as.matrix(stansamples) and then exclude columns other than the q, and then do a density plot of any given row of the samples, I get a big bump near 0, and a little bump near 6.5 or so.

Of course, in the absence of data, the Rhat is atrocious, because there are around choose(100,5) possible partitions between the two modes. But in the presence of data that would pick out the 5… you’d expect to get what you were looking for pretty nicely.

Now, I see this as taking an accordance for each individual q, that each one is most likely close to zero, and combining it with an accordance for the theoretical quantity equal to the sum of the nonlinear functions, to get a distribution that describes the accordance with the overall information, namely that a few of these dimensions are outliers.

]]>But Aki we’re innovating here! Well-posed sparsity for posteriors and shoes for reindeers! Let’s start with flats and then build up to heels when the reindeer are ready.

]]>p(F(Data,Params) | F, Params, OtherParams, Model)

as a statement about the accordance of the prediction about the combined quantity F(Data,Params) with the theoretical expectations for such a quantity, then it makes just as much sense as

p(Data | Params, Model)

since this is just a special case, where you are now just using the “identity function”

p(ID(Data) | Params, Model)

The thing that I suspect though, is that the structure of the Cox proof really doesn’t need much if any alteration to become a proof about this new interpretation (and I think, though I’m woefully ignorant, that in Model Theory this reads as my interpretation is a model of probability theory in the same way that Cox generalized boolean logic is a model of probability theory)

The fundamental difference seems to be to interpret the probability as a measure of agreement between theoretical and observed rather than a measure of credence that the quantity is the correct one.

]]>the next step is to smoothen that filter and you get something like a kernel ABC method…

All of this plays nicely into my stuff on declarative models (in which we declare that some function of prediction and data is in a soft region defined by a probability distribution). The fact that all this works in practice, gives hints as well. For example I have a model of certain economic conditions in which I have a predictor function F that takes some covariates, and an observed data D, and I’m assigning D/F ~ some_distribution() in Stan. It’s worked very well to give meaningful inferences on certain costs, and the inferences have a long right tail which is in accord with my expectation that some people just like to spend more on those things, more than is necessary for a family of their size.

So next, I’m working on formalizing this notion into something in such a way that this notion becomes an alternative model of probability theory in the way that frequentist, and Cox interpretations are also alternative models.

]]>I think so and there was a discussion of that on this blog earlier (2 or 3 years ago?) ]]>

The more fundamental concept for actual statistical practice is something like filtering subsets using a quantity that describes degree to which the subset has been included in the candidates.

That Cox’s theorem gives us uniqueness of a particular kind of filtering (filtering *one* truth out of many possibilities) and that there is a model (Kolmogorov probability) and therefore a proof of consistency, makes a powerful argument that Probability is a good way to filter. But I do think in Statistics the more fundamental concept is filtering out those things that are contradicted by either data (likelihood) or purely theoretical (prior) considerations.

In this sense, a my-little-pony-shoe prior is a statement about the regions of N dimensional space within which we expect accordance of a particular predictive model with data. It becomes more obvious too why you need to constantly alter your prior as you alter your likelihood, because the meaning of the parameter space is tied directly to the predictive model, and the likelihood, and so the region that “makes sense” a-priori is also tied inevitably to that likelihood.

There really are no priors without likelihoods. Or to put it better, the joint distribution is more fundamental than any factorization.

]]>I have put more vaguely as the prior representing (via a probability model) how parameter values came about or were set and then the data generating model representing (via a probability model) how observations came about and ended up in your current data set (given various points in the parameter space). So the getting multiplied by the “likelihood” simply reflects it being a joint model. There is some discussion of that here http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf (have you written anything up on this?)

Now if I am getting Michael’s comment here, a joint prior with dependencies is a better (less wrong) representation of “the process underlying unknowns”.

]]>My general point was that it seems to me that probability density-based modelling is a special case of more general regularisation concepts.

Two examples: 1. Two densities can be arbitrarily far apart yet define the same probability measure arbitrarily closely. 2. Probability is by definition normalised over the set of possibilities.

So probability density modelling can be useful, sure, but I’m not convinced it’s the more general or interesting way to understand the problem of learning from data etc.

]]>For me this is about prior information. 1) In these biomarker logistic regression examples, it’s very unlikely that any single biomarker could predict the correct class with high probability, which means that none of the weights should be really large. This has been well discussed, e.g., in http://projecteuclid.org/euclid.ba/1340371048 and http://projecteuclid.org/euclid.ba/1488855634 2) In these biomarker examples it’s likely that many of the biomarkers measured are not relevant at all and those which are relevant have may have different effect sizes.

Sometimes these priors are use also as regularizers. For example, I would choose p_0n, we also know that the data has to lie on lower dimensional space restricted by the number of observations, and we could include the prior information that the effects will have dependencies due to non-identifiability leading to sparse factor models and then we could again say it’s a prior and not a regularizer.

Comments about high-dimensionality are relevant and the same examples used to illustrate the concentration of measure in posterior sampling are useful. When the number of dimensions increase Gaussian prior will put most of the mass far away from the mode. Horseshoe is sharp enough near the mode, so that even when the number of dimensions increase it’s possible to have most of the mass near the mode. Alternatively factor models will effectively set the prior on lower dimensional manifold allowing more mass near the mode.

Reindeer’s don’t (at least usually) have shoes like horse’s have.

]]>“Priors” can have lots of uses, and I think about them sometimes as data-independent information about the parameter process, or at least the process underlying unknowns.

Priors, which are ultimately just probability statements, or a part of a model that is a function of unknowns, can:

– Act as soft probabilistic constraints for model identification

– Act as regularization for estimates

– Guard against sampling-error induced extreme estimates

– Act functionally, to improve estimation efficiency or to improve inferences

– Include prior empirical information

– Describe parameter generation processes (e.g., random effects models)

And other things.

I think horseshoe, lasso, ponyshoe, etc, fall under my heading of ‘functional priors’, in the sense that I’m not sure anyone is arguing that the horseshoe methods describe a DGP, but they nevertheless permit better inferences. It’s still a probabilistic component to the joint density, but it’s one used functionally.

If it’s described as a continuous spike-and-slab prior, then it’s a bit like describing a parameter process in which a parameter is either essentially nil or it isn’t a-priori, and the known portions of the joint density permit marginal inference about which parameters are indeed nil or not. I don’t see that is particularly non-bayesian; it’s a joint density, it’s a probabilistic process on parameters, and one obtains marginal posteriors assuming that process and the likelihood.

TLDR; I think once one gets rid of the “prior” nomenclature, some misconceptions about Bayesian inference can go away; in the end, what matters is a joint density represents a model of knows and unknowns.

]]>Something like truncated SVD is a regularisation method, right? In general I find regularisation an easy enough idea to understand, but struggle to see how to interpret it as a prior in _formal_ Bayesian terms beyond taking exponentials.

Even something like the typical set sampling makes sense to me as a regularisation concept, but little sense to me as a Bayesian concept. For example, what definition of Bayes’ theorem do you use for continuous models? If in terms of densities it seems ill-posed until you add something like typicality.

]]>I am curious what these priors do in the case of perfectly correlated linear predictors. For L1 (Laplace), the penalized MLE is arbitrary for a fixed regularization scale. For L2 (normal), the coefficient is doled out evenly between the two predictors. In matrix factorization, they just get combined and you lose one rank.

]]>OK, fine. But I guess then I see regularisation, and e.g. the existence of emergent sparse approximations, as the deeper principles. And not necessarily tied to probability theory as such.

]]>Here’s one way to think about it: What makes a distribution a “prior distribution” rather than a mere “distribution” is that it is getting multiplied by a “likelihood.” To put it another way, the properties of a prior (for example, whether it can be reasonably considered “weakly informative” depend on available information and context, of course, but also on the likelihood.

So what Dan is saying, I think, is that in the context of regression-type likelihoods, the regularized horseshoe prior can behave well. That is, have good “frequency properties.” (Sorry, Dan.)

]]>I don’t really understand this comment. Isn’t the idea of sparsity that one is interested in an approximate, ’emergent’ lower dimensional model of high-dimensional data?

I suppose this isn’t a prior in the sense that one doesn’t ‘believe’ that the ‘true’ coefficients are zero (as would technically be the case with formal Bayes), but that they can be treated approximately ‘as if’ zero. So, OK, not really a prior in the sense that it’s not a probability assignment to a proposition about an underlying truth.

But how are these to be interpreted as ‘priors’ rather than say sparsity regularisers? A sparsity regularised problem is simply ‘find an adequate, low dimensional representation of the data’.

I guess I’m saying I see the rationale for sparsity regularisation, but struggle to see in what sense it can be interpreted as ‘prior’, other than a formal mathematical correspondence. Which comes first, the regularisation or the prior?

In what sense does ‘otherwise you’ll get nonsense results and overfitting’ define a ‘sensible prior’?

]]>One of the nice features of the Finnish horseshoe (although I would love “Reindeer show” to catch on) is that you can control the joint behavior directly so that the regularization is robust to the number of parameters.

]]>1. See here: “Whither the “bet on sparsity principle” in a nonsparse world?” The comment thread is pretty good too.

2. I don’t like “ponyshoe” either: that’s why I recommended that Juho and Aki call it “regularized horseshoe” instead.

]]>