For pure prediction (and communication of same) it’s reasonable to partition the set of predictions into relevant classes and do prediction conditional on the future state being in one of the those classes.

]]>Yup that’s definitely at least one of my concerns.

And yeah I think it is perfectly justifiable to say you don’t want to average over qualitatively different model instances (in particular) eg a nonlinear ODE system with different dynamical regimes.

Or perhaps, for _some_ (but not necessarily all!) nonlinear models where ave(f(theta)) doesn’t equal f(ave(theta)) or even any f(theta’) for any theta’ in your parameter space.

On the other hand, there are some in machine learning who would presumably be fine averaging or weighting or whatever multiple quite different models as long as it improved a particular predictive performance measure.

Another thing I was getting at is that you might decide to use a predictive check to check the adequacy of your model family. But you could have each individual model instance be inadequate while the convex combination (predictive distribution) looks fine. That is, you have ‘model misspecification’ in the sense that no individual model instance is close to what you want but the predictive check actually helps _hide_ this.

Which again – if you only care about pure predictive performance then who cares. But maybe you care about more than that. What is it you are caring about in such cases? And if you only care about predictive performance, regardless of individual model misspecification, then maybe Bayes isn’t best? I think the literature is pretty mixed on this.

But yeah, plenty of things to discuss, probably some other time.

]]>http://models.street-artists.org/2016/08/23/on-incorporating-assertions-in-bayesian-models/

We specify some simple region of parameter space like

A ~ normal(0,100);

telling us that the coefficients of the basis expansion aren’t “too big” but then beyond that, we need to specify the prior in terms of its effect on the function, so for each sample we calculate some qualities we want our function to have approximately, and then provide a weighting function over those qualities, which down-weights those parameter values that produce functions that aren’t “right” according to our knowledge of what “right” means.

If you can’t make this work within the model from a computational perspective (perhaps it’s too costly in computing time to calculate the thing you need to calculate), it’s still legit to take the samples you generate and filter them based on the same knowledge that would have been in the model if it weren’t so damn expensive to calculate. Or the like.

]]>The bigger issue is when the behavior isn’t necessarily wrong, but it shouldn’t have such a large effect. Suppose 60% of your samples have some oscillatory behavior. you know the oscillatory behavior could occur, but it’s also known to be an unusual case… certainly not 60% of the posterior samples. How do you post-process your sample appropriately? How do you justify that? If your audience has been spending a decade in nuanced discussions of the philosophical meaning of Bayesian inference on Andrew’s blog, you’re all set. So, basically a thousand people in the whole world will understand ;-)

]]>averaging over a posterior distribution that includes all these possibilities is saying essentially that you have no frikin idea what is going on… the behavior can be sensitive to aspects of the prior you aren’t able to really calibrate. like for example, the size of the basin of attraction to different behaviors in the parameter space matters, but although you might know that the different behaviors exist, the prior probability of each behavior is a combination like

pbar(s) size(s)

where pbar is the mean probability density of the prior over the set s and size(s) is Lebesgue measure of the basin of attraction to the behavior described by the set s. The set s itself can be uncomputable, it’s impossible to specify a meaningful prior over the behaviors without detailed knowledge of the complex attractor sets in the parameter space. Think of something like the Mandelbrot set, or the Newton’s Method fractal: https://en.wikipedia.org/wiki/Newton_fractal how would you specify a prior that excludes the green colored region entirely?

There are some hacks you can do, but in the end, real world problems are much more complex than that Newton’s fractal, so good luck. Any actual calculation you do can’t really compute the basin of attraction to a behavior you’d like to exclude, and give it zero weight. So any actual calculation you do is a combination of things you know are wrong and things you know could be right. If you try to do decision analysis and you’re including some behaviors that you know are wrong… are you going to get a better result than if you say do a Bayesian analysis, then find a MAP estimate then fuzz that estimate a little into a finite sample, then calculate the behaviors in each of the sample cases, then throw out any samples that exhibit the wrong behavior, then do your decision analysis just based on this small fuzzy filtered MAP estimate?

In some ways, this is the kind of thing I mean when I say you can “do some hacks” above, but although there’s some justification for the hack, keeping the underlying logic in mind is maybe more important than say doing the exact Bayesian calculation based on mechanical application of Bayesian inference to simple computable models.

]]>P(y|theta_1) and P(y|theta_2) seems to me a much different proposition then integrating over P(y|theta_1)*P(theta_1)*d_theta_1.

When you say,

> To add to this, a predictive distribution is basically a convex combination/mixture distribution

> formed from the individual models p(y|theta). If this family is not closed under convex combinations

> then the result is something different to any individual instance.

which of these scenarios do you have in mind? I also understand we could have corner cases, like where one model is just a specific instance of another (say a hierarchical model for the components of theta or something).

]]>Depends on what you mean by logic but agree it is just about learning (getting less wrong as quickly as possible) about empirical aspects of the world. ]]>

At the rate Andrew posts all this will be buried soon anyway…

]]>Email me (dp.simpson@gmail.com) if you want to talk offline.

]]>I suspect that the same thing is easily constructed for the GP case. Each GP sample path is a finite but nonstandard dimensional vector of values g[i]. The probability associated with any given vector is normal_pdf(g,Cov) where Cov is the covariance matrix, an NxN matrix for N nonstandard. The resulting probability over a given g is infinitesimal, but not zero.

One appealing reason to use this formulation is that it corresponds in a very clear way to what is actually done in applied problems. Sure, there are still subtleties, but generally I find them more tractable.

]]>> there’s nothing going on beyond what you mention

feels a bit like ‘gaslighting’ or whatever it’s called.

But my bad for getting frustrated. No hard feelings :-)

]]>Here’s a perhaps simplistic example.

Suppose no individual model can fit the data well but a convex combination of models can. Do you care? If yes then you care about parameter inference if no then you care about prediction.

]]>The thing I was talking about was different to this. In the GP case (and in, therefore, a possibility in general), the action space (before seeing specific data) has zero probability. I agree that the point estimate having zero probability isn’t really a problem (because the data does too). But this is more of a thing that has to be considered. And in particular, it’s why it’s important to distinguish between the action space and the parameter space (because they may not coincide).

And I’m definitely not saying “don’t use the output of decision analysis” (that would be a weird thing to say!). Just that the original point (way back when) was not correct as stated.

]]>My chosen formalism for dealing with some of these issues is nonstandard analysis using Internal Set Theory (IST) of Edward Nelson.

In this formalism you have realizations of a GP being vectors of nonstandard length representing the value of the function at each point a + i * dx for dx an infinitesimal for values going from a to b where a and b are potentially nonstandard themselves.

In general, I don’t believe in minimizing squared error, because it rarely corresponds to a real world outcome. But even if you do want to do it, you still are left with basically:

choose: standard part of argmin over g(x) of sum over x (f(x)-g(x))^2 dx

Now, that sum over x is the integral in IST, and argmin over g(x) is the choice of a really really big vector of values… and the standard part tells you to ignore the details that are only accessible when you have the infinitesimal microscope.

You can argue that the standard part isn’t one of the g(x) (ie. it’s in the RKHS instead of the space), but in this context, g(x) is just a device to get you that standard part. In the same way you can argue that “103.4 F” has probability zero under normal(100,3) posterior but the posterior is just a device to get you a 4 digit number to put on your website. The GP is just a device to get you a “standard” function that tells you what the air quality is all over the western part of the US… or whatever.

]]>there are even situations where you might say something like “choose which realization of my gaussian process to publish as the temperature vs time curve for tomorrow’s weather forecast” or “choose a function of space to plot the current air quality on a map”

Whenever we choose a parameter to publish from a continuous distribution, the point estimate has zero probability relative to the continuous probability measure. This doesn’t bother me. Sure the probability that it’ll be exactly 103.1F tomorrow is zero, but that doesn’t mean we shouldn’t choose “publish that temperature” as our action.

]]>The same thing happens if you’ve got a GP with a known covariance structure and your loss function is mean-squared error (or some weighted variant thereof). In that case, the optimal action is the posterior mean, which lies in the Reproducing Kernel Hilbert Space (RKHS) associated with the GP covariance function (or its Cameron-Martin space if you’re from that world).

The statement {f belongs to the RKHS} has zero probability under both the prior and posterior, so the space of admissible actions does not correspond with the parameter space.

Someone on the discourse used this as an example of the mean being far from the typical set, but that’s not really what’s happening here. The RKHS is dense in the support of the prior/posterior, so the optimal prediction is very close to the posterior mass. It’s just that you can “maths” it away by repeating a certain operation an infinite number of times.

So the statement that the action space is different from the parameter space is true, but possibly not very useful.

]]>In the GP case, there is a natural partition of the parameter space into parameters that control the correlation structure of the GP and the GP itself. If we call the former parameters theta and the GP f(x), then the prediction of f(x) that minimises the prediction error is usually computed by finding the conditional mean of f(x) | y, theta and then integrating over the posterior theta| y. So this does not correspond to one particular parameter.

If instead you want to choose an action, then yes I 100% agree with your statement. In the GP case, the space of actions that could minimise the Loss often has zero probability under the model (it’s one of those weird games we can play with infinities). But if you separate “actions” from “parameters” I think things are much much cleaner.

]]>What he said.

ojm > Which type of statistics?

Statistics (of any stripe) is the study of getting answers from data. Applied statistics does it, computational statistics facilitates it, theoretical statistics seeks to understand its properties, philosophy of statistics is philosophy. The last isn’t empirical (although, it’s the philosophy of something empirical, so I guess it’s meta-empirical). The first three are empirical in the sense that they are statements about data (be it real or just limited by a specific set of assumptions).

ojm > [Various things that indicate you felt attacked]

As said above, that was not my intention. I’m going to step away from this part of the conversation. None of us really need a flame war.

]]>But this doesn’t preclude us from choosing a g(x) that minimizes the real world consequences of the error (although as I point out below, it also isn’t always required to choose a particular g(x))

The predictive loss functions that I’m concerned with are always real world ones: if I mis-estimate the number of fish that escape from fish farms each year, how much money does that cost the farms, how much money does it cost to environmentally remediate the problem? If I mis-estimate the effect of cool roofs on heat-islanding, and I impose a regulation regarding cool roofs, how much money is wasted in roofing that could have been spent on better ways to prevent heat islanding? etc etc

I agree we don’t necessarily always need to do this kind of thing, but when the rubber hits the road and we need to collapse our uncertainty down to a specific decision that really matters, I think that’s how to do it. In actuality, often the decision isn’t “choose a parameter value” but rather “choose an action” and so we don’t wind up choosing a particular realization of the GP for example, we just choose a much lower dimensional action whose choice is informed by marginalizing the consequences of that choice across a large sample of GP functions. Sometimes though, we do need a parameter value. Let say we’re going to adjust a mars orbiter’s orbit by sending it a rocket burn command. The values to transmit to the orbiter are the parameters, we’ve gotta choose one vector of them… how? It should be by decision based on real world consequences: some errors might just require a second burn, whereas others cause a deorbit and crash… so we err on the side of not deorbiting and crashing.

]]>> In fact the mixture/predictive distribution can be very different to any individual model p(y|theta). How do we interpret it?

From Dan Simpson:

> you answered this yourself.

From ojm:

> Again, cool snarky internet response

I don’t think Dan was trying to be snarky. I think he was just saying that there’s nothing going on beyond what you mention—it’s an average of predictions over the posterior. This has nice calibration properties, but in terms of interpretation, it just averages uncertainty.

]]>> To add to this, a predictive distribution is basically a convex combination/mixture distribution

> formed from the individual models p(y|theta). If this family is not closed under convex combinations

> then the result is something different to any individual instance.

It’s rarely closed. For example, if the likelihood is binomial and the prior is beta, then the posterior is a beta-binomial, not a simple binomial. This is good. Any single binomial would underestimate predictive uncertainty inherent in estimating the binomial chance-of-success parameter.

]]>Which type of statistics? There are well known tensions here, if you care to think of them/read about them.

Eg Bayesian inference has a well-known logical interpretation. For a nice discussion of how this relates to empiricism (and for more on the observable issue) you can read eg van Fraassen’s ‘Symmetry and law’ which includes a good discussion of logical probability and tensions with empiricism. He has a fairly good discussion of Jaynes’ point of view, unusual for a philosopher (van Fraassen is well known in philosophy for developing ‘constructive empiricism’.’).

But there are a million examples. The ability of you to blithely dismiss something doesn’t mean it doesn’t exist. You’re missing out on interesting issues, but it’s your call.

> if you don’t but into this then the Bayesian machinery is not for you

This makes a cool internet comment dismissal but is a pretty shallow response imo. As mentioned I do use Bayes (see eg link above) but I still have the ability to reflect on what this means and how it might be done better. Call it ‘working outside the model machinery’.

> you answered this yourself.

Again, cool snarky internet response but not really in the spirit of thinking things through.

Generally speaking, let’s just say that if you don’t understand any of the points I’m getting at then they’re not for you.

]]>Yeah. That’s not really true though is it. Take GP regression. If the model is correctly specified, the predictive distribution constructed this way will usually be singular against the data generating measure.

What you’ve described is one possible option, but if I were you I’d consider the predictive loss function that it corresponds to.

]]>So i really don’t know what you’re talking about. A concrete example could help, but this might turn out like the upthread conversation, where it hung on the definition of observable.

But I don’t agree that inference is logical rather than empirical. Statistics is empirical. It is literally the science of data.

> To add to this, a predictive distribution is basically a convex combination/mixture distribution formed from the individual models p(y|theta). If this family is not closed under convex combinations then the result is something different to any individual instance.

Yes. If you don’t buy into this, than the Bayesian machinery is not for you.

> In fact the mixture/predictive distribution can be very different to any individual model p(y|theta). How do we interpret it? Is this the right thing to interpret? Etc.

You answered this yourself. The posterior p(theta | y) are model weights that define a predictive distribution and can be interpreted in terms of how well the model corresponding with theta can represent the data.

]]>Both informative priors, and Bayesian decision theory play important roles here. First off, when the priors are informative all the individual points in the prior high probability sets correspond to believable prediction mechanisms, and so the ensemble of them are individually believable… second off, when making a prediction, we should really choose *one* set of parameters, and the way we choose it is some kind of Bayesian Decision Theory. Often though, we don’t have a particular real-world decision to be made, and so we are satisfied with a range of predictions from a range of models: the posterior predictive distribution. This shouldn’t be seen as a thing, more like a collection of things whose purpose is to be filtered through a decision process that we haven’t specified yet.

But I agree with you that many don’t “recognize the tension” even Bayes can be cargo-cultified into “do this, and then do that, and then publish and then press the lever and get a cookie”

]]>http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005688

But to the point – what Daniel said.

To add to this, a predictive distribution is basically a convex combination/mixture distribution formed from the individual models p(y|theta). If this family is not closed under convex combinations then the result is something different to any individual instance.

In fact the mixture/predictive distribution can be very different to any individual model p(y|theta). How do we interpret it? Is this the right thing to interpret? Etc.

A general theme here is the tension between prediction and inference, empirical procedures and logical procedures. It is unclear to me which the authors prioritise when they conflict, or whether they recognise the tension at all.

]]>For example, suppose the speed of light is exactly a constant, but we don’t know what that constant is. We do some sloppy experiments and get c = 1.04 +- .1 in some units… but c isn’t varying, it’s just unknown. So averaging over all the posterior values tells us what? What is the scientific justification for caring about marginalized predictive?

In a case where the thing of interest varies from place to place or time to time, it’s more easy to see that the marginalized version gives you essentially a mixture of predictions that might actually occur. With something like the speed of light issue the mixture doesn’t occur, it represents uncertainty in what will happen, not variability in what will happen. Perhaps we should just pick a given value and use that. How do we pick such a value?

]]>.

The prior can be thought of as prior to the observed data in the sense that if it is empty, we can still make predictions. This is done with the prior predictive distribution,

.

Sorry if this is all obvious and I just missed the point.

]]>When it comes to choosing a prediction set of parameters, I think we need to treat it as a decision process, a-la Wald. Use a model of real-world consequences, and then choose a value of the parameters that minimizes expected real-world consequences (or maximizes if you’re using gain vs using loss).

Ultimately, if you want to choose a particular value for the parameters, you need a reason why to prefer one vs another. This procedure involving real-world consequences seems to me the only one really justified, and the justification is a moral one so people don’t like to talk about it. But in most problems there are real honest to goodness morality issues really involved: choosing drug treatments and dosage, choosing design parameters for building structural calculations, choosing policies for Norwegian salmon fisheries, whatever, lives are lost, trillions of dollars are wasted… etc Even stuff like measuring the mass of the Higgs Boson. You might just say “hey we just really want the true value” but at some point you have to decide whether you’re going to commit an extra month on the LHC at $10^9/mo to reduce your uncertainty by 1% or whatever.

]]>How do you interpret the predictive distribution obtained by marginalising over the parameters? A single model? A ‘typical’ model? An ensemble of models? Etc

]]>the way I see this is we have a predictive theory, which requires the values of some free variables in order to predict, and which accepts certain “errors” or imprecisions of prediction compared to measurement.

What comes first is the predictive process. After we know the predictive process description we have to specify which sets of free variable values we’ll accept, and which set of errors we’ll accept based on our knowledge, not including the data we use for inference. Then we collect the data, and do the inference mechanically.

]]>What does it mean that “the prior distribution should come before the data model”? How can one propose a prior distribution for theta before knowing what is theta (i.e. the form of the full model)? If the model is only partially known, I think one would require at least some guarantees about the irrelevance of the undefined part when it comes to theta…

I assume the “data model” in the previous quote refers to the model before the data is known. (I assume as well that the references to the “likelihood function” in the paper are to the general form of the likelihood function before the data is known, a function of the parameters and the potential data, and not the the actual likelihood function conditional on the observed data, but I’m not 100% sure about that.)

]]>Thinking in terms of formalisms, your model is an expression in a formal language, let’s say the lambda calculus. It has free variables. if we wrap this expression in a lambda where you can bind the free variables, then your expression becomes a function of those free variables which returns a function that predicts.

lambda a,b,c . lambda ObservedData . predictor(a,b,c,ObservedData)

Now, how can we find “good” values of a,b,c? One way we can get them is to understand what the insides of our predictor function does, and to determine based on our scientific understanding of the insides of this function, what values for a,b,c would produce “reasonable” results. For example, we don’t want muscular power to go to infinity, or even to 1000 horsepower for a horse, or for the surface temperature of a roof in a city to reach 1000F or whatever. These “reasonable” ideas about how our predictor should work go into constructing a set of a,b,c values we’re willing to consider, and then we construct a measure over this set to generalize the indicator function into not just “in or out of the set” but rather “how much weight to give to the set”

Next we’re stuck, unless we are willing to provide a way to filter the a,b,c set through real world experience. Since no predictor predicts precisely, we need to define more or less reasonable prediction comparisons. We create a function f(Data,Prediction(a,b,c)) which assigns large values when Data and Prediction are in some sense “close” and small values when they are “far”. Then we multiply this function into the weighting function over the set of a,b,c values, and re-normalize the measure over a,b,c

Although the likelihood is often written p(Data | a,b,c) I think more fundamental to all of it is f(C(Data,Prediction(a,b,c))) where C is a comparison function and f weights the different comparison outcomes.

So, now you know some of what is in the paper I wrote a month back, and I keep meaning to send a draft to you.

]]>Is a ‘joint generative model’ to be compared to ‘joint’ data/parameters or are the parameters marginalised out first a la predictive distributions?

Is the distinction between data and parameter important or not? Is it just what happens to be directly observed or is it something else? Eg do you directly observe a regression coefficient or a population mean?

I still think ‘generative’ is perhaps hiding a bunch of these issues – what is a generative model of a parameter intended to represent? Or do you only care about the generative data model thay results after marginalisation.

The paper itself seems to me to be shifting _more_ towards a frequentist and/or predictive view and further from a logical Bayes point of view. The idea that a prior requires a ‘likelihood’ to interpret it seems to be taking an empirical rather than logical POV to me.

Which is fine, and I prefer that tbh, but I still get the feeling of lots of inconsistency in the various viewpoints floating about (unfortunately it is still in the ‘feeling’ stage and less in the coherent view stage).

]]>It’s a practical point. When students see how missing data and measurement error solutions arise naturally from defining the joint model, before we know which variables we get to observe, serious problems get solved. If instead we use traditional language, they end up blocking solutions.

]]>It is the profitability (in a scientific) of representation that is important – not the meta-physics of factorization and the prior having to be before and independent of the likelihood. That is just bad meta-physics in my opinion.

]]>