(It’s never a) Total Eclipse of the Prior

(This is not by Andrew)

This is a paper we (Gelman, Simpson, Betancourt) wrote by mistake.

The paper in question, recently arXiv’d, is called “The prior can generally only be understood in the context of the likelihood”.

How the sausage was made

Now, to be very clear (and because I’ve been told since I moved to North America that you are supposed to explicitly say these things rather than just work on the assumption that everyone understands that there’s no way we’d let something sub-standard be seen by the public) this paper turned out very well.  But it started with an email where Andrew said “I’ve been invited to submit a paper about priors to a special issue of a journal, are you both interested?”.

Why did we get this email?  Well Mike and I [Simpson], along with a few others, have been working with Andrew on a paper about weakly informative priors that has been stuck in the tall grass for a little while.  And when I say it’s been stuck in the tall grass, I mean that I [Simpson] got mesmerised by the complexity and ended up stuck. This paper has gotten us out of the grass. I’d use a saying of my people (“you’ve got to suffer through Henry Street to make it to People”) except this paper is not Henry Street.  This paper is good.  (This paper is also not People, so watch this space…)

So over a fairly long email thread, we worked out that we were interested and we carved out a narrative and committed to the idea that writing this paper shouldn’t be a trauma.  Afterwards, it turned out that individually we’d each understood the end of that conversation differently, leading in essence to three parallel universes that we were each playing in (eat your heart out Sliders).

Long story short, Andrew went on holidays and one day emailed us a draft of the short paper he had thought we were writing. I then took it and wrestled it into a draft of the short paper I thought we were writing.  Mike then took it and wrestled it into the draft of the short paper he thought we were writing. And so on and so forth.  At some point we converged on something that (mostly) unified our perspectives and all of a sudden, this “low stakes” paper turned into something that we all really wanted to say.

Connecting the prior and the likelihood

So what is this paper about? Well it’s 13 pages, you can read it.  But it covers a few big points:

1) If you believe that priors are not important to Bayesian analysis, we have a bridge we can sell you.  This is particularly true for complex models, where the structure of the posterior may lead to certain aspects of the prior never washing away with more data.

2) Just because you have a probability distribution, doesn’t mean you have a prior.  A prior connects with a likelihood to make a *generative model* for new data and when we understand it in that context, weakly informative priors become natural.

3) This idea is understood by a lot of the classical literature on prior specification like reference priors etc.  These methods typically use some sort of asymptotic argument remove the effect of the specific realisation of the likelihood that is observed.  The resulting prior then leans heavily on the assumption that this asymptotic argument is valid for the data that is actually being observed, which often does not hold.  When the data are far from asymptopia, the resulting priors are too diffuse and can lead to nonsensical estimates.

Generative models are the key

4) The interpretation of the prior as a distribution that couples with the likelihood to build a generative model for new data is implicit in the definition of the marginal likelihood, which is just the density of this generative distribution evaluated at the observed data.  This makes it easy to understand why improper priors cannot be used for Bayes factors (ratios of marginal likelihoods): they do not produce generative models.  (In a paper that will come out later this week, we make a pretty good suggestion for a better use of the prior predictive.)

5) Understanding what a generative model means also makes it clear why any decision (model choice or model averaging) that involves the marginal likelihood leans very heavily on the prior that has been chosen.  If your data is y and the likelihood is p(y | theta), then the generative model makes new data as follows:

– Draw theta ~ p(theta) from the prior

– Draw y ~ p(y | theta).

So if, for example, p(theta) has very heavy tails (like a half-Cauchy prior on a standard deviation), then occasionally the data will be drawn with an extreme value of theta.

This means that the entire prior will be used for making these decisions, even if it corresponds to silly parts of the parameters space.  This is why we strongly advocate using posterior predictive distributions for model comparison (LOO scores) or model averaging (predictive stacking).

So when can you use safely diffuse priors? (Hint: not often)

6) Enjoying, as I do, giving slightly ludicrous talks, I recently gave one called “you can only be uninformative if you’re also being unambitious”.  This is an under-appreiciated point about Bayesian models for complex data: the directions that we are “vague” in are the directions where we are assuming the data is so strong that that aspect of the model will be unambiguously resolved. This is a huge assumption and one that should be criticised.

So turn around bright eyes. You really need to think about your prior!

PS. If anyone is wondering where the first two sentences of the post went, they weren’t particularly important and I decided that they weren’t particularly well suited to this forum.

80 thoughts on “(It’s never a) Total Eclipse of the Prior

  1. Last week I sent you (the authors) a comment about the paper by email, but the address I found for you was not valid anymore. I know Andrew got the message but I don’t know if he might have noticed that you didn’t receive the original message. In case you didn’t see it:

    > With the above prior and likelihood, the posterior for β is a product of independent Gaussians with unit variance and mean given by the least squares estimator of β. The problem is that standard concentration of measure inequalities show that this posterior is not uniformly distributed in a unit ball around the ordinary least squares estimator but rather is exponentially close in the number of coefficients to a sphere of radius 1 centered at the estimate.

    I think there are a couple of problems in that paragraph (or maybe I misunderstood what you mean). The d-dimensional spherical Gaussian with variance 1 in each dimension has its mass concentrated around the sphere of radius sqrt(d), not 1. A more substantial issue is that the concentration of measure is not really related to the fact that the posterior is not uniformly distributed in a unit ball. It is of course true that this distribution is not uniformly distributed in a unit ball. But even if it were distributed uniformly in a unit ball, that would also result in concentration of measure at the surface.

    • Thanks Carlos. Just FYI, my email address is at the bottom of the abstract link on arxiv (you have to click “show email”). (One of the joys of just having changed jobs, is that a lot of email is getting lost at the moment…)

      The sqrt(d) mistake is embarrassing! It will be fixed in an arXiv revision soon.

      > A more substantial issue is that the concentration of measure is not really related to the fact that the posterior is not uniformly distributed in a unit ball. It is of course true that this distribution is not uniformly distributed in a unit ball. But even if it were distributed uniformly in a unit ball, that would also result in concentration of measure at the surface.

      I was coming at this from a different direction (albeit somewhat inelegantly). My argument is that the prior is strongly concentrated near the (not unit!) ball, and hence the posterior will be concentrated near there unless there is an extremely large amount of data. Not every prior in high dimensions has this concentration problem (spike and slab, for instance, or horseshoe). But I agree with your point that I wrote it badly. I will have another shot!

      • I noticed the “view email” link but it requires an arxiv account. I guess I have one but I don’t remember the details and anyway I wouldn’t expect it to work after seventeen years…

  2. Very interesting paper – fleshes out the idea that the joint model can mathematically be factorized into prior and likelihood but the idea that they separate is just a superstition. The joint model needs to represent not too badly the reality that generated the data we are trying to understand.

    (Also you have just moved to Toronto and my favorite Canadian University – enjoy.)

    • I’m not so ready to dismiss the conceptual distinction between the parts of the model that bear directly on observable quantities and the parts of the model that don’t…

      • That’s fine. This is how you build the model (and this is a good way to build a model). But how well the model can represent data is indifferent to the factorisation.

        • Take a high-dimensional sparse regression and compare a multivariate normal with a horseshoe. The individual standard deviation parameters (the ones with the half normal) are unobservable. The model fit will be quite different.

        • You mean compare these two as priors right? And then model fit is via predictive distributions? Or what?

          How much is learned from the data fit about the parameters is given by change from prior to posterior. Unobservables you learn nothing, right?

          Or how do you define ‘unobservable’?

        • Well, I’m defining it as “something you observe directly” unless there’s a better definition out there.

          In the horseshoe case, those local standard deviations can’t be observed directly, but their prior is updated by the data.

        • Which issues? The point that how well a model can represent the data is indifferent to the factorisation continues to be true even when the distributions of some parameters are not updated by the data. The prior predictive is constructed by marginalising the joint (as is the posterior predictive, just a different joint).

          “Identifiability” is a statement about behaviour in a specific asymptotic regime, so you’d need to be more specific. Again, I don’t see how this is relevant.

          And again, I’m really not suggesting people abandon the idea of observables or anything else. (And really, if you want to work with weakly informative priors, as we strongly advocate in this paper, you often need a notion of scale so you do need to think about how things are observed or how they can interact with the rest of the model).

        • (Incidentally, I don’t want to dismiss you comments casually or otherwise. I just don’t understand them! This may or may not come across well in text.)

        • ‘How well a model can fit the data’ explicitly distinguishes the concept of data from the concept of parameter.

          Sure a likelihood x prior is equivalent to a joint. But that is already given a prior.

          If that’s the issue at stake then OK I agree. But it’s a trivial point to me. The more relevant point, I think, is that a parameter is a different kind of thing to data.

          (The identifiability issue is slightly orthogonal, but relates to the parameter/data distinction. Ie it comes in if you define parameters as functionals of observable data distributions)

        • (BTW I think my first comment about fit being independent of unobservables is sufficiently vague to be ignored. Was more meant as a response to what seemed to be causal dismissal of parameter/data distinction. But then you said this is relevant for model building…so I’m not too sure if we necessarily disagree on this issue after all…)

        • Ok. I was wondering why you were fighting back against something so straight forward!

          Yes a parameter is different to data. That’s why we need to consider the implied distribution on the data given by the generative model.

        • Yeah my bad. Also I’m supposed to be reading fiction in the sun drinking cocktails right now not arguing on the internet so…better get on with that! :-)

        • IDK, I see what McElreath is talking about as an exploration of some potential consequences of the whole ‘joint generative model’ point of view. It’s a good POV, that I agree with, but I can see why ojm thought there might be a connection…

        • To clarify my point, as ojm asked, it was mainly as Dan put it “how well the model can represent data is indifferent to the factorisation”.

          It is the profitability (in a scientific) of representation that is important – not the meta-physics of factorization and the prior having to be before and independent of the likelihood. That is just bad meta-physics in my opinion.

        • The examples from my “language” talk were precisely chosen to be ones where some variables are observable in some contexts but not in others. Therefore whether we’d call a variable a parameter or data depends upon the context. Once we know the context, the distinction matters a lot.

          It’s a practical point. When students see how missing data and measurement error solutions arise naturally from defining the joint model, before we know which variables we get to observe, serious problems get solved. If instead we use traditional language, they end up blocking solutions.

        • OK but it does seem like there is some real ambiguity here. My own views have shifted enough that it doesn’t really matter (I don’t think I really belong to any of the present camps as such anymore) but still I’m interested.

          Is a ‘joint generative model’ to be compared to ‘joint’ data/parameters or are the parameters marginalised out first a la predictive distributions?

          Is the distinction between data and parameter important or not? Is it just what happens to be directly observed or is it something else? Eg do you directly observe a regression coefficient or a population mean?

          I still think ‘generative’ is perhaps hiding a bunch of these issues – what is a generative model of a parameter intended to represent? Or do you only care about the generative data model thay results after marginalisation.

          The paper itself seems to me to be shifting _more_ towards a frequentist and/or predictive view and further from a logical Bayes point of view. The idea that a prior requires a ‘likelihood’ to interpret it seems to be taking an empirical rather than logical POV to me.

          Which is fine, and I prefer that tbh, but I still get the feeling of lots of inconsistency in the various viewpoints floating about (unfortunately it is still in the ‘feeling’ stage and less in the coherent view stage).

        • ojm: my own take on this is that the inference/logic aspect of Bayes is orthogonal (not slightly, actually) to the model-construction aspect. When choosing a model for what is going on, you define some equations and expressions that describe how to predict the outcomes of experiments… and in order to do it you have some unknown quantities that you need to know. This is outside Bayes, just scientific understanding. For example, how does muscular power output relate to the size of the animal? How does roof albedo affect heat storage in cities? etc etc. You can’t expect yourself to write down an expression for these things and plug in the correct numbers for each symbolic component, or even expect that these numbers *are* numbers, rather than say slightly different numbers for mammals vs reptiles, or for each city, or for whatever. But, if an oracle could give you the best numbers for those “parameters” you could plug them into a formal expression to predict outcomes of experiments.

          Thinking in terms of formalisms, your model is an expression in a formal language, let’s say the lambda calculus. It has free variables. if we wrap this expression in a lambda where you can bind the free variables, then your expression becomes a function of those free variables which returns a function that predicts.

          lambda a,b,c . lambda ObservedData . predictor(a,b,c,ObservedData)

          Now, how can we find “good” values of a,b,c? One way we can get them is to understand what the insides of our predictor function does, and to determine based on our scientific understanding of the insides of this function, what values for a,b,c would produce “reasonable” results. For example, we don’t want muscular power to go to infinity, or even to 1000 horsepower for a horse, or for the surface temperature of a roof in a city to reach 1000F or whatever. These “reasonable” ideas about how our predictor should work go into constructing a set of a,b,c values we’re willing to consider, and then we construct a measure over this set to generalize the indicator function into not just “in or out of the set” but rather “how much weight to give to the set”

          Next we’re stuck, unless we are willing to provide a way to filter the a,b,c set through real world experience. Since no predictor predicts precisely, we need to define more or less reasonable prediction comparisons. We create a function f(Data,Prediction(a,b,c)) which assigns large values when Data and Prediction are in some sense “close” and small values when they are “far”. Then we multiply this function into the weighting function over the set of a,b,c values, and re-normalize the measure over a,b,c

          Although the likelihood is often written p(Data | a,b,c) I think more fundamental to all of it is f(C(Data,Prediction(a,b,c))) where C is a comparison function and f weights the different comparison outcomes.

          So, now you know some of what is in the paper I wrote a month back, and I keep meaning to send a draft to you.

  3. Maybe it is my old age showing, but while every else is knee deep into the details of the paper, I am floored by the link to that version of “Total Eclipse of the Heart”. I mean the original video was bad enough. Written in bars, indeed!

  4. > In the present paper we address an apparent paradox: Logically, the prior distribution should come before the data model, but in practice, priors are often chosen with reference to a likelihood function.

    What does it mean that “the prior distribution should come before the data model”? How can one propose a prior distribution for theta before knowing what is theta (i.e. the form of the full model)? If the model is only partially known, I think one would require at least some guarantees about the irrelevance of the undefined part when it comes to theta…

    I assume the “data model” in the previous quote refers to the model before the data is known. (I assume as well that the references to the “likelihood function” in the paper are to the general form of the likelihood function before the data is known, a function of the parameters and the potential data, and not the the actual likelihood function conditional on the observed data, but I’m not 100% sure about that.)

    • Yeah, I don’t agree with the idea “logically the prior distribution should come before the data model”

      the way I see this is we have a predictive theory, which requires the values of some free variables in order to predict, and which accepts certain “errors” or imprecisions of prediction compared to measurement.

      What comes first is the predictive process. After we know the predictive process description we have to specify which sets of free variable values we’ll accept, and which set of errors we’ll accept based on our knowledge, not including the data we use for inference. Then we collect the data, and do the inference mechanically.

      • But should the predictive model be generated by averaging over the parameters or, say (in principle), pluging in values for each of them and separately evaluating each model instance thus generated?

        How do you interpret the predictive distribution obtained by marginalising over the parameters? A single model? A ‘typical’ model? An ensemble of models? Etc

        • Marginalization is a good question. I think it becomes a mathematically acceptable thing to do as soon as you accept that a probability distribution over the parameters is a way to describe the parameters which are acceptable to your theory pre-data. But just because it’s mathematically acceptable doesn’t mean you should do it.

          When it comes to choosing a prediction set of parameters, I think we need to treat it as a decision process, a-la Wald. Use a model of real-world consequences, and then choose a value of the parameters that minimizes expected real-world consequences (or maximizes if you’re using gain vs using loss).

          Ultimately, if you want to choose a particular value for the parameters, you need a reason why to prefer one vs another. This procedure involving real-world consequences seems to me the only one really justified, and the justification is a moral one so people don’t like to talk about it. But in most problems there are real honest to goodness morality issues really involved: choosing drug treatments and dosage, choosing design parameters for building structural calculations, choosing policies for Norwegian salmon fisheries, whatever, lives are lost, trillions of dollars are wasted… etc Even stuff like measuring the mass of the Higgs Boson. You might just say “hey we just really want the true value” but at some point you have to decide whether you’re going to commit an extra month on the LHC at $10^9/mo to reduce your uncertainty by 1% or whatever.

        • Technically, if $latex \theta$ is the parameter vector, $latex y$ is the observed data, and $latex \tilde{y}$ is new data, then the posterior predictive distribution is defined exactly by marginalizing,

          $latex p(\tilde{y} \mid y) = \int p(\tilde{y} \mid \theta) \ p(\theta \mid y) \ \mathrm{d}\theta$.

          The prior can be thought of as prior to the observed data $latex y$ in the sense that if it is empty, we can still make predictions. This is done with the prior predictive distribution,

          $latex p(\tilde{y}) = \int p(\tilde{y} \mid \theta) \ p(\theta) \ \mathrm{d}\theta$.

          Sorry if this is all obvious and I just missed the point.

        • I think the point was not how do you define the marginalized predictive distribution, but rather why is marginalization a scientifically meaningful thing, and what does it mean?

          For example, suppose the speed of light is exactly a constant, but we don’t know what that constant is. We do some sloppy experiments and get c = 1.04 +- .1 in some units… but c isn’t varying, it’s just unknown. So averaging over all the posterior values tells us what? What is the scientific justification for caring about marginalized predictive?

          In a case where the thing of interest varies from place to place or time to time, it’s more easy to see that the marginalized version gives you essentially a mixture of predictions that might actually occur. With something like the speed of light issue the mixture doesn’t occur, it represents uncertainty in what will happen, not variability in what will happen. Perhaps we should just pick a given value and use that. How do we pick such a value?

        • Yes I know (and use) the definitions of prior and posterior predictive distributions. See eg here for an example:

          http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005688

          But to the point – what Daniel said.

          To add to this, a predictive distribution is basically a convex combination/mixture distribution formed from the individual models p(y|theta). If this family is not closed under convex combinations then the result is something different to any individual instance.

          In fact the mixture/predictive distribution can be very different to any individual model p(y|theta). How do we interpret it? Is this the right thing to interpret? Etc.

          A general theme here is the tension between prediction and inference, empirical procedures and logical procedures. It is unclear to me which the authors prioritise when they conflict, or whether they recognise the tension at all.

        • “In fact the mixture/predictive distribution can be very different to any individual model p(y|theta).”

          Both informative priors, and Bayesian decision theory play important roles here. First off, when the priors are informative all the individual points in the prior high probability sets correspond to believable prediction mechanisms, and so the ensemble of them are individually believable… second off, when making a prediction, we should really choose *one* set of parameters, and the way we choose it is some kind of Bayesian Decision Theory. Often though, we don’t have a particular real-world decision to be made, and so we are satisfied with a range of predictions from a range of models: the posterior predictive distribution. This shouldn’t be seen as a thing, more like a collection of things whose purpose is to be filtered through a decision process that we haven’t specified yet.

          But I agree with you that many don’t “recognize the tension” even Bayes can be cargo-cultified into “do this, and then do that, and then publish and then press the lever and get a cookie”

        • > A general theme here is the tension between prediction and inference, empirical procedures and logical procedures. It is unclear to me which the authors prioritise when they conflict, or whether they recognise the tension at all.

          So i really don’t know what you’re talking about. A concrete example could help, but this might turn out like the upthread conversation, where it hung on the definition of observable.

          But I don’t agree that inference is logical rather than empirical. Statistics is empirical. It is literally the science of data.

          > To add to this, a predictive distribution is basically a convex combination/mixture distribution formed from the individual models p(y|theta). If this family is not closed under convex combinations then the result is something different to any individual instance.

          Yes. If you don’t buy into this, than the Bayesian machinery is not for you.

          > In fact the mixture/predictive distribution can be very different to any individual model p(y|theta). How do we interpret it? Is this the right thing to interpret? Etc.

          You answered this yourself. The posterior p(theta | y) are model weights that define a predictive distribution and can be interpreted in terms of how well the model corresponding with theta can represent the data.

        • > Statistics is empirical. It is literally the science of data.

          Which type of statistics? There are well known tensions here, if you care to think of them/read about them.

          Eg Bayesian inference has a well-known logical interpretation. For a nice discussion of how this relates to empiricism (and for more on the observable issue) you can read eg van Fraassen’s ‘Symmetry and law’ which includes a good discussion of logical probability and tensions with empiricism. He has a fairly good discussion of Jaynes’ point of view, unusual for a philosopher (van Fraassen is well known in philosophy for developing ‘constructive empiricism’.’).

          But there are a million examples. The ability of you to blithely dismiss something doesn’t mean it doesn’t exist. You’re missing out on interesting issues, but it’s your call.

          > if you don’t but into this then the Bayesian machinery is not for you

          This makes a cool internet comment dismissal but is a pretty shallow response imo. As mentioned I do use Bayes (see eg link above) but I still have the ability to reflect on what this means and how it might be done better. Call it ‘working outside the model machinery’.

          > you answered this yourself.

          Again, cool snarky internet response but not really in the spirit of thinking things through.

          Generally speaking, let’s just say that if you don’t understand any of the points I’m getting at then they’re not for you.

        • From ojm:
          > In fact the mixture/predictive distribution can be very different to any individual model p(y|theta). How do we interpret it?

          From Dan Simpson:
          > you answered this yourself.

          From ojm:
          > Again, cool snarky internet response

          I don’t think Dan was trying to be snarky. I think he was just saying that there’s nothing going on beyond what you mention—it’s an average of predictions over the posterior. This has nice calibration properties, but in terms of interpretation, it just averages uncertainty.

        • Bob > I don’t think Dan was trying to be snarky. I think he was just saying that there’s nothing going on beyond what you mention

          What he said.

          ojm > Which type of statistics?

          Statistics (of any stripe) is the study of getting answers from data. Applied statistics does it, computational statistics facilitates it, theoretical statistics seeks to understand its properties, philosophy of statistics is philosophy. The last isn’t empirical (although, it’s the philosophy of something empirical, so I guess it’s meta-empirical). The first three are empirical in the sense that they are statements about data (be it real or just limited by a specific set of assumptions).

          ojm > [Various things that indicate you felt attacked]

          As said above, that was not my intention. I’m going to step away from this part of the conversation. None of us really need a flame war.

        • Fair enough. It’s no so much feeling attacked but that

          > there’s nothing going on beyond what you mention

          feels a bit like ‘gaslighting’ or whatever it’s called.

          But my bad for getting frustrated. No hard feelings :-)

        • ojm: that’s quite a serious statement and I really really do not want to ever gaslight anyone (intention is much less important than effect). What would you like me to do. For instance, I can remove my comments from this thread, or I can remove the post.

          Email me ([email protected]) if you want to talk offline.

        • If you’re not comfortable talking to me, both Bob and Andrew (and a few others, but definitely those two) have edit privileges and can do whatever you want.

        • > don’t agree that inference is logical rather than empirical
          Depends on what you mean by logic but agree it is just about learning (getting less wrong as quickly as possible) about empirical aspects of the world.

        • > second off, when making a prediction, we should really choose *one* set of parameters, and the way we choose it is some kind of Bayesian Decision Theory.

          Yeah. That’s not really true though is it. Take GP regression. If the model is correctly specified, the predictive distribution constructed this way will usually be singular against the data generating measure.

          What you’ve described is one possible option, but if I were you I’d consider the predictive loss function that it corresponds to.

        • Not sure I understand what you’re saying here. Are you saying that if f(x) is a function, and g(x) is a realization from a GP designed to learn about f(x) that g(x) never exactly equals f(x)?

          But this doesn’t preclude us from choosing a g(x) that minimizes the real world consequences of the error (although as I point out below, it also isn’t always required to choose a particular g(x))

          The predictive loss functions that I’m concerned with are always real world ones: if I mis-estimate the number of fish that escape from fish farms each year, how much money does that cost the farms, how much money does it cost to environmentally remediate the problem? If I mis-estimate the effect of cool roofs on heat-islanding, and I impose a regulation regarding cool roofs, how much money is wasted in roofing that could have been spent on better ways to prevent heat islanding? etc etc

          I agree we don’t necessarily always need to do this kind of thing, but when the rubber hits the road and we need to collapse our uncertainty down to a specific decision that really matters, I think that’s how to do it. In actuality, often the decision isn’t “choose a parameter value” but rather “choose an action” and so we don’t wind up choosing a particular realization of the GP for example, we just choose a much lower dimensional action whose choice is informed by marginalizing the consequences of that choice across a large sample of GP functions. Sometimes though, we do need a parameter value. Let say we’re going to adjust a mars orbiter’s orbit by sending it a rocket burn command. The values to transmit to the orbiter are the parameters, we’ve gotta choose one vector of them… how? It should be by decision based on real world consequences: some errors might just require a second burn, whereas others cause a deorbit and crash… so we err on the side of not deorbiting and crashing.

        • Ah – I think our problem has to do with what we’re seeing as a parameter!

          In the GP case, there is a natural partition of the parameter space into parameters that control the correlation structure of the GP and the GP itself. If we call the former parameters theta and the GP f(x), then the prediction of f(x) that minimises the prediction error is usually computed by finding the conditional mean of f(x) | y, theta and then integrating over the posterior theta| y. So this does not correspond to one particular parameter.

          If instead you want to choose an action, then yes I 100% agree with your statement. In the GP case, the space of actions that could minimise the Loss often has zero probability under the model (it’s one of those weird games we can play with infinities). But if you separate “actions” from “parameters” I think things are much much cleaner.

        • Ugh – I just realised that was a stupidly overcomplicated example.

          The same thing happens if you’ve got a GP with a known covariance structure and your loss function is mean-squared error (or some weighted variant thereof). In that case, the optimal action is the posterior mean, which lies in the Reproducing Kernel Hilbert Space (RKHS) associated with the GP covariance function (or its Cameron-Martin space if you’re from that world).

          The statement {f belongs to the RKHS} has zero probability under both the prior and posterior, so the space of admissible actions does not correspond with the parameter space.

          Someone on the discourse used this as an example of the mean being far from the typical set, but that’s not really what’s happening here. The RKHS is dense in the support of the prior/posterior, so the optimal prediction is very close to the posterior mass. It’s just that you can “maths” it away by repeating a certain operation an infinite number of times.

          So the statement that the action space is different from the parameter space is true, but possibly not very useful.

        • Yes, sometimes actions correspond to “choose a parameter” (to send to the mars orbiter as 32 bit floating point numbers) other times actions correspond to something simpler “choose whether to pass the proposed law or not, choose whether to buy machine x vs machine y”

          there are even situations where you might say something like “choose which realization of my gaussian process to publish as the temperature vs time curve for tomorrow’s weather forecast” or “choose a function of space to plot the current air quality on a map”

          Whenever we choose a parameter to publish from a continuous distribution, the point estimate has zero probability relative to the continuous probability measure. This doesn’t bother me. Sure the probability that it’ll be exactly 103.1F tomorrow is zero, but that doesn’t mean we shouldn’t choose “publish that temperature” as our action.

        • I think some of these things are examples of the math that pure mathematicians like being not the applicable math for the problem at hand.

          My chosen formalism for dealing with some of these issues is nonstandard analysis using Internal Set Theory (IST) of Edward Nelson.

          In this formalism you have realizations of a GP being vectors of nonstandard length representing the value of the function at each point a + i * dx for dx an infinitesimal for values going from a to b where a and b are potentially nonstandard themselves.

          In general, I don’t believe in minimizing squared error, because it rarely corresponds to a real world outcome. But even if you do want to do it, you still are left with basically:

          choose: standard part of argmin over g(x) of sum over x (f(x)-g(x))^2 dx

          Now, that sum over x is the integral in IST, and argmin over g(x) is the choice of a really really big vector of values… and the standard part tells you to ignore the details that are only accessible when you have the infinitesimal microscope.

          You can argue that the standard part isn’t one of the g(x) (ie. it’s in the RKHS instead of the space), but in this context, g(x) is just a device to get you that standard part. In the same way you can argue that “103.4 F” has probability zero under normal(100,3) posterior but the posterior is just a device to get you a 4 digit number to put on your website. The GP is just a device to get you a “standard” function that tells you what the air quality is all over the western part of the US… or whatever.

        • > Whenever we choose a parameter to publish from a continuous distribution, the point estimate has zero probability relative to the continuous probability measure. This doesn’t bother me.

          The thing I was talking about was different to this. In the GP case (and in, therefore, a possibility in general), the action space (before seeing specific data) has zero probability. I agree that the point estimate having zero probability isn’t really a problem (because the data does too). But this is more of a thing that has to be considered. And in particular, it’s why it’s important to distinguish between the action space and the parameter space (because they may not coincide).

          And I’m definitely not saying “don’t use the output of decision analysis” (that would be a weird thing to say!). Just that the original point (way back when) was not correct as stated.

        • Interestingly, in the NSA / IST formulation it typically isn’t the case that the action space has zero probability. In standard math, there is no infinitesimal, so the best we can do is call the measure over a point zero. But in NSA there is an infinitesimal, and so typically for a simple “continuous” variable like x ~ normal(0,1) the probability space assigns infinitesimal probability to each point in a nonstandard grid.

          I suspect that the same thing is easily constructed for the GP case. Each GP sample path is a finite but nonstandard dimensional vector of values g[i]. The probability associated with any given vector is normal_pdf(g,Cov) where Cov is the covariance matrix, an NxN matrix for N nonstandard. The resulting probability over a given g is infinitesimal, but not zero.

          One appealing reason to use this formulation is that it corresponds in a very clear way to what is actually done in applied problems. Sure, there are still subtleties, but generally I find them more tractable.

        • From ojm:

          > To add to this, a predictive distribution is basically a convex combination/mixture distribution
          > formed from the individual models p(y|theta). If this family is not closed under convex combinations
          > then the result is something different to any individual instance.

          It’s rarely closed. For example, if the likelihood is binomial and the prior is beta, then the posterior is a beta-binomial, not a simple binomial. This is good. Any single binomial would underestimate predictive uncertainty inherent in estimating the binomial chance-of-success parameter.

  5. I agree it’s rarely closed. (And it was Daniel L not me saying we should use a single parameter value for prediction.)

    Here’s a perhaps simplistic example.

    Suppose no individual model can fit the data well but a convex combination of models can. Do you care? If yes then you care about parameter inference if no then you care about prediction.

    • Hopefully I clarified a bit. It’s really not that we should necessarily use just a single parameter, more that when we need to make a single prediction with consequences we should choose a prediction using those consequences to make the choice

    • ojm, if you’re willing to humor me, let’s make things more concrete (maybe just for my sake). I think there’s a big difference between marginalizing over parameter uncertainty, given a specific structure, and structural uncertainty in models. For instance, say we have P(y|theta), with theta a vector of parameters. Let’s make things super simple and say theta vector has one component, beta_0, and we are estimating mean of a normal basically (or intercept-only linear model). We use Bayesian machinery and get posterior predictive distribution, etc. Now say we change the model structure by introducing a slope. Our parameter vector has changed- theta now has two components. Averaging over
      P(y|theta_1) and P(y|theta_2) seems to me a much different proposition then integrating over P(y|theta_1)*P(theta_1)*d_theta_1.
      When you say,
      > To add to this, a predictive distribution is basically a convex combination/mixture distribution
      > formed from the individual models p(y|theta). If this family is not closed under convex combinations
      > then the result is something different to any individual instance.

      which of these scenarios do you have in mind? I also understand we could have corner cases, like where one model is just a specific instance of another (say a hierarchical model for the components of theta or something).

      • Chris, we’ll have to see what ojm has to say, but one thing I was thinking about when ojm said that was something like ODE models or dynamic models in general. The behavior of nonlinear ODE models can be wildly varying under different parameter values. For example one set of parameters might involve a stiff unstable aperiodic oscillation between several values… another might involve a smooth decay to zero, yet another might involve a slow drift toward zero with unstable shocks…

        averaging over a posterior distribution that includes all these possibilities is saying essentially that you have no frikin idea what is going on… the behavior can be sensitive to aspects of the prior you aren’t able to really calibrate. like for example, the size of the basin of attraction to different behaviors in the parameter space matters, but although you might know that the different behaviors exist, the prior probability of each behavior is a combination like

        pbar(s) size(s)

        where pbar is the mean probability density of the prior over the set s and size(s) is Lebesgue measure of the basin of attraction to the behavior described by the set s. The set s itself can be uncomputable, it’s impossible to specify a meaningful prior over the behaviors without detailed knowledge of the complex attractor sets in the parameter space. Think of something like the Mandelbrot set, or the Newton’s Method fractal: https://en.wikipedia.org/wiki/Newton_fractal how would you specify a prior that excludes the green colored region entirely?

        There are some hacks you can do, but in the end, real world problems are much more complex than that Newton’s fractal, so good luck. Any actual calculation you do can’t really compute the basin of attraction to a behavior you’d like to exclude, and give it zero weight. So any actual calculation you do is a combination of things you know are wrong and things you know could be right. If you try to do decision analysis and you’re including some behaviors that you know are wrong… are you going to get a better result than if you say do a Bayesian analysis, then find a MAP estimate then fuzz that estimate a little into a finite sample, then calculate the behaviors in each of the sample cases, then throw out any samples that exhibit the wrong behavior, then do your decision analysis just based on this small fuzzy filtered MAP estimate?

        In some ways, this is the kind of thing I mean when I say you can “do some hacks” above, but although there’s some justification for the hack, keeping the underlying logic in mind is maybe more important than say doing the exact Bayesian calculation based on mechanical application of Bayesian inference to simple computable models.

        • Daniel, interesting points. I’ve thought about this some for ecological models where things like chaotic attractors might exist. My current preoccupation is with generating data to fit fundamentally distinct soil carbon turnover models. These models can have drastically different consequences for climate feedbacks and I don’t like the idea of averaging over model space for prediction. Not at all. What you are saying also might be involved if certain regions of parameter space, even holding parameter vector constant, can lead to qualitatively distinct outcomes..

        • If you think of a prior as a device to make formal predictor machinery behave appropriately… then you can think of specifying some approximate prior… running a Bayesian machinery to get posterior samples… and then post-filtering the Bayes samples to remove things you know are “wrong” and should have been removed from consideration in a fully-accurate prior as a correction step to the approximate prior.

          The bigger issue is when the behavior isn’t necessarily wrong, but it shouldn’t have such a large effect. Suppose 60% of your samples have some oscillatory behavior. you know the oscillatory behavior could occur, but it’s also known to be an unusual case… certainly not 60% of the posterior samples. How do you post-process your sample appropriately? How do you justify that? If your audience has been spending a decade in nuanced discussions of the philosophical meaning of Bayesian inference on Andrew’s blog, you’re all set. So, basically a thousand people in the whole world will understand ;-)

        • I think this technique has a similar flavor:

          http://models.street-artists.org/2016/08/23/on-incorporating-assertions-in-bayesian-models/

          We specify some simple region of parameter space like

          A ~ normal(0,100);

          telling us that the coefficients of the basis expansion aren’t “too big” but then beyond that, we need to specify the prior in terms of its effect on the function, so for each sample we calculate some qualities we want our function to have approximately, and then provide a weighting function over those qualities, which down-weights those parameter values that produce functions that aren’t “right” according to our knowledge of what “right” means.

          If you can’t make this work within the model from a computational perspective (perhaps it’s too costly in computing time to calculate the thing you need to calculate), it’s still legit to take the samples you generate and filter them based on the same knowledge that would have been in the model if it weren’t so damn expensive to calculate. Or the like.

        • > My current preoccupation is with generating data to fit fundamentally distinct soil carbon turnover models. These models can have drastically different consequences for climate feedbacks and I don’t like the idea of averaging over model space for prediction. Not at all.

          Yup that’s definitely at least one of my concerns.

          And yeah I think it is perfectly justifiable to say you don’t want to average over qualitatively different model instances (in particular) eg a nonlinear ODE system with different dynamical regimes.

          Or perhaps, for _some_ (but not necessarily all!) nonlinear models where ave(f(theta)) doesn’t equal f(ave(theta)) or even any f(theta’) for any theta’ in your parameter space.

          On the other hand, there are some in machine learning who would presumably be fine averaging or weighting or whatever multiple quite different models as long as it improved a particular predictive performance measure.

          Another thing I was getting at is that you might decide to use a predictive check to check the adequacy of your model family. But you could have each individual model instance be inadequate while the convex combination (predictive distribution) looks fine. That is, you have ‘model misspecification’ in the sense that no individual model instance is close to what you want but the predictive check actually helps _hide_ this.

          Which again – if you only care about pure predictive performance then who cares. But maybe you care about more than that. What is it you are caring about in such cases? And if you only care about predictive performance, regardless of individual model misspecification, then maybe Bayes isn’t best? I think the literature is pretty mixed on this.

          But yeah, plenty of things to discuss, probably some other time.

        • “These models can have drastically different consequences for climate feedbacks and I don’t like the idea of averaging over model space for prediction. Not at all.”

          For pure prediction (and communication of same) it’s reasonable to partition the set of predictions into relevant classes and do prediction conditional on the future state being in one of the those classes.

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published. Required fields are marked *