Personally my workflow these days is more akin to classical inverse problems theory:

– does a decent solution exist?

– is it unique or are there many equally good solutions etc?

– are my results stable wrt small perturbations?

First part is pretty much just point estimation (would always do ‘predictive’ ie data space checked tho!).

Second step is identifiability analysis and lies outside of Bayes imo.

Final step is where Bayes enters to me, but I increasingly care about the first two steps and happy to just eg bootstrap this last one

]]>I’m trying to be more honest about this in current and future work and explore new approaches, but it’s a process (first step is to acknowledge there’s a problem I guess). Luckily there are many other applied examples from other folk if you’re willing to look!

]]>Fair enough if you don’t want to consider other approaches, but many do. One reason I do is that I notice that no one really believes their probability results or forecasts coz of things like M-open issues or lack of identifiability etc.

Instead of fighting it I take it for what it seems to indicate about how we reasoning under uncertainty.

]]>But, I don’t take those foundations as truly foundational for my purposes. I think the Cox lineage makes more sense. The big problem with Cox, that you’ve pointed out, is that it’s a calculation about an underlying “truth” and when there is no underlying truth… you need a new interpretation if you are to keep Bayes.

My point in working on my interpretation is that what Bayesian probability measures is basically compatibility with assumptions. I call it accordance in my paper. The idea is that as a scientist (not an engineer who just wants predictive ability) I want to collect data and find out what the data implies about my theory, specifically, which subsets of the parameter space make my predictions “work best” according to the theory formalized in my likelihood function, as well as the theory formalized in my base data-free theory (my prior). I see Bayes as a kind of microscope that lets me mathematically look inside my theory and find out which sub-theories are viable in the presence of observation.

From that perspective, the ruling out of certain regions of parameter space needs to move probability to the remaining space, and I think this is where the sum-to-one comes in. If you don’t have additivity, you don’t have sums and if you don’t have sums you don’t have sum-to-one aka conservation of probability.

It’s also possible to do so with multiple theories using mixture models, but again, I’m looking inside a *particular given overall mixture theory* and trying to find the viable sub-theories, things not ruled out by what I expect from my predictions, that ruling out some regions pushes probability onto the remaining regions is a feature, not a bug.

]]>So long as I’m willing to “go back to the drawing board” and add in additional models, I always want some measure of *which of these models / parameters best accords with my theory and data*. And if a certain region of parameter space accords well… I want the remaining regions to accord badly particularly due to the concept that accordance(RegionA) + accordance(RegionNotA) = 1 *within the restricted analysis*

I always think it’s important to consider if your analysis is too limited, but I think it needs to be a meta consideration until you can specify a sufficiently formalized model to be inserted into the analysis.

If you can show me a whole bunch of applications where non-additivity makes sense I would be willing to reconsider it… but I think you’ll find that hard, because in practice thinking about the “outside model space” in the absence of a specific computable model means you can’t really analyze. I can’t do: quantum mechanics… or “something else” I can only do quantum mechanics vs strangeTheoryA which computes different results.

]]>ftp://ftp.math.ethz.ch/sfs/Research-Reports/146.html

(Hampel and Huber were two of the key founders of robust stats, and both considered theories with non-additive components.)

]]>Note that additivity is not the same as normalisation. Also one can instead use eg a maxitive measure as in my most recent link. So additivity is not the same as a measure in general.

There are many ways to represent uncertainty. Additive measures just don’t seem right to me for some circumstances and I haven’t seen a good argument for why I should impose it in general. It is an assumption!

Hampel even suggests that the original conception of probability was non-additive – google ‘non-additive probability Hampel’ or similar. Huber speculated whether statistics should instead be founded on Choquet capacities.

The point is not that additivity is not sometimes correct, but that many have in fact questioned it, particularly in cases of ‘extreme’ or ‘large’ uncertainty.

]]>>temporarily why we would want to adopt additivity. Why not

>say that multiple models even of those only considered so

>far could be equally good?

Of course, multiple models can be equally good: if I have 3 models and a posterior weight on them of 1/3 , 1/3 , 1/3 they’re all equally good yes?

Since every finite measure is isomorphic with a probability measure, are you asking why choose to normalize our measures to 1? I think the reason is that our measure is a dimensionless ratio and there is no “absolute” scale that we can use to measure “how good” a model is which can be easily standardized among different people.

You could choose to “normalize” your measure so that the peak density is always 1, or always 100, or you could do other things. But as long as the measure is finite, you can always rescale it to be a probability measure (integrates to 1). So choice of integrating to 1 is an arbitrary choice I admit, but it has good properties.

If you’re asking “why use a measure at all?” then I think the answer is that you’re starting from the assumption that you’re going to *measure* the degree to which a certain subset of the possibilities accord with your theory. If you want to do that, you need a measure.

if you *don’t* want to do that, then I agree Bayes isn’t for you.

]]>A big issue I have with these results is that they require identifiability (and/or a non-singular Fisher information matrix). You basically build in uniqueness and then derive asymptotic normality.

Which is fine in a toy setting, but imo very far from an ‘open world’ in which there might be no or many unique solutions.

In such cases the asymptotics is not such that ‘as more and more data arrive, the posterior distribution of the parameter vector approaches multivariate normal’.

To me ‘open world’ modelling should most definitely not assume identifiability (or e.g. a non-singular Fisher information matrix).

]]>It’s not clear to me, even when we ‘close’ the world temporarily why we would want to adopt additivity. Why not say that multiple models even of those only considered so far could be equally good? Why should the support of a model be a function of the support or not of the other models, rather than of more direct ‘positive’ evidence?

Similarly, see the link I posted above – psychologically I see no reason why you might want to give both a model and its contradictory equal support. While flat probabilistic priors have many issues for representing ignorance, terms of how they transform under changes of variables, flat possibility distributions remain flat over their support. See e.g. https://arxiv.org/abs/1801.04369 which I wrote for fun.

]]>I would add that even for (informal) forecasting I find myself often defaulting to more minimax style reasoning than average case reasoning and I think it’s a bit of a shame that both don’t seem to be emphasised together as complementary as much anymore.

I actually accidentally took a post grad decision theory course back in the day – before it was fashionable again! – where we did decision making under various types of uncertainty including both probabilistic and ‘extreme’. I often found myself more sympathetic to the more non-probabilistic (extreme uncertainty) and/or more qualitative methods. I’ve never been much of a fan of expected utility ideas.

]]>Interesting discussions here.

Now in one my comments http://andrewgelman.com/2018/06/05/comments-limitations-bayesian-leave-one-cross-validation-model-selection/#comment-759090 perhaps I should have pointed to these ideas to clarify where I believe Bayes _should be_ used in the ongoing process – just in the quantitative inference stage. The first and third stages are open worlds.

“speculative inference -> quantitative inference -> evaluative inference or

abduction -> deduction -> induction -> or

First -> Second -> Third

Over and over again, endlessly.”

If that model pops out of the analysis with a high posterior probability, it indicates that none of the “real” models fit the expected precision, and that only the proxy model can predict with the precision I expect from my real models. That indicates a failure of the real models, and I can then go looking for a better real model.

Understanding that idea in terms of underlying “truth” values etc… makes no sense, but understanding that idea in terms of accordance with predictive expectations does make sense, and so it’s one way I think my rethinking of Bayesian foundations helps me understand how modeling should work.

]]>ojm, thanks for sharing your thoughts. I think we actually agree to a large extent. I see the above as a key part of the *scientific process*, and agree that application of probability theory is deploying a tool that has some sharp limits. My goal *as a scientist* is to figure out where hypotheses/models/theories make maximally divergent predictions, and set up experimental tests to discern between them (of course, other criteria also play a role, but want to be brief here :)). But, as a forecaster/predictor-of-things, given data y on system z, and assuming that I don’t have a clean test between models M1, M2, M3, …, Mn, I think we’d want to leverage probability modeling. Or at least, stacking predictive densities with a set of weights summing to one to avoid silliness. My $0.02.

]]>What has to be remembered though is that each analysis only holds as far as you accept the assumptions. There is nothing philosophically wrong with expanding or changing your model space subset and re-analyzing in that context. There is if you think of Bayes as modeling *your actual belief* but if you think of it as modeling hypothetical compatibility of a model set with data from the world… Re-analyzing with a different model set is just asking a different hypothetical. Comparing *across these two analyses* will not work, but this doesn’t bother me. If I want to compare across two model subsets, I need an analysis in which the union of both model sets is included.

]]>It seems like your preference for systems that work in the M-open problem is that there is some model “out there” which you don’t yet know, *and you want to discover it using some mathematical tool or technique*

The problem with that of course is that it’s a strongly non-unique, noncomputable problem. Suppose you come along with me and limit your model universe to strings in a formal computing language, modulo purely formal changes to the structure of the program (such as renaming variables, or lifting local functions to global scope or the like).

Now, there are still nonformal changes to the program which keep its scientific content intact, for example instead of pow(x,1/2) we could do exp(1/2*log(x)) these call different functions but have the same result in this special case… Similarly we could imagine say series representation of functions, or trigonometric identities, etc. They’re not formal, but they are provably identical.

Next there’s the strongly difficult problem of termination. If you allow pretty much any string in the formal language, some of them will loop infinitely. So when you do something outside probability theory, you’re left with limiting yourself to a “small” set of provably terminating models *anyway* or your calculation won’t terminate. Machine learning techniques tend to use what you might call “universal approximators” to certain subsets of functions, like neural networks and soforth, but they’re still very limited compared to say the full lambda calculus.

It’s possible to use the same set of universal approximators within Bayes as well, though computationally difficult, because instead of finding say *one point* in the space of possibilities that does a “pretty good” job, you have to find a random sample of points in the posterior distribution that all do a reasonably good job, thereby quantifying the uncertainty in model space.

In some sense every problem is m-open, precisely because we know that the “real” model is outside the scope of our necessarily reduced search space, even if we consider the search space to be say the finite strings of lambda calculus with length less than 10^300 symbols (I’d argue that you could simulate the universe exactly to infinitesimal accuracy if you had some “bigger” computer capable of executing such large strings and knew which string to choose ;-) there are thought to be something like 10^80 atoms in the universe)

So, what we *always* want is some projection of the real model space onto our restricted model space that does “a good job” as measured in some way. How do we measure this? For Bayes, it’s the likelihood which describes what we expect our model to be able to do. It won’t give us our measured data exactly, but it should give us “close to” our measured data, in the sense of making the likelihood be largeish…. but there is no sensible way to describe “largeish” on an absolute universally accepted scale. Needing to define for ourselves a scale that can allow us to compare between two different people carrying out the same calculation with a likelihood differing by purely a constant multiplier leads us to probability theory.

Now, we often take problems and make them *much more restricted* than what I’ll call the M-practically-pseudo-closed problem of say “all the neural networks less than a gajillion neurons + nonrecursive functions less than 10^9 symbols” In fact I often work with just some very limited functions like “radial basis functions with less than 100 centers” and “nonlinear regression functions with 8 parameters in a particular family” or whatever. We’re still basically doing a *computational shortcut* to what amounts to the M-pseudo-closed calculation we could be doing if we had a trillion times faster machines.

The fact is, we use prior knowledge to simply set a-priori probability on essentially all of those other models to zero. This is a computational shortcut, not a philosophical principle.

By the same route, random forests and deep learning and boosted foo-bars or whatever are really just computational shortcuts which limit the model search space, *and* don’t try to quantify the uncertainty in the posterior. A double-edged kind of computational shortcut.

Principled refusal to admit the prior doesn’t really have the principled flavor it seems to have when you let the model be a lambda-calculus string and realize that by excluding almost every conceivable model as you must to get anything done, there is effectively a prior distribution over models that is being imposed, it’s only priors over the remaining parameters in the nearly-infinitely-restricted remaining domain that are being “left out”.

]]>Fair question.

I’d say it depends on the goals. There are certainly cases where I’d just want to give an averaged prediction, in which case probability seems fine.

Other cases that might occur: worst case/minimax etc. This to me is closest to possibility reasoning rather than probability reasoning. We want to bound possible behaviour under greater levels of uncertainty. How should you behave given much less info than nature’s full probability distribution?

This is also related to work on ‘robust’ stats etc – it tends to be built on minimax ideas. I tend to think a lot of mathematical reasoning in general is built on inequality/minimax/possibility style reasoning, though I think Andrew once called it ‘ugly 60s style stuff’ or something. To me it’s a style of thinking that grew on me the more math I learned.

Similarly you might want to determine when the models give sufficiently qualitatively different behaviours such that they become distinguishable (eg you want to learn about which is closest to the ‘underlying mechanism’ rather than just predict well on average or in the worst case). Bifurcation theory and the like. I don’t think this is well captured by probability style thinking.

This sort of ‘bounding’ or ‘partitioning’ style reasoning is to me more qualitative and non-probabilistic. Probability really seems to me to be a more fine-grained ‘known unknowns’ sort of reasoning.

Which is to say – many different approaches are useful. You want an averaged prediction of the future given a handful of models, it seems OK. But there are other things you might want and I think quantitative probability theory is a much more limited reasoning tool in general than often advertised.

]]>In Appendix B we explicitly work in an M-open framework. Here’s what we write:

The key assumption for the results presented here is that data are independent and identically distributed: we label the data as y = (y1, . . . , yn), with probability density \prod{i=1}{n} f(yi). We use the notation f(·) for the true distribution of the data, in contrast to p(·|θ), the distribution of our probability model. The data y may be discrete or continuous.

In Appendix B, the density “f” is not assumed to be in the set of p. That is, we’re working in an M-open framework.

]]>ojm, I’m curious what you would do in practice. You collect or have access to dataset y, and have 3 ‘competing’ models M1, M2, M3, whose likelihoods differ in some structural respect so that they cannot simply be considered special cases of each other. Now, these structural differences also happen to embody different hypotheses about how the world works in some fundamental respect. Although grant that each model is of course, like all models, an idealization/approximate/whatever. In fact, it is even possible (if not likely) that the mechanisms embodied in each model are all plausibly defensible descriptions of the world, simply incomplete.

Your goal: forecast the future state of some system z, from which data y have been collected, and about which M1, M2, and M3 are all, a priori, potentially in play. Do you combine models in some way or simply select one and condition all forecasts on that selection? If you combine, don’t you want countable additivity and a sum to one measure?

Note, I am not trying to be cute or deliver some kind of “gotcha” here- I think this is fascinating philosophy of science stuff!

]]>But the basic idea had been noted over and over again when thinking about uncertainty, including by eg Fisher, Hampel, Huber etc in statistics.

]]>The asymptotics section – putting aside ‘not-too-onerous regularity conditions’ that I think are actually frequently violated in reality – basically says things can easily but arbitrarily bad, right?

A subtle point that I don’t think Bayesians ever really address is not just normalisation but additivity: once you accept all models are approximations then not only is there not one true model in the support but there is no reason I can see to say that the negation (or even the complement within the set of models considered) of a good approximate model is a bad approximate model etc.

So it’s not just normalisation that seems suspect to me (which is perhaps more minor) but also the basic algebraic structure of probability theory. Or at least I haven’t ever really seen a convincing reason why I should want to use an additive measure over a set of approximate models. This is an implicit and/or explicit motivation for many non-Bayesian approaches.

]]>The basic asymptotic theory of M-open Bayes is described in appendix B of BDA. See also section 3 of this paper with Shalizi from 2012, which in turn is based on various ideas of mine from 1993 or so. Sample sentence: “This is not a small technical problem to be handled by adding a special value of theta, say ∞ standing for ‘none of the above’; even if one could calculate p(y|∞), the likelihood of the data under this catch-all hypothesis, this in general would not lead to just a small correction to the posterior, but rather would have substantial effects.” See also section 4.3 of that article. Actually, read the whole damn thing!

]]>I’d say that non-terminating programs are simply non-scientific. It’s no use if your predictions take longer than the end of the universe to make. I realize that it’s impossible to compute whether an arbitrary piece of code terminates, but it is possible to reason about some code (ie. code that has finite loops, or what’s equivalent, code that recurses monotonically towards a base case, etc)

So for the most part, we need to deal with some kind of “finite terminating computable” models, and furthermore, I think most people are ok with saying that a thing is only science if an actual person or reasonable size group of people can actually write down the code in a lifetime. So realistically, we’re always in an M-closed setting: finite, not too big to actually write down sentences that are designed to terminate.

Dealing with situations where we have unobtainable models to consider is not pragmatic… for example say models involving newtonian mechanics with initial conditions on 10^80 particles where just writing down the initial conditions for the model would take longer than the human race will exist…

So, I think taking a concrete approach like this is useful. In this concrete approach, Bayes makes good sense, whether it’s “optimal” is more or less not something I think is answerable (at least on its own without describing the objective function you’re optimizing) but I think it’s easy to point out what Bayes does right that *every* other approach in wide use today that I’m aware of falls short in some way.

]]>This does mean limitations of different approaches should be acknowledged, including Bayes.

Really I was just trying to get some clarification on how Bayes is formalised in an M-open setting and how things I see as limitations of Bayes in this context might be addressed. Eg does it make sense to use formally additive (and normalised) measures of uncertainty in this context?

See my comment much earlier on about stacking and predictive distributions.

]]>No statistical method is uniquely better than everything else. I suppose there are some methods that are really terrible, but there are a lot of methods on the efficient frontier, considering all possible problems that might be studied.

]]>I thought that was the whole point?

> In the small world, the Bayesian mechanics work just fine, and there are a ton of philosophical and pragmatic arguments to use them for learning.

I’ve probably looked at more of these arguments that your average person and I find all problematic in some way or another, and for both philosophical and pragmatic reasons.

I’m not gonna stop anyone from using Bayes, and I’m happy to use it myself sometimes, but I find the arguments for it, over other approaches, unconvincing.

Are the goalposts:

– Bayes is one OK approach or

– Bayes is best

– How to formalise M-closed and M-closed tools or

– Everything is an approximation/everyone makes assumptions of some sort?

I’m ok to say

– Bayes is one OK approach, with flaws like any other.

– I find the formalisation of Bayes in what is referred to as the M-open setting unclear

But I don’t see any good arguments that Bayes is uniquely better than anything else. Some of its assumptions strike me as particularly awkward when trying to formalise things in the formal M-open setting.

]]>By not starting from an additive, sum to one measure of support?

]]>So in this sense, you can’t have your prior be f(x)=1 for x in the real numbers. (not normalizable). But you can have your prior be 1/2N on the nonstandard interval [-N,N] which can be normalized in nonstandard analysis, but there’s no standardization (no standard probability distribution that is “infinitesimally close” to this one)

As long as your prior is nonstandard normalizable, operations with data will keep the posterior nonstandard normalizable… If on the other hand, there’s no *standardization* you could interpret that to mean that you don’t yet have any finite quantity of information. So, if you started with a nonstandard prior, collected some data, and still couldn’t form a standard distribution… basically the data wasn’t informative.

]]>So, for inference from fixed data, we can actually determine the “good” region of the theoretical quantities (parameters) even if this function isn’t normalizable with respect to the data.

But, if we want our model to have predictive power, then predicted data is itself a *theoretical quantity* and the distribution over all theoretical quantities must be normalizable or it will be impossible to come to a calculation that can be shared between people (it would always be possible to multiply your result by a constant and claim to have “Better” agreement than someone else who had the same calculation but a smaller constant)…

So, it’s a corner case, where if you only want to go from fixed observed data to a distribution over parameter space, you can technically relax the requirements of the likelihood, but as soon as you require that the model be able to predict unobserved “future” or “alternative conditions” data, your model has more stringent requirements and the likelihood has to be normalizable with respect to the data.

]]>I meant to restrict to proper priors – does your characterization restrict to proper priors?

]]>With those choices, we get probability measure

That set of assumptions leads

]]>Let’s start with the assumption that you don’t have the “true” model in front of you. What’s going to happen as you collect more data is that you’ll be able to fit better and better models. Any “best model” notion is going to depend on data size and quality. As an example, consider building a language model (something that predicts the next word in a conversation or text given the previous words). How much context (number of previous words and their structure) can be exploited depends on the size of the data. That’s why tasks like speech recognition or spelling correction benefit so much from big data—they have very long tails. The “true” model involves human cognition, attention span, context, world knowledge, etc., all of which is only crudely appoximated in a simple language model.

]]>When I last gave a webinar on ABC I

1. Started with simple rejection sampling method for observed discrete data (draw parameter from prior, draw data using that parameter value, only keep those draws that generated the exact same data and viola there’s the posterior).

2. Pointed out the kept percentages for each parameter had probability equal the the likelihood (probability of data given parameter value) and hence was a weighted average that we could calculate directly.

3. Reminded people what importance sampling was (draw from a convenient distribution and re-weight to your target distribution) and re-characterized Bayes as importance sampling from the prior to get the posterior. (Then I moved on to sequential importance sampling to show why MCMC was needed for real problems).

You seem to be something similar using nonstandard probability – but it just _seems_ like yet another way to characterize implementing Bayes Theorem?

]]>Sounds right.

> how Bayes is actually supposed to be formalised in an M-open view.

How is anything to be formalized in an M-open view except by enclosing it in a bigger closed view?

Now within the “closed world” (a representation taken as representing itself) we know how to make re-representations without making the originating representation more wrong (truth preserving operations) – so why mess that up?

]]>But most of these other approaches already explicitly embrace the ‘M-open’ view and often justify their hackiness on this – better an approximate answer to the right (M-open) question than an exact answer to the wrong (M-closed) question etc.

It always seems to me like Bayes buys its ‘principles’ by paying in ‘closed worlds’. E.g. Jaynesian robots, rationality assumptions, settable bets etc etc.

Now we also have Bayesians saying that ‘of course we should deal with the M-open setting’ while keeping the ‘we should do it in a Bayesian way’.

But I’m honestly not really sure in what sense this works because Bayes foundations seem so closely tied to an M-closed view. What is the clearest statement of Bayes in an M-open setting, where things are built from the ground up (e.g. not just ‘let’s stack Bayesian predictive distributions’ but why I would want to start from Bayesian predictive distributions in the first place)?

I’m not just trolling on this – I struggle to see how Bayes is actually supposed to be formalised in an M-open view.

An again – I’m perfectly fine to be hacky, but now that I’ve embraced hackiness I struggle to see why I would want to be hacky in a Bayesian way.

]]>