Skip to content
 

Representists versus Propertyists: RabbitDucks – being good for what?

It is not that unusual in statistics to get the same statistical output (uncertainty interval, estimate, tail probability,etc.) for every sample, or some samples or the same distribution of outputs or the same expectations of outputs or just close enough expectations of outputs. Then, I would argue one has a variation on a DuckRabbit. In the DuckRabbit, the same sign represents different objects with different interpretations (what to make of it) whereas here we have differing signs (models) representing the same object (an inference of interest) with different interpretations (what to make of them). I will imaginatively call this a RabbitDuck ;-)

Does one always choose a Rabbit or a Duck, or sometimes one or another or always both? I would argue the higher road is both – that is to use differing models to collect and consider the  different interpretations. Multiple perspectives can always be more informative (if properly processed), increasing our hopes to find out how things actually are by increasing the chances and rate of getting less wrong. Though this getting less wrong is in expectation only – it really is an uncertain world.

Of course, in statistics a good guess for the Rabbit interpretation would be Bayesian and for the Duck, Frequentest (Canadian spelling). Though, as one of Andrew’s colleagues once argued it is really modellers versus non modellers rather than Bayesians versus Frequentests and that makes a lot of sense to me. Representists are Rabbits “conjecturing, assessing, and adopting idealized representations of reality, predominantly using probability generating models for both parameters and data” while  Propertyists are Ducks “primarily being about discerning procedures with good properties that are uniform over a wide range of possible underlying realities and restricting use, especially in science, to just those procedures” from here.  Given that “idealized representations of reality” can only be indirectly checked (i.e. always remain possibly wrong) and “good properties” always beg the question “good for what?” (as well as only hold over a range of possible but largely unrepresented realities) – it should be a no brainer? that would it be more profitable than not to thoroughly think through both perspectives (and more actually).

An alternative view might be Leo Breiman’s “two cultures” paper.

This issue of multiple perspectives also came up in Bob’s recent post where the possibility arose that some might think it taboo to mix Bayes and Frequentist perspectives.

Some case studies would be: 

Case study 1: The Bayesian inference completely solves the multiple comparisons problem post.

In this blog post, Andrew implements and contrasts both the Rabbit route and the Duck route to get uncertainty intervals (using simulation for ease of wide understanding). It turns out that the intervals will not be different under a flat prior, while increasingly different under increasingly informative priors. Now the Duck route guarantees a property that is considered to be important and good by many – “uniform confidence coverage” and by some, even mandatory  (e.g. see here). The Rabbit route with a flat prior will also happens to have this property (as it gives the same intervals). Perhaps to inform the good for what property, Andrew evaluates another property of  making “claims with confidence” (type S and M error rates) and additionally evaluates that property.

With respect to this property “claims with confidence”, the Duck route (and the Rabbit route with flat prior) does not do so well – horribly actually. Now, informed with these two perspectives, it seems almost obvious that if a prior centred at zero and not too wide (implying large and very large effects are unlikely) is a reasonable “idealized representations of reality” for the area one is working in, the Rabbit route’s will have good properties while the Duck route’s guaranteed “good property” ain’t so good for you.  On the other hand if effects of any size are all just as likely (which would be a strange universe to live in, perhaps not even possible) and you always keep in mind all the intervals you encounter, the Duck route will be fine.

Case study 2: The Bayesian Bootstrap

In the paper, Rubin outlines a Bayesian bootstrap that provides close enough expectations of outputs to the simple or vanilla bootstrap and argues that the implicit prior involved is _silly_ for some or many empirical research applications and hence shows the vanilla bootstrap is not an “analytic panacea that allows us to pull ourselves up by the bootstraps”. The bootstrap simply cannot avoid sensitivity to model assumptions. And in this post I am emphasising that _any_ model assumptions that give rise to a procedure with similar enough properties whether considered, used or even believed relevant? should be thought through. Not sure where this “case study” sits today – at one point Brad Efron was advancing ideas based on the bootstrap “as an automatic device for constructing Welch and Peers’ (1963) “probability matching priors” .

An aside, I find interesting in this paper of Rubin is the italicized phrase “with replacement”. It might be common knowledge today that the vanilla bootstrap simply samples from all possible sample paths of length n with replacement, but certainly in 1981 few seemed to realise that.  I know because when Peter McCullagh presented work that was later published in Re-sampling and exchangeable arrays 2000 at the University of Toronto, I pointed this out to him and his response indicated he was not aware of this.

Case study 3: Bayarri et al Rejection Odds and Rejection Ratios .

This is a suggested Bayesian/Frequentest compromise for replacing the dreaded p_value/NHST.  It is not being put forward as the best method for a replacement but rather one that can be easily adopted widely – Bayes with training wheels or a Frequentest approach with better balanced errors. Essentially a Bayesian inference that matches a frequentest expectation with the argument that “Any curve that has the right frequentist expectation is a valid frequentist report.”

I am not expected most readers will read even one of these case studies, but rather readers who do or have already read them, might share their views in comments.

 

125 Comments

  1. Toby says:

    Keith,

    I appreciate the effort, but it’s very hard to follow. Propertyist, ducks, representists, frequentests, rabbits, it’s confusing to read.

    Also, I do understand that to you this,

    “Though, as one of Andrew’s colleagues once argued it is really modellers versus non modellers rather than Bayesians versus Frequentests and that makes a lot of sense to me. Representists are Rabbits “conjecturing, assessing, and adopting idealized representations of reality, predominantly using probability generating models for both parameters and data” while Propertyists are Ducks “primarily being about discerning procedures with good properties that are uniform over a wide range of possible underlying realities and restricting use, especially in science, to just those procedures” from here.”

    makes sense. To me it does not. In particular given that you link Breiman’s paper right after. There the Frequentists are the modellers it seems.

    You go a bit too quick for me. I’ve read some of the other posts. I’m somewhat familiar with what it is all about. But it would help me if you would expand on some of the assertions / points that you make.

    • Keith O'Rourke says:

      > In particular given that you link Breiman’s paper right after. There the Frequentists are the modellers it seems.
      Yes, as an alternative view, though perhaps more distracting than I realised.

      When statistician’s justify their choice of techniques they use(d) it tends either be in terms of the properties those techniques have been determined to have under a minimal/default/convenient model (e.g. symmetric distribution) or that the specific/purposeful/motivated model that they carefully chose for the application implies that technique (so it has to used unless the model is revised).

      This http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf provides more discussion of this and a quote from Rob Tibshirani that might help.

      • Toby says:

        Thanks. I believed you also linked that paper in the post?

        And yes, I am well-aware of that fact. That’s not what confuses me. The modeller vs. non-modeller seems to be a distinction without much of a difference. That is, the distinction doesn’t seem to be modelling vs. non-modelling. At least not based on what I can read here. I have the feeling that you’re omitting several steps in your reasoning or take some things for granted. Then again perhaps I am not the target audience for this post.

        • Keith O'Rourke says:

          The source of “as one of Andrew’s colleagues once argued it is really modellers versus non modellers” might help see p.475 here:
          http://www.stat.columbia.edu/~gelman/research/published/badbayesresponsemain.pdf

          The difference is more one of emphasis and purpose – the non-modellers do need to have models to evaluate properties but they want them to be vague and non-specific so they hold widely whereas modellers want specific models that represent the reality they are trying to get at well (e.g. include background knowledge with a prior and accurately reflect the data generating process as it took place).

          > Then again perhaps I am not the target audience for this post.
          It might require some knowledge of case study 1 (which was a previous post) or other arguments about Bayes and Frequentist perspectives when the statistical output is the same from both. For instance, that one should somehow take some assurance from them being the same when that may not be case.

  2. I’m sorry, but I’ve read the first paragraph multiple times and I have no idea what it’s saying. The rest doesn’t help. In case you want some feedback:

    “It is not that unusual in statistics to get the same statistical output (uncertainty interval, estimate, tail probability,etc.) for every sample, or some samples or the same distribution of outputs” — i.e. you write that it’s not unusual…
    …to get the same statistical output for every sample
    …to get the same statistical output for some samples
    …to get the same statistical output for some distribution of outputs
    etc.
    I am totally confused by what this means, and I imagine there’s some interesting point you’re excited to convey.

    • Keith O'Rourke says:

      When you use two different statistical techniques, comparing the two results can be complicated.

      For instance, think of two ways to get confidence intervals – how do the intervals differ and does that difference really matter?

      If the intervals are always the same for each and every sample- it easy (case study 1).

      Perhaps they differ by sample, but the distributions of intervals is the same – does that difference really matter?

      Perhaps they differ by sample, but the coverage probabilities are the same, the average width is the same but distribution of intervals not the same – does that difference really matter? What if the variance of the width of intervals was much larger in one method?

      So the similarity of statistical outputs can be confusing, but to cover the case studies I needed the same statistical output for every sample (case 1), just close enough expectations of outputs (case 2) and the same expectations of outputs (case 3).

  3. Andrew says:

    Keith:

    Since you cite Leo Breiman, I have to point out that he was subject to the foxhole fallacy.

  4. ojm says:

    Hi Keith,
    Interesting post.

    Quick comment for now. I think the modeller vs non-modeller distinction is indeed important. See also the generative vs discriminative distinction.

    As someone from a modelling background (and even mechanistic rather than statistical) I used to think more modelling is always good. However, I’ve come to realise, for example, that it is often easier/better to filter out or ignore noise than to model it explicitly. (Also reminds me of some of Hofstadter’s GEB book I read a long time ago – something about the interplay background vs foreground ie noise vs signal.)

    The archetypes for each case are probably Tukey (non-modeller, filter noise) and Box (modeller). Huber has an interesting quote about an interaction between them at a robustness conference that I mentioned in my written response to Andrew and Christian’s paper (I have no idea if it was read/has been seen though).

    • ojm says:

      Another related distinction is between those who think in terms of equalities and those who think in terms of inequalities (i.e. bounds). Or between ‘direct’ and ‘indirect’ characterisations. I think Bayesians tend to want equalities – e.g. a specific probability distribution – whereas Frequentists think in terms of inequalities – e.g. a bound on coverage probability.

      One barrier to teaching Frequentist-style reasoning is probably that we do a poor job of teaching students how to think in terms of approximation and inequalities. Hence the near-universal pain when (or more likely if) students encounter analysis after calculus.

      • This is an interesting way of framing things, but I think it’s not correct to say that “Bayesians want equalities… whereas Frequentists think in terms of inequalities”.

        I think the better way to think of this is that Bayesians want to specify an equality of a specific individual quantity within a soft error bound whereas Frequentists want to specify the exact frequency with which you have exact equality under repetition of an experiment.

        y = a*x + epsilon(sigma)

        has two different interpretations. One is that (y_i – a*x_i) is plausibly anywhere in the high probability region of epsilon(sigma) with higher density areas more probable than lower density areas, for each and every i (Bayesian) and the other is that out of a large number of i values, (y_i – a*x_i) will fall within a region epsilon in [foo,bar] exactly integrate(p(epsilon),epsilon,foo,bar) of the time (Frequentist). The Bayesian is actually a joint probability (plausibility) distribution over the a_i, the Frequentist is actually a single frequency distribution for the ensemble of all the a values.

        Both are about approximation of equalities, but one is about pointwise measures and the other is about ensemble measures.

        Note that this pointwise declaration of knowledge about a quantity is what led me eventually to blog a bunch about “declarative models”

        http://models.street-artists.org/?s=declarative

        Although it’s possible to formulate any declarative model in a generative way, it is, at least to me, more natural to use a declarative formulation of some models. This specifically occurs when I have knowledge that allows me to assign a probability to a particular function of data and parameters, rather than directly to the individual data or parameters.

        For example, in Corey’s pseudo-periodic regression case below, you could create a function f(x) parameterized by some parameters a_i, with uniform probability on the a_i, and do something like write down an approximate expression for integrate((d^2f(x)/dx^2)^2,x,0,1) and assign a Bayesian probability to this quantity, thereby asserting that whatever the a_i values are, they result in a function that doesn’t wiggle around very much.

        It’s much easier to assign this “declarative” probability than to back-out what the implied joint probability over the a_i values is.

        In this sense, we’re approximating our knowledge of the properties of this function using a distribution.

    • My take on this is that a two stage approach is very helpful. Noise -> filtered data -> model

      in this scenario, you can incorporate the filter into the model if needed, for example to alter the distributions used in the model due to the effect of the filter.

      I’ve worked on projects like this, at least briefly. For example, trying to detect small earthquakes from arrays of seismometers. The first approach is to filter out both extremely small amplitude signals, and relatively large amplitude signals. Filter out anything where a signal doesn’t reach all of the sensors within a certain windows…. and then start to model “what does a real seismic event that makes it through the filter look like”. Post filtration, you don’t have to work as hard at making the model so general purpose.

      • ojm says:

        Yes exactly.

        But notice

        Generative: model to data
        Discriminative/Filtering: data to filtered data

        operate in opposite directions. Much of statistics, including Bayes, operates primarily in the generative direction, inverting after a full model is set up, while much of machine learning etc operates in the opposite direction, not necessarily requiring an explicit generative model. This I think is another aspect of Breiman’s two culture distinction.

      • Corey says:

        In my view filtering in this fashion implements part of the model — the part that says measured energy in those frequency bands or lacking that cross-correlation aren’t caused by the signal of interest. A preprocessing step is just prior information by another name.

        • I agree with this, I was going to say something to that effect, about how the filtering step can be seen as some kind of piece inserted into the generative description of the filtered-data.

        • ojm says:

          “A preprocessing step is just prior information by another name”.

          It depends on whether you mean prior information = probability distribution, or = general background information.

          You can of course try to force these things into a modelling framework, but there’s also the possibility that Tukey thought differently to Box and that both perspectives have merit.

          • ojm says:

            When you do the Bayesian update do you condition on the filtered data or the actual data? To do the latter you would seem to need a model: filtered data to data, which is much harder than the data to filtered data direction.

            • Keith O'Rourke says:

              It is rare to ever condition on all the actual data in an application (think of all the variables that are initially recorded in a study) and the actual data should not necessarily be taken as the recorded data to this many decimal places.

              For instance, this paper suggests “rather than condition on the data exactly, one conditions on a neighborhood of the empirical distribution” https://arxiv.org/abs/1506.06101

              I think the real argument is about whether or when to use a probability model to incorporate this information versus doing it informally. One position is almost never as David Cox recently stated it “prior information is often very important but insisting on quantifying its uncertainty Is often (nearly always?) a bad idea.”

              • ojm says:

                “It is rare to ever condition on all the actual data in an application”

                Exactly- one reason being that writing down a probability model for the actual data is effectively impossible.

                So what are we doing in practice and how does that translate to foundations? A lazy answer is we just have an implicit probability model/prior etc. But is this really a justifiable answer? Perhaps we are simply doing something different? That’s the possibility I’m raising.

                RE David Cox on prior info. Again I’m questioning the assumption that prior info = probability distribution.

          • Keith O'Rourke says:

            > both perspectives have merit.
            One way to assess that would be to try to match the statistical outputs (in some sense) and then ask is this model sensible and are these properties really good.

            In case study 1 when the _good_ property was obtained the model was not sensible and the _goodness_ of the property called into question by assessing a different _good_ property. (The not sensible involved being an alarm to think more carefully about the property being taken as _good_.)

          • Corey says:

            So you tell me what you think of this. I was analyzing an ensemble of quasi-periodic signals using Gaussian process regression. The sampling rate was ~25 Hz and the noise was white; the frequency content I was interested in was certain to be below 3 Hz. I lowpass filtered the data to remove noise power, which mathematically induces correlated noise. I didn’t model the noise as correlated — instead I downsampled the data such that the remaining section of the frequency spectrum was flat, i.e., the correlations were excised and the white noise condition restored. I then proceeded with the Gaussian process modelling. I think this is entirely legitimate from a fully Bayesian point of view, and here’s why:

            A posterior distribution is really an abstract mathematical object — we always summarize our inferences by computing posterior expectations and those numbers, being calculated by numerical methods or MC and rounded off, are approximate to a lesser or greater degree. My assertion is that to the extent that for typical data sets the two processes

            data –> –full model–> inferential output

            and

            data –> filtered data –simplified model–> inferential output

            implement (approximately) the same map from data to inferential output, they are merely two different ways of realizing inferences from the same posterior distribution.

            I don’t view this correspondence as vacuous — it helps us decide what to do in various practical situations we may face. If a fully Bayesian computation (that is to say, a high quality approximation valid everywhere in data space) is impractical, the correspondence suggests that we seek a simplified, perhaps facially non-Bayesian procedure that may not be a good approximation everywhere in data space but is adequate for data sets we’re likely to encounter. Given a non-Bayesian procedure that is performing poorly in the face of actual data, the correspondence suggests that we map to a Bayesian posterior, examine the way in which the model fails to be adequate, shift to a better model, and map back to a simplified procedure. Of course, moving to a better procedure is what people would try to do anyway; the correspondence gives us a roadmap for doing so.

            • This all makes good sense to me. I should add though that sometimes the useful fiction of signal + noise = signal + output_of_stable_random_number_generator leads us to something that looks like a Frequentist procedure for “removing” the noise (like say using a p value in a filtering procedure).

              It would be a mistake to think that you’re “doing frequentist statistics” just because you use the frequency properties of a random number generator. There is no inference really involved in filtering on p values, except that you’re using prior knowledge to infer that “what’s coming through the filter is signal of interest”

              • Another way to put this is that in the filtering case, the SCIENTIFIC LOGIC is Bayesian even when the mathematics is the mathematics of Frequentist probability.

              • ojm says:

                Frequentist probability is not the same as frequentist inference. (And frequentist inference is not the only alternative to Bayes).

                RE ‘implement the same map’. But how exactly? You state it without proof.

              • Re implement the same map

                It’s not that Corey is saying that the simplified model for the filtered data necessarily implements the same map, it’s more that this is the goal and the extent to which it’s true measures the success.

              • ojm says:

                But you don’t have the full map.

                Corey says
                data -> full model -> inferential output

                But this obscures the issue I tried to raise:

                Bayes works by _inverting_ from a generative model

                data <- full model

                via conditioning on the data. See the issue?

              • Corey says:

                The broad outlines of the proof that the two procedures implement approximately the same map are trivial to see given some DSP background; proving it in detail would be tedious. It helps to know that I designed the filter by inverse-FFT-ing the frequency response I wanted: perfectly flat in the passband, no phase shift, nice smooth transition band. That the filter removes noise power while leaving the signal completely unaffected follows from the linearity of the filter and the prior knowledge that the signal has no power in the stopband; that the downsampling leaves the information in the data about the signal untouched follows from the Nyquist theorem and the fact that it only chops off the portion of the frequency spectrum that I already squished with the filter. Figuring out the ratio of the effective noise power in the two analyses is simply a matter of calculating the second moment of the filter impulse response function; it didn’t occur to me to bother with that at the time.

                Sorry, my ASCII art isn’t great. “–full model–>” and “–simplified model–>” should maybe look like “–-(full model)-–>” “–-(simplified model)-–>”. These aren’t nodes; they’re edges. So “data <- full model" doesn't make sense in my metaphor.

              • ojm: you have in mind some model, Data —- (Full Model) —-> Posterior Distribution over non-nuisance parameters

                You realize through some transformation of the data, that the portion of the data that you model with the non-nuisance parameters is basically unchanged by the filtering process…. So, you do:

                Data —- (filter) —-> Filtered Data —— (Simpler Model) —-> Posterior Distribution over non-nuisance parameters

                To the extent that you have a reason to believe that the “signal of interest” is unchanged or changed in a well defined way, then the (simpler model) induces the same marginal distribution on the non-nuisance parameters. To the extent that this is true, then the logic is still the same Bayesian concepts regardless of the mathematics used in the filter step.

              • ojm says:

                RE: data <- full model 'doesn't make sense'.

                Just replace by

                data <- (full generative model) (full model)^-1 -> inference about parameter

              • ojm says:

                Damn something went wrong with WordPress

                data < – full generative model full model -> inference on parameter

              • ojm says:

                Another attempt

                Need to start from

                data <~ full model <~ parameter

                Then invert.

              • Corey says:

                The operational content of the estimate of a parameter (by definition, not directly observable — otherwise we’d call it data) is in the prediction of observable quantities it directs us to make. Non-modellers get from observed data to predictions without abstracting the information in the data as explicitly as modellers do, and Bayes can encompass this mode of inference too, by passing from a joint prior predictive distribution over current and future data to the posterior predictive distribution directly. (Bruno de Finetti was inspired to prove his exchangeability theorem because he wanted to explain what parameter are and how they can arise from the predictive point of view.)

                Whether we’re modellers or non-modellers our inferential output can be cashed out in prediction; the Bayesian “full model” element may reify parameters but it doesn’t have to. The filtering and “simplified model” together now constitute a “simplified setting” in which we impose a direct functional map that doesn’t mention parameters explicitly. So the upshot is: I think that at a high enough level of abstraction the distinction between non-modellers and modellers disappears, and that Bayes provides a path to go up and down from this level of abstraction and end up in either Modelling Land or Non-Modelling Land as we choose. The correspondence between the full model and the simplified setting exists in both.

              • Curious says:

                Corey’s comment is an exercise in logical absurdity.

              • Curious says:

                Corey,

                I apologize for the strength of my comment above, but I find this line of reasoning to rest on one key assumption that seems untenable in almost every data mining context I have observed, which granted may be quite limited. This notion that the cutting away of noise is done in a way that preserves signal and only loses noise is based on what evidence exactly? How can you possibly know this when you are using the data itself to determine whether or not there is a signal at all and in turn simply chopping away anything that obscures this, which could very easily be chopping away the cause of both x & y in the mined data?

              • Curious says:

                Corey,

                Never mind. I see your point.

                I read your comments above and I see what you are saying. If the signal is strong enough and operates consistently and you have some prior knowledge about this behavior, then you can reasonably accomplish it.

              • Corey says:

                Aw, and here I thought we were about to start a comment war. Alas…

              • Curious says:

                Corey:

                Yeah. I was kind of looking forward to it as well.

                I think my data mining experience has been in the realm of human behavior and find that the strongest signal is usually so obvious as to have wasted everyone’s time in undertaking the analysis in the first place. And that it is the fainter signals being obscured by the obvious that are most useful.

              • Curious. Imagine you’re trying to detect particular types of irregular heart beats over a EKG that has an intermittent connection. It’s not your job to detect a patient with no heartbeat. You assume each flat-line section of more than 3 seconds is where the connection has dropped. You remove them, and then run your model on the remaining signal, trying to detect afib or bigeminy or other known phenomenon.

                In the full model you’d need a model for whether or not a section of signal is connected or disconnected… in the filtered model you just need to detect the particular kinds of abnormalities…

                I think this is the kind of scenario Corey and I are discussing.

              • ojm says:

                Corey, regardless of whether you work in terms of predictive distributions or not you are still thinking in terms of a generative process:

                full data <~ model <~ inputs

                which is updated by conditioning on the data side.

                (As a side point, doesn't de finetti essentially require infinite exchangeability and even then tends to only produce existence theorems, not uniqueness? And says nothing about what happens if you make a wrong assumption, ie nothing about robustness. Tukey said "better to be approximately right than exactly wrong")

              • In the context of human behavior you could have an example of some kind of economic behavior. Perhaps for very poor people or very rich people your model is not expected to be very useful. So you select from the American Community Survey households with income at least 75% of GDP/capita but less than 200%, and do your analysis on them.

                The “full” model would be something where you model how things work for the poorer and richer folks as well, but if those aren’t of interest to you, then it’s just a nuisance to have that additional model and you’re going to only publish the estimate of the parameters related to the middle group anyway… this is still logically a Bayesian analysis.

              • Curious says:

                I suppose the distinction between the two methods begins to blur when you begin coding those different levels and fitting different models within them.

              • ojm says:

                “this is still logically a Bayesian analysis.”

                Can you be more explicit? Also, given the issues Bayes has with non-identifiable problems can you provide assurance that I will still get a sensible answer when I massively increase my model space to take account of all the possibilities you neglect?

              • ojm says:

                The same issue arises in discussions of vagueness and formal logic – inevitably we must significantly clean up informal statements before submitting them to formal logical analysis. By this time much of the work has arguably already been done.

                The logical interpretation of Bayes doesn’t fix this in that it builds on Boolean propositional logic. At the base are sharp T/F statements (also no quantifiers but that’s another topic). This leads to ideas like needing to condition on a neighbourhood of the empirical distribution as Keith mentioned above – we no longer want to condition on a sharp proposition like y=y0 but instead on a vaguer neighbourhood. But I don’t think this is a sufficient fix (again, another story).

              • I think the correspondence with formal binary logic helps rather than hinders. I have no problem with the idea that formal logic doesn’t help you unless you have a well developed correspondence between reality and the model you’re using, but as soon as you do, then you can use formal logic to arrive at conclusions, and because binary logic is compatible with binary logic in the asymptotic perfect information regime, Bayesian calculations will not go wrong. I think this is why I love giving examples on the blog, without that correspondence it’s easy to go off into vague land.

                In the EKG example, it’s pretty clear that when the EKG is disconnected, the shape of the waveform is uninformative, and when it’s connected, the shape of the waveform is completely informative. It’s also clear from our knowledge of heart rates that a living person’s heart beats more than once every 3 minutes. So, when you remove any long stretch of flat line, you don’t lose information about the waveform shape.

                Now, either you accept that statement about reality, in which case you can easily proceed to do the Bayesian analysis of the remaining waveform, or you don’t accept that statement, in which case you need to come up with an alternative if you want to proceed. It’s not sufficient to simply say “there doesn’t exist a guarantee that you haven’t lost some information”. There doesn’t exist a guarantee that the world isn’t a simulation taking place in the computer of a precocious child of an ultra-advanced race of beings either.

                The essential goodness of the Cox Bayesian interpretation is that it’s compatible with logic, so when you make a logical deduction it *is* a Bayesian deduction asymptotic for extremely precise information about the world.

                In Corey’s case, he knows enough DSP theory to show that applying his linear filter and downsampling leaves any signal with less than 3Hz essentially unchanged. The result is, if he assumes based on science that the generating process for the thing of interest has energy entirely contained in the band 0 to 3Hz that his post-processed signal informs his generating process as much as the full dataset sampled at a billion samples per second.

                The correspondence with formal logic is not a guarantee of scientific correctness, but it is a guarantee of (If scientific correctness of A and logical argument for A -> B then scientific correctness of B)

              • Corey says:

                I’m thinking in terms of encoding the information available to me in a probability distribution on observable quantities; I understand that not everyone does this. I am capable of thinking in other terms — Laurie Davies’s Stepwise Choice of Covariates in High Dimensional Regression is particularly clear, simple, powerful, and completely non-Bayesian — but in the end I find I gain clarity of thought by at least attempting to bring everything back to the Bayesian frame. To the extent that I have a position, I guess it’s that other modes of thinking are rarely impossible to understand from the Bayesian frame (vice versa, not so much), and that viewing things from this perspective is a clarifying exercise. For example, I usually think of methods in robust statistics as the “approximately right” answer to a full Bayesian approach* that we might not want to bother with if there’s a robust method that gives us pretty much the same inferences with less work.

                In response to, “The same issue arises in discussions of vagueness and formal logic – inevitably we must significantly clean up informal statements before submitting them to formal logical analysis. By this time much of the work has arguably already been done,” I would say that (1) this is true and (2) it doesn’t in the slightest cut against the claims Bayesians make for the desirability of casting problems into the Bayesian setting. Let’s say I want to design a controller. I might think that fuzzy logic will work well, so I’ll work to bring the information I have at my disposal into the fuzzy control system framework. Alternatively, I might think that stochastic process control is the way to go, in which case I’ll work to cast my prior information in terms of that setting. That I have to do this work before turning the crank on whatever mathematical system I’m using to accomplish my aim is part of the point.

                (There are representation theorems for finite exchangeability. The point of bringing it up is to show that stating a model in terms of parameters does not necessarily imply that one is reifying those parameters — it might just be that it’s easier to do the math that way.)

                * to Daniel Lakeland: on page 6 of this paper de Finetti gives that Khintchine’s theorem stuff we were messing around with here and calls it “obvious”. :-PPP

              • ojm says:

                Corey, Daniel,

                What is your position on dealing with non-identifiable models within the Bayes machinery? A non-issue? A failure of the researcher to provide a well-defined problem? Etc.

              • Corey says:

                My take is that ideally one reduces to the equivalence classes induced by the obvious equivalence relation on the original parameter space before assigning the prior.

              • ojm says:

                So again, quite a significant amount of preprocessing work before bringing in the formal machinery? Why not just assign a prior straight away?

              • ojm: non-identifiability can be a bug or a feature. If you’re not sure about your model and you find that you can’t identify something, it can be an indication that you weren’t thinking carefully and putting your model together in a good way. Perhaps a reformulation is in order.

                on the other hand, if you’re pretty sure about your model, and you can’t identify parameters, then it’s an indication that you need to design a data collection procedure to collect additional data that helps to resolve the identification issue.

                If in principle you could never collect the data that identifies the parameter. perhaps you should look at your theory and see if there is some other more fundamental quantity.

                That’s my initial take on it at least.

              • ojm says:

                I just find all of these responses to beg the most interesting questions. But I guess we can discuss some other time.

            • Curious says:

              Daniel:

              Yes. That makes sense.

    • Keith O'Rourke says:

      > archetypes for each case are probably Tukey (non-modeller, filter noise) and Box (modeller).
      From my recollection of history both, Laplace and Gauss moved from modellers to non-modellers in their later years.
      (I would have to reread Anders Hald to be sure.)

  5. MB says:

    Interesting, but maybe some typo in the first sentence? “It is not that unusual in statistics to get the same statistical output (uncertainty interval, estimate, tail probability,etc.) for every sample, or some samples or the same distribution of outputs or the same expectations of outputs or just close enough expectations of outputs.” doesn’t really parse as anything to me even though I’m a working statistician.

    After reading through the piece, seems like you meant something like “Often, using different models or sampling techniques will lead to the same or similar estimates” which I do agree with and see quite often in my own work :)

    • Keith O'Rourke says:

      MB: Thanks, maybe
      “Often, using different Bayesian models or frequentest techniques will lead to the same or similar inference outputs – estimates, confidence intervals, tail probabilities (p_values, posterior probabilities of the null), etc. Now what is taken as similar, is often not straightforward but varied and complex.”

  6. Christian Hennig says:

    I’m probably too late to this party but just wanted to make a few comments:
    1) I think that when we’re analysing data using probabilities, we should know what these probabilities are supposed to mean. A basic distinction is whether they model data generating processes in the world, or rather a state of knowledge. I don’t think these two should be mixed in the same analysis, because then this means that eventually it’s not clear what the resulting probabilities mean. Of course one could analyse the same data in different ways using the different interpretations and compare the results and their interpretations, but ultimately as long as probabilities don’t mean the same thing, this should be respected and not mixed.
    2) Whereas the kind of analysis that is usually called “frequentist” only applies to the modelling of data generating processes, Bayesian analysis can be applied using both kinds of interpretations (although when it is used for modelling data generating processes, the interpretation of the prior may occasionally be difficult and feel “un-natural”). But still I’d ask the Bayesian to decide what the probabilities that she produces are supposed to mean! (This happens quite rarely; unconsciously mixing up meanings happens far more often.)
    3) Another key decision, apart from the “aleatory/epistemic interpretation” one mentioned above is whether in a given situation there is some information that makes the use of a specific prior distribution attractive and useful. If not, I don’t see why I should bother with a Bayesian analysis, at least not if modelling data generating processes seems fine and I don’t need for any reason epistemic probabilities which then would be based on rather shaky choices of priors.

    • Christian, I think you are off track a little. Although I agree that people should decide what probabilities mean… I think there is really only one kind when it comes to science, namely a state of knowledge.

      When it comes to “data generating processes” what we need is physics, biology, ecology, chemistry, biochemistry, genetics, etc (For ease of writing, I’ll just call this “science”)

      When it comes to applying probabilities to data generating processes, the only thing that makes sense is to use the science to predict outcomes, and then acknowledge the imprecision of our predictions. it is this imprecision that equals lack of information that equals state of knowledge that produces a Bayesian probability distribution.

      When it comes to frequency based probability, this is only ever a property of pure math, or computing. It is only through rigorous construction of algorithms that enforce randomness behavior that we get frequency guarantees which are appropriately usable. There’s a reason why the die-hard and die-harder tests exist: https://www.phy.duke.edu/~rgb/General/dieharder.php

      In the absence of mathematical/computational construction, there is nothing which guarantees some kind of future frequency behavior will actually occur, but this is not true about our state of knowledge. Our state of knowledge about what will occur is invariant to unknown changes in the science.

      One kind of state of knowledge is “for all we know, each outcome will be in the high probability region of distribution D” and this leads to an IID sampling type model, but it is *still* a model of our *state of knowledge*. It doesn’t enforce any particular frequencies on future outcomes.

      So, if we simply adopt the position that every scientific study is analyzed from our current state of knowledge, and for the purpose of improving our state of knowledge and that there is no objective IID frequency in the world, we will be far less wrong.

      • Christian Hennig says:

        Daniel, I disagree.
        Frequentist probabilities are a model for data generating processes in reality, as Bayesian probabilities (if interpreted in this way) are a model for knowledge in the face of uncertainty.
        I think that a major problem in discussions on the foundations of statistics is that people have a too naive idea about models and how they could be “true” or not in reality.

        You are apparently opposed to using the frequentist interpretation of probability for modelling anything that doesn’t work very much like a random number generator. I think you have a too narrow idea of how models are used and can be used (actually, reading this again before sending, no, I don’t think that you generally have a too narrow idea, I rather think that you apply a narrower idea to frequentist models than you’d be happy to apply to your own favourite interpretation).

        Obviously, applying frequentist models to something other than random number generators requires more idealisation, but that’s just how it is with models. Such idealisations need to be critically discussed, fair enough, and at times one may be convinced that in this-or-that situation it’s really not a good idea, but ultimately if we use mathematical modelling in any way, we won’t get around such issues. It’s just the same with Bayesian epistemic probability modelling; almost all Bayesian models I have seen involve exchangeability assumptions (at least on some level), and I have no reason to believe that anything in reality is exchangeable. Tough luck! This doesn’t make me an anti-Bayesian but it makes me very wary of dogmatic statements of the kind “only such-and-such way of modelling makes sense”.

        Reference:
        C. Hennig: Mathematical Models and Reality – a Constructivist Perspective. Foundations of Science 15, 29-49 (2010).

        • The thing about a frequency approach is that for it to be meaningful scientifically, a stable observed frequency and a scientific concept of why that occurs must be observed and hypothesized first. As soon as you do this, you are working with a state of knowledge. Your knowledge is “I observed this histogram for plant growth rates in environment X, and I know that the way plants grow in environment Y is much like in environment X because the sunlight, water, atmosphere, and soil nutrients are similar and these are the dominant factors that control plant growth”

          When you say “the frequency of rate R is f(R)” you are sweeping this longer and more detailed justification under the rug, but it must be justified, and it is justified by … your state of knowledge.

          So even f(R) is a state of knowledge, it’s just a state of knowledge about the existence of a function f which under these conditions Y would model how often growth rate R will be seen, and also knowledge about what region of the function space f is in (what it’s approximate shape is). In fact, when you do a Bayesian analysis to fit f(R) you wind up with a posterior distribution over the shape parameters for f which more explicitly models your state of knowledge about f.

          In answer to Carlos’ question, this is how you model the unstable atom, you observe many decays and see a regularity that is independent of many many things, and so you decide to assign a frequency distribution with your knowledge that regardless of what conditions you look at it, the duration of time to decay will always be one of the numbers in the high probability region. It’s justified by acquiring a huge number of observations into your knowledge set so that you can produce a sequence of histograms each of which is very similar independent of your manipulations within some regime.

          The biggest issue I have in Frequentist statistics is that it is applied automatically *as if the existence of a frequency distribution were guaranteed*. Anyone who wants to do an explicit argument about why the frequencies of a particular scientific process are some function f and justify it with a wide variety of data gets a pass to do Frequentist calculations from me. But some doctors running an RCT on acupuncture with 35 patients or whatever… not so much.

          • Put another way, what is the justification for your choice of model:

            1) With a Bayesian analysis it is sufficient to say “for all I know R is somewhere in the high probability region of p(R)” and it’s justified by the fact that it is explicitly a statement about *for all I know*, that is a state of knowledge.

            2) With a Frequentist analysis we are explicitly making the assumption “after many observations the frequency with which R is between r and r+dr is f(r) * dr” or at least f(R) is a good enough smooth approximation. This is a statement about the future evolution of the physics of the world. What justifies it? You need something fairly strong! You need hundreds or thousands of data points.

            When you collect 6 months of satellite fly-by images of a forest and you select randomly 400 from each fly-by and show me that the frequency of finding a certain quantity of IR emission caused by evaporative transpiration is such and such… and it is stable in each fly-by. I will believe you. Then we can do frequency calculations together.

            • Christian Hennig says:

              Daniel: If I identify the world with my model, then yes, you’re right, I’m making this assumption. This can come from knowledge but often it’s just some kind of hypothesis, you’re right about that, too. Even worse, we’re not testing all of our hypotheses and we make some assumptions we know are wrong (e.g., assuming a continuous distribution for real and therefore discrete data).

              But this is how modelling *in general* works. With Bayesian modelling (of knowledge, rather than of data generating processes) it’s just the same. With exactly the same right I could, if you give me a Bayesian model for anything, say, how can you be sure that this is an *exact* representation of our knowledge. For all Bayesian models I have seen involving exchangeability (which is pretty much all of them) I *know* that they don’t completely capture my knowledge (or lack of) about the world, because having exchangeability is extremely restrictive and I have never seen any positive proof for it in any situation – so the knowledge model should put some probability elsewhere, shouldn’t it? But that would often be spectacularly complex and even if it wouldn’t, one can find other annoying knowledge (or lack of) about reality that could shred your knowledge model to tears if you took it too seriously.
              But still these can be models that are fine for the purpose and very useful.

              (Note by the way that I believe that all useful probability modelling needs to involve an element of repetition modelling, be it frequentist “i.i.d.”, Bayesian exchangeability, or whatever, because we need to make sure that with help of the model we can use the past to make statements about the future. This is a requirement of our reasoning; this is *not* because the world really is like that; we need to do some idealisation violence to the world in order to get this.)

              Also you need to understand that the point I’m making is *not* about “believe” that the world really is like that. We use models, frequentist and Bayesian, to achieve something; they are modes of thinking and communication. As such they can work (but of course won’t always) without the world being “really” like that.

              It still looks as if you demand some kind of perfection of frequentist modelling that Bayesian modelling can’t give you either, and actually no modelling can (because it runs counter to the nature of modelling).

              • Christian: of course everything is an idealization. But Frequentist statistics is an idealization of observable falsifiable facts, namely the future Frequency with which something will occur. It can’t be right except in rare idealized circumstances.

                Suppose with Frequentist hat on you say that F ~ normal(m,s) you think that your model has 2 unknown parameters. But in fact your model has 62 parameters or so and you are specifying 60 of them precisely (they describe the shape of the normal distribution). If I collect say 200 data points I can falsify your model because your model will predict that F should fall between say 5 and 10 more often than it really does…

                If you collect an infinite amount of data on the Bayesian model, you can not falsify the model with observed frequencies, because the model doesn’t specify the frequencies it specifies the knowledge you had at the beginning about unobserved things. The Bayesian model F ~ normal(m,s) says “if you tell me m and s I estimate that each individual F will be in the high probability region of normal(m,s), if you observe F 3000 times, it gives you a joint probability over the 3000 dimensional vector. You will only ever get ONE observation in a Bayesian model. All your data is ONE vector. There is no frequency.

                This I think is the essential difference. A Frequentist analysis is falsifiable by observation that the actual data deviates from the assumed sampling distribution. A Bayesian model is only “falsifiable” by Bayesian comparison with an alternative explanation.

                Of course, you can use p values for model checking in Bayesian stats, but it’s only relevant when your model is really a Bayesian model over frequency distributions. Then you can find out that indeed you don’t do a good job of matching the real-world frequencies.

              • Christian Hennig" says:

                Daniel: “But Frequentist statistics is an idealization of observable falsifiable facts, namely the future Frequency with which something will occur.”
                One can see this as a feature rather than a bug, see Andrew’s and my paper on “Beyond objective and subjective” as linked by Keith further down.

              • Corey says:

                If as a Bayesian you’re postulating F ~ normal(m,s) with 3000 observations, there are prior predictive distributions (over the entire data vector) for quantities that are basically statistics for tests of normality. Do you want to make decisions or bets on the basis of this prior predictive distribution? I sure don’t.

              • It is a foregone conclusion before you collect any data that your frequency model is false. The extent to which this is a problem for you can be highly variable. Suppose for example that you are modeling stock market returns, and we are all actually living in a computer simulation. In the computer simulation daily returns are *actually* normal(0,1) * 0.9997 + cauchy(0,1) *0.9993 then there IS NO average and yet you find this out only after something like 10000 days and bankrupting an entire industry.

                To me the larger point is this:

                1) Frequentist logic is 2 valued logic with irreducibly uncertain outcomes that follow stable patterns through time.

                2) Bayesian logic is real-valued logic with reducibly uncertain outcomes that follow rules given by physics and chemistry and biology to within a range of prediction given by a real-valued weight function over conceivable outcomes.

                3) (2) completely contains all the logical parts of (1) as a special case

                4) Most of the problems with (1) come from bolting on things that fail to follow logic (such as p < 0.05 means round-off to 0.0) or failing to have a way to make assumptions other than “irreducible uncertainty with stable patterns”

              • Corey. sorry it’s not clear who you’re asking your question to. Was it me or Christian?

                in case it was me, F ~ normal(m,s) with 3000 observations is intended to inform me about m,s which are uncertain quantities that have priors. I never make any predictions about the F vector, I only plug in the F values I collect and make predictions about the m,s

              • sorry it should be normal(0,1) * 0.99997 + cauchy(0,1) * 0.00003

              • Corey says:

                You are really that sure — that willing to bet — that there’s no skew in the actual data to be observed? That’s a bullet I won’t bite.

                Elsewhere I’ve made the point that among all of the isomorphic plausibility systems allowed by Cox’s theorem, only one — probability — appears in the Law of Large Numbers for exchangeable random variables, and it’s this strong connection between probability and expected frequency that helps resolve the underdetermination of the Cox theorem result. I can’t have my cake and eat it too — and in any event, I’ve always been more concerned than you and Joseph about the lack of predictive calibration inherent in the truth-in-the-high-density-region-is-good-enough stance.

              • Corey says:

                Daniel, my bad — I didn’t read all your comments, just clicked on your latest on the sidebar and read it as a direct response to my question (which was indeed directed to you).

                A situation can easily arise where you do care about predictions — your action space and loss function could be predictive after all — and then you’d best start caring about predictive calibration… There’s an example in BDA in which a log-normal model works very well for one estimand and disastrously for another due to failure to get the tail correct.

              • Christian Hennig says:

                Daniel:
                “It is a foregone conclusion before you collect any data that your frequency model is false.”
                It’s a model. If you take it too literally, it’s false, yes. Same with Bayesian models of knowledge.

                “Suppose for example that you are modeling stock market returns, and we are all actually living in a computer simulation. In the computer simulation daily returns are *actually* normal(0,1) * 0.9997 + cauchy(0,1) *0.9993 then there IS NO average and yet you find this out only after something like 10000 days and bankrupting an entire industry.”
                …which has nothing to do with whether your model is Bayesian or frequentist; using a plain normal model in a setup that is prone to outliers will get the Bayesian as easily into trouble as the frequentist.

                “Most of the problems with (1) come from bolting on things that fail to follow logic (such as p < 0.05 means round-off to 0.0) or failing to have a way to make assumptions other than “irreducible uncertainty with stable patterns”"
                Nothing in the frequentist interpretation of probability enforces treating p<0.05 as 0.0, and frequentists can make all kinds of assumptions that the Bayesian can make, because every Bayesian model can be given a frequentist interpretation.

              • ojm says:

                Corey – off the top of my head, at some point Cox assumes every proposition is T/F so P or not P has probability one. If you drop this you can get likelihood (for example). In fact likelihood provides a semantics for possibility logic. Search for Dubois.

                Why isn’t it desirable to assume P or not P in general? Well, identifiability for one: multiple mutually incompatible models can be equally consistent with your observables. There is a uniqueness issue. I saw a generalisation of Cox’s approach to constructive logic in which case P or not P also drops out. Note also in Cox’s book that when he extends to vectors of propositions he also obtains a weaker logic than Boolean logic. I feel like Jaynes didn’t read that far or something.

              • Corey, yes, of course sometimes I do care about prediction accuracy, when I care about frequency prediction, then I’ll build the more flexible model I was talking about (the GMM one with say 60 shape parameters for example) and then when I do my Bayesian analysis I’ll see whether I can pick out one particular shape as being strongly favored, but if I can’t I don’t see how it will help me to pretend that I know those 60 parameters exactly, which is what is done in a typical Frequentist analysis.

                I have no problem with dealing with stable frequencies, I just want to concentrate the inevitable Bayesian uncertainty around a particular shape of the distribution before doing the frequency calculations.

              • ojm says:

                Here’s one quick read at a general high level:

                Is Probability the Only Coherent Approach to Uncertainty?

                http://www.colyvan.com/papers/ipocatu.pdf

              • Corey says:

                Daniel, would you say this falls under the full-setting-vs-simplified-setting correspondence we were talking about recently or do you justify it in some other way?

                ojm, if I have multiple mutually incompatible models that can be equally consistent with my observables, I can’t see how rejecting “P xor not-P” helps me; the issues seem orthogonal to me. What connection do you see?

                I don’t recall Cox investigating vectors of propositions per se. I do recall him working on a logic of questions which did involve systems of collections of propositions; IIRC the aim was to provide a quantitative relationship between the informativeness of answers to various question in the same was that Cox’s theorem gets at a quantitative relationship between the plausibilities of uncertain propositions. There’s a guy named Kevin Knuth who has picked up and continued that program.

                I’ll have a look at Dubois’s stuff. I started the Colyvan paper and it’s making me grit my teeth — both Cox and Jaynes were very clear from the outset that they did not purport to quantify all uncertainty (like the definitional uncertainty Colyvan is going on about) but only uncertainty about propositions for which “P xor not-P” makes sense. But I’ll persevere… (A dude named Alain Drory pulled a similar trick about Jaynes’s analysis of Bertrand’s problem: he set up a straw man, misattributed it to Jaynes, and then knocked it down by arguing for the position Jaynes actually held.)

              • Corey says:

                Half the paper is blather about how definitional uncertainty is different from epistemic uncertainty. The article is in Risk Analysis but I’m going to guess that Colyvan is a philosopher… and indeed he is. In a triumph of hope over expectation I’m going to finish the paper anyway.

              • ojm says:

                I can’t say I especially recommend the Colyvan paper – I just quickly googled to see if there was a ref to point to. But the basic points are simple, made by many and to me pretty obvious.

                Quickly – the point is probability is an additive measure of uncertainty. This makes sense for obsevrables. Not so much for unobservables for which our uncertainty is often non-additive.This comes up all the time in practice when dealing with non-identifiable models.

                Why did you say to use the ‘obvious’ equivalence class above? To restore additivity it seems to me.

                Why doesn’t Andrew like th idea of the probability of a model, preferring predictive checks. I’d argue an at least implicit recognition of non-additive uncertainty.

              • ojm says:

                RE: Cox. See II.9 p 53:

                “we may conclude that there is no analog, or at least complete analog, in the algebra of systems [of propositions], to contradiction in the algebra of propositions”

                In his notation there is no A or ~A for systems of propositions. The lazy answer to this is ‘we only need propositional logic’ but that program seems to me have failed in all foundational mathematical and philosophical projects.

              • ojm says:

                (For example we need quantifiers, variables, sets etc to express more than trivial mathematical ideas)

              • Corey says:

                ojm, if I give both of my kids some Easter candy and later find a candy wrapper on the floor, the fact that the two models that predict this observable are non-identified does not seem to bear on the applicability of the law of the excluded middle to the various propositions involved and the resulting additivity of the probabilities I attach to them. In general if I have a collection of hypotheses that all have the same likelihood function, all that means is that while the data can change the probability of the disjunction of the hypotheses it cannot alter the ratio of any two hypotheses in the collection. That’s all, and it’s fine.

                Andrew’s been pretty clear on why he doesn’t like probabilities for models — it’s because they have a sensitive dependence on details of the prior that have negligible effect on inferences within the model. Also because the Occam factor argument that physicists tend to use to argue in favour of that dependence place a value of simplicity which is inappropriate in the social science context of the models he works with.

                Regarding predicate calculus and quantifiers and such, all I can say is that I’ve never felt the lack.

              • Corey says:

                “place a value of simplicity” -> “places a value on simplicity”

              • ojm says:

                So you would be happy to start with prior prob 1/2 (say) for two equivalent models of the situation? And then 1/3 if you introduced a third? Etc? And you only care about probability ratios?

                Anyway, this feels pretty pointless – some people will never be convinced to consider alternatives once they’re committed to their axioms. I’ve tried many times to raise the simple possibility that the Jaynes-Cox argument is nowhere near as convincing of ‘the one true way’ as it seems to be made out. Maybe alternatives are also interesting?

                So again, I’m not saying probability is not a useful tool I’m saying there are many reasons why Jaynesian proselytizing falls on deaf ears.

                I’ll leave Andrew to confirm, deny or remain silent on your and my interpretation of his philosophy.

              • ojm says:

                If you set the prior probability of kid A leaving it as 0.5, is the probability of ‘not kid A’ 0.5? Does ‘not kid A’ include ‘it fell out of my own pocket without me noticing’. Is ‘not kid A’ a well-defined proposition?

              • ojm says:

                “So you would be happy to start with prior prob 1/2 (say) for two equivalent models of the situation? And then 1/3 if you introduced a third? Etc? And you only care about probability ratios”

                Point being 1/2, 1/2 and 1/2 instead of 1/3, 1/3 and 1/3 also satisfies everything except normalisation. But dropping this blocks key steps of Cox’s argument.

              • Corey says:

                “In his notation there is no A or ~A for systems of propositions. The lazy answer to this is ‘we only need propositional logic’ but that program seems to me have failed in all foundational mathematical and philosophical projects… some people will never be convinced to consider alternatives once they’re committed to their axioms. I’ve tried many times to raise the simple possibility that the Jaynes-Cox argument is nowhere near as convincing of ‘the one true way’ as it seems to be made out. Maybe alternatives are also interesting?”

                Cox’s project here is basically to derive entropy as a quantification of the informativeness of answers to questions in a way similar to the way he derives probability as a quantification of the plausibility of propositions. No one is talking about “all foundational mathematical and philosophical projects” — the question at hand is “how do we learn from data”. If people want to update possibility and necessity measures in light of data, more power to them, but I personally will not resort to that approach or other similar ones unless I can’t get Bayes to work in the same problem — that is, unless the Cox desiderata prove unsuitable to the task at hand. I keep an eye on those other approaches in case I need a fallback, but I haven’t yet. I’d be happier with possibility theory in particular if it had a well-developed decision theory whose application is comparable in difficulty to that of applying the principle “minimize posterior expected loss”.

                “If you set the prior probability of kid A leaving it as 0.5, is the probability of ‘not kid A’ 0.5? Does ‘not kid A’ include ‘it fell out of my own pocket without me noticing’. Is ‘not kid A’ a well-defined proposition?”

                This is the catch-all hypothesis issue, not identifiability (i.e., that “multiple mutually incompatible models can be equally consistent with your observables”). I’m happy to concede that the catch-all hypothesis issue is thorny philosophically but not too difficult to handle in practice; Christian and Daniel’s discussion below about forking paths is basically about this.

                Here’s identifiability: consider a non-linear model that has a sigmoidal shape (e.g., a logistic function): as a function of the (1D real) predictor, the output is level for a stretch, then smoothly rises to a new level. There are four parameters; my preferred parameterization is: central point (that’s two parameters), slope at central point, and vertical range. (There will also be a noise model, but let’s assume it’s known.) For some data sets all parameters will be identified, but if the data don’t actually saturate on both ends and instead have a hockey stick shape or a straight line shape then the range parameter is lower-bounded but not upper-bounded and some directions of parameter space have flat likelihood functions. This isn’t even strict non-identifiability — it’s data-dependent — but all of the problems of non-identifiability can potentially arise. I’ve given a fair bit of thought to approaches for setting reasonable priors for this model.

              • ojm says:

                Decision theory isn’t that appealing to me but try Cattaneo’s recent work (was a PhD student of Hampel).

                My view is that identifiability, catchall etc are all entangled via implicit uniqueness assumptions. Having a prior that integrates to one amounts to assuming there is a true solution we just don’t (yet) know which. This is particularly problematic in non-identifiable problems for which there is inherently not a unique solution.

                I don’t want priors to ‘fix’ identifiability issues – I want to explore identifiability issues. Bayes works well when you are confident there are a set of mutual exclusive, identifiable possibilities of which only one is true but you are a little quantitatively uncertain which it is.

                RE: Cox. Yes, he looks at a slightly more complicated system involving multiple propositions and immediately has to drop one of the stronger conditions from the simpler system and come up with an alternative (here entropy).

              • ojm. I think you might have a point that is interesting to explore in the context of computing or language or mathematical logic / set theory.

                Bayes works by p(Data,Parameters | Knowledge) that is, the joint probability over observed and unobserved quantities that we assign given a large database of stylized facts.

                Now, I’m perfectly happy with the idea that the mapping from “Knowledge” to p(Data,Parameters | Knowledge), that is the assignment of formulas for probability distributions, is itself an imprecise problem which is NOT amenable to modeling through Bayes. this is kind of the Godel’s incompleteness of Bayes, you have an uncertainty about what meanings to assign to the facts in your knowledge set which is not in general amenable to Bayesian calculation.

                Is this the essence of what you have been bringing up? or at least close?

              • ojm says:

                Something along those lines.

                A mapping from ‘background’ knowledge to probability models

                Knowledge -> P(Data,Parameters)

                can be considered as a family of probability models indexed by ‘knowledge’. If you are confident that only one is correct and you just don’t know which then a probability measure over these would likely make sense. Then you have

                P(Data,Parameters | Knowledge)

                and

                P(Knowledge)

                and knowledge is ‘within’ the overall model too. If not you still have an indexed family available (eg a likelihood function). You can also use Bayes to update within the model

                P(parameters | data, knowledge)

                etc where background knowledge always stays on the right hand side.

                But the same issue arises with non-identifiable parameters – they are problematic to update _within_ the probability model because you no longer satisfy normalisation/additivity etc – multiple inconsistent parameter sets can all have possibility one but not all have probability one. When you start dealing with large parameter spaces enforcing the ‘true but unknown’ assumption required for probability becomes more questionable. You can still resort back to possible though- possibility being essentially probability without normalisation and being eg maxitive rather than additive.

              • Corey says:

                ojm, thanks for the pointer to Cattaneo’s thesis; it looks promising.

      • Carlos Ungil says:

        > When it comes to applying probabilities to data generating processes, the only thing that makes sense is to use the science to predict outcomes, and then acknowledge the imprecision of our predictions.

        How do you use science to predict when is an unstable atom going to decay?

    • Carlos Ungil says:

      “A basic distinction is whether they model data generating processes in the world, or rather a state of knowledge. I don’t think these two should be mixed in the same analysis”.

      Why not? Unless I misunderstand what you say, I think Bayesian analysis does precisely work by combining the probabilities of the data generating process (i.e. the probability of data conditional on the parameters) with the probabilities representing the state of knowledge (i.e. the prior distribution for the parameters) to update the state of knowledge (ie. the posterior distribution for the parameters conditional on the data).

      • Christian Hennig says:

        Carlos: In the epistemic Bayesian setup, the “probabilities of data conditional on the parameters” are also epistemic; they are models of knowledge and epistemic uncertainty considering data generating processes, they are *not* models of the data generating processes themselves. If you read about the foundations of Bayesian probability, e.g., de Finetti or Jaynes, you’ll find that they very explicitly state that the probabilities are not “located” in the world, implying among other things that data observed after modelling cannot present evidence that the model was “wrong”.
        And this has good reasons because probability calculus can be justified from epistemic axioms, it can be justified (or at least motivated) from properties of relative frequencies, but I’m not aware of any setup of probability calculus that mixes them. If the prior probabilities of parameters are epistemic, and those given the sampling/data model are aleatory, it is not clear what kind of animal the posterior probabilities (derived from both of them) are.

        • The animal is epistemic probabilities over the parameters that describe the f(R) frequency function of repeated samples of R.

          I’m fine with doing frequency calculations, it’s just that you need to tell me why you believe f(R) is a property in the world and under what special conditions it is stable.

          Typically you might start with “biochemistry tells me that all these particular situations I’m studying in the lab are biochemically similar so a stable frequency should be observed, let f(R) be a member of an extremely flexible family of distributions, such as a gaussian mixture model with 20 components… here are the epistemic priors I have over the 60 quantities that completely describe the 20 component GMM, here is a bunch of data… in the end the epsitemic distribution over the 60 vector is narrowly concentrated around some particular value Q in the 60 dimensional vector space, it’s narrow enough that I can choose Q as a sufficient approximation and call fQ(R) “the frequency distribution”

          As soon as you do that and then start doing frequency calculations, you and I are on board the same spaceship.

          • Christian Hennig says:

            This needs exchangeability, see above. No biochemistry will tell you that anything is exchangeable. (Or if it does, it’s a bold idealisation that you shouldn’t believe and the biochemist won’t either. And if you do, you can well set up a frequentist model.)

            • Exchangeability in Bayesian analysis is a property of our knowledge about the situation. It tells us that if what we know about the situation is symmetric between all the individual situations, that we can use a single distribution for the whole ensemble. So a Biochemist won’t tell you “these are all exchangeable” but if you ask a biochemist “as far as you know was there anything different that you did in all of these experiments?” and they say no… you’re talking about an exchangeable model.

              The exchangeability is in the applicability of a single simplified state of knowledge to the assignment of the probabilities.

              • symbolically if we have x[i] for i = 1..N and in each case everything we know about x[i] is contained in K[i] and K[i] = K the same exact facts for all i then

                p(X | K) = product(p(x[i] | K), i = 1..N)

                this holds because K is a constant, rather than something that changes for each data point. As soon as you have a time-series structure, or a potential change-point model or whatever, then this symmetry property doesn’t hold and you don’t have exchangeability. For example you might do a gaussian process for a signal through time, where every data point is really just part of ONE function whose values you observe with error at various times or places. The fact is, you’re putting probability on *one* object, a vector of observations.

                Though in actual fact, most biochemistry examples I’ve done involve “these are all from one batch, and then this bunch are another batch, and Joe did some extra ones… and then we repeated our first batch with the reagents we re-ordered from a different supplier…”

                So that in the end, exchangeability is only within small groups of data points.

                Taken as an expression of symmetry of knowledge, the Bayesian IID exchangeability is un-complicated. Taken as an expression of an assertion about the world that *frequencies of outcomes* will be constant, it’s a highly objectionable and essentially always wrong on the face of it assumption.

              • Christian Hennig says:

                Assuming exchangeability is much stronger than saying that “we don’t know any reason why the situation shouldn’t be symmetric”, it means that “we know for sure that the situation is symmetric”. A model for the former would allow to go away from symmetry in case observations suggest otherwise strongly, but assuming exchangeability you decide to forego that option.

              • Christian: no it only means that if you think your distribution is a frequency distribution.

                The Bayesian can say “the frequency of outcomes is f(Data | Params)” and these are exchangeable among the observed data and all the future data of interest to me…. But this is a model assumption that the Bayesian doesn’t HAVE to make. The Bayesian can simply say “my probability over what would occur is p(Data | Params) for the particular set of data run in this particular subset of my current experiment” and the assumption means nothing about symmetry of the physics, it means only symmetry of the knowledge.

                On the other hand, I see no way around assuming symmetry of the physics in Frequentist logic.

              • Christian Hennig says:

                Daniel: It assumes that the knowledge says that it’s symmetric. If you are actually not sure that it’s symmetric and you think it may not and in the future you may learn more about this from observations, you can’t model this using exchangeability.

                Once more, if you start from an exchangeability model, no observation whatsoever can get you out. It mean that you don’t only treat the situation as symmetric before observations, it means that you commit yourself to treating it symmetrically forever regardless of what the observations are.

                I don’t think that such knowledge ever exists. I rather think that the exchangeability model is an idealised *model* of the knowledge ignoring some of its complicating subtleties, as is the frequentist model an idealised model of the data generating process that commits us to ignore some subtleties for the sake of simplicity.

        • Carlos Ungil says:

          > Carlos: In the epistemic Bayesian setup, the “probabilities of data conditional on the parameters” are also epistemic; they are models of knowledge and epistemic uncertainty considering data generating processes, they are *not* models of the data generating processes themselves. If you read about the foundations of Bayesian probability, e.g., de Finetti or Jaynes, you’ll find that they very explicitly state that the probabilities are not “located” in the world, implying among other things that data observed after modelling cannot present evidence that the model was “wrong”.

          Jaynes spends many pages in his Probability Theory book talking about extracting balls from urns (you can’t get more world-located than that), and the last section in that chapter includes the following remarks (slightly edited):

          “Sampling distributions make predictions about potential observations. If the correct hypothesis is indeed known, then we expect the predictions to agree closely with the observations. If our hypothesis is not correct, they may be very different; then the nature of the discrepancy gives us a clue toward finding a better hypothesis. This is, very broadly stated, the basis for scientific inference.”

          “In virtually all real problems of scientific inference we are just in the opposite situation; the data D are known but the correct hypothesis H is not. The the problem facing the scientist is of the inverse type: Given the data D, what is the probability that some specified hypothesis H is true? […] In the present work our attention will be directed almost exclusively to the methods for solving the inverse problem. This does not mean that we do not calculate sampling distributions; we need to do this constantly and it may be a major part of our computational job.”

          The likelihood is obviously related to the sampling distribution (although it’s not exactly the same thing, the likelihood is not a probability distribution because we switch the roles of the variables). It looks as a model of the world to me, at least as much as when the same sampling distribution appears in frequentist methods. Of course probability can mean other things beyond frequencies, but I think sampling distributions are one of the places where the connection appears automatically.

          Regarding the evidence for models being “wrong”, you can look at different models (for example a simple model embedded in a more complex model). And there are also ways to do “goodness of fit” tests (I’m not sure if that’s the frequentist alternative that you’re thinking of). Jaynes discuss the issue extensively, with the following conclusion:

          “Our discussion of significance tests is a good example of what, we suggest, is the general situation: if an orthodox method is usable in some problem, then the Bayesian approach to inference supplies the missing theoretical basis for it, and usually improvements on it.”

          • Christian Hennig says:

            Carlos: Thanks for reminding me of this. I read this long ago and it got kind of overwritten in my brain by some other things in his book, e.g., (regarding urns, p. 52) “The probability assignments (for drawing red or white balls) are not assertions of any physical property of the urn or its contents; they are a description of the state of knowledge of the robot prior to the drawing.”
            Like with Fisher, one can probably occasionally get apparently inconsistent messages from Jaynes. Mo doubt somebody can explain why they are not really inconsistent. My understanding is that Jaynes embeds the statements cited by you into a general epistemic Bayesian logic, i.e., one better sets up a model that includes the possibility that ball drawing doesn’t work in the way elementary sampling theory would imply and assign probabilities to that, too. Then obviously data can shift probability from one submodel away to another. Still then data cannot give evidence against the bigger model that was used to allow testing the more restrictive one within it.

            Ah! You say it yourself!

            Within a supermodel, all this inference can be done for submodels. But what is ultimately modelled is still to what extent one should believe (or not) – even within the submodels – rather than the data generating mechanism itself, which was my original point.

            • Carlos Ungil says:

              To be fair, Jaynes strongly denied the existence of “physical probabilities” even in the realm of quantum mechanics that I mentioned elsewhere. “Quantum physicists have only probability laws because for two generations we have been indoctrinated not to believe in causes – and we have stopped looking for them.”

              I think the likelihood represents a model of the world in the Bayesian setting as it does in the frequentist setting. I still don’t get your point. What kind of evidence regarding the model do alternative inference frameworks provide which is missing in the Bayesian framework?

              • Christian Hennig says:

                I prefer to speak of the epistemic interpretation of probability rather than the “Bayesian setting”, because Bayesian models can have aleatory interpretations, in which case they indeed model data generating processes. They only don’t if they don’t.

                But that’s the thing: Jaynes models the state of knowledge of his robot, not the data generating process. Which means that the full model (I’m not talking about shifting probabilities within the full model around between submodels) cannot be tested against data, because it is simply not about where the data come from. Daniel, who surely has a good grasp of Jaynes-type Bayesianism, explains the same thing above: “If you collect an infinite amount of data on the Bayesian model, you can not falsify the model with observed frequencies, because the model doesn’t specify the frequencies it specifies the knowledge you had at the beginning about unobserved things.”

                Also I think that you’d need to decide: Are your probabilities epistemic or aleatory? I don’t think it works to have an epistemic prior and an aleatory sampling model, because none of the approaches to the foundations of probability licenses you, in this case, to use both of these kinds of probabilities in the same calculus.

              • Carlos Ungil says:

                > the full model cannot be tested against data

                As I mentioned, Jaynes discusses “goodness of fit” tests. If this is not testing a model against data, what do you mean precisely?

                Gelman and Shalizi in their “Philosophy and the practice of Bayesian statistics” list Jaynes as one of the writers who “emphasized the value of model checking and frequency evaluation as guidelines for Bayesian inference” and later single out Jaynes in particular: “A more direct influence on our thinking about these matters is the work of Jaynes (2003), who illustrated how we may learn the most when we find that our model does not fit the data – that is, when it is falsified – because then we have found a problem with our model’s assumptions.”

                Are they talking about testing a model against data?

              • The Bayesian analyst SHOULD decide which objects in the model represent frequencies but there’s no fundamental reason they can’t use frequencies.

                Suppose a Bayesian is told that a computer is outputting a sequence of random numbers from some univariate distribution with a mean and standard deviation… Are you saying this Bayesian isn’t logically warranted in collecting a large set of data, and then define say a Gaussian Mixture Model (GMM) with a large number of parameters, and then fit those parameters in such a way that they specify p(Data | Parameters) = GMM(Data | Parameters) because this GMM doesn’t represent a state of knowledge it represents a physical frequency enforced by computer program?

                That seems extremely odd to me.

                What my complaint is is that the Frequentist is willing to go the other route, they simply before seeing any data specify that the GMM is actually definitely a one-component mixture with unknown mean and standard deviation and then proceed to “guarantee” that their interval estimates of the mean and standard deviation are guaranteed to be right 95% of the time that they collect data even though they haven’t established in any way that this initial assumption of gaussianity is relevant (and the same thing happens with lognormals, gammas, beta, whatever, they choose a particular family based on a hunch and maybe a failure to reject using a test and then predict frequencies from it!)

                What’s more is that they tend to apply this in settings like 30 samples from human randomized controlled drug trials etc where serious deviations from whatever your assumed shape is are essentially guaranteed at least in the even medium-short run (like if you do this in 5 or 10 different problems one of them will have a serious violation that you can’t detect because you only have 18 data points)

              • Christian Hennig says:

                Carlos: “As I mentioned, Jaynes discusses “goodness of fit” tests. If this is not testing a model against data, what do you mean precisely?”
                If you follow Jaynes formally, this means that you define model and some alternative in the framework of a supermodel, and goodness-of-fit testing shifts probabilities around between submodels of the supermodel, which in itself you can’t test. However, it may well be that Jaynes also made remarks to the effect that you can test the supermodel without setting up a Bayesian super-supermodel, which then means that you go out of the Bayesian framework. I have read the Probability Theory book but I can’t claim to be the very best expert on Jaynes and can’t claim for sure that he consistently sticks to the Bayesian setup overall.

                What I mean is still that epistemic probabilities do not model the data generating process in itself, only a person’s knowledge about it, and this holds for “sampling models” as well, and Jaynes is quite clear about this in his book.

                Daniel: You know the wrong frequentists. Frequentists do model checking. Empirically I’m fairly sure that they do more of it than the Bayesians (Andrew may have some good influence there).

                If you get the same people you are referring to to do a Bayesian analysis instead, they will make the same hash of it.

                “Suppose a Bayesian is told that a computer is outputting a sequence of random numbers from some univariate distribution with a mean and standard deviation… Are you saying this Bayesian isn’t logically warranted in collecting a large set of data, and then define say a Gaussian Mixture Model (GMM) with a large number of parameters, and then fit those parameters in such a way that they specify p(Data | Parameters) = GMM(Data | Parameters) because this GMM doesn’t represent a state of knowledge it represents a physical frequency enforced by computer program?”

                I don’t know whether this was for me, but I’m not the one who is overly prescriptive here. Surely the Bayesian can do that but still it’s either fully aleatory or fully epistemic, and I’d like the Bayesian to tell me what it is.

    • Keith O'Rourke says:

      Christian:

      Before I fully respond could you put your comments in the context this excerpt from you recent paper
      “A key issue regarding transparency of falsificationist Bayes is how to interpret the parameter prior, which does not usually (if occasionally) refer to a real mechanism that produces frequencies. Major options are firstly to interpret the parameter prior in a frequentist way, as formalizing a more or less idealized data generating process generating parameter values.” http://www.stat.columbia.edu/~gelman/research/published/objectivityr5.pdf

      One interpretation of your comment 1 would be that a Bayesian analysis should never be done as it combines the prior and likelihood e.g. log(posterior) ~ log(prior) + log(likelihood). I don’t think you meant that but a recent reviewer made exactly that argument.

      Comments 2 and 3 I largely agree with and for 3 I would put this as a case where a literal interpretation of the posterior would be OK.

      • Christian Hennig says:

        Keith: *If* the parameter prior is interpreted in a frequentist way (which is possible but may often require quite strong idealisation), then Bayes using such a parameter prior uses the frequentist interpretation of probabilities all the way, and an inconsistency problem doesn’t arise.

        “One interpretation of your comment 1 would be that a Bayesian analysis should never be done as it combines the prior and likelihood e.g. log(posterior) ~ log(prior) + log(likelihood).”
        I’d agree but only *if* the prior is interpreted as epistemic and the likelihood is interpreted as aleatory (frequentist); the posterior is then a muddle. Many people seem to have such an interpretation in mind but one can do it in a fully frequentist manner (see above) or in a fully epistemic manner (de Finetti, Jaynes,…).

        • Keith O'Rourke says:

          A lot of comments to digest here, but thought I would add a few.

          Christian’s phrase “they are modes of thinking and communication” underlines why I prefer the term representation than model or information or knowledge – as there is much more in representations (i.e. abductions or good guesses).

          Now, I used to make a distinction between aleatory and epistemic which (by an argument by authority from Andrew ;-) ) Andrew seemed to ignore. I am not sure anymore.

          Let me put it this way, using two stage simulation as a conceptual way to represent _a_ _Bayesian_ approach.

          Rep1. You have a representation of empirical phenomenon with unknown parameters, a random model for how the parameters’ values were set/determined and a random model for how observable came about and were observed/recorded.
          Rep2. Given observations in hand we restrict the representation to one have exactly/approximately those observation values.

          Rep1 gives a distribution of parameters’ values and Rep2 gives a arguably more appropriate distribution of pareamters’ values.

          Bayes theorem is just a diagrammatical representation of the relationship of Rep1 to Rep2 which allows us to experiment and think about it. Out of this we get labels like Bayes, prior, likelihood (noticed as ~ Rep2/Rep1), posterior, etc. but there is no justification for taking prior and likelihood as separate entities. So we are we worried about re-combining them?

          • Christian Hennig says:

            This depends on how the result is interpreted. In a real situation I can imagine interpretations about which I’d not be worried and interpretations about which I’d be worried indeed (e.g., “probability that the null hypothesis is true” with the null hypothesis being interpreted in a frequentist manner).

  7. Christian: we’re out of room above, so starting over here. You say:

    “Once more, if you start from an exchangeability model, no observation whatsoever can get you out.”

    This is simply not true. Suppose on friday I evacuate a tube and use it to drop balls with several timers in the absence of air resistance. I decide my apparatus is well behaved and all the observations are exchangeable.

    On monday I come back and drop the balls in free air… there is absolutely nothing that prevents me from using this information to change the model I use for the errors in the monday dataset. The Bayesian way is to condition on a set of knowledge, and the knowledge I have about the friday dataset is different from the monday data set and I can immediately apply it to my model. This holds for any and all future experiments that I do as well.

    Exchangeability is a property of the knowledge set used for assigning the probability expressions.

    • Also: “It assumes that the knowledge says that it’s symmetric. If you are actually not sure that it’s symmetric and you think it may not and in the future you may learn more about this from observations, you can’t model this using exchangeability.”

      If you have some doubts about whether there should be a trend or a change point or some such thing that eliminates exchangeability, then there’s nothing keeping you from going back and putting those in the model when they seem to be warranted after seeing the data. We’ve been over this before about how any given Bayesian analysis is always some kind of truncated asymptotic expansion of the full set of models you’d be willing to entertain, and when that expansion becomes singular you’re free to go back and add in the “missing” components.

    • Christian Hennig says:

      Daniel: You can have a model that doesn’t state that observations on Friday are exchangeable with observations on Monday, true.
      I’d still claim that your Friday model is stronger than your knowledge. You don’t actually know that your observations on Friday are symmetric, you only don’t know any reason why they shouldn’t be. If you only take 10 observations on Friday, it may not matter. If you take 10,000, a good data analyst may well spot a difference that you have committed yourself to ignore by using exchangeability, and not because any proper knowledge forces you to do that.

  8. Christian Hennig says:

    “If you have some doubts about whether there should be a trend or a change point or some such thing that eliminates exchangeability, then there’s nothing keeping you from going back and putting those in the model when they seem to be warranted after seeing the data. We’ve been over this before about how any given Bayesian analysis is always some kind of truncated asymptotic expansion of the full set of models you’d be willing to entertain, and when that expansion becomes singular you’re free to go back and add in the “missing” components.”

    I’m not really convinced by the idea that you set up a model, claim that this only models knowledge and not the data generating process, then you see some data from that process and claim, “oh, the knowledge I had before the data was actually different from how I initially set up my model” and change your prior. You may be right that formally one can show that under so-and-so conditions this means that you can only get things so-and-so wrong, but I’d expect that in practice one wouldn’t normally worry so much about these so-and-so conditions and whether they are fulfilled, and it’s hard to do that anyway if you licence yourself to change the prior model in all kinds of ways after seeing the data. Certainly it’s not transparent unless you specify everything in advance, otherwise it’s a fine Bayesian garden of forking paths, isn’t it?

    • Christian. I don’t license myself to do anything its just a fact that data can make the probability of your chosen model small enough that a model you chose to leave out would dominate.

      We always simplify because its too big of a task to put every possibility in at first especially when some seem at first unlikely. But after data we can realize that even though the prior said it was unlikely the posterior would have picked out that option if we had included it.

      • Christian Hennig says:

        The thing is that had you had the “unlikely” possibility in the model from the start with some nonzero probability, this would impact everything else even if eventually the simplest model still looks most likely.
        Perhaps you don’t license yourself to do anything, but you’d be willing to do certain things that are unspecified in the beginning and you only figure them out after looking at the data. That’s forking paths for me.

        Anyway, I don’t even disagree about whether this could at times be a reasonable thing to do. I’m just noting that when it comes to discussing frequentism, you insist that models should be perfectly true from the start (despite frequentists being happy to reject and adjust after looking at the data, which I am aware by the way comes with its own problems), whereas for the Bayesian you consider this apparently good practice.

        • Notes for clarification:

          I DON’T think it’s good practice to NARROW the model after data. This is the essence of forking paths… start with the obvious thing, if that doesn’t seem to get you the p value you want, look at something more specific and more specific until you find something (an interaction between interactions or whatever).

          What I think is OK is to broaden the set of possibilities you’re entertaining to include new possibilities that you had initially thought were probably irrelevant (I assumed my evacuated tube didn’t have any leaks, but in fact there was always some possibility that the air steadily leaked back in at some slow rate… etc

          The “assumed no leaks” is equivalent to a delta function on the leak rate at r = 0. The broader model is maybe exponential(1.0/rsmall)

          similarly if there are discrete categories of things that were left out, you can put them back in without any problem, it always spreads the probability onto a larger set of possibilities, and it becomes the data’s job to re-concentrate things.

          What wouldn’t be ok would be something like saying “hormone cycle affects what color shirt you wear” and then later when there’s no support for that say “hormone cycle affects what shade of red you wear”, that is, to add in some restriction on the set of possibilities which then necessarily sucks probability weight out of the expanded set and plops it down on your favorite sub-hypothesis.

          • Christian Hennig says:

            The reason why I think this is an instance of forking paths or rather “researcher degrees of freedom” is that you give yourself more freedom after seeing the data than were in your your initial model setup.
            Many “forking path” examples work exactly like this. You don’t get a significant p-value in the analysis you initially planned but you discover that by combining some subgroups or applying certain transformations you can. “Broadening the set of possibilities” is exactly the issue.

            • To me “forking paths” is a hack used to communicate why p values are problematic, it’s not a fundamental foundational issue.

              I hereby declare that I have a mixture model in which I place a uniform prior over all Stan models that compile and run weighted by epsilon in addition to (1-epsilon) for whatever particular Stan model I’m running today, but am approximating this model by truncating epsilon to 0. There, now are you happy ? :~)

              Seriously though, the forking paths issue in p values isn’t the same thing AT ALL. Here’s why.

              Suppose I run an experiment, I do an analysis in which I place 100% prior probability on my data coming from some default frequentist distribution, say normal(m,s) and I do a test to see if m ~ 0 and… damn I can’t reject it, no biscuit for me.

              So, I now go and look for some other stuff in my data, and then I place 100% prior probability on some interaction between hormones and shirt color and weather… and lo and behold I can detect nonzero mean… hooray, gimme that biscuit.

              But, this isn’t the same thing at all as setting up a bayesian model in which there are broad ranges of possibilities such as maybe a positive interaction with weather, maybe nearly zero, maybe a negative interaction with zero, maybe some interaction with age, maybe some interaction with race or certain cultural backgrounds… etc etc.

              When I broaden the range of possibilities, it makes it harder for the data to strongly pick out one possibility. Whereas when I take a sequence of trials in which I place delta functions on the one possibility for an interaction and simultaneous delta functions on no interactions of any other sort… I get lots of combinations to check each one inconsistent with any coherent view.

              Suppose you have a sequence of numbers which you have labeled A,B,C,D,E,F,G,H,I,J and for some reason I believe that when sorted in descending order, they decay rapidly. Then I tell you “sort these in descending order and tell me the first 2 of them” so you tell me A,B and I add them up and divide each one by the total… and I say there is 90% probability for A and 10% for B, but then you come back to me and say “well C is almost as big as B” then I’m going to know that my calculation was too imprecise and I have to go back and calculate more precisely. That’s all I’m suggesting, except that instead of actually sorting in descending order, in a Bayesian analysis we sort in what we SUSPECT is the descending order, so there’s even more chance we’d be wrong in our calculation. That’s just math, it’s not a special “license” or anything.

              • Christian Hennig says:

                A Bayesian kind of p-hacker could also try out all kinds of possible models, transformations etc. that look remotely plausible and go for the one that gives them the best posterior probability for their favourite hypothesis. They could claim that all kinds of models were in the initial set but some others were ruled out by the data. Obviously not everything was tried out (neither would you try out everything even if you claim that “I place a uniform prior over all Stan models”) so they have degrees of freedom that they can use to their advantage, if they are as keen on doing junk science as the straw men that you put up as frequentists all the time.

                “I hereby declare that I have a mixture model in which I place a uniform prior over all Stan models that compile and run weighted by epsilon in addition to (1-epsilon) for whatever particular Stan model I’m running today, but am approximating this model by truncating epsilon to 0. There, now are you happy ? :~)”
                Well, you’ve got to make an attempt to convince me that this approximation is any good. You’re in a bit of a catch there if you are on one hand happy to say that the prior probability for this-or-that is small enough that it makes sense to approximate it by zero to start with, but then later claim that it was still big enough to be picked by the data that didn’t behave quite as the model for which your approximated probability was 1.

              • The difference is Frequentist p values *require* you to “hack” because they don’t admit a weighting function over different possibilities.

                A Bayesian *can* hack by trying 4 different models and then settle on a single one and pretend that the prior they had on the other 3 were small, and the likelihood didn’t strongly favor them, so they are just approximating things by truncating out near-zero values. But they don’t have to, they have the math in place to allow a variety of options simultaneously. A person CAN drive a screw into a board with a hammer, but a person who doesn’t have a screwdriver has no other choice, and a person who admits the existence of screwdrivers does.

                The math to allow a variety of options simultaneously just doesn’t exist in a frequentist testing scenario because the weighting functions over different options are not admissable probability distributions for frequentism.

              • Christian: I think that’s the essence of it for me. I want to have the math in place to simultaneously consider many options, with always the possibility to add in an additional option if I realize that it makes sense. This requires some sort of weighting function over options. Cox axioms describe the math of such weighting functions. And voila, it’s proven to be unique.

                As ojm points out, Bayesian math isn’t the only math in the world (in particular it’s got nothing to say about existential quantifiers and the hyperreal number line or 2 person iterated games or computability or ….), but it IS the UNIQUE math that describes a real-valued unit-normalized weighting function over logical options.

  9. Carlos Ungil says:

    Christian: Jaynes’s goodness-of-fit measure phi (pp. 293-305) doesn’t rely on an explicit alternative (it is relative to an implicit class of models, like the frequentist chi-squared). Maybe he was getting out of the Bayesian framework. Maybe his approach of assuming a strong model and then fixing it when it does not fit the data (Andrew dixit) is also inconsistent with the Bayesian setup. That’s a legitimate position to hold, but I don’t think Jaynes would agree. I suspect your understanding of Bayesianism is not the same as his.

    • Christian Hennig says:

      Well, the goodness-of-fit measure given there does not depend on a prior and isn’t a posterior, so despite all the Bayesian rhetoric around it it doesn’t look particularly Bayesian to me. (Apart from this I have no issues with it, it’s a nice enough thing to read.)

      • Bayesian doesn’t mean “prior and posterior” it means *probabilities as measures of plausibility* and sum and product rule.

        This measure he’s got is basically “how much less information does your model have about what would actually come about than if someone told you the oracular outcome?” Or alternatively, “How sharply peaked would your model be around the real data if you had the right parameter values?” or even alternatively “how many bits per observation do you need to correct your model predictions to be exact?”

        To the extent that your model, given the proper parameter values, predicts nearly perfectly, it will have not much less information than an oracle. That’s what’s meant by goodness of fit in his construction.

        His construction makes more sense in a discrete case. It’s closely related to minimum message length versions of Bayes

Leave a Reply