Skip to content
 

Statisticians and economists agree: We should learn from data by “generating and revising models, hypotheses, and data analyzed in response to surprising findings.” (That’s what Bayesian data analysis is all about.)

Kevin Lewis points us to this article by economist James Heckman and statistician Burton Singer, who write:

All analysts approach data with preconceptions. The data never speak for themselves. Sometimes preconceptions are encoded in precise models. Sometimes they are just intuitions that analysts seek to confirm and solidify. A central question is how to revise these preconceptions in the light of new evidence.

Empirical analyses in economics have diverse goals—all valuable. . . . Common across all approaches is lack of formal guidelines for taking the next step and learning from surprising findings. There is no established practice for dealing with surprise . . .

This paper advocates a strategy for reacting to surprise. Economists should abduct. Abduction is the process of generating and revising models, hypotheses, and data analyzed in response to surprising findings. . . .

Regular readers of this blog or of our article and books will not be surprised that I am in complete agreement that we should react to surprise, generate and revise models and hypotheses, etc.

It’s just too bad that Heckman and Singer are unfamiliar with modern Bayesian statistics. For example, they write:

Do Bayesians Abduct?

Bayesian readers will likely respond that learning from data is an integral part of Bayesian reasoning. They are correct as long as they describe learning about events that are a priori thought to be possible as formalized in some prior, however arrived at.

More fundamentally, Bayesians have no way to cope with the totally unexpected (priors rule out “a surprising fact C is observed” if C is a complete surprise). Total surprise is the domain of abduction. . . .

I don’t think they really mean total surprise—all our reasoning is probabilistic. But, on the larger point, yes, learning from surprise is a core aspect of Bayesian data analysis. Indeed, it’s the third of the three steps listed on the very first page of our book, Bayesian Data Analysis. Here is how our book begins:

1.1 The three steps of Bayesian data analysis

This book is concerned with practical methods for making inferences from data using probability models for quantities we observe and for quantities about which we wish to learn. The essential characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical data analysis.

The process of Bayesian data analysis can be idealized by dividing it into the following three steps:

1. Setting up a full probability model—a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem and the data collection process.

2. Conditioning on observed data: calculating and interpreting the appropriate posterior distribution—the conditional probability distribution of the unobserved quantities of ul- timate interest, given the observed data.

3. Evaluating the fit of the model and the implications of the resulting posterior distribution: how well does the model fit the data, are the substantive conclusions reasonable, and how sensitive are the results to the modeling assumptions in step 1? In response, one can alter or expand the model and repeat the three steps.

What Heckman and Singer call abduction is included in this step 3, and we talk a lot more about it in chapter 6 of the book.

Don’t get me wrong—I’m not saying this idea is original to me, or to me and my collaborators. I’m just disputing the claim that “Bayesians have no way to cope with the totally unexpected.” We do! We set up strong models, then when the unexpected happens, we realize we’ve learned something.

Here’s another relevant article:
Why ask why? Forward causal inference and reverse causal questions (with Guido Imbens)

And, for a non-quantitative take on the same idea:
When do stories work? Evidence and illustration in the social sciences.
That’s the paper where Thomas Basbøll and I argue that good stories are anomalous and immutable, which is another way of saying that we learn from surprises, from aspects of reality that don’t fit our existing models.

Also this one from 2003:
A Bayesian formulation of exploratory data analysis and goodness-of-fit testing.

Finally, here’s a paper where Cosma Shalizi and I connect statistical model checking and model improvement with Lakatosian ideas of testing and improvement of research programs.

Again, my citation of this work is not an attempt to claim priority, nor is it intended to diminish Heckman and Singer’s suggestions. I assume they’ll be happy to learn that an influential school of Bayesian statisticians and econometricians is in agreement with them on the value of generating and revising models, hypotheses, and data analyzed in response to surprising findings.

Indeed, I think Bayesian inference is particularly valuable in this area, both in allowing us to fit more complex, realistic models, and, when coupled with graphical visualization techniques, in providing methods for checking the fit of such models.

P.S. All the abduction in the world won’t save us from selection bias, and I still think that just about all published estimates of effect sizes are biased upward. Including the one discussed here.

84 Comments

  1. Shravan says:

    The only problem with this approach is that it takes a lot of time and effort. I have been spending days validating my Stan models. I could have written several papers by now. Now I am writing slower than I think, rather than the usual case of writing faster than I can think.

  2. “More fundamentally, Bayesians have no way to cope with the totally unexpected”

    Here’s what I want to say to that based on recent discussions (including some with OJM on this blog).

    The Bayesian formalism does not necessarily solve the problem of model choice formally. When there are a fixed number of models and one wants to choose among them, Bayesian mathematics does fine.

    But, when there’s a nebulous “I need a model for this stuff” it makes no sense to formalize this model search because formalizing it means encoding all the different models you’d be willing to entertain, which is potentially a lot of models, in fact it includes models you probably haven’t even thought of (for example suppose you have a colleague and the colleague hears about your problem and proposes a model, once you’ve heard the proposal you might well accept it as something you should consider, but until you hear the proposal, the model isn’t in your universe of models).

    So, we need something outside of formal Bayesian model fitting, and that’s always going to be true. I also think this thing needs to be informal, precisely because of the “otherwise you need to have a formal system that describes the universe of models”. Nevertheless, I think the structure of this informal search is well informed by the general idea of Bayesian inference. If you’re “surprised” by your model fit/misfit, this indicates you need to search in a different region of model space, in the same way that if the high probability region of the posterior is far outside the region of the prior, you need to move to a different region of parameter space within a Bayesian inference.

    So I think informally you can structure modeling as a process that looks like:

    1) Think up one or more models of the process that seem reasonable (informally “in the high probability region of model space”)

    2) Formalize the models in (1) into formal mathematical statements / Stan code etc.

    3) Fit the models in Stan and get posterior inferences.

    4) Informally check the fit from (3) to see if it meets the informal goodness of fit requirements you expect a good model to meet. If so, accept the model fit tentatively until such time as an alternative is suggested, otherwise backtrack to 1 and expand / alter the model space.

    (2) and (3) are mathematically precise formal things, but (1) and (4) are imprecise and driven by fit-for-purpose considerations, nevertheless, 1-4 can be seen as similar to Bayesian ABC method in which (1) is the generate-from-the-prior over models stage, and (4) is the compare goodness of fit stage.

  3. ojm says:

    This is all fair enough, but step 3 of the BDA approach is obviously not intrinsically Bayesian – it can be done in many ways. Here the ‘data analysis’ component is doing most of the work imo.

    Many alternatives to Bayesian inference can probably be seen as trying to jump straight to step 3. Or at least focus more on this component. I think there is a reasonable case to be made for this sort of thing. Once you allow step 3 there is no philosophical/logical argument to say steps 1 and 2 should be done Bayesianally.

    Which is fine when being a non-dogmatic/pragmatic Bayesian but it can be annoying to hear the argument that ‘one must use Bayes for steps 1 and 2 because of [philosophical reasons that are undermined by step 3] but of course make sure to do step 3’.

    • ojm says:

      I know Andrew likes to think of empirical Bayes type work as an approximation to a ‘fully Bayesian’ solution, but given the acknowledged limitations/inherent approximate nature of full Bayes I now wonder if ‘falsificationist Bayes’ isn’t really another name for empirical Bayes (broadly conceived). I know Efron still wonders if there is a better foundation for empirical Bayes than just poor man’s Bayes…

    • Andrew says:

      Ojm:

      Not to disagree with you but just to clarify: I’m not one of those people who says “one must use Bayes.” In particular, in my above post I was addressing Heckman and Singer’s ill-informed statement that “Bayesians have no way to cope with the totally unexpected.” That’s just annoying. They aren’t familiar with a subfield of statistics, so they just make statements out of ignorance. How hard would it’ve been for them to have said, “We are not aware of any Bayesian methods for coping with the totally unexpected.” It’s no shame on them for not being knowledgeable about Bayesian statistics—the world’s a big place and none of us can be experts on everything—but then why can’t they be open about their ignorance?

      • ojm says:

        Yes, absolutely. I imagine they’re pushing back against some ‘you must use Bayes’ incursion, while you’re pushing back against a ‘you can’t use Bayes’ incursion. I agree that you can but needn’t use (some form of) Bayes.

      • jadagul says:

        I wind up in a fair number of online arguments with philosophical Bayesians, who really do believe that it’s basically illegitimate to do anything other than Bayes Rule updates on your prior. If you read the excerpt as responding to those people then the comment as written makes perfect sense; purist philosophical Bayesianism has no place or way to add weight to possibilities that didn’t have non-zero probability in your original prior.

        Now obviously no one who actually uses Bayesian methods to solve actual problems has this problem in this way; there’s lots of discussion about how to choose priors and adjust your model in response to data. But that’s stepping outside the purist framework because the purist framework doesn’t have an answer to that problem.

        • Godel’s theorem tells us any mathematical system that contains integer arithmetic is incomplete: there are true facts that can’t be proven within the system. Bayes is about quantifying the trueness of facts, and certainly contains integer arithmetic. So, anyone who is simply purist enough will realize that Bayes can’t do it all. It’s an impure kind of purity that says “within these bounds are all the good stuff”

          • That being said, I think it’s still legit to say that it’s illegitimate to do something other than Bayes when the thing you’re trying to achieve is compatible with the Cox axioms, namely doing inference *within a model* in such a way that you don’t get results that contradict boolean logic.

            • ojm says:

              As emphasised in criticisms of NHST, boolean logic is probably not a good model for scientific theories. At least if you don’t like think of models/theories as strictly ‘true’ or ‘false’. The basic premise of NHST is, after all, based on ‘not (not H) implies H’.

              • p( Parameter | Data, Model) explicitly conditions on the assumptions of the Model. Within the model, which of the Parameter values have the highest plausibility?

                the fact that there doesn’t exist a Parameter which is *literally true* in the outside sense of “unconditionally” or “in reality” doesn’t keep us from wanting a plausibility measure which keeps both A and Not A from increasing without bound. If the sum of their plausibilities is bounded, we might as well divide by that bound, and make it 1, at which point we get Bayesian Probability.

                Non-identifiability then tells us that *our model doesn’t know or doesn’t even have an actually true value*. Fundamental Non-identifiability (ie. in the presence of whatever experimental information you would like) is signaling to the outside world “This model doesn’t make sense”

                I think you mistake non-identifiability of p(Parameter | Data, Model) for a fault, when it is in fact a feature, and I think you want “plausibility” to be unconditional, ie. what is “actually true” whereas it is instead conditional “what is true within the restricted world of what the model allows”.

              • ojm says:

                Daniel – there’s only so many times we should repeat the same arguments Andrew’s blog.

                I think you are determined to convince me of things that I am not likely to be convinced by, and vice-versa. You think I make mistakes and I think you make mistakes. Etc.

              • Well, I do feel like I still make progress in understanding, really I’m determined to convince myself as to which issues I should or should not worry about. If understanding how to do science were so easy, we wouldn’t have all these discussions.

                Anyway, I guess it’s fine to let it be for now.

              • ojm says:

                Well it’s not like I’m good at sticking to my intention to let it go anyway…

            • jadagul says:

              I’m going to pass over the Godel point for right now.

              Pure Bayes is reasonable as a response to doing inference within a model. (Although there are alternative suggestions like Dempster-Schafer theory, which offers a plausible alternative to following Cox).

              But if you believe that literally all reasoning has to/ought to follow the progression of a Bayesian update to priors–and I reiterate that there are people who believe this–then you have serious issues with how to handle what Heckman and Singer call a “complete surprise”: an outcome to which your original prior assigned probability zero. If that happens you have to go back and alter your prior and model.

              And obviously this is a thing “Bayesians do” if you mean that people using Bayesian tools to do statistical analysis do it. Because it’s a necessary step. But it’s very specifically outside the purist-Bayesian philosophical regime.

              • Well, I think the answer here is that if you assigned zero to a probability then either you were wrong, or the data is wrong. No matter how much you point to any dataset you will never convince me that a toothbrush has negative length… the data is wrong essentially by definition of “length”. On the other hand, if I say there is zero probability to live to 110 years or older, and you exhibit someone who is 114 I should acknowledge that I made a mistake. The existence of mistakes is undeniable, anyone who claims otherwise is ignorable I think.

                As for passing over the Godel point. Well it’s a fair thing to do for the moment, but it’s a valid objection to any claim of the formalization of truth.

              • jadagul says:

                Sure, and that’s fine if you’re trying to predict a scalar or something.

                But suppose you’re trying to fit a function to data, and you have a prior over polynomial functions, and it turns out that the “true” function is a sine wave. Then since that possibility wasn’t in your prior your prior gave it weight zero, and you can’t handle this “within” Bayes.

                You have to step out and do step 3, which is “look at your model, see that it’s bad, think about what models might work better.”

              • ojm says:

                > that’s fine if you’re trying to predict a scalar or something.

                +1 That’s a nice point to make (though there are other ways of doing inferences for scalars, as you point out).

                What holds for ‘structureless’ objects like propositions doesn’t necessarily hold for ‘structured’ objects like mathematical models or scientific theories.

                I really dislike the idea of treating a model or theory as an elementary proposition. What is the ‘negation’ of a differential equation?

                Sure there are some things that it makes sense to treat in terms of propositions and probability statements – some of which can even be statements about parameter values! – but I’m pretty dubious on the general validity of this, and believe there can be strong unintended consquences.

              • Jadagul. Yes exactly, just like the Godel sentence is obviously true but only in the meta mathematics of the system, it’s unprovable in the system.

              • ojm says:

                According to the Godel metaphor then, one should be non-Bayesian _within_ the model and Bayesian _outside_ the model, since only _externally_ (e.g. ‘God’s eye view’) do we have access to all true/false statements.

              • ojm: Hmm, here’s what I was thinking of:

                Suppose I specify a model for some science (toothbrushing or whatever) and I collect some data. I am committed to the idea that I want my analysis of this data and model together to have a certain consistency. I take Cox’s axioms as being sufficient for the consistency properties that I want. So, I choose priors over parameters that express my scientific knowledge, and conditional probabilities over data given my priors, and I code it in Stan and I get some results. Within this logic, everything I’ve done is consistent. But there are scientifically true facts which are nevertheless not provable within the Bayesian model, namely, for example “my model sucks and it doesn’t explain what actually happens very well”. Within the model, there is no way to examine this question, so I need to examine it outside the Bayesian model.

                One method is model expansion. Andrew advocates this often. Now, our model is more flexible and can explain more things, so there might be a region of parameter space where in fact the fit is such that, outside the model, I will say “my model doesn’t suck, and does explain things pretty well”.

                In the absence of that, I may need to simply try alternative models. How do I decide what to try?

                In his book “Mathematical Logic” (see, look what you’ve done, now I’ve bought another great Dover Kindle book) Stephen Kleene introduces the subject by asking:

                “Now we are proposing to study logic, and indeed by mathematical methods. Here we are confronted by a bit of a paradox. For, how can we treat logic mathematically (or in any systematic way) without using logic in the treatment?

                The solution of this paradox is simple, though it will take some time before we can appreciate fully how it works. We simply put the logic that we are studying into one compartment, and the logic that we are using to study it in another. Instead of “compartments”, we can speak of “languages”. When we are studying logic, the logic we are studying will pertain to one language, which we call the object language…Our study of this language and its logic… we regard as taking place in another language, which we call the observer’s language. Or we may speak of the object logic and the observer’s logic.

                The scientific “goodness” of a Bayesian model must be evaluated at an observer level, but once the model has been formalized out of the scientific knowledge, the relative goodness of the various unknown parameter values conditional on the data is precisely formalized through the sum and product rules of Bayesian probability theory in such a way as to give you an answer that has certain strong logical consistencies. For example, you will never get a result “the expected brushing time of kindergardeners is -3 minutes” because this is excluded in the prior.

              • ojm says:

                FYI – another good Dover book on logic that gets at some of the subtle issues at play is Topoi: The Categorial Analysis of Logic By Robert Goldblatt.

              • ojm: if in pseudocode Stan I do:

                ypred = predict_ode_results(paramvec);

                y ~ normal(ypred,1);

                What are the “propositions” that I’m considering? They are basically of the form

                particular value of paramvec makes the vector of observations y be in a ball of a given size around ypred

                There’s no “negation of the ode” it’s all propositions about particular parameter vectors and the resulting nearness to the ball of given size

                I think this makes things more explicit about the object and observer logic. Why the particular size ball? Because in observer logic we believe this is the right measurement error size. Why the particular parameter values handed to us by Stan? Because in the object logic (the bayesian model) it best satisfies the declared requirements.

                The Bayesian model doesn’t give us truth in the observer sense (the scientific sense) it gives us truth in the restricted object-model sense (best satisfies the formal requirements of the model).

                If the model is bad, GIGO

              • ojm says:

                When restricted to parameters in eg an ode your propositions are eg parameter = 1, parameter = 1.1 etc.

                In a context where multiple parameters could be ‘correct’ (or all are incorrect) then logically you are saying it has to be the case that either parameter = 1 or parameter != 1. But then you are saying that it can’t be the case that both parameter = 1 and parameter = 1.01 etc. Or even parameter = 10. This can be thought of an identifiability constraint.

                An alternative is to say ‘I will consider the parameters in some range. For each individual value I will evaluate its consistency with the data (eg via its implied predictions). I make no assumption that only one value can be the right answer’.

              • ojm says:

                This latter reasoning involves no probability. And no, restrictioning your initial search to a range is not the same as imposing a prior _probability_ distribution.

              • ojm:

                Cox Bayes only says essentially that “when your knowledge gives you complete certainty, the mathematics is the same as Boolean logic” it doesn’t say that “there is a really and truly, actually true value in every question and we just don’t know what it is”

                See Kevin Van Horn’s R2 http://ksvanhorn.com/bayes/Papers/rcox.pdf

                it requires only compatibility with Boolean logic, not equivalence to it in the limit of infinite data for example.

                I think outside the Bayesian machinery you can claim “I will only accept a model as scientifically meaningful if there really and truly is one parameter value that is True ™” but if you do this, this is externally imposed by you, not by a sufficient set of Cox axioms.

                Suppose in the limit of infinite data you come up with p(a = 1) = 0.8 and p(a = 1/100) = 0.2 does this mean your application of Bayes was illegitimate, it failed to satisfy some of the Cox axioms? No. No axiom is violated.

              • ojm says:

                I’ve read Van Horn – I think it begs all the same questions.

                BTW probability is not a multi-valued logic, it assigns real numbers to propositions built on single-valued logic. So there is an underlying single-valued logic. It is the ‘plausibility that this is true’ not plausibility replacing T/F.

              • ojm says:

                You might be interested in Halpern’s ‘Reasoning about uncertainty’ https://mitpress.mit.edu/books/reasoning-about-uncertainty

              • “BTW probability is not a multi-valued logic, it assigns real numbers to propositions built on single-valued logic. So there is an underlying single-valued logic.”

                Formally this is nowhere in the axioms. Probability assigns real numbers to propositions. Period. Restricted to the case where it assigns p(A) = 1 it also assigns p(~A) = 0 and the like, yes. But there is no underlying 2 valued logic required by the axioms.

              • In particular, I think your assumption of an underlying 2 valued logic gets cast into Cox/Bayes as an additional axiom that goes something like:

                “There exists a unique state of knowledge K which contains all possible correct scientific facts about the real world, and in each model, an *atomic* proposition A about the parameter vector of the model, such that p(A | K) = 1.”

                And this is a *strong* axiom about *Science* that is *definitely outside* the Cox axioms. Also I think it’s obviously wrong (in particular, there is basically just one correct model of the whole universe, involving some kind of Quantum particle reality, and none of our scientific models are equal to this model, so given K every actual A we might come up with has p(A|K) = 0 ).

              • ojm says:

                > assigns real numbers to [Boolean] propositions.

                It’s right there.

              • ojm says:

                Anyway, my general argument is

                a) Cox-Jaynes is not the unique unobjectionable logical system for reasoning about uncertainty.

                b) One probably shouldn’t want one anyway

              • ojm says:

                (unless you have a different definition of proposition and associated Cox derivation?)

              • ojm says:

                (I’m assuming the usual definition of a proposition as a declarative sentence that is either true or false)

              • A proposition is just a statement, such as “a = 1” but it could also be a statement such as “a is in the interval [0,1]” if the posterior of your Bayesian model + background knowledge + large quantities of data assigns uniform(0,1) then it seems you consider this to be a fault, that there should be an underlying atomic proposition of the form a=X for some single X which gets p(a=X | Background, LargeQuantitiesOfData) = 1

                But there is nothing in the axioms which requires a proposition of this form a=X to get p=1. In fact, often this doesn’t occur. It seems as if you find this an illegitimate thing because deep inside it all “there can be only one” in other words, there is an equality proposition which we would all agree to assign truth to, if we just had enough knowledge K. I find that odd, particularly considering that we actually know ahead of time “all models are wrong, some are useful”.

                anyway I think we’ve gone far enough for now. Thanks for playing again :-)

              • ojm says:

                Yeah fair enough to stop for now. For the record I find your characterisations of what I ‘want’ very strange and quite far off. I’m just working with the definitions, not on what I ‘feel’ like Bayes should be or what the derivation ‘should’ say.

                (BTW statements and propositions are usually distinguished in formal logic. And yes of course propositions can be convoluted or about complex objects, but they are required to ultimately be declarative T/F statements. There are whole chapters in books on philosophical logic dealing with the question ‘what is a proposition’.)

              • ojm says:

                …one last comment.

                A proposition like a is in [0,1] is fine. And you can even assign it a probability if you want, like 0.5. This is not the same as assigning a uniform probability distribution over the interval [0,1], as I assume you know?

              • ojm: If you could formalize your argument somewhat it would make it so I didn’t have to guess at what the concern is so much.

                The fact is the Cox axioms give you uniqueness of a certain algebra of plausibility. Certain plausibilities are assigned (from outside Bayes, priors and likelihoods) and then once you assign those and you accept that you want your calculations to obey the Cox axioms, the unique rules for manipulating these plausibilities are the probability rules, and they guarantee that you wind up with a unique mapping that has certain consistencies, input to output

                and that this mapping accords with p(A) = 1 implies P(~A)=0 and soforth.

                I don’t see how this implies some “underlying” anything. There’s just formal rules for manipulating things called “states of information” which are formally probability distributions.

                In particular your statement: “In a context where multiple parameters could be ‘correct’ (or all are incorrect) then logically you are saying it has to be the case that either parameter = 1 or parameter != 1. But then you are saying that it can’t be the case that both parameter = 1 and parameter = 1.01 etc. Or even parameter = 10. This can be thought of an identifiability constraint.”

                but I disagree that I have to accept a=1 or a!=1 as acceptable propositions. The meaning of the word “proposition” is undefined in the Cox axioms. So I can simply say that a=1 isn’t a valid proposition about which I have to be able to assign probability.

                So, if that kind of thing is your concern, it seems like your issue is something related to measure theory and the meaning of “proposition”. You are imbuing the purely formal “proposition” with particular meaning, in particular including things like “a=1” in your propositions.

                In fact, measure theoretic probability works using Borel algebras, so it doesn’t allow probabilities to be assigned to a=1.

                which is to say, in Cox/Bayes the meaning of the word “proposition” is undefined, just like in geometry the meaning of the word “line” is undefined. and so we can have euclidean or non-euclidean geometry based on what we take “line” to mean, and whether or not we add a parallel postulate.

                So, I admit, I don’t get it. But someday I hope you will formalize some concern in a way I could actually understand it… and I also admit to making progress just by thinking about the issue… sigh… but we never seem to be able to stop posting.

              • ojm says:

                Suppose you treat propositions as undefined terms. They are at least required, in Cox’s treatment, to satisfy a Boolean algebra right? That’s what I remember from last time I looked at Cox’s book.

                So in the geometry analogy, I see you as asserting the parallel line postulate and claiming all geometry is Euclidean.

                I’m saying if you instead use eg a Heyting algebra (which could be motivated intuitively for scientific applications eg based on a preference for constructive falsification rather than proof by contradiction), then you get a generalisation to something more akin to a non-Euclidean geometry. Without an underlying Boolean algebra, you don’t have the Cox argument, coz he explicitly uses it.

                So Bayes = Euclid, non-Bayes = non-Euclid. Appropriately, I sometimes find Bayes too rigid.

                (I also generally find ‘quantitative logics’ to be pretty unappealing these days but that’s another issue)

              • My impression is that Cox’s original treatment was imperfect and that Van Horn covers a more modern minimalist version. I don’t see a requirement in Van horns exposition that there be an underlying Boolean algebra, only that if there is one the plausibility algebra respects it.

                If you can exhibit a uniqueness proof for an uncertainty measure based on generalizing Heyting algebra and a model for it… Hey I’m willing to give it a very deep look. Even better if it comes with a powerful computational inference engine….

              • Interesting question I now have: can we justify probability as a measure of accordance, not of plausibility. That is, remember we know essentially all scientific theories are completely false. Newtonian physics for example, doesn’t respect relativity. But, Newtonian physics predictions accord extremely well with reality provided the relative velocity between two objects is much less than the speed of light, and the mass of the objects is much more than a Hydrogen atom.

                Now, if there is an underlying quantity that only exists in some kind of imprecise aggregate sense, like say “the location of a baseball” then it can’t be said that the failure of the underlying quantity to actually take on the predicted value at given time t is a refutation of the theory. Only some kind of “large deviation” from the predicted quantity refutes the theory. Furthermore, refutation should be continuous. It’s not like things are just fine for some range, and then when the baseball moves an additional atomic diameter… it all goes to hell. From this perspective the likelihood could be viewed as a measure of accordance… the smaller the likelihood is, the less we believe that the prediction accords with the theory.

                we combine this with prior information on parameters to get a measure of “accordance with theory and data”.

                just ignore the plausibility/boolean algebra/formal language stuff for the moment.

                In an algebra of accordance, the finiteness (hence, normalizability) of the accordance measure seems obviously desirable, the functional inverse relationship between accordance and not accordance seems desirable. The factorizing conditionally seems desirable: accordance of the data with the prediction obviously depends on the prediction… and hence on the inputs to the prediction: parameters.

                Then we don’t need a “true” underlying theory in any way. Bayes becomes simply a measure of which portions of theory space (parameters) accord most well with data and the theoretical precision (the likelihood).

              • Carlos Ungil says:

                Daniel, a few quotes appearing the first three pages of the paper by Van Horn that you cited:

                “We stress that we are concerned with degrees of plausibility, as opposed to degrees of truth. Fuzzy logic (with the exception of possibility theory) and various other multivalued logics deal with the latter, and hence have aims distinct from ours.”

                “Our logic shall be restricted to statements such as P, which are either true or false, although we may not know which.”

                “A proposition is an unambiguous statement that is either true or false.”

              • Carlos I see where he says that but it’s much less clear to me that there is anywhere he relies on this interpretation for the proof of any of the mathematical theorems. I would have to read it very carefully to be sure.

                In fact I remember from the 90s when fuzzy logic was a big deal, that there was a proof that it was equivalent to probability Theory under some restrictions mainly related to a sum to one type restriction.

                I should look for that and see if I can find a citation.

                In the end I think the interpretation of meaning has to be different from the formal proof of certain properties of the formal system. If there are multiple interpretations of the meaning which all satisfy the same formal rules, then in some sense it’s up to the user to decide what they mean.

              • To me, the real advantages of using Probability in a Bayesian context is that we have a uniqueness proof in Cox’s theorem (provided you agree with the assumed properties), and Kolmogorov exhibited a constructive model, and so via model theory we have a proof of consistency. A consistent unique algebra of quantities representing degrees of *something* which has as a limiting case both “definite truth/boolean logic” and “definite membership in a set” (restricted subset of fuzzy logic) and possibly “accordance with a theory” interpretation, gives you assurance that you’re not going to calculate contradictory information after you plug in whatever your input information is: each set of input objects results in a well defined set of output objects. How you interpret *the meaning* what the output of your Stan code tells you, is in some sense up to you, and depends on how you interpreted the meaning of the input values you constructed, in the same way that if you calculate the number 4 as the radius of a crushed up ball of paper… you have to interpret that somehow because it’s not a perfect sphere, so what does 4 refer to? At least, if you calculated the number 4 using the algebra of the real numbers, you are pretty sure the rules of addition, multiplication division etc are consistent and someone else could if they accepted the logic of your input quantities, would calculate the same number 4 as you would.

              • Carlos Ungil says:

                How does the enunciation of Theorem 14 (let alone the proof) make any sense without the previous definitions of Boolean propositions and operators?

              • On theorem 14:

                Sure, he’s looking for an algebra of plausible knowledge about facts, so he states Theorem 14 in that language… But alter “known to be false given the information in X” and “known to be true given the information in X” to “known to have no membership in the set Q” or “known to have full membership in the set Q” and you could get a restricted fuzzy logic out of 14 it seems. Or am I being obtuse?

                At this point anything I say should be considered as speculative, because I haven’t looked deep into the possibility that there are yet other alternative interpretations of the meaning of probability algebra. We know one meaning is “frequency of occurrence within countably infinite sequences of numbers”, and another meaning is “plausibility that a statement in a formal language turns out to be true, under a restricted set of information”, it’s not clear to me that we can’t get the same exact algebra from “degree of membership in a set, where the total membership across all sets is constrained to be equal to 1”, or “degree of accordance of a theoretical prediction with the measured value given a theoretical accordance function (think measurement error)”

                Just Frequentist vs Orthodox Cox/Jaynes Bayesianism shows that there isn’t *one true interpretation of the formal structure*. How else / widely can it be interpreted? I don’t know.

              • The thing that concerns me about using “truth 0/1” as the basis of interpretation of probability theory, is the “all models are wrong, some are useful” adage. I know for sure that my model of a bearing ball being dropped through a thick syrup is false, because my model involves the Navier Stokes equations of a continuum, whereas the syrup is a bunch of molecules.

                So long as I say

                (Pred(t) – Measured(t)) ~ normal(0,s)

                Expresses the probability that the true modeling and measurement error takes on some value… we can get away with just talking about say true values of this error, and we could wave our hands about the problem of how do you measure the location of a bearing ball… does it even have a location at the fundamental level?…

                but as soon as we want to compare say two models of the drag coefficient as a function of reynolds number… and we do it

                p(Data | parameters, Model1) p(Model1) + p(Data|parameters,Model2) p(Model2)

                now what is our interpretation of p(Model1), p(Model2) given that we know “Model1 is a literally true expression of the actual drag coefficient function” is exactly zero, since at the fundamental level, there *is no* drag coefficient, only a lot of molecules colliding with each other…

                So, it may be possible to restrict ourselves to propositions about the goodness of approximations or something, but it’s a technicality I don’t think people usually look too deeply into. There be dragons, and we’re busy, we’ve got to deliver some statistical report to some client, or publish some paper, or get some grant. Maybe next year, on sabbatical we can discuss what the heck does it mean to evaluate the boolean truth of a model we know is false to begin with.

              • ojm says:

                > The thing that concerns me about using “truth 0/1” as the basis of interpretation of probability theory, is the “all models are wrong, some are useful” adage

                This is sort of the point – probability theory _is_ fundamentally tied to Boolean propositions, and so is an awkward fit with ‘all models are wrong’. This is the exact argument many people have made against using probability as a ‘logic of science’. One of many, as Christian has noted, is Laurie who points out that replacing ‘true’ with ‘adequate’ breaks the whole _formal_ logic of Bayes.

                You want to have your cake – probability for uncertainty forced via eg the Cox derivation – and eat it too – all models are wrong. You can use probability and claim all models are wrong but you can’t say that using probability/Bayes is formally forced via eg Cox.

              • I acknowledge that probability isn’t formally forced on you. It’s only formally forced on you once you accept Cox’s axioms. But I am not going to buy into the requirement that Cox requires an underlying boolean truth value incompatible with “all models are wrong some are useful” without a lot of careful examination of that assertion. Since probability *is* compatible with *frequency in countably infinite sequences* it isn’t the case that there is only one interpretation of the meaning of probability theory, and that it’s degree of plausibility of propositions with definite boolean truth values. Yes, Cox’s program was to show that it was *compatible* with that interpretation, but his program didn’t show that it’s incompatible with other interpretations.

                Still, I’m very happy to have found a more precise formulation of at least one of your objections ojm, so now I have a thing to do some research on.

              • Carlos Ungil says:

                > But alter “known to be false given the information in X” and “known to be true given the information in X” to “known to have no membership in the set Q” or “known to have full membership in the set Q” and you could get a restricted fuzzy logic out of 14 it seems.

                Starting from different assumptions and using a different derivation you may be able to prove a different theorem, we agree.

                > I am not going to buy into the requirement that Cox requires an underlying boolean truth value

                What do you mean by “Cox”? The 1946 paper? The 1961 book? The rigorous variants of his theorem? All of those are built on Boolean propositions. Of course there could potentially exist an alternative Cox reaching alternative conclusions. But why should it be the case?

                It’s not like people has not tried to find more general results. For example, Kosko (1990) concludes that fuzzy theory is an extension of probability theory. http://sipi.usc.edu/~kosko/Fuzziness_Vs_Probability.pdf

              • Kosko shows in this paper http://sipi.usc.edu/~kosko/Fuzziness_Vs_Probability.pdf that Probability theory is a special case of a fuzzy subset notion. He gets into a diatribe a bit about Lindley and Cox and soforth, but in the context of our discussion here, the important point is that the functional relations used in Cox’s derivation also apply to an alternative interpretation. In other words the degree to which some set of points S is a subset of “the points of type A” is more or less the fraction of points in S that are type A.

                In this context, I’ve mentioned how we can interpret single term likelihoods as boolean questions about the “truth” of some error which can be model or measurement error, so I think the important question is what is meant by something like a likelihood p(Data | q1, Model1) p(Model1) + p(Data | q2, model2) P(Model2) when both Model1 and Model2 are known to be false at some outer-Godelian-onion layer (ie. I know continuum mechanics is wrong because I know molecules exist)

                In the context of probability theory, where p(Model[i]) is a simplex vector which has a prior, this is a forced choice among alternatives, it is in essence: “assuming one of these is right, which one?”

                This can be seen as a restricted “shootout” between model+measurement errors. If one model definitely dominates another in terms of giving higher probability to the actual results observed in data, the observed result of calculations is that this model will dominate and in the limit of large data have posterior probability 1 within this overall model.

                I think the mistake is in interpreting something like P(Model1) as a probability “that model 1 is the true science”. Why does Model 1 wind up with all the posterior probability? Because, among the various models considered, Model1 assigns high probability to the actually observed errors. Model 1 makes the *best* assessment of its own limitations.

                Even though, in each model, there is a “true” measurement + model error, if one of them consistently assigns true errors to have higher probability, it will dominate the posterior. There’s a kind of asymptotic selection for most accuracy. I think the role of this selection for most accuracy isn’t based on “scientific truth” of the model but rather something else. It being 4am here, I’ll have to leave this for now to come back to it later, but my intuition is that we may be able to construct boolean statements, one of which is true, about which the p(Model1) are everyday Cox probabilities. Doing so would help us figure out what fact is being inferred, and / or what are the “external meanings” of non-identifiability.

              • Carlos, I see we crossed paths and both wound up at kosko.

                What I’m thinking of is an alternate model of the cox proof not an alternate proof. Simply anonymize the meaning of elements of say Van Horns treatment. Instead of propositions say glarbs instead of known to be true say known to be frob etc. Now formally the structure is the same. Can we plug back meaning in such a way that in observer logic we still buy the statements but we also buy the new meaning of the statement p(model1) even though in observer logic we all agree that model1 is something like continuum mechanics which is known not to be true on some fundamental level.

              • ojm says:

                Quick comment.

                Kosko seems to say that the law of non-contradiction and the law of the excluded middle are equivalent and fuzzy logic requires violation of both.

                But constructive logic satisfies the law of non-contradiction while denying the general validity of the excluded middle.

              • ojm says:

                Here is an attempt to develop a constructive/intuitionistic Bayes/probability theory:

                http://brian.weatherson.org/conprob.pdf

                What I am not sure of is the upshot – it seems as if we can only then apply probability to observable statements, which would severely restrict Bayes as a universal logic of uncertainty/inference.

              • ojm says:

                It’s probably possible to see ‘falsificationist bayes’ as ‘constructive bayes’ where ‘constructive procedures’ are perhaps akin to generative models.

                A key issue, as Andrew acknowledges, is clarifying exactly which things can and can’t be assigned probability statements/distributions. It’s clear that not everything that classical Bayes assigns standard probability statements to would have such statements in constructive Bayes, but it’s not clear where the line is drawn and how that affects practice.

                I don’t think it’s sufficient to say ‘within the model, Bayes, without non-Bayes’ though, especially when dealing with hierarchical structures. So what would be sufficient then? Again, there seems to me to be some sort of connection to the concept of identifiability. In this case it would be nice to have some (supplementary) method of identifiability analysis.

              • ojm says:

                It also seems entirely possible that constructive Bayes reduces to some form of likelihoodism (in the general sense, not the Bayes minus a prior sense).

              • ojm says:

                (Or to empirical Bayes)

              • ojm: I’m not up on constructivist intuitionist mathematics, other than to know that it exists, and it denies the law of excluded middle. So here I have to pause with all of that, and potentially come back to it when I have sufficient background.

                However, what I will say, and now I really want a book on model theory or something… I’m not sure what it is that is the relevant field… is that I can imagine a logic in which we assume statements like “the drag coefficient is 3.31 at time t=0” are decidable only externally to Bayes (at a scientific level, not a formal level), and the meaning of a statement like that is “under whatever restrictions on our view of the world that lead us to accept the Navier Stokes equations as realistic, including any granularification / round-off / scaling / homogenization or other mathematical modeling assumptions, we can’t tell the difference in the context of that model between whatever the best most exact description of the appropriate drag coefficient value is and the value 3.31 “

                This plays nicely into IST, in which we assume there is some “infinitesimal” scale beyond which we accept that dragcoeff ~= a means standard_part(dragcoeff) = standard_part(a). Of course in the mathematics, we idealize this and talk about 1/N being infinitesimal only if N is nonstandard… but it’s an idealization of the basic concept that in science, beyond some level of precision, differences just don’t matter.

                Then, in a model that has something like

                p(Data | model1) p(model1) + p(Data | model2)p(model2)

                what we really mean is “probability that Data is “essentially equal” to its observed value given model1 is “essentially equal” to the correct model * probability that model1 is essentially equal to the correct model”

                When we do this with a finite set of models, we then demand that the observer for the moment is willing to accept that they can resolve the question of “is essentially the right model” and “is essentially the value Data” at an external level.

                Model checking then can be seen as verification of the notion of “essential equality” implicit in the Bayesian construct. If none of the samples of the parameters in the high probability region produce a model which we are willing to call “essentially correct” then we can reject the particular finite selection of models. In essence the logic of Bayes becomes:

                if (Models are scientifically sufficient) then (parameters are probably the ones in the high probability region of the posterior).

                It becomes then our external observer logic requirement to evaluate whether (Models are scientifically sufficient) is a true fact in observer logic.

                Bayes then becomes a restricted component of observer logic which operates on assumptions about the decidability of “TRUE” at the external level.

                at least that’s the sketch of what seems intuitive to me.

          • Consider the question “what is the effect of giving out free electric toothbrushes on dental health of elementary school children in Los Angeles County?”

            The “truth” of this is answered by a Lagrangian and initial conditions for all the atoms in the universe.

            The point of Bayes is that *after you’ve picked an approximation to the truth* (ie. an abstract model of the phenomenon at an appropriate level of abstraction) you can do inference about the quantities defined within that abstract view of the universe in a way that obeys Cox’s axioms using a unique mathematical construct, and it will never give you answers that disagree with things like “the length of a toothbrush in cm is a positive number” or “children never brush a negative amount of time” or “we already know that any kind of brushing at all will generally on average make measures x,y,z of dental health be somewhat better than they otherwise would have been if no brushing at all occurred”

            But outside of this model, you might ask yourself questions like “which measures of dental health matter, and what does it mean to take those measurements, and do I need to consider the fact that some of the measurers were blinded to treatment and others weren’t?” and many of those other questions necessarily are outside of a formal Bayesian scheme. And all this is even more true when we realize that our model is known to be false to begin with, the “right” model is that enormous Lagrangian.

            You can ask yourself “why care about the Cox axioms?” but once you’ve accepted them for the inference within a model, you have 1 choice to proceed.

            http://ksvanhorn.com/bayes/Papers/rcox.pdf

            Axioms are (roughly)

            R1: plausibility is a real number
            R2: plausibility agrees with boolean logic
            R3: plausibility of a fact and its negation are related in an obvious way
            R4: It makes sense to potentially apply plausibility very broadly (not restricted to certain simple discrete cases).
            R5: plausibility of combined statements combine using a formula related to conditionality

            The alternative is to not ask questions about plausibility, and just ask questions about “how often would X happen if I assumed Y was a random number generator with a given distribution” and that’s a particularly illegitimate question when “if I assumed Y was a random number generator with a known distribution” simply doesn’t even approximately apply to your model of dental health.

        • Andrew says:

          Jadagul:

          I agree with you that there are people who call themselves Bayesians who have the beliefs that you state.

          But, just to return to my original post, I was reacting to Heckman and Singer’s statement, “Bayesians have no way to cope with the totally unexpected.” This is a false statement. “Bayesians,” of which I am one, do have a way to cope with the totally unexpected. This way is described in chapter 6 of a 22-year-old book that happens to be the best-selling textbook on Bayesian statistics. It’s not exactly an obscure idea. Again, I can’t fault Heckman and Singer for their ignorance—we’re all ignorant in our own ways. But I can fault them for their overconfidence.

  4. Tom Dietterich says:

    A goal of AI researchers is to create an autonomous agent that can learn from its interactions with the world. This requires us to develop a fully formal method, rather than one that relies on human insight and informal model checks. So an interesting challenge is to automate the “Gelmanian” process of model formulation, conditioning, and posterior predictive checks. When the model fails, how do we formalize the next moves in model formulation space? If there is a Bayesian account, that would be great. I don’t think we have any account yet.

    • Keith O'Rourke says:

      > we formalize the next moves in model formulation space?
      CS Peirce did claim (~ 1905) that abduction should be a logic (normative providing oughts) but only of the vaguest sense.

      Not aware if anyone has gotten any further?

  5. Bill Jefferys says:

    This is an interesting discussion.

    The following draft paper came to my attention today; it seems to be treading on similar although not identical territory and I wonder what people think of it.

    http://philsci-archive.pitt.edu/13167/1/NRF%20Draft.pdf

    • ojm says:

      It does seem to be somewhat similar territory.

      For what it’s worth, I generally dislike arguments of the form

      “There is no good argument for the existence of God, so I am an atheist”

      “You can’t prove that my client is guilty, so I believe that she is innocent”

      etc

      which they claim can (sometimes) be given a Bayesian justification. And I probably dislike the arguments for similar reasons to my general qualms about (philosophical/dogmatic as opposed to e.g. BDA-style) Bayes.

  6. Andrew, great article and valid points you bring up regarding the main issue. When I first got into modeling I was told to remember that “all models are wrong but some are useful.” As you mentioned in the article for step 3 Bayesian data analysis, it is important to review the model to make sure it fits the data. If it does not, learn as to why it doesn’t work and make the necessary adjustments to reevaluate the model.

    • Glen M. Sizemore says:

      “…it is important to review the model to make sure it fits the data. If it does not, learn as to why it doesn’t work and make the necessary adjustments to reevaluate the model.”

      GS: Such practice can lead to reasonable prediction, to be sure. Ptolemy was good at that, too, if you remember. Not all prediction is science even though prediction is a goal of science along with control and interpretation of complex cases. Simply continuously modifying “models” post-hoc is not science though it is frequently done in the name of science. Or am I missing something?

  7. Shravan says:

    One implication of doing Bayesian analyses seems to me to be that there is no such thing as a confirmatory analysis. It’s all exploratory. Even after you fit a model that was pre-planned, you usually have to go and modify it after you do your posterior predictive checks and fake data simulations. You have to fit many models.

  8. Leon says:

    The word “Bayesian” is being used here in two ways. In the first, narrow sense,

    (1) Bayes = inference = adjusting a distribution over a hidden quantity by conditioning on what’s observed.

    Opposed to this is

    (1′) Non-Bayes = testing/checking = considering *separately* different possible values of a hidden quantity, & how well each fits what’s observed (or predicts a different hidden quantity). Model checking, p-values, and confidence intervals all instances of this.

    On the other hand, in the broader sense, “(modern) Bayesian” is used to mean

    (2) The applied statistical philosophy of the BDA authors, i.e.: use (Bayesian) inference (1) liberally for parameter values, and (non-Bayesian) testing/checking (1′) for models.

    Opposed to this might be:

    (2′) Use Bayes for everything — all hidden quantities are either assumed fixed or given a distribution. Never consider multiple models without adjusting a distribution over them.

    or

    (2”) Never model/”treat as random” a quantity you don’t observe.

    The real question seems to be: should we use the word “Bayesian” primarily to describe a way to treat hidden quantities, or primarily to describe a (fairly ecumenical/moderate) applied statistical philosophy? I think some strong arguments in favor of the first option are

    – Consumers of statistics need to understand the drastically different interpretations of (1) vs (1′). So preserving that distinction is important. And IMO both have their place practice. One should be both Bayesian and non-Bayesian.

    – The distinction between model (collections) and parameter (values) is extremely superficial and at times pedagogically confusing.

    – Calling an applied philosophy “Bayesian” that is actually pretty ecumenical smacks to me of gerrymandering. It’s like the opposite of “no true Scotsman”: “of course *modern* Bayesians are willing to use model checking”. Ok — why not just acknowledge that a BDA-ish philosophy is straightforwardly broader that the orthodoxies of the past?

    • Andrew says:

      Leon:

      Heckman and Singer wrote, “Bayesians have no way to cope with the totally unexpected.” The term “Bayesian” refers to people who use Bayesian statistics, no? BDA is a 22-year-old book, and Heckman and Singer’s article came out just this year. If they’d said something like, “25 years ago, most Bayesians had no way to cope with the totally unexpected,” then, sure, fine. But that’s not what they said. They did not criticize the “orthodoxies of the past.” They used the present tense.

      As a more technical matter, I disagree with your claim that Bayesian model checking as discussed in chapter 6 of BDA is “non-Bayesian.” What we do is completely Bayesian—it’s a working out of implications of the posterior distribution. You can call it modern Popperian (in that we are using the implications of the model to decide to (probabilistically) reject it, you can call it Lakatosian (in that we’re using problems in our model to motivate improvements), you can call it Cantorian (in that we recognize the impossibility of laying out all possible model choices ahead of time), but in any case it’s Bayesian in its use of the posterior distribution. You can read my 2003 paper for elaboration of this point.

      Of course, that’s just me talking. You have as much of a right to call me non-Bayesian as I have to call myself Bayesian. And, book sales aren’t everything—but to the extent that words are defined in part by their use, I think it’s fair when talking about what “Bayesians have no way to cope with” something, to recognize that the tens of thousands of people who’ve read our book do have a way!

    • ojm says:

      I agree with the have your cake and eat it too nature of a lot of this. Both I. The positive sense and the negative sense, appropriately…

  9. Keith O'Rourke says:

    Perhaps a general comment on abduction – when I meet CR Roa in Toronto in the early 1980,s I mentioned that he did not address abduction in his talk or book. He said others had also raised that point and in the next addition of his book there was a chapter on it.

    Abduction is something that has been overlooked by many in science, philosophy and statistics until fairly recently.

Leave a Reply