Skip to content
 

The fallacy of the excluded middle — statistical philosophy edition

I happened to come across this post from 2012 and noticed a point I’d like to share again. I was discussing an article by David Cox and Deborah Mayo, in which Cox wrote:

[Bayesians’] conceptual theories are trying to do two entirely different things. One is trying to extract information from the data, while the other, personalistic theory, is trying to indicate what you should believe, with regard to information from the data and other, prior, information treated equally seriously. These are two very different things.

I replied:

Yes, but Cox is missing something important! He defines two goals: (a) Extracting information from the data. (b) A “personalistic theory” of “what you should believe.” I’m talking about something in between, which is inference for the population. I think Laplace would understand what I’m talking about here. The sample is (typically) of no interest in itself, it’s just a means to learning about the population. But my inferences about the population aren’t “personalistic”—at least, no more than the dudes at CERN are personalistic when they’re trying to learn about particle theory from cyclotron experiments, and no more than the Census and the Bureau of Labor Statistics are personalistic when they’re trying to learn about the U.S. economy from sample data.

I feel like this fallacy-of-the-excluded-middle happens a lot, where people dismiss certain statistical approaches by too restrictively defining one’s goals. There’s a wide, wide world out there between the very narrow “extract information form the data” and the very vague “indicate what you should believe.” Within that gap falls most of the statistical work I’ve ever done or plan to do:

62 Comments

  1. Chris Wilson says:

    Great stuff!

  2. Keith O'Rourke says:

    Some this is discussed here http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

    For instance, “It seems to be already widely supported for probability generating models for data “[providing an] explicit description in idealized form of the physical, biological, . . . data generating process,” that is essentially “to hypothesize a data generating mechanism that produces observations as if from some physical probabilistic mechanisms” (Reid and Cox, 2015.)”

    Now I pitched Andrew and my perspective to Reid and Cox “That is, an explicit prior probability distribution to represent available but rough scientific judgements of what values the unknown parameters might have been set to and a data generating probability distribution to represent how the recorded data likely came about if the unknown parameters’ values were set to specific possible values.” Cox replied along the lines of prior information is often very important but insisting on quantifying its uncertainty is often (nearly always?) a bad idea.

    Anyway, I don’t think _my take_ on his position has changed (yet) https://andrewgelman.com/2012/02/01/philosophy-of-bayesian-statistics-my-reactions-to-cox-and-mayo/#comment-73186

  3. It would be great if Deborah Mayo and others contribute to this thread. I enjoy reading the debates back and forth. I’d like to see Sander Greenland, Stephen Senn, Daniel Lakens, and many others here.

    • Chris Wilson says:

      Meh, I am pretty over the “Bayes is subjective/personalistic/belief-based” arguments (closely related to “but where do your priors come from?!”). There are plenty of productive philosophical discussions to have about practice, but I am skeptical of where this particular cue would lead at this point.

      • Kyle C says:

        +1. And I for one find Mayo’s comments at this blog simply inscrutable.

        • Anonymous says:

          It would be clarifying if Mayo applied her philosophical ideas to a real, nontrivial data set + question (without any help from anyone anyone else) and showed us her analysis and conclusions.

          • Kyle C says:

            +1. And without trying to coin aphorisms or bon mots, and without the academic philosophical apparatus of slicing and dicing other people’s words. Just say plainly what you mean and how it works better in practice than the alternatives.

            • Anonymous says:

              The great irony of Mayo of course, is that the centerpiece of her thinking is the idea of subjecting hypothesis to severe tests, yet she’s never even come close to subjecting any of her ideas to severe tests.

              I used to think this was merely “priest refusing to look through Galileo’s telescope” style frequentist fanaticism, but it’s pretty clear she doesn’t have the technical expertise to do so.

        • Deborah can correct me if I’m wrong. But if Kyle is referring to Deborah’s comments in Andrew’s link, then I’d say that her insights are closer to some commentary on objectivists discussion forums I came across in the late 80’s and early 90’s.

  4. Michael says:

    Nobody is really putting “prior information” into their analysis; I have never seen a prior distribution that is not centered at zero. It’s just regularization.

    • For Bayesian inference in general and Stan in particular, we generally recommend at least weakly informative priors—priors that inform the scale of the parameter.

      When we can, we use informative priors. There are lots of examples on the Stan forums. We use them a lot in pharmacological modeling. Here’s a paper on toxicology with informative priors by Frederic Bois, Andrew, and others.

      • Thanatos Savehn says:

        A very interesting paper. I was involved in the PERC wars back then but don’t remember this. Maybe it was because I was always looking for the magic phrase “statistically significant” (which is absent from this paper).

        A couple of questions. Consider: “These distributions incorporate a mix of uncertainty (measurement errors) and variability (results from several animals were pooled)” and three sentences later this: “Such a conclusion suffers from confusion between uncertainty and variability …” I thought that uncertainty in e.g. acceptance sampling in the ball bearing factory arose from both the machinery and the technician’s calipers. It seems to me that if you know what the measurement error is then you’d be a bit less uncertain about the variability, and vice versa. What am I missing?

        I was similarly intrigued by the comments about large variations in metabolism among individuals and thought this comment particularly interesting: “While uncertainty could be reduced by additional analyses, population variability … could increase when more subjects are included.” There are of course several critiques of RCTs out there making this very point (and demonstrating it with examples of RCTs failing badly when used to predict variation in treatment effect in much larger populations). So I get that more sampling might uncover wider variations but how does measuring more ball bearings reduce a caliper’s propensity to err?

        No doubt my confusion about uncertainty, variability and measurement error is just another instance of a layman trying to understand what’s going on without having first taken Statistics as foreign language credit.

        • It’s my thought that Bayes does a great job of dealing with all sources of uncertainty because it gives a single measure of how much information you know about something. Whether you have uncertainty because you have errors in your measurements, or your model isn’t taking everything into account, or you are using a value of a parameter that isn’t quite the correct one, or there’s some fundamental unknowability (maybe you’re using a quantum measurement device or some such thing), all of it represents failure to have perfect information, and that failure can be described in terms of a measure of how much weight to give to the various alternatives.

          once you get there, all this stuff about “epistemic vs aleatory” and “confusion between uncertainty and variability” and soforth just falls away. Fundamentally, if something varies from instance to instance that’s one reason why *you* fail to have perfect information about what you should predict… among many others.

          • a reader says:

            “It’s my thought that Bayes does a great job of dealing with all sources of uncertainty because it gives a single measure of how much information you know about something.”

            Yes, but no.

            In particular, it’s important to recall that (a) whatever likelihood function you use doesn’t fully capture the data generating process and (b) whatever prior you use doesn’t fully reflect your prior knowledge of the subject. So now let’s suppose you run a Bayesian analysis on some data and you get a counter intuitive result. You could just blindly accept it, or you could go back and double check your prior or your likelihood.

            So either your Bayesian analysis does confirm exactly what you expected coming in, or you do need to break things up. And really, even if it does confirm exactly what you were expecting, you should probably double check anyways, otherwise you’re hugely biasing everything toward what you expect.

          • Nat says:

            It’s my thought that Bayes does a great job of dealing with all sources of uncertainty because it gives a single measure of how much information you know about something. […] once you get there, all this stuff about “epistemic vs aleatory” and “confusion between uncertainty and variability” and soforth just falls away

            I am not sure what you mean by “single measure” and “falls away”, but the separation between epistemic and aleatory uncertainty can be very important from a practical standpoint. For example, if you are designing an airplane it is often useful to treat lack of knowledge (epistemic uncertainty) separately from random variability (aleatory uncertainty) in order to estimate a distribution of probabilities of failure, rather than a single measure that includes both. I will admit that the distinction between epistemic and aleatory uncertainty can become muddied, however, I think it is still a useful concept. I am not sure that a Bayesian philosophy allows you to avoid the tricky problem of uncertainty sources.

        • Martha (Smith) says:

          Yes, the words uncertainty and variability are used in different ways by different people in different circumstances. I’ve got a little discussion (including in the footnotes) of various uses of terminology for these concepts at http://www.ma.utexas.edu/users/mks/statmistakes/terminologyrevariability.html .

      • Shravan says:

        A point relating to “personalized” analyses:

        Spiegelhalter and others have used the phrase community of priors, and I really like the idea of displaying the posterior under different beliefs. I don’t know why this isn’t standard practice, why it isn’t part of the Bayesian workflow. IMHO Bayesians should own the use of different degrees of informativity in priors, what people derisively call “personalized” or “subjective”.

        For example, in phonetics we study voice onset time or VOT, which is (roughly speaking) the amount of time in milliseconds that it takes for the vocal folds to start vibrating from the moment that the articulation of a sound like g is produced. Men and women have different VOTs. The gender effect is about 3-20 ms in Mandarin (women have longer VOTs). English has about 0-18 ms. The effect is going to be definitely well below 50 ms. It’s a subtle effect that isn’t easy to detect.

        In this situation, a Normal(0,100) prior is not only uninformative, it is *wildly* uninformative. A much more reasonable prior could be Normal(0,20), and increasingly uninformative priors would be Normal(0,50), Normal(0,70). In this situation, I would display the posteriors for all of them. It will turn out that the posterior mean and 95% credible interval is generally unaffected by this sensitivity analysis, but it’s good to know that even if one doesn’t ultimately display that in the published paper.

        Because many people equate Bayes with Bayes factor, I feel it’s even more important to make it standard practice to display the result of such a sensitivity analysis. Look at how the Bayes factor (in favor of the alternative that there is an effect of gender, vs a point null hypothesis that the effect is 0) changes depending on a very reasonable prior like Normal(0,20) vs a very wildly uninformative prior like Normal(0,70).

        Normal(0, 20) 6.45
        Normal(0, 50) 3.14
        Normal(0, 70) 2.44

        For me this is a nice example of why we should make it part of the workflow to study the effects of the prior on inference (I know this is not an original point; I am repeating Spiegelhalter’s point, I encountered it in one of his books, 2004 I think). I know that Andrew dislikes Bayes factors, but my more general point here is that one should make it a part of the analysis to display the effect of the prior on inference and hypothesis testing (if one wants to go down that road).

        All the details of this example, along with reproducible code, are in an in-press paper, available from OSF. (I’m not an expert in this area, but I worked with one on this.)

        • Martha (Smith) says:

          Good point and examples.

        • Rahul says:

          As an aside, why is something like VOT useful to measure? Sounds pretty arcane.

          • Lai Ka Yau says:

            It is one of the main cues (if not usually the main one) for distinguishing between aspirated sounds like the [t] in ‘temple’ and the [t] in ‘stop’.

            • Lai Ka Yau says:

              Sorry, I was careless in my phrasing! I meant to say, ‘for distinguishing between aspirated sounds like the [t] in “temple” and unaspirated sounds like the [t] in “stop”‘.

          • Shravan says:

            It’s important for basic research in phonetics and related areas, as Lai Ka Yu explained. It’s a deep area, but I suspect you are asking for real life applications. Maybe accent recognition (here) makes it less arcane sounding to the non-linguist.

            A lot of things linguists do have no practical relevance other than helping us understand deep properties of human language. I’m pretty comfortable with that immediate-uselessness of the research. A lot of abstract problems end up having surprising practical implications we cannot anticipate, but that’s not the motivation driving the study of abstract questions.

        • Keith O'Rourke says:

          Agree, the challenge is many Bayesian statisticians working as consultants/collaborators just want to pull the Bayesian crank and point to _the_ posterior as the definitive answer. Default priors take the least time and if forced sensitivity analyses that really don’t chance much will be quickly produced. Just economics.

          • Chris Wilson says:

            Yup. Substituting *the* posterior for *the* p-value, or whatever. What Shravan recommends is a harder conceptual leap, because it involves embracing that all modeling is conditional on assumptions/specifications…no more “uncertainty laundering” in other words :)

            • Shravan says:

              I should add that this is my attempt to make concrete the more general point that Andrew keeps making, to embrace uncertainty. What does that mean in practice? I am turning to the classic proposals (community of priors) to realize that point in a way that makes sense within the scientific context of linguistics. Maybe Andrew has a different view. I’m looking forward to talking to Michael Betancourt about this too in a few weeks when he comes to Potsdam to teach Bayesian statistics.

              • Chris Wilson says:

                I look forward to reading any further reflections you have after those discussions :) It’s an important area for practice!

              • Anoneuoid says:

                I should add that this is my attempt to make concrete the more general point that Andrew keeps making, to embrace uncertainty. What does that mean in practice?

                Just look to areas where they test the predictions of their model (as opposed to trying to reject a null model to somehow get support for their model). The theorists are collectively biased towards accounting for as much uncertainty in the estimates as possible since this allows more theories to survive the test.

                Eg, look how they seek out sources of uncertainty to allow the pioneer anomaly to be consistent with GR: https://arxiv.org/abs/1204.2507

                The lack of “embracing uncertainty” is just one more effect of reversing the logic of science by trying to reject a strawman null model rather than your own model.[1] So many bad incentives stem from that error. It is really so simple to fix all of this, but it means calling into question too many “facts” (ie, sacred cows) generated from the incorrect procedure, so people just refuse to accept it.

                [1] https://meehl.dl.umn.edu/sites/g/files/pua1696/f/074theorytestingparadox.pdf

              • Shravan says:

                Anoneuoid, thanks for the pointers. I am in the process of going down this road (question everything), not making many friends I suspect (but that‘s what tenure buys you, the freedom to say it like you see it).

              • Shravan says:

                Chris, I will blog about what I learn from Betancourt. He will spend a lot of time educating us on foundational issues.

          • Chris Wilson says:

            To push the metaphor a little more- this is difficult because as you say most scientists want going to statistical collaborators/consultants to be like going to the laundromat- where they get the blessing of clean certain statements in place of messy uncertainty. This is what journal reviewers often want, so the ripple effect goes into publications, colleague esteem, tenure and promotion reviews with deans, etc.

      • Michael says:

        Would you recommend someone who wants to estimate a treatment effect to use a prior on the treatment effect that is not centered at zero?

        • Phil says:

          I am replying to Michael, “Would you recommend someone who wants to estimate a treatment effect to use a prior on the treatment effect that is not centered at zero?”, I hope this comment shows up in the right place. I assume this question is addressed to either Andrew or to earlier commenters on this thread but I’ll give my own answer:
          (1) it depends on what information you have about the treatment effect; if you have plenty of evidence that the treatment effect is substantially greater than zero then there is no reason to center your distribution at zero (and good reason not to).
          But (2) if the details of your prior distribution make a big difference then you probably aren’t learning all that much from your experiment…which is fine, it’s OK to add a modicum of information to an existing body of work.

    • Carlos Ungil says:

      If saying that “zero or close to zero is more likely than far from zero” for some parameter in the model is not prior information, then what is it?

      I also don’t understand Andrew’s point. How are “inferences about the population” different from “what you should believe about the population with regard to information from the data and other, prior, information”?

      • Martha (Smith) says:

        Replying to Carlos’s last sentences:

        I can’t read Andrew’s mind, so this may not be his view, but the way I see it is that there is a difference between a statistical inference and “what you should believe about the population” — namely, that a statistical inference always involves some degree of uncertainty, while “what you should believe” sounds like expressing certainty.

      • Michael says:

        It is prior information, but at least you’re using the same zero as someone who does a null-hypothesis significance test. It’s not like anyone ever uses a Normal(5,1) prior for a treatment effect and then says ‘look, the posterior distribution suggests the treatment effect is positive”.

        If people would do that I could understand the concern about Bayesian inference being ‘personalistic’, but nobody actually does that.

        • Carlos Ungil says:

          Estimation of likely-to-be-zero treatment effects is not the only thing that Bayesian inference is used for.

          Do you think that a prior centered on zero is the best option to study the effect of education on income? To estimate the speed of light? Or the size of fish populations?

          Of course zero is just a number and any prior can be centered at zero by a change of variables…

          • Andrew says:

            Carlos:

            I would not use a zero-centered prior for the speed of light. But I would use a zero-centered prior for the change in the speed of light.

            I would not use a zero-centered prior for the effect of a particular intervention, compared to doing nothing. But I would use a zero-centered prior for the effect of a particular intervention, compared to the status quo.

            Etc.

            • Carlos Ungil says:

              i.e. change of variables: the baseline is the prior

              • Carlos Ungil says:

                (Ok, it’s not always equivalent change of variable because the baseline my be estimated at the same time and estimating the difference is better than calculating the difference in the estimates. But not everything is about comparing interventions.)

    • David Marcus says:

      Lots of nonzero means at https://www.ratingscentral.com. See https://www.ratingscentral.com/UnratedPlayers.php. Actually, probably too many, but that is a different problem.

      Back when I was working on the Trident submarine, we would occasionally have a nonzero mean, if the hardware hadn’t been updated. But, usually the prior provided scale information.

  5. Isn’t the choice of likelihood in a classical analysis subjective/personalistic/statistician-derived rather than being given by the data alone or by nature?

    How is the information extracted from data not something you should believe?

    • Martha (Smith) says:

      My response to your last sentence is essentially the same as my response to Carlos above.

    • Michael J Lew says:

      Seems to me that Andrew’s three challenges go well together with Richard Royall’s three questions:
      What do the data say?
      What should I believe now that I have these data?
      What should I do or decide now that I have these data?
      Those questions are not interchangeable with or equivalent to Andrews challenges, but complementary. And they help clarify the issues that Bob Carpenter raises.

      The choice of likelihood is largely a red-herring because in many cases the choice is really a choice of statistical model (the likelihood comes with the model). All statistical analyses need a model, and in many cases the model is equivalent for Bayesian and frequentist methods. Nonetheless, I agree that there is a degree of subjectivity involved.

      The likelihood function tells what the data say, according to the model. When you want to make an inference you should include what the data say and all other relevant information, be it prior evidence or information that helps extrapolation from a sample to a population that was not directly sampled. If you want to decide or recommend an action then the costs and benefits of action should be weighed along with what the data say and what you believe.

      • Martha (Smith) says:

        Your last two sentences get to the heart of the matter.

      • Keith O'Rourke says:

        Michael:

        Never liked this wording “The likelihood function tells what the data say, according to the model”

        The data never say anything (as Sander Greenland once put it – if you think the day are saying something you need a psychiatrist).

        It is by choosing to consider the data as if it was generated by a probability model (an idealization) by which a likelihood is defined (that’a a big assumption). Some further argue that that requires or makes at least more sense if how the unknown parameter value was set/determined can also be considered as being generated by a probability model (another idealization – bigger or not as big?).

        Bypassing these steps with the phrase what the data say (even with caveats of subjectivity being stated) suggest more direct access to reality that we have.

  6. Anoneuoid says:

    Generalizing from sample to population and from past to future

    I find it useful to think of these as completely different problems, grouping them together has led to a lot of confused applications of stats imo. Eg, see Deming’s distinction between “enumerative” and “analytic” studies:

    Dr. Deming’s aim in this paper is to contribute something to the improvement of statistical practice. He distinguishes between enumerative studies and analytic studies. An enumerative study has for its aim an estimate of the number of units of a frame that belong to a specified class. An analytic study has for its aim a basis for action on the cause-system or the process, in order to improve product of the future. Techniques and methods of inference that are applicable to enumerative studies lead to faulty design and faulty inference for analytic problems. It is possible, in an enumerative problem, to reduce errors of sampling to any specified level. In contrast, in an analytic problem, it is impossible to compute the risk of making a wrong decision. A number of examples clarify the issues.

    https://deming.org/deming/deming-articles

    You can find the ref there but here it is:
    W. Edwards Deming. On Probability As a Basis For Action (1975). The American Statistician, Vol.29, No. 4, 1975, pp. 146-152. https://s3.amazonaws.com/wedi/www/Articles/b21d561f-9aea-4b8d-8757-d70964ae13b5.pdf

  7. Pedro says:

    I might be missing something here, but I don’t see the difference between the two approaches mentioned. When you extract information from data, you are, inter alia, answering the question ‘what should I believe given this data?’. It’s not as if this ‘extraction’ were some sort of manual labour where you mindlessly extract information in a way that bears no relation to epistemology. You are given incomplete information about a hypothesis and you want to know what you are justified in believing given this partial picture. Conversely, in order to know what you should believe, you should learn how to extract information from the data you have. Two different ways of saying the same thing.

  8. Jag Bhalla says:

    I worry that “the data” risks becoming a heterogeneity-hiding abstraction
    see
    In “The Concept of Scientific History” Isaiah Berlin distinguishes two kinds of data…
    https://bigthink.com/errors-we-live-by/the-two-kinds-of-data-how-facts-in-economics-resemble-those-in-history
    and
    (rote-used) stats methods risk focusing on logic-losing numbers
    https://bigthink.com/errors-we-live-by/judea-pearls-the-book-of-why-brings-news-of-a-new-science-of-causes

  9. Mikhail says:

    What paper this screenshot is coming from? I cant find it

    • Dan Mirman says:

      Same question. I like this inference-as-prediction formulation and it is relevant to some of my current projects. I might use it when I talk about those projects and I want to make sure I attribute it properly.

    • Andrew says:

      Mikhail, Dan:

      The screenshot is of the very first paragraphs of the forthcoming book, Regression and Other Stories, by Andrew Gelman, Jennifer Hill, and Aki Vehtari, published by Cambridge University Press, to appear in 2018 or 2019. OK, let’s be realistic. 2019.

  10. Shravan says:

    Related: there is a somewhat bizarre piece about cargo-cult statistics, and cargo-cult Bayesian statistics in Significance:
    here.

    It says there:

    “There is also Bayesian cargo-cult statistics. While a great deal of thought has been given to methods for eliciting priors,40, 41 in practice, priors are often chosen for convenience or out of habit; perhaps worse, some practitioners choose the prior after looking at the data, trying several priors, and looking at the results – in which case Bayes’ rule no longer applies! Such practices make Bayesian data analysis a rote, conventional calculation rather than a circumspect application of probability theory and Bayesian philosophy.”

    Does anyone know examples of people gaming the prior in this way? I have never heard of or seen such a misuse. The authors make this claim but provide no references.

    Aside: I wanted to post this question on Significance’s blog, but then I read their terms and conditions: they promise to sell my data to advertisers. No thanks. I am a member of the Royal Society of Statisticians, and am disturbed that the RSS wants to sell my data if I participate in commenting on their articles.

    • Carlos Ungil says:

      I’m not sure if this is what you mean by “gaming the prior”:

      https://andrewgelman.com/2015/08/25/can-you-change-your-bayesian-prior/

    • Andrew says:

      Shravan:

      I found things to agree with and disagree with in the linked article.

      The authors write, “The problem is one of cargo-cult statistics – the ritualistic miming of statistics rather than conscientious practice.” I agree about cargo-cult statistics being a problem, but I think otherwise this quote misses the point. It’s my impression that ESP researcher Daryl Bem, for example, used “conscientious practice” when doing his statistics; the problem is that his noise outweighed any signal that might have been there. All the conscientiousness in the world won’t help you, if (a) your noise overwhelms your signal, and (b) you’ve decided from the beginning that you’re looking for a positive finding.

      I have similar problems with the authors’ statement that “science has become a career, rather than a calling.” Again, I think that for Bem, the ESP search is a calling, not a career, and I expect that his motives are pure. But that doesn’t help. Indeed, pure motives could hurt, in the sense that he could think that his purity protects him from bad things like “p-hacking.”

      On the plus side, I fully agree with this quote from that article: “no good statistician, Bayesian or frequentist, would ignore how the data were generated in assessing statistical evidence.” I would just change “generated” to “generated and collected.”

      Regarding the statement, “priors are often chosen for convenience or out of habit; perhaps worse, some practitioners choose the prior after looking at the data, trying several priors, and looking at the results – in which case Bayes’ rule no longer applies”: The exact same concerns apply to any statistical approach! Just change “prior” to “likelihood” or “data model” or “hypothesis” or “estimator” or “machine-learning method” or whatever.

    • Mikhail says:

      > Does anyone know examples of people gaming the prior in this way? I have never heard of or seen such a misuse. The authors make this claim but provide no references.

      I did this.
      I started with the model and a prior. I estimated the posterior, and it turn out it was completely ridiculous. Prior was not regularizing enough. So I made it more informative.

      Oh yes, and I changed the model like 100 times, because the problem was harder when I originally expected so I had to change the model.

Leave a Reply