The Anti-Bayesian Moment and Its Passing

This bit of reconstructed intellectual history is from a few years ago but I thought it’s worth repeating. It comes from the rejoinder that X and I wrote to our article, “‘Not only defended but also applied’: The perceived absurdity of Bayesian inference.” The rejoinder is called “The anti-Bayesian moment and its passing,” and it begins:

Over the years we have often felt frustration, both at smug Bayesians—in particular, those who object to checking of the fit of model to data, because all Bayesian models are held to be subjective and thus unquestioned (an odd combination indeed, but that is the subject of another article)—and angry anti-Bayesians who, as we wrote in our article, strain on the gnat of the prior distribution while swallowing the camel that is the likelihood.

The present article arose from our memory of a particularly intemperate anti-Bayesian statement that appeared in Feller’s beautiful and classic book on probability theory. We felt that it was worth exploring the very extremeness of Feller’s words, along with similar anti-Bayesian remarks by others, to better understand the background underlying controversies that still exist regarding the foundations of statistics. . . .

Here’s the key bit:

The second reason we suspect for Feller’s rabidly anti-Bayesian stance is the postwar success of classical Neyman-Pearson ideas. Many leading mathematicians and statisticians had worked on military problems during the World War II, using available statistical tools to solve real problems in real time. Serious applied work motivates the development of new methods and also builds a sense of confidence in the existing methods that have led to such success. After some formalization and mathematical development of the immediate postwar period, it was natural to feel that, with a bit more research, the hypothesis testing framework could be adapted to solve any statistical problem. In contrast, Thornton Fry could express his skepticism about Bayesian methods but could not so easily dismiss the entire enterprise, given that there was no comprehensive existing alternative.

If 1950 was the anti-Bayesian moment, it was due to the successes of the Neyman–Pearson–Wald approach, which was still young and growing, with its limitations not yet understood. In the context of this comparison, there was no need for researchers in the mainstream of statistics to become aware of the scattered successes of Bayesian inference. . . .

As Deborah Mayo notes, the anti-Bayesian moment, if it ever existed, has long passed. Influential non-Bayesian statisticians such as Cox and Efron are hardly anti-Bayesian, instead demonstrating both by their words and their practice a willingness to use full probability models as well as frequency evaluations in their methods, and purely Bayesian approaches have achieved footholds in fields as diverse as linguistics, marketing, political science, and toxicology.

If there ever was a “pure Bayesian moment,” that too has passed with the advent of “big data” that for computational reasons can only be analyzed using approximate methods. We have gone from an era in which prior distributions cannot be trusted to an era in which full probability models serve in many cases as motivations for the development of data-analysis algorithms.

65 thoughts on “The Anti-Bayesian Moment and Its Passing

  1. > strain on the gnat of the prior distribution while swallowing the camel that is the likelihood.

    Many non-Bayesians are sceptical about both priors and likelihoods. For those people, I guess, two wrongs don’t make a right.

    • Ojm:

      Neither the prior nor the likelihood is a “wrong,” and Bayesian inference is not “a right.” Assumps are assumps; they’re not bad things, we just should be aware of them. In my work, I strain at both the gnat and the camel, and we spend a lot of time thinking about model checking.

      • Where does “assumption-free” inference fit into this story? That is, interpreting models as nothing more than operations on the data, and evaluating the properties of these operations which hold under (essentially) no assumptions about the true DGP?

        • Ram:

          The three key problems of statistics are: (1) generalizing from sample to population, (2) generalizing from treatment group to control group, and (3) generalizing from measured data to underlying constructs of interest.

          Except in some very special cases, assumption-free inference won’t do any of these things. Assumps are absolutely required.

        • To put it another way: Operations on the data are what we do, but the reason we do these operations is to learn about non-data entries: people who are not in the sample, latent variables, future data, etc. Except in rare settings like perfectly clean random sampling, assumps are needed to bridge between data and non-data.

        • Fair enough, but OLS (for example) in large samples recovers the mean partial derivatives of the conditional expectation function, or at any rate its best smooth approximation, under *very* weak assumptions about the sampling procedure, and no assumptions about error distributions, variance homogeneity, linearity, or anything else. That seems like a remarkable success at problem (1), assuming those mean partial derivatives are what we’re after.

        • Ram:

          I don’t really want “mean partial derivatives.” I want to estimate the effects of a drug, or the rate of Republican voting among white women in Michigan, or the radon level in a house, or the level of arsenic in water from a certain well, or the days that a plant will be blooming, or the probability of a death sentence being reversed . . . things like that.

        • One type of question about the effect of a drug on, say, blood pressure is: how should our prediction of blood pressure change when we observe a small increase in the dose of the drug taken, and no change in any of the confounders of the drug-blood pressure relationship? That question can (sometimes) be usefully cast as a question about the mean partial derivative of expected blood pressure (conditional on the confounders) with respect to dose.

        • I agree almost everywhere. Still, I’ll strain for a moment on this gnat:
          “Perfectly clean random sampling” is an assumption, too – one often made and rarely satisfied (if ever outside of simulations, and even those rely on deterministic approximations).
          There is no such thing as assumption-free inference, at least not of the deductive sort (it’s as deductively impossible as trisecting an angle with only a straightedge and compass).

        • Nah, camels are extremely well adapted to their environment, if you ask “god” for an animal to help you travel across the desert you get a camel, if you ask a committee you get a Humvee powered by a horse in an enormous hamster wheel.

  2. “The second reason we suspect for Feller’s rabidly anti-Bayesian stance is the postwar success of classical Neyman-Pearson ideas. Many leading mathematicians and statisticians had worked on military problems during the World War II, using available statistical tools to solve real problems in real time.”

    The irony is that the power of Bayesian methods remained a closely-kept secret for decades after WW2 precisely because it was so successful – specifically in breaking Axis ciphers, and later in the decryptions involved in Project Venona. I can’t help feeling Jack Good must have been frustrated having to listen to frequentists bang on about the inadequacies of Bayes, while being unable to say anything about what he and Turing had achieved. Turing’s manual on cryptographic uses of Bayes was only declassified in 2012.

  3. Ram wrote

    One type of question about the effect of a drug on, say, blood pressure is: how should our prediction of blood pressure change when we observe a small increase in the dose of the drug taken, and no change in any of the confounders of the drug-blood pressure relationship? That question can (sometimes) be usefully cast as a question about the mean partial derivative of expected blood pressure (conditional on the confounders) with respect to dose.

    Can you give a concrete example here (actually calculate something based on fake data and draw a conclusion about blood pressure). I suspect you are making some quite strong (and dubious) assumptions such as “all important confounds, and no spurious ones, are included in the model”.

    • OLS essentially always estimates the mean partial derivatives of the conditional expectation function, given the regressors. Absent randomization (of either the artificial or natural variety), estimating a causal parameter requires conditioning on all confounders. You’re right, therefore, that giving OLS estimates a causal interpretation requires the very strong assumption that the regressors include all confounders. My point was simply that the parameter which OLS consistently estimates is *sometimes* the parameter of interest, and OLS consistently estimates that parameter under remarkably weak assumptions.

      • Thanks. I’d just add that randomization doesn’t guarantee anything about future results, or those under different conditions. Ie, in the blood pressure case you are dealing with an “analytic” problem while (from your comment) you seem to be thinking of an “enumerative” problem: https://en.wikipedia.org/wiki/Analytic_and_enumerative_statistical_studies

        For example, randomize people to receive either some opioid or placebo in an in-patient (hospital) setting and assess addiction. Then, do the same study when the patient sits at home self-administering as they watch tv all day…

        • Right. I’m assuming the parameter we’re trying to estimate is a functional of the same population distribution which generated the data. This is a very strong assumption. “Assumption-free” is perhaps too simplisic of a descriptor for what I’m talking about. The larger point is that, in some cases, we can go a long way towards answering our question using procedures that work under far fewer assumptions than those of a fully specified probability model. But we’re always assuming that the thing being estimated is the thing we care about, and that’s often the strongest assumption of all (and one we can’t do away with). The real question is stronger v. weaker assumptions, not some assumptions v. no assumptions.

        • > The real question is stronger v. weaker assumptions, not some assumptions v. no assumptions

          +1

          I’d add that assumptions can also be ‘qualitatively’ different and not just ‘quantitatively’ different. Eg assuming and then inverting a generative model (or ‘likelihood’, according to the usual abuse of terminology) vs eg defining a functional on a large class of possible models (as you allude to).

      • Isn’t the existence of a parameter also an assumption? We want to know how effective a drug is, and so we assume that there exists a mean amount by which blood pressure changes, given a change in dose.

        I’m not saying this is unreasonable, but it feels like an assumption to me.

        • I think we can say that OLS estimates the mean partial derivatives of the best approximation of the conditional expectation function for which there exists mean partial derivatives, where “best” is in the quadratic-loss sense. Whether that best approximation is a good approximation is a fair question. But this assumption is surely far weaker than any specific distributional or functional form assumptions about the DGP.

        • Why is the assumption that “”best” is in the quadratic-loss sense” surely far weaker than any specific distributional or functional form assumptions about the DGP?

          It seems to me that “best approximation” might be relative to a distributional or other assumption. For example, if a question is really about a median, then using an L-1 loss function might be more realistic — in particular, to take into account that the median may not be unique.

        • Ram:

          I think this whole DGP thing is missing the point. Or, to say it another way, framing the problem in terms of a “DGP” assumes some sort of stationarity, and it assumes that the sample is representative of the population, and it assumes that your measurements directly address your substantive questions. In real life, we spend a lot of time on modeling these connections. You’ve avoided assumptions by defining away some of the most important parts of the statistical problem.

        • P.S. Don’t get me wrong—I use DGP’s (data generating processes) all the time in my statistical modeling, I think they’re super useful! I just wouldn’t call them assumption-free or even consider the idea of a data generating process as representing minimal assumptions.

          As far as I’m concerned, once you open the DGP jar, you’re all-in.

        • Ojm:

          What I’m saying is that those theoretical properties that are defined based on the existence of a “DGP” are implicitly making huge assumptions about stationarity of the process, representativeness of the sample, and relevance of the measurements. The assumptions here aren’t the functional form of the DGP, they’re assumptions about the connections between the data and the underlying goals of the inference.

        • Sure, but you could in principle make your functional a…function of…contextual variables etc without assuming a map context -> data.

          Again, kne big difference is whether you do the two step of parameter to data then data to parameter or aim for data (and/or context) to parameter (or whatever) directly

        • Ojm:

          I don’t think it’s a bad idea to think about the “mean partial derivatives of the conditional expectation function” of the data (see for example this paper from 2007), but I don’t think it makes much sense to think of this as model-free or model-minimal, given the huge assumptions required to connect these mean partial derivatives to any applied questions of interest.

        • To quote Laurie Davies, how about ‘the amount of copper in a sample of water’?

          Do we need a model of how each bit of copper got there? To what detail?

        • Ojm:

          I assume that if you want to measure the amount of copper in a sample of water, there are low-cost measuring devices that will do the job for you, with little or no statistical modeling required.

          Unfortunately (or fortunately), I don’t work on such problems. Or, I should say, when I have problems involving such direct measurements, I do the calculations and don’t think twice. I spend my effort on problems such as estimating public opinion from surveys which are nothing close to random samples, or estimating the metabolism of a drug based on indirect data, or all sorts of other problems involving far-from-trivial steps of generalizing from sample to population, from treatment to control group, or from measurements to underlying constructs of interest. If I spent my time measuring copper concentrations, my view of statistics would surely be much different. But notice the title of this blog!

        • Like I said, you can make eg your functional or whatever depend on contextual factors. You might call this a ‘model’ but one implicit assumption I’m questioning is whether you need an explicit model of the form parameter to data in order to go from data to parameter.

          An analogy: Thermodynamics is a very applicable important subject. It’s validity is essentially independent of any specific statistical mechanical model, despite it also being useful and interesting to use (typically toy) statistical mechanical models to look at some parts in more detail.

        • Here’s a very relevant quote from one of the best books on thermodynamics (and a bit on statistical mechanics):

          > models, endemic to statistical mechanics, should be eschewed whenever the general methods of macroscopic thermodynamics are sufficient

          From Callen’s classic ‘Thermodynamics and an introduction to thermostatistics’.

        • Andrew:

          I think another example of what ojm is referring to is that of deep learning.

          On the one had, you could argue that neural networks have a likelihood function, in some cases with a very strange error distribution induced by various loss functions used.

          But on the other hand, I think most users don’t see it as that but rather just a method for approximating an arbitrary function. And I think this kind of thinking really guides the research, although sometime in somewhat strange directions; for example, there’s lots of discussion about whether you’ll get a better answer if you use SGD vs ADAM optimization algorithms, which is really far from the theories of getting estimates via MLE or integration.

          But I think that’s certainly some evidence that you can do a lot without being focused specifically on the data generating process.

        • Corey:

          I think that blog post illustrates my point; Ferenc writes Bayesians originally dismissed deep learning as over-parameterized MLE, meaning it was doomed to fail, and when the test of time proved otherwise, a Bayesian interpretation of deep learning was found.

          While there may be many ways to interpret deep learning, including a Bayesian one, I think ojm’s point is that plenty of progress can be made without intense focus on the DGP, even if there does exist a DGP interpretation.

        • a reader, the question of methods that gets you to a deep but narrow peak vs methods which will only get you to maxima for which the surrounding peak is flat and wide is *exactly* MLE vs integration.

        • Corey:

          I would argue that those two points are related, but not equivalent.

          What I’m referring to is papers that say “SGD tends to arrive at wider points than ADAM”. Note that these are both optimization methods, and more over, the papers have very little *why* nor do they address the issue that SGD tends to give a better solution *for a very specific problem* and who knows if that trend will generalize to other problems.

        • I keep wanting to displace the model assumptions vocabulary with making an representation of something _real_ so that we have something (the representation) we can manipulate (diagrammatically or symbolically) to hopefully learn about the reality which we have no direct access to.

          There, how well the representation captures critical features of that something we thought was real and how well we can work with it, determines how likely we will be frustrated by reality when we act on it. Period.

          Thought of that, to make “no assumptions” would mean to avoid representing, which could only mean acting without thinking … maybe assuming thinking itself a complete waste of time?

          Much of the debate here centers on focusing on certain aspects of the representation that can made more general (less specific DGMs) without appropriately considering the importance of that in the overall scientific profitability of the representation as a whole. Including the next revised representation it inevitably will lead to.

          (Translation – I am agreeing with Andrew.)

        • Again, see the quote on thermodynamics vs statistical mechanics above.

          A key point that I think the blaze ‘everyone makes assumptions’ or the ‘everything is a model’ comments miss is that these can be _qualitatively_ different types of assumptions.

          Think Kuhnian paradigms or something – the ‘no assumption’ folk are making assumptions, sure, but they are often trying to make quite different style assumptions. Not everything is a DGP!

        • Here is a common physical assumption:

          Energy is conserved.

          This isn’t a DGP. But it can get you surprisingly far. In fact sometimes it’s completely impractical to solve a problem by following a detailed model but easy to solve when you ignore the process and remember that the final result satisfies a constraint.

          Feynman I think has a discussion of this when he was learning variational mechanics – he said he initially hated energy methods and wanted to follow all the forces in the problem – think DGP – but then realised this was simply not the way to approach or think about these problems.

        • From http://physicstoday.scitation.org/doi/abs/10.1063/1.2711636?journalCode=pto

          > One amazing (in retrospect) quirk displayed by Dick [Feynman] in Stratton’s course was his maddening refusal to concede that Joseph-Louis Lagrange might have something useful to say about physics. The rest of us were appropriately impressed with the compactness, elegance and utility of Lagrange’s formulation of mechanics, but Dick stubbornly insisted that real physics lay in identifying all the forces and properly resolving them into components. Fortunately that madness appears to have lasted only a few years!

          I would argue that an obsession with DGPs is somewhat similar to the above ‘mad’ insistence on resolving all mechanics problems into forces and components.

        • ojm,

          Thermodynamics makes fewer assumptions than Statistical Mechanics, but the price paid is that it only provides answers to a handful of questions. Specifically, it can only answer questions insensitive to all the physics that’s being left out with those week assumptions. For example, almost any question you ask about an ideal gas in a non-equilibrium situation is unanswerable knowing just the total energy.

          So a couple of points. First, if the question you need answering is one of the vast majority which can’t be answered by the week assumptions of thermodynamics, then kinetic theory+statistical mechanics, et al, is suddenly handy again.

          Second, you think what’s being done in thermodynamics represents something qualitatively different than traditional statistical modeling, but it isn’t!

          Consider when someone says “assume each possible sequence of 100 coin is equally likely” and concludes the fraction of heads is nearly 50%. This masquerades as a strong “stat mech style” assumption about the “data generation process”, but it’s actually a “thermo style” instance of only asking questions who’s answer is highly insensitive to all the unknown details being left out.

          Or to put it another way: we expect to see the ideal gas laws in practice because almost anything that a diffuse gas could be doing leads to those laws. Similarly, we expect to see nearly 50% heads in coin flips because almost anything the coin could be doing leads to that outcome. Far from being two different styles of reasoning or inference, they’re identical.

          They’re both examples where we can make nearly “assumption free” inferences because we happen to be asking a questions who’s answer isn’t sensitive to the physics details we’re ignoring.

        • I think what you’re getting at in part is that many times people use toy models to represent a process, knowing that the answer they’re interested in is actually process independent. Point being you can just use a convenient process if the process doesn’t matter! This style of reasoning is used in thermo all the time. Also in dynamical systems eg hyperbolic vs non-hyperbolic equilibria.

          And sure, some Bayesian approaches are similar to this – eg Jaynes and MaxEnt. While I have other quibbles with MaxEnt stuff, I’m not so much pushing back against these folk here.

          I’m pushing back at the ‘literalist modelling’ types who think everything, and every assumption, is tied to a DGP. I don’t believe these folk are necessarily making points which you are making, which I largely agree with. I think we have other disagreements but not so much these.

          BTW – I think you are one who has claimed that ‘frequentists’ have some kind of literal belief in DGPs. This may be true for some, but generally speaking I see frequentists as the ones trying to make model/DGP independent inferences.

        • Ojm:

          I don’t think that “everything, and every assumption, is tied to a DGP.” I think that in some very simple problems, or problems with a lot of symmetry, you can do inferences without the data generating process—but in that case you have a lot of other assumptions going on. In the case of a gas at room temperature, these other assumptions might be just fine, but this doesn’t really describe the sorts of problems that I work on. When my colleagues and I were reconstructing climate from tree rings, we needed models! When my colleagues and I do political science using surveys with 90% response rates, we need models! When my colleagues and I want to make inferences for pharmacology, we need models! We need models in all sorts of situations where our data do not happen to be random samples of our population of interest, or where our measurements are indirect, and where we don’t have the convenient symmetries that allow one to apply limit theorems. Such symmetries work for coin flips and ideal gases and slot machines in Las Vegas, not so much in the problems I work on.

          P.S. I love these long threads!

        • I’d say I work on problems in areas that are just as, if not more, ‘messy’ and complicated. You can find people writing down arbitrarily complicated models, arguing that ever more detailed modelling is necessary etc etc.

          Still, I stand by my points.

          Which is to say – my views are certainly not shaped by me working in ‘tidy’ areas, as seems to be the implication. I look at mainly biological and energy engineering applications where large and complex models are very common. In fact, working with these complicated examples has had a large impact on my views and skepticism towards ‘literalist’ modelling.

        • > skepticism towards ‘literalist’ modelling.
          Some of us (including Nancy Reid and David Cox) do not take DGMs literally but rather as idealizations http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

          The data came about some how and I believe somehow needs to be reflected in the representation you are working in/with (e.g. consistent with symmetry).

          If being vague about how data came about offers an advantage, that needs to be argued for not just assumed – right?

          Now advantages based on measures of performance beg the question of that measure being good for what?

        • > If being vague about how data came about offers an advantage, that needs to be argued for not just assumed – right?

          I think the usefulness of the first and second law of thermodynamics speak for themselves no?

          See also Einstein’s distinction between constructive and principle theories:

          > Einstein’s most original contribution to twentieth-century philosophy of science lies elsewhere, in his distinction between what he termed “principle theories” and “constructive theories.”

          > This idea first found its way into print in a brief 1919 article in the Times of London (Einstein 1919). A constructive theory, as the name implies, provides a constructive model for the phenomena of interest. An example would be kinetic theory. A principle theory consists of a set of individually well-confirmed, high-level empirical generalizations. Examples include the first and second laws of thermodynamics. Ultimate understanding requires a constructive theory, but often, says Einstein, progress in theory is impeded by premature attempts at developing constructive theories in the absence of sufficient constraints by means of which to narrow the range of possible of constructive. It is the function of principle theories to provide such constraint, and progress is often best achieved by focusing first on the establishment of such principles. According to Einstein, that is how he achieved his breakthrough with the theory of relativity, which, he says, is a principle theory, its two principles being the relativity principle and the light principle.

          > While the principle theories-constructive theories distinction first made its way into print in 1919, there is considerable evidence that it played an explicit role in Einstein’s thinking much earlier. Nor was it only the relativity and light principles that served Einstein as constraints in his theorizing. Thus, he explicitly mentions also the Boltzmann principle…

          From https://plato.stanford.edu/entries/einstein-philscience/#PriTheTheDis

        • More concretely, if needed – the point is about the necessity of assuming a DGP or not. The so-called ‘assumption free’ (I agree these are poorly named) methods are more analogous to principal or thermodynamics style theories than constructive or DGP style theories.

          I’m not saying generative models are not useful. But there exists complementary or even conflicting approaches that ask how far they can go without making strong assumptions about DGPs. I think these are interesting and reflect a ( perhaps subtly) different approach to statistics. I pointed to analogous examples in the physical sciences because these are interesting and concrete to me.

        • A statistics paper describing a similar philosophy is Wasserman’s ‘Low assumptions, high dimensions’.

          Now whether you agree with this approach or not, I think it represents a counter philosophy that is not so compatible with Bayesian inference and explains why some are not Bayesian. Furthermore it connects back to my original point – such work is equally concerned about swallowing camels and gnats.

          So the whole ‘an anti Bayesian swallows the likelihood but chokes on the prior’ does not apply to such approaches and I find it at least an interesting intellectual exercise to figure out what they’re doing and why. But sure, you could just say ‘it’s not what I do’, I guess.

        • Ojm:

          When I wrote, “I don’t see the laws of thermodynamics helping me estimate public opinion from surveys—but, hey, who knows??”, this was not a joke at all! I was serious. You talk of counter philosophies and so forth, and I don’t see how such philosophies will help me in my applied research in public opinion, pharmacology, etc.

          I’m not saying that Bayesian methods are the only way to go—I’ve seen various non-Bayesian solutions to problems in public opinion, pharmacology, etc., and I have no doubt that non-Bayesian methods could be devised to do just about everything that I do—but all the methods I’ve seen that could work for such problems use assumptions. And the biggest assumptions are typically those that relate current data to future data, or predictions, or underlying parameters: these are the assumptions that go into the data model. Or, in a model-free approach, these are the assumptions that go into the predictive procedure.

        • Thermodynamics uses assumptions as well. The first and second laws by themselves wont tell you what the pressure of a substance is in equilibrium. Just like the sum and product rules of probability theory wont tell you the probability in a presidential election. You have to supplement them with assumptions about probability distributions.

          In thermodynamics too, you need to assume a functional form for the entropy S(U,V…) (to use Callen’s notation). This functional form can gotten several ways: simply be guessed at, derived from a deeper model, or determined empirically.

          Not really so different from statistics.

        • ojm,

          A question for your thermo students:

          “Suppose a human brain has a give total energy, volume,… (or any other macro parameter used in class so far such as total angular momentum, total magnetic moment). Then what language does the person speak?”

          A followup question is,

          “If the Ergodic Hypothesis is true, then given enough time will the person speak every language consistent with that energy, volume, ….? “

  4. Andrew, what are you referring to when you talk about “swallowing the camel that is likelihood?” I know what likelihood is and hear it can’t handle some distributions too well, but is that what you’re talking about here?

    • Before doing a likelihood based inference, every statistician is required to have eaten an entire camel (or a tofu-based camel substitute). [Sorry]

      He’s saying that that assumptions that are used to choose the form of a likelihood have orders of magnitude more influence over the inference than the assumptions encoded in the prior. This statement eventually becomes less true as the model gets stranger and stranger, but for “classical” models it’s correct.

      • Ah, right, because you have to choose a distribution to come up with the likelihood in the first place (unless I’m misunderstanding). Yeah, that makes sense.

        • Think of the “likelihood” as just your mathematical model for what is going on. If you flip a coin n times with probability p of landing heads each flip, you expect to see p*n heads on average. So if you have data on the number of heads from a bunch of coin flip sequences (assuming equal length here), you could use that as your likelihood. You can estimate the two parameters of your model (p and n) after putting priors on their values.

          You can derive a different likelihood by saying each flip is not necessarily independent of the last, etc. Eg, compare:

          https://en.wikipedia.org/wiki/Binomial_distribution
          https://en.wikipedia.org/wiki/Poisson_binomial_distribution

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *