Progress! (on the understanding of the role of randomization in Bayesian inference)

Leading theoretical statistician Larry Wassserman in 2008:

Some of the greatest contributions of statistics to science involve adding additional randomness and leveraging that randomness. Examples are randomized experiments, permutation tests, cross-validation and data-splitting. These are unabashedly frequentist ideas and, while one can strain to fit them into a Bayesian framework, they don’t really have a place in Bayesian inference. The fact that Bayesian methods do not naturally accommodate such a powerful set of statistical ideas seems like a serious deficiency.

To which I responded on the second-to-last paragraph of page 8 here.

Larry Wasserman in 2013:

Some people say that there is no role for randomization in Bayesian inference. In other words, the randomization mechanism plays no role in Bayes’ theorem. But this is not really true. Without randomization, we can indeed derive a posterior for theta but it is highly sensitive to the prior. This is just a restatement of the non-identifiability of theta. With randomization, the posterior is much less sensitive to the prior. And I think most practical Bayesians would consider it valuable to increase the robustness of the posterior.

Exactly! I completely agree with 2013 Larry (and it’s what we say in our Bayesian book, following the ideas of Rubin and others).

I’m happy to see this development. Much of my recent work has involved Bayesian analysis of sample surveys. And, indeed, our models typically assume simple random sampling within poststratification cells. Such models are never correct (even if the survey is conducted by a probability sampling design, nonresponse will not be random) but it’s a useful starting point that we try to approximate in many of our designs. In other settings, we simply don’t have random sampling or random assignment, and then, indeed, our inferences can be more sensitive to our assumptions. The only place I’d disagree with Larry is when he writes “sensitive to the prior,” I’d say, “sensitive to the model,” because the data model comes into play too, not just the prior distribution (that is, the model for the parameters).

P.S. Beyond appreciating Larry’s recognition of this particular issue, I find his larger point interesting, that we add noise in different ways to achieve robustness or computational efficiency.

56 thoughts on “Progress! (on the understanding of the role of randomization in Bayesian inference)

  1. > “sensitive to the prior,” I’d say, “sensitive to the model,”

    But here, say when you only find out about the outcomes for the groups to be compared, a multidimensional informative prior that adequately reflects the imbalances between the groups but has no way to be updated, would be all there is (i.e. no data model at all for confounding)?

    Remember being very dismayed reading papers (1980 and 90,s) by well know Bayesians arguing randomization did not play any part.

  2. I’m very curious about your comments on “sensitivity to the prior” vs “sensitivity to the model.” I take it that you see the prior as the “model for the parameters”, and the model as the prior plus the “data model.” Can you please elaborate on the difference between the model for the parameter and the data model? And in turn, between the model and the prior? I want a better understanding of the concepts at play here. Thanks!

    • This plays into my post on “Where do Likelihoods Come From” from several years back: http://models.street-artists.org/2011/12/13/mommy-where-do-likelihoods-come-from/

      I still have basically the same issue, and in fact these days I’m fitting a data timeseries to an ODE and I have to decide what the likelihood of the observed error is. Independent random errors within a single timeseries is clearly wrong, when things are close together in time they tend to have similar errors. Gaussian process errors could be more right, but there’s no reason to think the errors are really gaussian, in fact some whole timeseries don’t fit well because something occurred which isn’t modeled by my ODE. Robust errors, like a t distribution can help me by emphasizing the fit of the timeseries that fit well, and “ignoring” the cases where “something else happened”. But independent t errors aren’t right, and a t-process, if such a thing exists, is out of the question from a practical perspective. I’ve settled on what I’m calling “tempered independent t errors” I have maybe 80 observations, but more observations close together don’t really improve the information about the fit, since the errors are necessarily similar when the time values are similar. So I’m taking the sum of the log of the t distributions and dividing by the number of observations, and multiplying by a fixed number per cycle times the number of cycles. Essentially saying that there are effectively 3 or 4 independent errors per cycle regardless of how many time points I have in a cycle. It has no probabilistic justification really, but it leads to answers that make sense, in a similar way that independent random errors or AR(1) errors can often lead to good results for timeseries even though we know they are not really independent or AR(1).

      • All that is a long winded way of saying that sometimes in sort of “standard” simple analyses, the likelihood function is pretty obvious, but it’s extremely easy to think up situations where the likelihood function that will give you good answers in a reasonable time and computational complexity is not at all obvious.

      • Recently I’ve been getting interested in fitting ODE models via bayesian inference – can you point me to some seminal reviews/references on the topic?

      • Regarding “t-processes”, you can always pass a Gaussian process through the Gaussian quantile function to obtain something I’d be naturally inclined to call a Gaussian copula process. From there, you can make the marginals whatever you want.

        …A quick google for “Gaussian copula process” turns up this.

        • This is a quite interesting idea. If you’re looking to sample such a thing that would be that. However I am not clear on how I could calculate the likelihood of some observed errors under this kind of model. I guess maybe I could take the observed value, pass it through the t quantile function, get a p value, pass this through the gaussian CDF and get gaussian samples, and then calculate the density in the vicinity of the multivariate gaussian samples and multiply by some kind of complicated jacobian and … ugh… or I could sample a lot from this gaussian copula process and get an empirical likelihood by using small bins around the measured points, but that’s hugely computationally intensive.

          Anyway, next time I’m thinking about modeling an unobserved parameter as a gaussian process I’ll consider the gaussian copula process though, it seems quite useful.

        • Let me see if I understand this idea, I have some errors E_i which represent the difference between my ODE and my observed data, I say that these errors are distributed according to a transformation of a gaussian process X_i. This gaussian process has standard normal marginals and covariance from some covariance function. The particular transformation I use is to take each sample value and create variables that have uniform marginals U_i = pnorm(X_i), and then say E_i = qt(U_i) which says that the marginal distribution of the errors is t and this helps deal with when the result fails to fit due to unmodeled effects.

          To get the likelihood for observed E_i I need to interpret things in the other direction. The cumulative distribution of my E_i values is pmvnorm({qnorm(pt(E_i))}) and I take the derivative of this with respect to all the E_i values d/(dE_1 dE_2.. dE_n) which will turn out to look like dmvnorm({qnorm(pt(E_i))}) dnorm(pt(E_i)) dt(E_i) after all the chain rule stuff?? and tell R to fool around with everything to give me the logarithms of all this mess in a stable way, and voila??

          I will look into this. But for the moment, am I being obtuse and misreading what I’m supposed to be doing?

        • No, you’ve got it. But start with just two E_i, not an arbitrary number. Gaussian processes are handy in that if you understand two points, you basically understand the whole thing. (OK, that’s an oversimplification, but getting the bivariate distribution down for two arbitrary points is a good first step.)

  3. Thank you for quoting the passage from Wasserman (2008). But I don’t see that you really answer Wasserman’s charge that: “Some of the greatest contributions of statistics to science …are unabashedly frequentist ideas and, while one can strain to fit them into a Bayesian framework, they don’t really have a place in Bayesian inference. The fact that Bayesian methods do not naturally accommodate such a powerful set of statistical ideas seems like a serious deficiency.” The “powerful set of statistical ideas” get their rationale from affording error probability computations. You say randomization helps to warrant the model assumptions more generally. But it isn’t clear how something that makes the posterior less sensitive to the prior helps to make the prior more correct as a representation of prior belief or prior strength of evidence, or the like, in the null hypothesis (e.g., of no causal effect).

    • In my opinion, the randomization makes the likelihood more correct, by making the likelihood more correct, the data becomes more informative, and hence the results don’t depend on the prior so strongly because the data speak more strongly than they would have if the likelihood were basically bogus.

      The big problem with bayesian models isn’t the prior, for the most part. in many problems, it’s usually easy to come up with some kind of moderately informative prior for parameters that people won’t choke on if they aren’t vehemently anti-bayesian to begin with. What you want is a way to make your data tell you more than you knew before, ie to make your likelihood informative. But the choice of likelihood is not always so clear cut (see some comments above). not only are there different probabilistic assumptions that could be made, but there are also different structural assumptions, such that several models of the process could be reasonable in many cases.

      by adding in randomness we can make the likelihood easier to specify since at least the portion of the likelihood caused by the random number generator is known exactly. This can make our likelihood more informative, and hence result in better posterior distributions.

  4. I am not sure I understand this notion of randomization giving robustness to the posterior. If we want to get rid of the prior influence, what’s the point in doing a Bayesian analysis in the first place?!

    • But I don’t think you want to put an informative prior on the imbalances between groups compared – especially since there will unlikely be much more that a flat likelihood to update it, when this can be avoided if there is randomisation.

      Early papers and comments by Don Rubin were clear about this – e.g. Rubin DB 1978 Bayesian inference for causal efects. The role of randomization. Anals of Statistics.

    • What’s the point? Being Bayesian allows us to express what’s known about the underlying parameters via (posterior) probability distributions. This isn’t available unless one is at least approximately Bayesian.

      Surely Bayesian analyses don’t *have* to have prior sensitivity in order to be useful – as you seem to suggest?

    • The point of doing a Bayesian analysis is to learn something from the data. If the posterior is the same as the prior, it means we have learned nothing from the data.

  5. Christian: really? You think the billions of dollars we spend doing randomized trials
    is wasted?

    Why even collect data at all? Just state your prior and declare victory?

    Surely the fact that randomization converts an unidentified parameter into an easily
    estimated identified paramater is useful even for Bayesians?

    Larry

    • Anon/Larry: I think perhaps what is missed by those who scoff at randomized trials is the point you raise about the value of rendering a parameter identifiable or estimable or the like. To some of us knowledge-seekers (it’s too tiring to find an acceptable statistical label), the fundamental role of statistical methods, as with all measurement tools in science, is to enable us to discern, by indirect and clever means, something that we could not detect directly. By indirect means, we may find out what it would be like, statistically, if the treatment had no effect. This is a standard to learn from observable data. Designing the method of data collection and modeling is central. As Fisher approximately said someplace, to bring the statistician in after the data is at hand is merely to ask him conduct a post mortem, to say what the experiment died of. But a lot of background knowledge enters into the design of RCTS, he emphasizes, to control known effects and figure out when it may not be worth further control. And then again in linking the statistical inference to substantive claims and theories. We jump in, work the standard tool, and jump back out again.

      • Who in the world is scoffing at randomized trials? Randomization addresses a causal problem – what is being estimated – whereas the choice of bayesianism vs. frequentism addresses how it’s estimated and what constitutes a useful measure of the uncertainty.

        I also think most bayesians would agree that modeling the data collection is essential. The complexities that arise in the data collection process is often a motivation for using bayesian approaches (e.g. for missing data imputation).

  6. Larry has the same quality that Jaynes, and anyone else worthy of respect in statistics, possessed: they try as much as possible to reduce philosophical differences down to mathematical ones. That way there’s at least a fighting chance people with very strong ideas can resolve some of those differences.

    The dogmatic/philosophical/refuse-to-examine-the-math approach to probability is such an obvious dead end it’s a wonder it was ever tried, let alone that it’s still being attempted.

    • Agree it is the first step, but understanding representations in and of themselves is not enough to assess what is most purposeful for learning about our (empirical) world.

      What we want are methods for enquiring communities that facilitate them quickly and importantly getting less wrong about the world – nothing more, nothing less.

      • > try as much as possible to reduce philosophical differences down to mathematical ones

        I don’t get this, at least for statistics (and maybe other philosophical differences too). Surely we need to have some agreement on what the actual questions being asked are? Someone presents some data, a parameterized family of models M_theta, and a demonstration that – given the data – M_0 is rejected at such-and-such confidence level. I’m probably not going to be at all confused about, or in disagreement, with the mathematics. I may not like the model family. Or worse, I may likely be completely perplexed as to what purpose the analysis was done for or what “rejected” actually means (to which an assertion: “here’s what I define ‘rejected’ to mean in this context and here is why it is mathematically true given my definition” is deeply non-responsive.)

        There seem a lot of important questions that are controversial, and mutually misunderstood, even if each side concedes perfect mathematical omniscience to the other. This seems to be true in statistical philosophy a lot.
        What mathematical question is at issue (or whose resolution would be at all helpful) in Lindley’s paradox, for instance?

        Likely I’ve misunderstood you but… I’ve seen plenty of questions where the approach “write down a clear, formal if needed, definition of what you are actually claiming and then let’s try to prove it or find counterexamples” has been a magic bullet for settling disagreements. But I just don’t see it, at all, for these questions in statistics.

        • The point is that the Bayesian-frequentist debate is characterised by an unresolvable (at least thus far) disagreement on what the questions are. This does not prevent us from agreeing to disagree on that and making further progress by investigating whether particular methodologies are mathematically correct given the stated aims in one or the other framework.

          As for Lindley’s paradox, there are no (unresolved) mathematical questions at issue. From the wikipedia entry: “Although referred to as a paradox, the differing results from the Bayesian and Frequentist approaches can be explained as using them to answer fundamentally different questions, rather than actual disagreement between the two methods.” Once we accept that different people prefer to ask different questions, no problems remain.

        • > This does not prevent us from agreeing to disagree on that and making further progress by investigating whether particular methodologies are mathematically correct given the stated aims in one or the other framework

          Are there _any_ well-known and interesting questions in the Bayesian-frequentist debate that have
          this nature? I.e. (if I am interpreting you correctly) where someone grants another’s stated aims (and presumably definitions, etc – ?) and it boils down to whether they are doing the math correctly relative to their assumptions? I sincerely do not know of any, and would love an example.

          Wrt Lindsey’s paradox, the most serious problem DO remain. Suppose someone (not a statistician) comes along to a statistician and says “Help me please; I’d like to know whether theta is zero or not” (H_0 is true or not?). That’s almost certainly not what they really want to know, but, even if it is, _that_ question is NOT the question either the Bayesian or frequentist approach to the LP try to answer. We just can’t say they want the Bayesian question or the frequentist one or something else without further probing. And presumably it would be bad to just pull one of these two approaches off the shelf because that’s what the statistician being approached is familiar with. So the problem that remains is: what do we ask the client to work out what _his_ question is? This is not a matter of pure mathematics; quite the opposite.

          (I do wonder whether the client’s problem will ever really map, remotely plausibly, onto a hypothesis-test question, but maybe this happens – still, it’s far from a given.)

        • Bxg:

          Here’s an example, sort of. Anti-Bayesian know-nothings used to go around saying that they didn’t like hierarchical models because they didn’t “believe” in exchangeability. For example, in the 8 schools (see chapter 5 of BDA), what if some of the schools in the data are much better than others, or identifiably different in some important way? What if, say, 7 of the schools are public and 1 is private? One useful way to shift the discussion is to move to some technical issues. If you have identifying information on some schools, this info can be included as regression predictors. If you think that one of the schools might be much different than the other, this suggests that it’s best not to model the 8 effects as coming from a common normal distribution; a long-tailed distribution might make more sense. The point is to get people away from arguing about exchangeability nod focusing on (potentially) measurable aspects of the data. As Rubin would say, he’d like for the scientists to be focusing on the science, not on the statistical properties of estimators. Bayesian analysis when done right can transform statistical or even philosophical questions into scientific and technological questions.

          P.S. I hope someone’s still reading this thread!

        • > Bayesian analysis when done right
          For “right” I would read purposefully and though we have moved forward (are now less wrong) we still need agreement on _right_, _purposefully_, etc., that are not mathematical concepts.

          Somewhat like Rubin’s statement, David Andrews would suggest try to discern if the client wants help with the problem or the technique (and if just the technique its likely a waste of time).

        • quoting Bxg: “has been a magic bullet for settling disagreements”

          I don’t know that anyone claims it’s a magic bullet. I merely claimed when it’s possible, it gives a fighting chance of reducing some of the differences.

          Frequentists and bayesians agree on more specific points than in the past and that progress has come from those few cases where philosophical points could be reduced to mathematical ones. Many big differences remain though, and future progress will likely come in the way described by Max Plank: “Science advances one funeral at a time”

        • I think optional stopping is at or near the core of the frequentist-Bayesian divide, and it’s eminently amenable to being reduced to math. (I’ve been poking at an optional stopping toy problem for a while now.)

        • Corey:

          We discuss stopping rules a bit in chapters 6 and 7 of BDA (chapters 6 and 8 of the forthcoming third edition). The paradox is often presented that the stopping rule evidently makes a difference, but in Bayesian inference it doesn’t seem to matter. We resolve this in two ways: First, we point out that the stopping rule does matter in model checking because it affects the predictive distribution and thus affects the hypothetical replications to which the data will be compared. Second, we point out that the stopping rule does make a difference in Bayesian inference if the model is changing over time.

        • Corey,

          I had noticed that and had been curious where you were you had taken it. I’ve long since concluded the real divide is simply the definition of probability. One group thinks:

          Def 1: P(x) is the limiting frequency of x in repeated trials.
          Def 2: P(x) defines a region (the high probability manifold) where the true x lies.

          Def 1 people imagine they’re mini-physicists and are modeling physical laws of the universe. Def 2 people are essentially taking “majority votes” over the high probability region in order to make best guesses about functions of x. Def 2 is more useful and more realistic even in problems that involve frequencies in repeated trials and vastly more general to-boot. All distributions, sampling, priors, and posteriors are on exactly the same footing and are to be judged in the same ways. That’s the unbridgeable divide between the Bayesians and Frequentist; and really going forward the problem isn’t Frequentist, who are permanently lost-in-the-sauce, it’s that most Bayesians retain too much def 1 intuition from their early encounters with statistics.

          I don’t see the stopping rule stuff as fundamental. To cut a long story short, a likelihood L(f) comes from the function form f=F(x) where x is some deeper space (which ultimately represents in some way the unknown state of the universe). If F is known with certainty then observed values of f have the same implications for x regardless of how it was observed. If there is uncertainty about F, then things like stopping rules effect what conclusions are draw about x from the observed f. A straightforward Bayesian analysis of the case when there is uncertainty about the form of F takes care of the situation. Or in some cases it’s simply that the wrong F is being used. I had the impression this had been worked out well enough for most practical purposes in Gelman’s book.

        • Corey:

          Jim Berger did tell me in 2008 that the stopping rule stuff was what convinced him to become a Bayesian. There was not the opportunity to fully discuss why so I can’t (but you are not the only one).

          Entsophy: Believe a large amount of the divide you point comes from confusion over what’s being represented with the representation (e.g. thinking randomization does not matter in Bayes as it does not explicitly appear in Bayes Theorem.) I do like your majority vote description.

        • Andrew and Entsophy: I’m abstracting away most of the practical issues, economics-style, to maintain a tight focus on the irreconcilable differences. My toy problem is: collect one sample from a Gaussian with unknown mean and variance 1; If it’s within some interval that’s symmetric about 0, collect another data point with the same mean and variance sigma^2.

          This model is an abstraction of a general optional stopping situation with one decision point, arbitrary initial sample size and arbitrary optional follow-up sample size. It also doesn’t take us too far away from the fixed sample-size case, where the correspondence/agreement between Bayesian and frequentist inference is well-understood.

          I’m considering a comparison of confidence intervals vs. credible intervals. I got bogged down looking for a confidence interval procedure that was optimal according to some criterion (instead of one that was hacked together out of inequalities and that ignores some of the data, per the standard literature on the subject).

        • bxg, regarding Lindley’s paradox: But that question _is_ directly answerable in the Bayesian framework – by calculating the probability of H_0 being true. (Unless the client just wanted a yes/no answer with no uncertainty involved – in that case, refer them to a theologian.)

          Of course, like any problem in statistics or applied mathematics, one cannot get an answer without first supplying a sufficiently complete model, and the answer cannot be expected to correspond to reality except to the extent that the model does. So there is still some more work to do before the Bayesian mechanism can come into play.

          And of course if the client has not yet decided what the question is, it is not yet time to select a methodology for answering it. (Granted, most of our effort in science goes into formulating questions rather than answering them, and availability of methodology does affect the questions we choose.)

        • konrad, if I were to agree with you (at the meta-level I somewhat do), what space does that leave for the frequentist answer to LP? Part of this discussion is about the legitimacy of “different questions” so if Bayes gets to take the “is theta 0?” question, what is the other question (the frequentist one) good for and who cares? You either have a diplomatic answer to that or re-raise the philosophical dispute that people are trying to bottle up.

          In the specifics, I actually disagree with you that Bayes is going to be useful in dealing with a client whose true question, really is to discover whether theta is 0 (truly, absolutely, no noise, equal to zero).

          If someone asks me whether theta is truly 0 or not, then I’m going to search for a domain-specific argument why it is almost certainly not so. And for most such questions (certainly any in the social sciences, but much more broadly than that) I’ll quickly find such an argument, and furthermore the argument will be so strong that no reasonable amount of data will shake it. So I didn’t need any statistics to satisfy that client. And if you had data I would (rightly!) ignore it.
          But maybe we have another problem where theta truly= 0 is a live possibility. Then I’ll first approach it as a mathematician or logician, trying to deduce a proof of such or find a counterexample. No data, no statistics either.
          So, yes, there are some cases (I think rare) where theta truly=0 is a live possibility, we can’t deduce our way to truth of falsity, and sampling/statistics actually might help. I’ll go out on a limb and speculate that these are going to be very special situations, i.e. where we think sampled data can actually assist in deciding the truth of such a precise claim, so that the right way of using this data is going to be very idiosyncratic. Bayes? Something entirely new? Who can guess in the abstract?

        • @bxg: I agree with all of that. I was assuming we are talking about an example where the client really does have a nonzero prior for H_0; I might point out that some nuance is possible, e.g. the client might be thinking of =0 as meaning “within a distance epsilon from 0”, where they have a principled argument for choosing epsilon. In that case a continuous model will do the job without point priors.

          Regarding the frequentist question, a diplomatic answer would be that many people do find it compelling. Since I personally do not, I’ll not try to defend it myself (without denying that a valid defense could be offered in principle).

  7. Anonymous: I was merely alluding to the standpoints behind (a) Gelman’s citation of Wasserman:
    “Some of the greatest contributions of statistics to science involve adding additional randomness and leveraging that randomness. Examples are randomized experiments, permutation tests, cross-validation and data-splitting. These are unabashedly frequentist ideas and, while one can strain to fit them into a Bayesian framework, they don’t really have a place in Bayesian inference. The fact that Bayesian methods do not naturally accommodate such a powerful set of statistical ideas seems like a serious deficiency”.

    and (b) the previous remark:

    “Christian: really? You think the billions of dollars we spend doing randomized trials
    is wasted?”

    I don’t really understand the “what” vs “how” distinction here:
    “Randomization addresses a causal problem – what is being estimated – whereas the choice of bayesianism vs. frequentism addresses how it’s estimated.”

    If there is no dispute on the “what” then is the issue whether Wasserman is correct about the “how”?

  8. Pingback: Gaussian Copula Process errors for ODE models | Models Of Reality

      • I would love to, but at each iteration I need to run an ODE solver for each timeseries to get the predicted values for that timeseries. I don’t think Stan can do this can it?

        • It’s not clear to me how Stan could do this and still use NUTS, wouldn’t you have to be able to calculate the derivative of the ODE solver’s highly multivariate output with respect to all the parameters? In any case, it would be quite wonderful to be able to fit ODEs to timeseries data using Stan so I am excited to hear that this is something you’re considering.

  9. The Bayesian defence seems pretty unprincipled to be. If you’re not confident about your prior then randomisation reduces prior sensitivity, and if you’re not confident about your ability to model data collection then randomisation can give you an ignorable model. But that kicks off by assuming that you’re not a perfect statistician and are compensating for weakness. The perfect bayesian would be sure about the prior and able to model data collection – and wouldn’t need randomisation.

    It just seems the Bayesian defence is from weakness “we’re not perfect but this compensates” while the Fisherian defence is from strength “the best way of analysis is through randomisation”. Very different perspectives.

    • Alex:

      I can’t speak for others, but the discussion of randomization in BDA follows Bayesian principles. I think your remark, “If you’re not confident about your prior,” is a red herring. First, a model is a set of assumptions. We go with what assumptions we have, but we typically know they’re wrong. Second, the “prior” is only part of the model, and I strenuously object those statisticians who unquestioningly accept whatever data model comes to them but then balk at a probability model for the parameters.

    • To add to Andrew’s comments and paraphrase what Daniel Lakeland said above: the purpose of randomization is to increase the amount of (useful) information in the data set being collected. That’s an argument from strength, not weakness.

      Also, the absence of a good data collection model and the need to circumvent this is an intrinsic part of the problem, not of the proposed solution (whether frequentist or Bayesian).

Comments are closed.