Setting up a prior distribution in an experimental analysis

Baruch Eitam writes:

My colleague and I have gotten into a slight dispute about prior selection. Below are our 3 different opinions, the first is the uniform (will get to that in a sec) and the other two are the priors of dispute. The parameter we are trying to estimate is people’s reporting ability under conditions in which a stimulus which they have just seen is “task irrelevant” — as they has 5 options to pick from, chance level is .2.

My preferred prior is the higher one as it reflects our more conservative estimation of the effect (higher modal theta -> reflects less errors), My colleague on the other hand opted for averaging all our previous experiments which ended up giving larger proportion of errors (i.e. a lower prior modal theta).

Initially, i thought that i should go with a more conservative estimation of the effect size as the purpose of the current study was to test candidate mechanism underlying this phenomenon (which we named “irrelevance based blindness”) by seeing whether it exists under conditions that, theoretically, preclude the operation of (some) mechanisms and by doing so constraining the number of possible ones. hope this makes some sense. Anyway, with this goal in mind i thought that also including what i thought might be extreme results we may be losing sensitivity here. why so? because our conclusion will be based on comparing the posterior CI and seeing whether we have evidence for ‘blindness’. on a practical level this ends up not making much difference because, surprisingly, participants ended up showing much stronger “blindness” then we have ever seen under similar conditions but still i think me and my colleagues’ debate touches upon interesting issues for experimentalists trying to use Bayesian estimation. for example — should such “sensitivity” consideration factor in when choosing a prior or is this merely a relic of hypothesis testing practice?

I don’t have much to say about this particular example, but in general one way to understand these choices is, for each model, to simulate replicated datasets from the prior predictive distribution, and see if these make sense. This can be viewed as a way of engaging with the prior information that is already in your head. Also remember that this prior predictive distribution represents different datasets that might be observed in different experiments studying different treatments, and you can also feel free to play with the sample sizes of these hypothetical replications.

24 thoughts on “Setting up a prior distribution in an experimental analysis

  1. So, just to clarify, the “subjective prior” (is this what we’re talking about?) needs to be rationalized? What is the criterion for selection? Is there any choice that would be considered irrational and thus out of play? This would seem to be the only possible basis for rejection, since the priors aren’t empirically testable. Can this dispute be resolved? How?

  2. >> for each model, to simulate replicated datasets from the prior predictive distribution, and see if these make sense

    I am sure the reviewer will when your write this. Sounds like the Bayesian version of “comes almost certainly close to significance”.

    “see if it these make sense”: Sounds like the garden of forking paths..

    • Dieter:

      When doing exploratory work we can’t choose our models ahead of time. A bad model is a bad model. Suppose someone fits a logistic regression (standard textbook choice) with uniform priors on the coefficients (again, standard textbook choice). No forking paths at all. But it can be a bad model.

      The point of writing about forking paths is not to say that we should choose our models ahead of time; the point is for us to be aware of the choices we are making in data processing and analysis.

      • The conclusions of this type of exploratory work haven’t been tested empirically. And their assumptions about probability can’t be tested by rational argument. So at that stage they can be of no relevance a scientific findings.

    • Choice of a prior mechanically induces a posterior predictive distribution. It’s rare in any kind of complicated model to be able to know what that posterior predictive looks like without simulating from it.

      After drawing from the posterior predictive distribution, do the simulated data points fall where you expected them to? If so, good, if not then your choice of priors didn’t represent your expectations and hence didn’t make sense, in the same way that if adding a certain colorant to your paint can doesn’t make the paint closer in color to the item you are trying to match, then adding that color doesn’t make sense for achieving your goal.

      • >>After drawing from the posterior predictive distribution, do the simulated data points fall where you expected them to?

        Would you be willing to accept a paper in a review that tells you: We tried 5 different priors, and only the third “made sense”. Which, luckily, is the one close to our hypothesis. Reviewer: How do you define “made sense?” How far off-sense where the other 4?

        We can be lucky if she argues like that. The usual argument is “The paper should be rejected, because the statistics uses handwaiving methods”; or probably the same with a bit more scientific clauses.

        • Dieter:

          It’s funny: you must see different reviewers than I do. I’ve been using these methods for 30 years and have published hundreds of peer-reviewed papers, and I’ve never once had a reviewer write that my paper should be rejected because the statistics uses handwaving methods.

        • We like to think about the priors and likelihoods more holistically. Both involve subjective choice on the part of the modeler. This puts the choice among priors and choice among likelihoods on an equal footing. Both have ramifications for inferences.

          I would strongly prefer a coherent discussion of these modeling choices in a paper to “I ran off-the-shelf regression package X on data Y and got p-value alpha.”

          Choices of prior and likelihood both need to be justified in terms of consistency with what we know before we started and sensibility of posterior inferences (this is literally chapter 1, verse 1 in BDA). In particular, we like to use posterior predictive inferences to evaluate model fits for consistency with prior knowledge and consistency with held out data. This is a no-no in classical hypothesis testing, but the go-to method in engineering for building things that actually work. The bottom line is that I want to be pragmatic and build models that make well calibrated predictions (in the posterior coverage sense) that are as sharp as possible (narrower posterior intervals are better given calibraion).

        • When I said mechanically induces a posterior predictive distribution what I meant was assuming you had already made a choice of likelihood so I agree with you Bob all portions of the model require some kind of justification.

        • To add a little to Bob’s comment about engineering approaches that actually work:
          In engineering applications using quality concepts, experiments are usually done in succession, even when using frequentist statistics. The first experiment may only give a rough idea of the variability — but then this (now “prior information”!) can be used as informed input for a “power analysis” to choose a sample size for a second experiment. And the first experiment may also point out other information that suggests a different design (“point out” includes things like listening to what the shop foreman has to say).

        • To me this suggests that you aren’t sure how to describe “makes sense”. See Bob Carpenter’s post below: using a vague prior on whatever, the prior predictive distribution suggests that 10% of batters should have a 0.900 batting average or better, which is far from what really happens. The prior we chose has batting averages concentrated around 0.1 to 0.5 which covers the range of realistic values.

          I doubt you’d have any problem with something like that, gussied up into the appropriate academese.

      • How can fitting specific sets of data points to a model distribution help extend our understanding of nature? If you guessed right, it might be interesting if you had an underlying argument and the distribution wasn’t generic and otherwise predictable. If you guessed right and had no argument, then it wouldn’t be of any use. If you guessed wrong, again, if you had a solid underlying argument this might be interesting. If not, it’s just a bunch of meaningless points fit to a meaningless distribution. If you’re just taking shots in the dark or using generic rules of thumb, then you’re results are exceptionally vulnerable to falsification with the next dataset that comes along; and a replication will require exact replication of all conditions, since without an argument confounds aren’t made explicit. So any failure to replicate can in this case always be attributed, in the words of Dr. Primestein, to “prevailing neutrino winds…”

        • This sounds like a very general objection to (and possible misunderstanding of) parametric statistics in general. The idea is that you fit models (which encode various theories and assumptions about the world) to data- you then check your model fits against data to either, a) expand them, or b) compare and select among competing models.

  3. This point gets driven home to me every time I want to do model calibration tests using simulated data. When you simulate parameters from an overly vague prior compared to what you actually know about the problem, the resulting simulated data sets are often so dispersed as to be utterly inconsistent with what you know.

    I should add this to my repeated binary trials case study for Stan! If I were to simulate from the uniform prior on batting ability I used for the unpooled case, the outcomes would be ridiculous (a league of 0.900 or 0.050 batting averages). The normal I used for log odds isn’t much better (If it were logistic(0, 1), it’d be exactly the same). The question’s then just how tight do I make the prior? Let’s say I’m like Baruch Eitam (who Andrew quoted above) and want to stick to a beta. What I did in the case study was to use a hierarchical model. That has the advantage of estimating the prior along with everything else and the result is quite reasonable considering the amount of data (45 binary trials for each of 25 items). The hierarchical prior fit by the model is much more like the posterior in Eitam’s graphs.

    With the hierarchical model, I again took very vague priors on the parameters for the beta (or for the normal in the log odds case). Were I to have simulated from those hyperpriors, then generated a full data set, it’d look even more extreme than if I’d done it in the no pooling case because the effects get amplified (like in Neal’s funnel example).

    Really vague priors then cause computational problems when trying to apply the Cook-Gelman-Rubin calibration tests—the draws wind up in extreme locations that can cause numerical issues with floating point and with Stan’s adaptation and sampling.

  4. I never got to look into it. Is using beta(1, 1) prior same as uniform prior? They look the same. Of course, beta distribution is in the exponential family, so, mathematically, you can get some closed form solutions. But, by the “look” of it, they are the same.

  5. What has happened down here is that psychological science (and not only) has been taken over by mathematicans and statisticians, who know how to measure and manipulate numbers and describe distributions but not how to generate new arguments, new concepts, and new facts.

    Think about the question that started this discussion: Eitam want to describe data on “people’s reporting ability under conditions in which a stimulus which they have just seen is “task irrelevant.” Is the question designed to enlighten any point of theoretical interest? Will the decision as to how to describe the given dataset help lead to a better understanding of mechanism or principle? Has the question been framed with enough conceptual clarity and precision as to make any decision replicable in principle? No to the latter, for sure. When we’re talking about perception, we can’t just refer to a “stimulus,” we can’t just refer to a “task,” we can’t just refer to “context.” Just as in physics we can’t just refer to a “thing,” or a “situation” or “surrounding things.” The specific referents and methodological choices are crucial for any outcome to have theoretical interest. As long as Eitam and other psychologists limit their descriptions to crude placeholder terms like this (and I mention it because it is common), anyone who understands the subject will be able to falsify any predictions about data simply by a principled choice of conditions.

    The flood of quantitative technique gives the appearance of technical advancement and has swept away those who are scared off by or not attracted to artificially complex, convoluted arithmentical acrobatics, but rather to ideas. Science has gotten lost in this flood.

    Merry Christmas…;).

  6. A point not discussed is how much these “previous experiments” are relevant to the current one ? We have a spectrum going from “These were completely different” to “they were a pilot run of the current experiment”.

    In the first case, a prior reflecting the (supposed) ignorance about theta should be chosen. It should be clear by now that there is no such thing ; using beta(x,1,1) (the uniform) translates the *knowledge* that both “success” and “error” have been observed at least once. There are technical arguments to use either Jeffery’s prior (beta(x,1/2,1/2)) or even (Dirac(x)+Dirac(x-1)/2, the latter translating the (subjective) *belief* that both results are possible, without any corroborating data. Since the author adopts (without discussion) a binomial likelihood, its posterior will be beta(x,6,33), beta(x,11/2,65/2) or beta(x,5,32) in the respective subcases. I doubt that predictive posterior checking will be able to differentiate between those subcases.

    In the second case, let n_s and n_f the number of “successes” and “failures” observed in the pilot run. Starting from an “ignorance” prior (before the test run), the posterior knowledge about theta (after the test run) is resumed in a beta(x, n_s+1, n_f+1) posterior (replace +1 by +1/2 or 0 if you happen to prefer Jeffrey’s or Dirac priors to reflect ignorance). And *that* is the prior to be used in the analysis of the final run, which will end up in a beta(x,5+n_s+1,5+n_f+1) posterior distribution for theta (again, adjust by -1/2 or -1 according to your “ignorance representation” tastes).

    Note that posterior predictive checking *may* be able to detect a better fit using one of the priors rather than the other.

    Now, where are we between those two extreme positions ?

    * You may try to (subjectively) guesstimate “the degree of relevance” r of the test runs to the final run, varying from 0 (totally irrelevant) to 1 (totally relevant), and ponderate your prior by this factor ; you’ll end up with a beta(x, 5+(r*n_s)+1, 32+(n_f*r)+1) distribution for theta (again with suitable adjustments if necessary…).

    * You may even try to fit a *mixture* of these posteriors, and estimate it from the data. This would amount to get *froim trhe data* an estimate of how much the test runs were relevant to the final run.

    What do you think ?

Leave a Reply to Chris Wilson Cancel reply

Your email address will not be published. Required fields are marked *