Hidden dangers of noninformative priors

Posted on November 21, 2013 9:11 AM by Andrew

Following up on Christian’s post [link fixed] on the topic, I’d like to offer a few thoughts of my own.

In BDA, we express the idea that a noninformative prior is a placeholder: you can use the noninformative prior to get the analysis started, then if your posterior distribution is less informative than you would like, or if it does not make sense, you can go back and add prior information.

Same thing for the data model (the “likelihood”), for that matter: it often makes sense to start with something simple and conventional and then go from there.

So, in that sense, noninformative priors are no big deal, they’re just a way to get started. Just don’t take them too seriously.

Traditionally in statistics we’ve worked with the paradigm of a single highly informative dataset with only weak external information. But if the data are sparse and prior information is strong, we have to think differently. And, when you increase the dimensionality of a problem, both these things happen: data per parameter become more sparse, and priors distribution that are innocuous in low dimensions become strong and highly informative (sometimes in a bad way) in high dimensions.

Here are four examples of the dangers of noninformative priors:

1. From section 3 of my 1996 paper, Bayesian Model-Building by Pure Thought: estimating a convex, increasing function with a flat prior on the function values (subject to the constraints). It’s a disaster. As discussed in the article, the innocuous-seeming prior contains a huge amount of information as you increase the number of points at which the curve is estimated.

2. The classic 8-schools example: that is, any hierarchical model. A noninformative uniform prior on the coefficients is equivalent to a hierarchical N(0,tau^2) model with tau set to a very large value. This is a very strong prior distribution pulling the estimates apart, and the resulting estimates of individual coefficients are implausible.

3. Any setting where the prior information really is strong, so that if you assume a flat prior, you can get silly estimates simply from noise variation. For example, the claim that beautiful parents are more likely to have girls, which is based on data that are much much weaker than the prior information on this topic.

4. Finally, the simplest example yet, and my new favorite: we assign a flat noninformative prior to a continuous parameter theta. We now observe data, y ~ N(theta,1), and the observation is y=1. This is of course completely consistent with being pure noise, but the posterior probability is 84% that theta>0. I don’t believe that 84%. I think (in general) that it is too high.

None of these examples are meant to shoot down Bayes. Indeed, if posterior inferences don’t make sense, that’s another way of saying that we have external (prior) information that was not included in the model. (“Doesn’t make sense” implies some source of knowledge about which claims make sense and which don’t.) When things don’t make sense, it’s time to improve the model. Bayes is cool with that.

P.S. Much discussion in comments. The following bit might be helpful, regarding example 4 above:

Of course it depends on the context. Depending on the scaling of the problem, an effect of 100 could make sense. I try to scale things so that effects are of order of magnitude 1. For example, in logistic regression you’re not going to see an effect of 100, similarly in econ you’re not going to see an elasticity of 100 if you’re working on the log-log scale.

I wouldn’t frame this as “second-guessing someone’s prior.” A better way to put it would be that people use conventional models that include much less information than is actually known. Such conventional models include linear regressions etc. as well as uniform prior distributions. If data are strong, you can often do just fine with conventional models. But if data are sparse, it can often make sense to go back and add some real information to your model, in order to better answer your scientific questions.

To put it another way, an analysis based on a conventional model can (sometimes) tell you what’s in the data. But scientific reports typically don’t just report information in data, they also make general claims about the world, and for that it can be a terrible mistake to ignore strong information that is already known.

69 thoughts on “Hidden dangers of noninformative priors”

John on November 21, 2013 9:49 AM at 9:49 am said:

Excuse me for being dense, but are those examples of problems with noninformative priors really just examples of how “noninformative” sometimes isn’t what you think it is? Once the lesson is learned can one not then go out and form better noninformative priors?
- Michael Betancourt on November 21, 2013 10:51 AM at 10:51 am said:
  
  How would you improve the prior? By decreasing the range of the parameters? By emphasizing smaller values? All of those approaches would be _adding_ information to the system, in which case is the prior still non-informative?
  
  Really there are foundational issues in what non-informative even means, and the problem of when non-informative inevitably leads to distributions that can’t be normalized. There’s always some information available (as Andrew noted, if you’re not comfortable with your results it’s because they’re clashing with some implicit information to which you are comparing it) so you might as well use it!
Entsophy on November 21, 2013 9:53 AM at 9:53 am said:

That example #4 is bogus. If your prior says theta is in [-100,100] and the data is saying it’s in [-1,3] then the posterior puts these together and says:

Bayes Answer: “if we look at all the values of theta which are reasonably compatible with the prior and data then a sizable majority of them are greater than zero”

I don’t see how in anyone would complain about that answer since it’s obviously true.

If you have in mind a very different problem, namely:

Frequentist Question: “we simulate theta from some distribution and then simulation y and get 1 what percentage of the time is theta greater than zero?”

well that’s a completely different problem and the answer isn’t likely to be 84%.

If you go around interpreting the Bayesian Answer as providing the solution to the Frequentist Question then you can generate endless problems/paradoxes. I can have a million of them for you by tomorrow if you like.
- Entsophy on November 21, 2013 10:12 AM at 10:12 am said:
  
  Oh and if theta is a physically real fixed parameter, like say the speed of light for example, then the Bayesian questions is the physically relevant one: “what range of values for theta are consistent with our evidence?”
  
  The high probability region of the Bayesian posterior is answering this question perfectly. The Frequentist question on the other hand is just some made up nonsense completely irrelevant to the real world.
- bxg on November 21, 2013 1:48 PM at 1:48 pm said:
  
  I don’t understand why your “Frequentist Question” is different at all. Beyond stipulating that theta is actually drawn from a distribution (so making a distribution over theta something a non-Bayesian won’t object to), why isn’t it exactly asking for the posterior that theta > 0 conditioned on seeing “1”?
  I get the general difference, I just don’t see it for the question you pose.
  - Entsophy on November 21, 2013 2:59 PM at 2:59 pm said:
    
    There’s a slew of differences:
    
    (1) The Bayesian question/answer make perfect sense even if theta is a fixed physical quantity, the Frequentist question/answer do not.
    
    (2) For the Frequentist answer to be valid the distribution you choose for theta has to match the one you’re simulating from. The validity of the Bayesian answer depends on a completely different criterion, namely that the true fixed theta is in the high probability region of the prior ([-100,100] in this case).
    
    (3) When you go to use the Bayesian answer in the correct Bayesian manner, by say averaging a loss function or averaging over P(alpha|theta) for some other alpha of interest, you’ll get decent results (i.e. decent interval estimates for alpha). Although, if you have a tighter prior for theta, you could use it to get better results.
    
    The diffuse prior will give results consistent with the better answer, but just more diffuse.
    
    Bottom line: there’s absolutely nothing wrong with #4 other than the fact that if you mix and match Bayes/Freq you can generate endless amounts of nonsense.
    - bxg on November 21, 2013 7:07 PM at 7:07 pm said:
      
      Frequentists usually ask questions with this pattern: “I don’t believe there is _any_ sense in which there is a distribution over theta; nevertheless, what can I say?” (E.g. confidence intervals, hypothesis tests, etc). Answers to such questions mix notoriously badly with Bayesian approaches.
      
      But _your_ frequentist question has a case where there is a real (and thus acceptable-to-him) distribution over theta, and he asks (by your account) the Bayes-rule-driven question: what is probability that theta is > 1 conditioned on his data. He’s probably feeling lucky, because that’s not his everyday case (and in his everyday case, he is NOT going to make up a distribution.) But given that he has this distribution, where do he and the Bayesian collide? His distribution might not match your [-100,100] prior, but then two Bayesians might disagree too. I just don’t see what nonsense one can generate vis-a-vis _this_ Frequentist Question.
      
      Sorry for missing your point.
    - James Annan on November 26, 2013 1:04 AM at 1:04 am said:
      
      I think the “high probability region of that prior is the real line *excluding* the interval [-5,5].
      
      Doesn’t look such a good method (or answer) now, does it?
    - Entsophy on November 26, 2013 1:41 AM at 1:41 am said:
      
      It’s irrelevant what you think. If the true value is in [-100,100] the method and answer look great.
    - James Annan on November 26, 2013 3:29 AM at 3:29 am said:
      
      Why is your interpretation of the prior better than mine?
    - Entsophy on November 26, 2013 8:02 AM at 8:02 am said:
      
      Lets suppose the true value is where the data says it. Take theta_true = 1. Then consider the following statemens:
      
      A: “theta_true is in [-100,100]”
      
      or
      
      B: “theta_true is less than -5 or greater than 5”
      
      Statement A is true while B is false. That’s what makes my prior better than yours.
    - Entsophy on November 26, 2013 10:49 AM at 10:49 am said:
      
      Annan, in case I was misinterpeting your comment, the high probability manifold is:
      
      W_beta = {x|P(x) greater than beta }
      
      choose beta so W_beta contains almost all the mass. Say 99% of it.
- Daniel Lakeland on November 21, 2013 5:39 PM at 5:39 pm said:
  
  I’m going to take a different approach to interpreting Andrew’s example #4. I think in the type of problem Andrew works with, first of all the Likelihood D ~ N(theta,1) is itself a very approximate idea of what we know about real data. I mean, where does that fixed 1 come from in the variance? Do we really know the variance is exactly 1 but have almost NO idea what theta is? Come on. In situations like that, you almost always will have some knowledge about theta. And furthermore, in the type of models Andrew usually works with, theta is generally an “effect” type parameter, measuring how much something generally changes when some other data value changes. Whether it’s a causal parameter or not, zero is an obvious place to put a lot of these effect type parameters, since it’s easy to imagine that there are lots of relationships that have zero real effect. For example, the effect size of brand of coffee you drank this morning before coming to perform a psychology experiment about ESP… it’s just often the case that we have a strong bias towards having 0 for effects.
  
  So, Andrew’s more or less saying that in that context, we should often be using prior information to constrain things to be not so subject to noisy random small data sets.
  - Andrew on November 21, 2013 5:47 PM at 5:47 pm said:
    
    Dan:
    
    Yes. To put it another way, the very fact that zero is being used as a comparison point (with statements such as, “the estimate is only 1 standard error from 0”) typically implies a prior distribution in which zero plays a prominent role.
    - Daniel Lakeland on November 21, 2013 6:20 PM at 6:20 pm said:
      
      Right. Whereas in examples typical of Joseph’s interests, there’s not necessarily such a relevant “break point” for example like the inner diameter of a certain pipe coming out of a machine. It’s a positive number, it should be around 0.500 inches because that’s the nominal specification, but it can vary a fair amount, maybe 0.01 inches depending on the temperature of the machinery and the wear that it has undergone, in those kinds of situations if you use N(theta,0.01) and the data is 0.510 you aren’t going to say “geez there’s no way the mean diameter has an 85% chance that it’s bigger than 0.5” or something like that, because there isn’t hidden implicit prior data.
      
      In other words, your complaint is relevant in a certain context where everyone already knows implicitly that there’s a special value 0 and your priors really should be taking this into account.
    - Daniel Lakeland on November 21, 2013 6:27 PM at 6:27 pm said:
      
      Also typically the parameter estimated with a reference value of 0 has a qualitative difference when it’s below or above the breakpoint. It’s typically some kind of multiplier, so positive values imply one direction and negative values imply another. This sort of type S error can lead to wrong thinking about unknown but hypothesized mechanisms, which then lead to wrong thinking about the structure of the next more complicated model you fit. There’s a kind of risk aversion to interpreting the signs of multiplicative parameters too strongly in the absence of real data, because it can lead you down the wrong path with further models.
    - bxg on November 22, 2013 10:10 PM at 10:10 pm said:
      
      > 4. Finally, the simplest example yet, and my new favorite: we assign a flat noninformative prior to a continuous > parameter theta. We now observe data, y ~ N(theta,1), and the observation is y=1. This is of course completely consistent with being pure noise, but the posterior probability is 84% that theta>0. I don’t believe that 84%. I think (in general) that it is too high.
      
      Suppose I, in your presence, choose an independent uniform choice t from [-100, 100] – we agreed on this (it’s effectively fixed) and – then – observed y = t+1. Would you then feel that 84% posterior that theta > t is “too high”.
      
      Because (and especially if you say no) it sounds as though you want to second guess someone’s prior on the basis on what subsequent questions they ask about the posterior. In practice, fair enough. In theory and philosophy, what a rathole.
    - Andrew on November 23, 2013 5:25 AM at 5:25 am said:
      
      Bxg:
      
      Of course it depends on the context. Depending on the scaling of the problem, an effect of 100 could make sense. I try to scale things so that effects are of order of magnitude 1. For example, in logistic regression you’re not going to see an effect of 100, similarly in econ you’re not going to see an elasticity of 100 if you’re working on the log-log scale.
      
      With regard to your last point, I wouldn’t frame this as “second-guessing someone’s prior.” A better way to put it would be that people use conventional models that include much less information than is actually known. Such conventional models include linear regressions etc. as well as uniform prior distributions. If data are strong, you can often do just fine with conventional models. But if data are sparse, it can often make sense to go back and add some real information to your model, in order to better answer your scientific questions.
      
      To put it another way, an analysis based on a conventional model can (sometimes) tell you what’s in the data. But scientific reports typically don’t just report information in data, they also make general claims about the world, and for that it can be a terrible mistake to ignore strong information that is already known.
John on November 21, 2013 10:03 AM at 10:03 am said:

The first link doesn’t seem to work. Should it be this?

http://xianblog.wordpress.com/2013/11/18/approximation-of-improper-by-vague-priors/
Andreas Baumann on November 21, 2013 10:46 AM at 10:46 am said:

Is the 8-schools data set available somewhere? I keep hearing of it and I’ve never actually seen it!
- Michael Betancourt on November 21, 2013 10:53 AM at 10:53 am said:
  
  You can find the actual numbers in the Stan repo, https://github.com/stan-dev/stan/tree/develop/src/models/misc/eight_schools, although there’s no
  accompanying documentation with further information about those values.
  - Andreas Baumann on November 21, 2013 1:16 PM at 1:16 pm said:
    
    Thank you.
Pingback: Convenient and innocuous priors | The Endeavour
Rahul on November 21, 2013 1:06 PM at 1:06 pm said:

What are some good examples of studies where “data are sparse and prior information is strong”?
- Andrew on November 21, 2013 2:02 PM at 2:02 pm said:
  
  Rahul:
  
  You ask, “What are some good examples of studies where data are sparse and prior information is strong?”
  
  There are lots and lots of examples. You get sparse data whenever you want to estimate something in small subsets of the population. Consider our estimates using MRP of voting by ethnicity, income, and state, or our similar grids of maps for attitudes on health care and school vouchers. In all these cases the sample size in most individual cells is small, hence the need for models. Another sort of example is where a parameter can be pretty well bounded ahead of time, so that data with moderate sample size has much less information than prior information. An example is item 3 in the above post.
  
  It’s actually a little scary to me that after all my posts on the topic, I still haven’t got this point across clearly. But I think it also has to do with conventional statistical education (including my own textbooks!) which focus on examples with strong data and weak priors.
  - Rahul on November 21, 2013 2:13 PM at 2:13 pm said:
    
    Andrew:
    
    I can sure see how the data is sparse in say your study of attitudes on health care and school vouchers.
    
    The part I’m not sure about is the prior information? Is it strong here. If so what is the strong prior.
    - Andrew on November 21, 2013 2:25 PM at 2:25 pm said:
      
      If you have a cell with n=2, just about any prior info is strong by comparison.
    - Rahul on November 21, 2013 2:46 PM at 2:46 pm said:
      
      I just find the strong prior condition quite rare in practical situations I deal with. If the prior is really strong a study is rarely commissioned. Ok, to validate existing knowledge or a unconventional hypothesis that challenges established wisdom does occur but still not as frequently as the situation where there really is no settled prior.
      
      It is these situations where I’ve the most trouble applying Bayesian reasoning. Since it’s simply so hard to agree upon a consensus on what informative prior to use a priori.
    - Andrew on November 21, 2013 3:10 PM at 3:10 pm said:
      
      It’s always hard to get a consensus on what method to use. Different analyses will yield different results. This happens Bayesian or otherwise. I don’t want to use flat priors because they give me bad answers that make no sense. But there is a large class of models that make a little bit of sense, and ultimately we have to just do our best and check the models that we do fit. One cool thing is that, in many cases, the really bad models can be tossed out because they don’t fit important aspects of the data.
- Ben Bolker on November 21, 2013 3:23 PM at 3:23 pm said:
  
  There are some nice examples in M. McCarthy’s book on “Bayesian Methods for Ecology” ( http://www.amazon.com/Bayesian-Methods-Ecology-Michael-McCarthy/dp/0521615593/ref=sr_1_1?ie=UTF8&qid=1385065349&sr=8-1 ) ; the one I remember is an example where we have observed the lifetimes of 3 (!) eagle owls, and combine that with prior information on the relationship between body mass and longevity in other birds of prey.
  - Rahul on November 21, 2013 11:29 PM at 11:29 pm said:
    
    That makes sense. I just feel those are pretty niche situations.
    
    e.g. If it were human babies a n=3 study of this nature is probably silly.
- R McElreath on November 21, 2013 4:09 PM at 4:09 pm said:
  
  Examples are routine in my field, evolutionary anthropology. A common problem is radio carbon (and other types of) dating. Typically we get a posterior density for radio carbon date. But we also know a lot of other things, like for example that all the dates from the same stratigraphic layer must fall within the same layer. We use strong joint priors on the dates to update the posterior density of each date, with really nice inferential results. See for example Figure 1 in http://www.pnas.org/content/108/21/8611.full
  
  Basically, we use strong priors to combine different types of data about the same thing. And we never have the data we wish to have, just the data that we happen to have. So even a little bit of information in a prior can help a lot. The thing to note about these examples is that the prior is “strong” only in particular regions. It mainly serves to jointly truncate the radio carbon estimates.
  - Rahul on November 21, 2013 11:39 PM at 11:39 pm said:
    
    I love all these examples.
    
    What I think is these are great cases where rich data exists to construct a good data-based prior. The applications of Bayesian reasoning that make me uncomfortable are ones in which researchers pull a fairly subjective prior out of a hat and multiple researchers do not even show much consensus as to what prior is the right prior. I feel that is a can of worms.
    
    I suspect Andrew’s examples of voting, school vouchers, ethnicity etc. are in that category. I may be wrong. Perhaps there are obvious, uncontroversial priors there?
    - Andrew on November 22, 2013 5:25 AM at 5:25 am said:
      
      Rahul:
      
      I recommend that you (and others who think my models are “a can of worms”) to read my recent AJPS paper with Yair and my forthcoming JRSS paper with Kenny (for details on two particular cases) and BDA (for more general principles.
    - Rahul on November 22, 2013 7:00 AM at 7:00 am said:
      
      Andrew:
      
      Thanks! I will read those.
      
      PS. To clarify, I don’t think your models specifically are problematic; my concern was about using informative priors in subject areas where the priors are not strongly data-linked and hence where large flexibility & disagreement exists in the particular choice of priors.
- Robin Morris on November 21, 2013 6:26 PM at 6:26 pm said:
  
  I had a case where I had only a few direct observations of what I wanted to measure, but a lot of data of something that could be considered a proxy for what I wanted to measure. So I constructed an informative prior from the proxy data, and combined that with a likelihood from the sparse observational data.
  
  This was an e-commerce application; unfortunately I can’t go into the details.
  - Rahul on November 21, 2013 11:31 PM at 11:31 pm said:
    
    I’m totally in agreement here. I think the crucial difference is the availability of abundant proxy data, enough to construct a credible prior.
konrad on November 21, 2013 3:37 PM at 3:37 pm said:

I’m also confused by example 4. You described an observation that is much more likely when theta>0 than when theta<0. What do you mean by "This is of course completely consistent with being pure noise"?
- Andrew on November 21, 2013 3:45 PM at 3:45 pm said:
  
  If theta=0 (i.e., pure noise), there’s no surprise at all if the estimate is one standard error away from 0. Such a result is completely consistent with noise.
  - Entsophy on November 21, 2013 5:05 PM at 5:05 pm said:
    
    The 95% credibility interval estimate for theta you get from that posterior (with highly diffuse prior) is going to be something like [-1,3] which includes zero.
    
    So you think the data’s consistent with theta=0 and the posterior thinks the evidence is consistent with theta=0. However, the evidence is also consistent with other values of theta, which the posterior also takes into consideration. It’s a mystery why you think that’s a bad thing.
    - Andrew on November 21, 2013 5:27 PM at 5:27 pm said:
      
      Joseph:
      
      I just don’t believe that P(theta>0|y)=0.84. To put it another way, I don’t think that, if the study were repeated with a huge sample size, there’s an 84% the result would go in the same direction. 5:1 odds seem too strong to me, for a pattern that could easily have occurred by chance.
    - Entsophy on November 21, 2013 11:21 PM at 11:21 pm said:
      
      Well it’s perfectly fine if you interpret and use it correctly. It’s saying something like 84% of possible values compatible with the evidence are greater than zero. If someone misinterprets this and does something stupid with that info then that’s on them, not the prior.
    - K? O'Rourke on November 22, 2013 7:55 AM at 7:55 am said:
      
      > If someone misinterprets
      I think that is the issue here, how does one _interpret_ posterior probabilities?
      
      Obviously in the context of the appraised credibility of _the_ prior(s) and data model(s) used.
      
      But Andrew seems to be pointing to the frailty of noisy data, even for thought experiment true models, perhaps in a Rubinesque repeated use relevant way? (1984)
      
      (Perhaps something to work on over the weekend.)
    - Entsophy on November 22, 2013 10:36 AM at 10:36 am said:
      
      Everyone look, if the diffuse prior leads to a posterior which says:
      
      A: “theta’s in [-1,3]”
      
      while a more informative prior says:
      
      B: “theta’s in [-.1,.1]”
      
      then if theta=0 both statements are correct. The latter is simply more informative. Since B implies A it’s not possible to say the former’s wrong while the later is right. Why is this so hard for people to understand? I really don’t get it.
    - Anonymous on November 23, 2013 11:11 PM at 11:11 pm said:
      
      @entsophy I suspect andrew might be assuming that calibration is a desirable property of a bayesian model. You might not agree with that goal, but I think that’s where the notion of a correct/incorrect prior is coming from.
    - Entsophy on November 24, 2013 5:18 AM at 5:18 am said:
      
      Then the title of the post should be “The Hidden Dangers of Calibration”.
    - Martyn on November 22, 2013 9:56 AM at 9:56 am said:
      
      Number 4 is a nice example, but quite subtle. If you look at the non-informative prior as the limit of a sequence of increasingly diffuse normal priors centred on zero, then it puts too much prior weight on theta being very far away from zero, to the extent that the slightest evidence of positivity is over-interpreted (likewise negativity).
      
      To be devil’s advocate, you might say that this prior is too informative because it assumes the variance is known, which is never true in practice. Any prior on the variance would alleviate this problem. If one used a Jeffreys prior on the mean and variance then the posterior would still be improper after one observation.
      
      Or to be more even-handed, you could say that strong assumptions in one part of the model can bleed into so-called non-informative priors for other parameters, rendering them highly informative.
    - Andrew on November 22, 2013 2:51 PM at 2:51 pm said:
      
      Yup.
  - konrad on November 21, 2013 9:54 PM at 9:54 pm said:
    
    Andrew: I’m trying to follow your reasoning here so your appeal to personal incredulity is rather unhelpful. It seems to me that 5:1 odds is a fine description for a pattern that easily occurs by chance – any poker player who is willing to draw towards a straight will back me up on this.
    
    I’m unclear on why you are labeling theta=0 as “pure noise”, but it suggests that you have some concrete examples in your head. Is it perhaps the case that they are of the type where you have strong reason to expect that theta is close to zero (e.g. theta represents some effect that you expect may well be negligible)? Would you make the same claim if theta were, say, a temperature reading?
    - Andrew on November 23, 2013 5:20 AM at 5:20 am said:
      
      Konrad:
      
      I’m referring to theta=0 as “pure noise” in the sense that, in this simple example, we can write the model as y = theta + epsilon, where epsilon is an independent error term. Here, theta is the signal and epsilon is the noise. If theta=0, that’s pure noise. I have no deeper meaning that that.
  - Phil on November 22, 2013 1:48 PM at 1:48 pm said:
    
    Like a few other people, I’m confused by your example. The math seems clear enough, and we can check it by simulation (I did this in R):
    thetasim = runif(n=100000,min=-100,max=100) # instead of an infinite distribution, I’ll use uniform [-100,100]
    ysim = rnorm(n=100000,mean=thetasim,sd=1)
    
    Now look at all of the theta for which ysim was near 1; what fraction of these are from theta > 0?
    yes1 = round(ysim) == 1
    sum(thetasim[yes1] > 0)/sum(yes1)
    
    For a particular set of random draws (the first and only one I’ve done), I got 0.82. If the theta parameter could be anywhere from -100 to 100, and you draw from y ~ N(theta,1), it really is a very good bet that theta > 0.
    
    You obviously know this, so…I guess I don’t get the point of that example, which you say is your new favorite! Perhaps you’re saying that in most real-world circumstances that people use infinite uninformative priors, if they actually see a number that is near zero — anything with an absolute value below 10, maybe below 100 or 1000 — then they should reconsider their prior, because if the parameter value really could be “anything at all” then why is it so small, there’s probably a reason that we could figure out if we tried. Or something like that?
    - Phil on November 22, 2013 7:26 PM at 7:26 pm said:
      
      Oops, I meant “If the theta parameter could be anywhere from -100 to 100, and you draw from y ~ N(theta,1) and get a 1, it really is a very good bet that theta > 0.
  - James Annan on November 22, 2013 10:45 PM at 10:45 pm said:
    
    The “problem” is of course that your “ignorant” prior assigns huge probability to theta being miles away from zero. Now if you’re talking about the obs being consistent with noise or not, you presumably thought there was a nontrivial probability of a zero (or at least v small) theta after all.
    
    I’m not disagreeing with your example, of course. The issue (as I see it) is the assumption that a uniform (or indeed any other) prior can represent “ignorance”.
xi'an on November 21, 2013 4:32 PM at 4:32 pm said:

Sorry, my post was originaly planned for next Monday: it is now on but with a different link:

http://xianblog.wordpress.com/2013/11/21/hidden-dangers-of-noninformative-priors/
James Annan on November 21, 2013 8:49 PM at 8:49 pm said:

The biggest danger, albeit it might not qualify as hidden, is that people (certainly including scientists, and perhaps even some statisticians) think that “noninformative” in this sense might actually have the same (or at least similar) meaning to that which it takes in common english usage.
Pingback: Friday links: Being Black, why science news doesn’t go viral, MOOC fail, and more | Dynamic Ecology
Christos Argyropoulos on November 22, 2013 6:26 AM at 6:26 am said:

Cross posting my personal lesson from a experience (described in Christian’s blog: http://xianblog.wordpress.com/2013/11/21/hidden-dangers-of-noninformative-priors/)

Bottom line: non-informativeness is in the eyes of the beholder. If there is a formulation of your problem that you are comfortable reasoning about, choose priors that best corresponds to your state of knowledge (or ignorance) in that formulation/parameterization. But don’t expect these non-informative priors of yours to map to non-informative “folklore” priors in a different parameterization.
- Andrew on November 23, 2013 5:18 AM at 5:18 am said:
  
  Christos:
  
  Yes, that’s my point. A conventional or purportedly noninformative model can be a useful starting point but we have to be ready to move on if it gives implausible inferences.
stringph on November 24, 2013 4:46 PM at 4:46 pm said:

Actually, I don’t see why anything is wrong with the 84% number.

Suppose you wanted to measure the temperature at some time and place and had no prior information except that the place was in Canada. Your observation is 1 deg above zero with 1 deg standard uncertainty; how certain are you that it’s really above zero? I would be fairly certain; the 5-to-1-ish odds ratio seems reasonable to me.
- Anonymous on November 24, 2013 6:26 PM at 6:26 pm said:
  
  The information to assess whether 5-to-1 is correct or not isn’t given, but given the context in which non-informative priors are used, it’s probably going to be wrong.
  
  The problem comes back to this issue of mixing definitions of probability. The non-informative prior was chosen to reflect a state of knowledge, so we can’t suddenly change the interpretation of the posterior probability as a frequency. For the posterior interpretation to work as a frequency interpretation, the prior has to be calibrated to reflect the base rates of theta values.
  - stringph on November 24, 2013 7:17 PM at 7:17 pm said:
    
    Oh, given some contexts for typical use of non-informative priors which … I don’t have?
    
    Would you say that my example of measuring the temperature at a mystery location in Canada is one of these typical cases?
    
    If not, what would be a typical case where the inference is wrong?
    
    And isn’t it deeply ironic that the example can, apparently, only be understood if the reader already possesses a lot of contextual information which is not given in the post?
- Andrew on November 25, 2013 12:34 AM at 12:34 am said:
  
  Stringph:
  
  I agree that there are settings where an (approximately) uniform prior distribution makes sense. These settings are those in which the data are much stronger than the prior. Your example of a precise temperature measurement and very weak prior distribution (merely the statement that a measurement was performed in a particular country) is one such example. Most of the published studies I’ve seen that have featured statistically significant p-values do not look like this. To put it another way, if a result such as p<0.05 is considered newsworthy, this already implies (in some sense) a strong prior centered around zero, so that it is considered something of a surprise for the measurement to be far from zero. But, yes, in regard to your comment, there are definitely settings where inferences from the noninformative prior are reasonable. In my post, I was focusing (implicitly) on the more controversial settings.
  - stringph on November 27, 2013 4:57 PM at 4:57 pm said:
    
    Hmm… this seems to require me to perform a complicated piece of induction on the blog post.
    
    Since you are talking about the statistical interpretation of performing a given measurement, and one could place that interpretation in a number of different (and unstated) contexts – and the interpretation would be correct and uncontroversial in *some* contexts – then, because you wouldn’t ever write about something that *was* correct and uncontroversial, I have to imagine another context under which an interpretation using an uninformative prior would be dangerous and/or controversial.
    
    The sentence “Most of the published studies I’ve seen that have featured statistically significant p-values do not look like this” gives the game away: despite appearances, this isn’t a post about measurement errors, it’s another post in the ongoing series about p-values, or at least about mistakes made by people who habitually use p-values, or who would do so if they could get away with it any more. And the context is measuring alleged effects for which the default prior — one might almost say null hypothesis — is that their value is in fact very close to zero?
    
    OK, so example 3 was of this sort; but example 4 doesn’t look like mutant-frequentism to me, on the face of it it’s a type S problem.
    - Andrew on November 28, 2013 5:21 AM at 5:21 am said:
      
      No, the above is not a post about p-values, nor is it a post about measurement errors. It is a post about Bayesian inference. I like Bayesian inference a lot—I wrote two books about it!—but certain natural-seeming models can yield posterior probabilities that don’t make sense (in some settings). That’s the subject of this post.
Pingback: Entsophy
Pingback: There is Always Prior Information | Elements of Evolutionary Anthropology
a reader on August 22, 2017 12:55 PM at 12:55 pm said:

I’m really surprised about your example (4).

The idea that you don’t believe that theta is greater than 0 because “it is consistent with noise” really seems to be the failing to reject the Null implies the null fallacy.

Either you really doubt the prior (okay, that’s the point of this post…) BUT in a way that puts a spike at 0 since I assume you would say the same thing if we observed y = -1 (which I don’t think you like to do given your other posts) OR you really doubt the N(theta, 1) distribution of the data (valid doubt, but not about the prior!).

I get that it’s evidence that the maybe a flat prior is saying something…except I would guess an informative prior in this problem would be something like N(0, 10) which would result in a nearly identical answer!
Pingback: Bayesian inference completely solves the multiple comparisons problem - Statistical Modeling, Causal Inference, and Social Science
Pingback: Bayesian inference completely solves the multiple comparisons problem « Statistical Modeling, Causal Inference, and Social Science

Comments are closed.