When does Bayes do the job?

Posted on August 6, 2015 9:57 AM by Andrew

E. J. writes:

I’m writing a paper where I discuss one of the advantages of Bayesian inference, namely that it scales up to complex problems where maximum likelihood would simply be unfeasible or unattractive. I have an example where 2000 parameters are estimated in a nonlinear hierarchical model; MLE would not fare well in this case.

I recall that you have also stressed this issue, and I’d like to acknowledge that. Do you have pointers to a few of your papers where you explicitly mention this? Ideally I would just take a quotation.

I responded:

Bayes will do this but only with informative priors. With noninformative priors, the Bayes answer can sometimes be worse than maximum likelihood; see section 3 of this 1996 paper which I absolutely love.

Then there’s this paper about why, with hierarchical Bayes, we don’t need to worry about multiple comparisons.

Here’s a quote from that paper:

Researchers from nearly every social and physical science discipline have found themselves in the position of simultaneously evaluating many questions, testing many hypothesis, or comparing many point estimates. . . . we believe that the problem is not multiple testing but rather insufficient modeling of the relationship between the corresponding parameters of the model. Once we work within a Bayesian multilevel modeling framework and model these phenomena appropriately, we are actually able to get more reliable point estimates. A multilevel model shifts point estimates and their corresponding intervals toward each other (by a process often referred to as “shrinkage” or “partial pooling”), whereas classical procedures typically keep the point estimates stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p values corresponding to intervals of fixed width). In this way, multilevel estimates make comparisons appropriately more conservative, in the sense that intervals for comparisons are more likely to include zero. As a result we can say with confidence that those comparisons made with multilevel estimates are more likely to be valid. At the same time this “adjustment” does not sap our power to detect true differences as many traditional methods do.

That’s a bit long but maybe you can take what you need!

You also might enjoy this paper with Aleks on whether Bayes is radical, liberal, or conservative.

48 thoughts on “When does Bayes do the job?”

Daniel Lakeland on August 6, 2015 1:42 PM at 1:42 pm said:

I basically always use informative priors. I typically do this by two methods. The first is I build my model in a dimensionless form in which the parameters represent the ratios of two numbers. Typically I choose a denominator so that it provides an appropriate “inherent scale”. So for example, if you are talking household income, I’d be estimating it in terms of a fraction of GDP/capita/yr, if you are talking about height of a human I’d measure it as a fraction of a “tall person” which I might take to be just a fixed 2m, or I might look up some average height in a big dataset from a human health study or something. if you’re talking about mass of a dinosaur bone I might measure it as a fraction of the mass of the same bone in an elephant or in a very common dinosaur where hundreds of those bones are available… etc

The first thing this does is it makes most measurements be about the order of magnitude of 1, by appropriate choice of the reference. It’s pretty rare that you can’t do this. Even for weird uncommon measurements, like say the diffusivity of a protein through cartilage or something like that, you can probably bound this by say the diffusivity of ink in water, even stuff like the increment in probability of having girl children for beautiful parents, you can compare to the increment in probability of girls during famines, which has been studied…

The next thing I’d do is try to figure out something like an “upper bound” on how big something should be or how much it could vary. Can a dinosaur bone be 10 times as massive as an elephant? probably. Can it be a million times as massive? Definitely not. Can household income exceed a million times GDP/capita? Pretty much no, even for the Bill Gates types, a million times GDP/capita is like 50 billion dollars a year.

If we’re talking about households reached in a telephone survey for example, then even a factor of 10 vs GDP/capita is quite unusual, and a factor of 50 would be very rare. So, I think a prior on income as a fraction of GDP/capita like:

income ~ exponential(1/10)

is somewhat informative, while still being conservative in that it allows for some people to have 30 to 50 times GDP/capita, while still implying that the most likely thing is actually down near 0.

I think this is the kind of thinking people should be using to assign priors. THINK about your SCIENCE and *use* real information from whatever its source.

There is a big difference in informativeness between priors on income/gdp given as:

income ~ exponential(1/10)

and the truly “uninformative”

income ~ exponential(1/1e6)

But the first one is STILL fairly conservative.

Reply ↓
- Rasmus Bååth on August 6, 2015 5:53 PM at 5:53 pm said:
  
  I really like the “common reference”, never thought about that, cool!
  
  I find it easy to think about upper and lower bounds, and also about the “most likely case”. However, my gut reaction when thinking about upper and lower bounds is to put bounded uniform priors on everything, or perhaps triangular distributions if I also include a “most likely case”. A more reasonable prior is likely something more smooth and heap shaped, but I have no feeling for (how fast the probability should go to zero | how the tails of the distribution should look). I can of course set upper and lower bounds and the just use a Normal where most of the mass is in this interval. Or perhaps a Cauchy? The tails of a Normal and a Cauchy are pretty different…
  
  Reply ↓
  - Daniel Lakeland on August 6, 2015 7:02 PM at 7:02 pm said:
    
    Turns out a Gaussian is just a uniform distribution where you don’t know exactly how wide it should be:
    
    http://models.street-artists.org/2013/08/15/what-is-a-gaussian/
    
    of course this is true of other symmetric distributions as well, where the prior on the width is something else other than a chi distribution. There’s something to be said for using a t-distribution, which sort of interpolates between a cauchy (where dof=1) and gaussian (dof=infinity) for parameters where you aren’t really sure how fat the tail distribution should be.
    
    Reply ↓
- Anonymous on August 6, 2015 6:15 PM at 6:15 pm said:
  
  I’ve seen Daniels unitless/dimensionless analysis suggestions in the past and I like the motivation. It’s very intuitive and reminds me of all the equation balancing from chem 101.
  
  However, I’ve been confused, say in a regression context. Is the prescription as simple as standardizing an input covariate with regard to an interpretable reference? You then, assign the prior on this standardized covariate’s parameter?
  
  Reply ↓
  - Daniel Lakeland on August 6, 2015 7:55 PM at 7:55 pm said:
    
    I’d say that’s the first step. The simplest thing is to pick a fixed reference scale and then divide your equation by that constant scale. This just makes the numbers “nice” and by itself can be very useful.
    
    The next step beyond that though is to look for scaling-self-similarity within the structure of the model. In particular, if your model is expressed in terms of N parameters, and there are K freely choosable units for the dimensions in your model, then your model can be re-written in terms of only N-K unknown parameters.
    
    To get an idea of how this works take a look at these posts for example:
    
    http://models.street-artists.org/?s=swimming
    
    I’ll write a blog post on using dimensionless analysis in regression and link here in a few days.
    
    Reply ↓
    - Rahul on August 7, 2015 12:23 AM at 12:23 am said:
      
      It reminds me of a lot of Engineering practice. e.g. Most empirical correlations for fluid flow, drag, pumping power, etc. are all in terms of Dimensionless Groups.
      
      Even in shop floor practice it is very common for control loops to display in terms of percent of max parameters.
- Chris G on August 6, 2015 9:27 PM at 9:27 pm said:
  
  > The first is I build my model in a dimensionless form in which the parameters represent the ratios of two numbers. Typically I choose a denominator so that it provides an appropriate “inherent scale”.
  
  +1. I do the same thing. I never want to have to worry about making a mistake because I got units crossed up. Scaling at the outset pretty much eliminates the risk of doing so.
  
  Reply ↓
Corey on August 6, 2015 5:38 PM at 5:38 pm said:

When I’ve been asked why one would prefer Bayes to other methods in practical terms, my answer has been that no other approach handles nuisance parameters as well as Bayes does.

(I also love that 1996 paper, especially the analysis of the logical consistency of ARMA models as discretizations of an underlying continuous process. Elegant!)

Reply ↓
- Anonymous on August 6, 2015 9:08 PM at 9:08 pm said:
  
  Others disagree:
  
  “Ensuring error statistical calculations free of a nuisance parameter is essential
  for attaining objectivity: the resulting inferences are not threatened by unknowns.
  This important desideratum is typically overlooked in foundational discussions and
  yet the error statistical way of satisfying it goes a long way toward answering the
  common charge that:
  
  (#9) Specifying statistical tests is too arbitrary.
  
  In a wide class of problems, the error statistician attains freedom from a nuisance
  parameter by conditioning on a sufficient statistic for it; see [Cox and Hinkley,
  1974], leading to a uniquely appropriate test. This ingenious way of dealing with
  nuisance parameters stands in contrast with Bayesian accounts that require prior
  probability distributions for each unknown quantity.”
  
  Pg 181 http://www.phil.vt.edu/dmayo/personal_website/Error_Statistics_2011.pdf
  
  Reply ↓
  - Corey on August 6, 2015 9:35 PM at 9:35 pm said:
    
    Yeah, I’ve gone ’round with Mayo on the nuisance parameter thing a time or two; she’s never responded to my queries with anything as clear as that passage. As to substance, well, the thing about priors is they can always be made to exist; sufficient statistics, not so much.
    
    Reply ↓
    - Anonymous on August 6, 2015 11:35 PM at 11:35 pm said:
      
      what do you mean that sufficient statistics can’t be made to exist? Is it that distributions over data don’t usually correspond to clean distributions with sufficient statistics or something else?
    - Corey on August 7, 2015 1:03 AM at 1:03 am said:
      
      I mean that there are models/data distributions that arise commonly in practice for which no sufficient statistic for the nuisance parameter(s) exist.
    - Corey on August 7, 2015 1:04 AM at 1:04 am said:
      
      s/exist/exists
    - Anonymous on August 7, 2015 12:16 AM at 12:16 am said:
      
      I believe the author of “The prize for presto chango & voodoo statistics goes to Bayesians (also 1 for salesmanship). Nothing known? Assume a prior anyway!” believes priors absolutely meaningless except when they’re frequency distributions. The author has explicitly denied in the strongest terms distributions can represent uncertainty of a fixed parameter, so Daniels comment above must be read like the writings of a madman to them.
      
      It’s interesting though that if bayes theorem automatically detects and uses sufficient statistics when they exist it’s “voodoo statistics”, while if R. A. Fisher ad-hoc guesses the same sufficient statistics and uses it in effectively the same mathematical way it’s “ingenious”.
    - Andrew on August 7, 2015 12:19 AM at 12:19 am said:
      
      I continue to hate Twitter.
    - Chris G on August 7, 2015 6:45 AM at 6:45 am said:
      
      I don’t Tweet. (Nor do I Facebook.) That stated, the existence of that Twitter thread cracks me up. The details of the exchange? I could care less. That it exists is really funny.
      
      As for Daniel’s “Turns out a Gaussian is just a uniform distribution where you don’t know exactly how wide it should be…”, I’m still reviewing to confirm that I get it but I think I like it a lot. My gut reaction is that it’s a manifestation of the Central Limit Theorem. Yes? No?
    - Corey on August 7, 2015 8:21 AM at 8:21 am said:
      
      No, it’s more like building a function by adding up a lot of rectangles. A Riemann integral builds a function by piling up “vertical” rectangles and a Lebesgue integral builds a function by piling up “horizontal” rectangles (in this image, blue and red respectively). In a similar (but less general) fashion, a symmetric unimodal distribution can be built by adding up rectangles centered at the mode with varying widths and areas constrained to equal one.
    - Daniel Lakeland on August 7, 2015 1:01 PM at 1:01 pm said:
      
      Corey is correct, the idea is basically that the probability of being x far out from 0 in a unit normal is the same as the probability of being x far out from zero in a uniform distribution between [-a,a] where a is a random number greater than x which has a certain distribution over a. What is the distribution over a required to make the probability equal to the gaussian probability? Corey’s comment on that blog post gives the general case for all symmetric unimodal distributions.
    - Daniel Lakeland on August 7, 2015 1:04 PM at 1:04 pm said:
      
      In stan terminology
      
      a ~ chi(3)
      
      data ~ uniform(-a,a)
      
      is the same as
      
      data ~ gaussian(0,1);
    - Daniel Lakeland on August 7, 2015 1:40 PM at 1:40 pm said:
      
      Though, I should say that the gaussian version is probably more efficient in Stan, and if your data are more than one sample, you’ll need a separate a[i] for each sample, since they’re independent. The point is basically that choosing a gaussian is “morally equivalent” to a bounded interval where you don’t quite know what the bounds should be, but they’re around 2 or 3 which is the high probability region for the chi(3).
    - JD on August 7, 2015 2:39 PM at 2:39 pm said:
      
      @Daniel 1:04PM
      
      It Seems like this simple model could be used as an illustration to counter criticism of Bayesian inference as utilizing too strong of assumptions through prior specification.
    - JD on August 7, 2015 2:27 PM at 2:27 pm said:
      
      In that Twitter thread, Mayo: “So began the massive confusion: the search for uninformative priors (now finally given up). No way to represent it.”
      
      I probably just haven’t taken enough time to fully understand the argument, but I don’t find “Pierre Laplace”‘s response to this compelling. My perspective is not frequentist, but that of a Bayesian that accepts that probabilities are unavoidably “personal”.
      
      So to me, it seems a really strange exercise to try to find a so-called “uninformative prior”. I wouldn’t have any idea how to interpret a posterior based on such a prior.
      
      But more to the point, I don’t see how the blog post pointed to by “Pierre Laplace” refutes Mayo’s point.
    - Corey on August 7, 2015 3:10 PM at 3:10 pm said:
      
      ‘So to me, it seems a really strange exercise to try to find a so-called “uninformative prior”. I wouldn’t have any idea how to interpret a posterior based on such a prior.’
      
      The idea is that probability distributions are not “personal” but rather are conditional on (or encode) states of information. (In this approach it’s axiomatic that two people who have the same information ought to assign the same probabilities.) The point of the search for an uninformative prior is to provide the probability assignment for the null state of information; this serves as the empty product for Bayesian updates.
    - Daniel Lakeland on August 7, 2015 3:28 PM at 3:28 pm said:
      
      The uniform probability on all of the real line is not a standard probability distribution, and the delta function on 0 isn’t a standard probability distribution either, but the uniform probability on -N,N where N is a nonstandard integer is a perfectly reasonable nonstandard probability distribution, and the uniform distribution on [-1/N,1/N] is a perfectly good nonstandard distribution as well.
      
      It’s the fact that standard function spaces are so poor that you have to invent “distribution” theory which basically makes me prefer nonstandard analysis.
      
      I would rarely use a truly “uninformative” prior, but I think the problem people have with them is that standard functions don’t exist which are uniform “everywhere”. The thing that needs to be remembered is that in statistics, there’s always a standard number such that uniform on [-N,N] covers all the possibilities.
      
      For example, lengths, the uniform distribution on a trillion times the diameter of the universe covers EVERY length that any human will ever need to consider. A gaussian whose standard deviation is a trillion times the length of the universe is as uninformative as anyone will ever need to be.
    - JD on August 7, 2015 3:31 PM at 3:31 pm said:
      
      > “In this approach it’s axiomatic that two people who have the same information ought to assign the same probabilities.”
      
      Yeah, that’s what I would struggle with.
    - Anonymous on August 7, 2015 4:05 PM at 4:05 pm said:
      
      JD,
      
      There has been a definition of “amount of information in a distribution” that has been around since at least Claude Shannon in the 1940’s. The use of it is called “Information Theory”. Hundreds of thousands of papers and applications have been done with it.
      
      Information theory uses a functional F[P] to assign a number to each probability distribution P. P1 is less informative than P2 if F[P1] is greater than F[P2]. Intuitively, the more spread out P(x) is the less information it contains about the true x. You can then consider supremums which are least informative over a set of distributions {P} (which respect to a measure on x). The supremums may not be in {P}, but are at least limits of elements of {P}.
      
      So regardless of anyone’s philosophy, the fact is uninformative, or “least informative under given circumstances”, distributions exist.
      
      If you want to think of P(x) as “personal” sometimes, that’s fine as long as we agree there’s an implicit K there such that P(x) is really P(x|K). Presumably K’s in your head or whatever. In that case F[P] (which can be computed without specifying K explicitly) measures the amount of information in K about the true x.
      
      So given an experts “subjective” P(x) we can measure how much information they’re implicitly claiming to have about the true x.
    - Corey on August 7, 2015 7:13 PM at 7:13 pm said:
      
      This is a little glib. For continuous distributions you need to decide on the appropriate non-informative prior before you can define and compute F[P]. Section 12.3 of PTLOS has the details. You and I have gone over this before without resolution, so let me cite chapter and verse for you:
      
      Except for a constant factor, the measure m(x) is also the prior distribution describing ‘complete ignorance’ of x. The ambiguity is, therefore, just the ancient one which has always plagued Bayesian statistics: how do we find the prior representing ‘complete ignorance’? Once this problem is solved, the maximum entropy principle will lead to a definite, parameter-independent method of setting up prior distributions based on any testable prior information.
      
      — E.T. Jaynes, Probability Theory: The Logic of Science, page 377
    - Anonymous on August 7, 2015 7:31 PM at 7:31 pm said:
      
      Long-winded blog comments aren’t glib. Plus I didn’t say anything to disagree with you (look back at it carefully).
      
      As I recall you didn’t understand what was going on with that m(x), so I’ll try to explain it a different way. The expression S = – \int dx P(x) ln [P(x)/M(x)] is telling you roughly the log of the size of the high probability region of P(x) that overlaps, or is contained inside, the high probability region of M(x).
      
      If you want to consider all of the high probability region of P(x) over the entire domain of x, then just set M(x)=1.
    - Anonymous on August 7, 2015 7:56 PM at 7:56 pm said:
      
      P.s. m(x) isn’t an “uninformative” prior. It could be highly informative. Jaynes didn’t call it “uninformative” and the mathematics doesn’t suggest that. It’s something like a “prior” mathematically, which isn’t surprising since it plays the same role as a “prior” in Bayes theorem.
      
      In Bayes Theorem, P(theta) kind of defines the starting universe of possibilities for theta (i.e. the high probability region or uncertainty region) and then P(theta|Data) narrows us down to a sub-portioin of that typically.
      
      Same thing with the entropy. M(x) defines the starting universe of possibilities for x in a sense and then and the entropy tells you how much P(x)’s uncertainty region overlaps with it.
      
      That’s why incidentally entropy isn’t symmetric with respect to interchanging M(x) and P(x) and hence isn’t a true distance measure on the space of probabilities. Since the entropy is something like the log of the percentage of M(x)’s HPM that is in P(x)’s HPM, you definitely won’t get the same thing in general if you switch M(x) and P(x). Think of a ven diagram to see what I mean.
    - Corey on August 7, 2015 9:55 PM at 9:55 pm said:
      
      I didn’t think you were disagreeing with me; I thought that responding to a query about non-informative priors by describing a functional that can’t even be computed without a non-informative prior was skipping over a key point.
      
      I don’t think you understood what I was asking with regards to M(x). If you make a post about it at your blog we can continue there.
      
      As to Jaynes not calling M(x) “uninformative”, well, he did call it “the prior distribution describing ‘complete ignorance’ of x” — in the passage I quoted, no less. Perhaps you see a distinction there, but if so, it quite escapes me.
    - Daniel Lakeland on August 7, 2015 10:44 PM at 10:44 pm said:
      
      Corey: wait, his blog is back?
      
      Anonymous: you can’t just set m(x) = 1 in part because p(x) depends on an arbitrary scale with dimensions 1/[x]. even if you’ve made your model dimensionless, it is only possible to compare to other models where the nondimensionalization (choice of scale) is the same. In general, you can’t calculate an “absolute entropy” only something relative to another distribution. If you want to use a nonstandard distribution like a uniform on a nonstandard space, then you’ll wind up with a nonstandard constant, again you can calculate changes in entropy, because the nonstandard constants cancel out, but no standard absolute entropy.
      
      Thinking a little about your intuition about p(x) being “inside” m(x):
      
      Suppose that m(x) is a proper high entropy distribution, higher entropy than your posterior p(x), and the high probability region of p(x) is inside the m(x), this is the most interpretable case, because then p(x)/m(x) is “always” > 1 in the “high probability region of p(x)” so you’re calculating the average of the negative log of a bunch of positive numbers, so you’re going to wind up “more negative” and hence “lower entropy” … the case is a lot less clear when things kind of intersect and p(x)/m(x) < 1 but p(x) isn't near zero.
    - Anonymous on August 8, 2015 1:36 PM at 1:36 pm said:
      
      Corey,
      
      “M(x)” is not uninformative typically. It might be highly informative. If you view information H and K as constraints, and let P(x|K) be the maxent distribution resulting from constraint K then:
      
      maximzing S=-\int p(x|H,K) ln [P(x|H,K)/P(x|K)] dx with respect to constraint H
      
      is the same as maximizing S=-\int P(x|H,K)ln P(x|H,K) dx with respect to both H and K.
      
      P(x|K) could thus be highly informative if K is a strong constraint.
    - Anonymous on August 8, 2015 1:40 PM at 1:40 pm said:
      
      Corey,
      
      Perhaps a better way of phrasing Jayne’s remark would have been “complete ignorance of x beyond a given state of information represented by M(x)” or something to that affect.
    - Anonymous on August 8, 2015 1:45 PM at 1:45 pm said:
      
      I don’t know if any of that is getting the point across. The bottom line is: just as Bayes theorem has a consistency property whereby you can process information in any order at any time, entropy has the same properties.
      
      If you really understand that, then you understand “priors” whether they occur in Bayes theorem or maximum entropy. If you understand “priors” than everything I and Jaynes said about uninformative distributions makes perfect sense.
    - Anonymous on August 8, 2015 2:03 PM at 2:03 pm said:
      
      Daniel,
      
      Whether you’re talking about entropy or bayes theorem, the “nice” case occurs when the new info you’re trying to process is consistent with the old info. I was really sort of describing the nice case. If there’s an in-congruence between the old and new info, both entropy and bayes theorem will try their best to reconcile it, but a lot of messy things can happen.
    - JD on August 8, 2015 5:50 PM at 5:50 pm said:
      
      Jaynesians,
      
      I hadn’t read Jaynes in a while, so I went back and reread Chapter 12, though I admit, not very carefully.
      
      Correct me if I am wrong: It seems like that in Jaynes’ prescription, when you are constructing your noninformative prior you are only anticipating a small number of physically meaningful transformations to which the prior will be invariant, but you are not ensuring invariance to all one-to-one transformations. Is that a correct reading of Jaynes?
      
      I also went back and skimmed some old articles by Teddy Seidenfeld (1979, 1987) and Kass and Wasserman (1996) Have the concerns about MAXENT priors raised therein been resolved since they wrote these? In particular, the “partitioning paradox” seems troubling. Also, the long example at the end of Seidenfeld (1979) seems to indicate that the MAXENT measure need not be decreasing when new information is added. But I could be misinterpreting the import of this example.
      
      JD
    - Anonymous on August 8, 2015 7:54 PM at 7:54 pm said:
      
      JD,
      
      Moses 11th commandment wasn’t “uninformative priors have to be invariant”. That was just some crapola Fisher invented out of thin air. His intuition got all of Bayes wrong, so it’s not surprising he got this wrong too.
      
      Most all statistics is based on the fact that we can be ignorant in one space, but knowledgeable in other. If we flip a coin 1000 times we may be ignorant about exactly which of those 2^1000 sequences well see, but we’re no where near as ignorant about the frequency of heads. We’re talking about the same physical event in either case, but one function of the even we’re highly ignorant about, while another function we’re highly knowledgeable.
      
      Neither Maxent nor Bayes theorem are required to result in more certainty when new knowledge is added. That just reflects the commonly observed phenomen that new information can make us more uncertain, not less.
      
      For example, we may think Apple stock price will increase by 50% after a new killer Apple product comes out. This 50% figure may have small uncertainty to it initially. If we then learn there’s three other competitors with potentially viable alternatives to the killer product, then our uncertainty in Apples stock price can increase tremendously.
      
      More information can result in more uncertainty, which implies less information.
    - Andrew on August 8, 2015 8:48 PM at 8:48 pm said:
      
      Anon:
      
      Yes, we discuss this in one of the homework problems in BDA, I believe, probably in chapter 2.
    - JD on August 8, 2015 9:59 PM at 9:59 pm said:
      
      @Anon,
      > “Neither Maxent nor Bayes theorem are required to result in more certainty when new knowledge is added.”
      
      I wonder then what connection there really is between these entropy and relative entropy measures and a reasonable concept of information. I would think this would be problematic for Jaynes’ prescription, which as I understand it, is asking us to *maximize* entropy so that we pass along the least amount of “information” we can in the prior.
      
      This is an unsurprising property of Bayes’ Theorem, but since Bayes’ Theorem isn’t being prescribed as a means to try to construct noninformative priors, I am not sure what point you are trying to make by drawing a parallel.
      
      Also, your example of the coin flipping is not something that I my reading of Jaynes would suggest the he would consider representative of a situation where a noninformative prior was appropriate, but instead an informative prior where you may wish to constrain the first moment.
      
      I think the many kinds of examples that can be constructed where someone *thinks* they may be able to construct a prior that reflects the principle of insufficient reason, but then simple transformations of it do not, may be a difficulty in trying to label such priors as noninformative. And invariance wasn’t a constraint that Fisher imposed on objective Bayesians. It is the hole they dug for themselves. Both Jeffreys and Jaynes spend a considerable amount of ink on the subject.
      
      JD
    - Daniel Lakeland on August 8, 2015 11:06 PM at 11:06 pm said:
      
      Oy the nesting limit really is kicking in here.
      
      JD I can’t see that it could ever be possible to ensure invariance to all 1-1 transformations, that seems likely to be provably impossible.
      
      To me the invariance principles are information. If anyone can pick up a stick and make a mark on it and call it a unit of length… that doesn’t mean you should get different physical results based on different choices of stick mark lengths… so the equations of physical reality should be invariant to scale. If there’s nothing different about different directions in space, the equations should be invariant to the arbitrary choice of a direction to be “the x axis”, etc.
      
      Those are kinds of information that you can use to build models, whether they’re statistical priors or equations of motion, or whatever.
    - JD on August 8, 2015 11:50 PM at 11:50 pm said:
      
      @Daniel
      > “I can’t see that it could ever be possible to ensure invariance to all 1-1 transformations, that seems likely to be provably impossible.”
      
      That’s what I suspect, too.
      
      And that’s how I read Mayo’s “No way to represent it.” comment that I referred to up above… and which I thought didn’t seem to be adequately responded to by the link to a blog post by “Pierre Laplace” which provided Jaynes’ prescription.
      
      It seems to me that using the term “noninformative” to describe these constructions by MAXENT is misleading, whether they possess certain invariance properties or not.
      
      JD
    - Daniel Lakeland on August 9, 2015 10:55 AM at 10:55 am said:
      
      In this context, I think “noninformative” has to mean “for the parameter of interest” not “for all ways of looking at the entire universe”.
      
      If you put a prior on a length of some object on the earth that is normal(0,”radius of the universe”) this is non-informative about this object because you ALREADY KNEW it was less than the radius of the earth. So, you’re using less information than you really have.
      
      To a pure frequentist, the distribution isn’t supposed to represent degree of credibility of a given length, it’s supposed to represent how often you’d get such a length in repeated trials of some kind of measurement. In that context, it’s clearly “wrong” because you will NEVER get a radius of 2 astronomical units much less two galaxy radii, when you measure something much smaller than the radius of the earth.
      
      If you remember that the Bayesian is using “how high the PDF goes up the y axis at point x” is “how much you should give credibility to values near x” then obviously a normal(0,radius of the universe) prior is a lot less informative for values near “size of my house” because the pdf is going up about 1/radius of the universe instead of 1/radius of my house, which is a factor of maybe 10^25
    - JD on August 9, 2015 12:36 PM at 12:36 pm said:
      
      @Daniel
      > “If you put a prior on a length of some object on the earth that is normal(0,”radius of the universe”) this is non-informative about this object because you ALREADY KNEW it was less than the radius of the earth.”
      
      You’re going to propose a prior density function for a *length* that has support on the whole real line. That seems very strange to me.
      
      Also, while I don’t think I agree in either case, don’t you also mean to say that this prior is “non-informative about the length” and not “non-informative about the object”?
    - Daniel Lakeland on August 9, 2015 2:30 PM at 2:30 pm said:
      
      “You’re going to propose a prior density function for a *length* that has support on the whole real line. That seems very strange to me.”
      
      Yes, once you understand what the bayesian density is doing, it isn’t that strange. It’s just another way in which one can choose to use less information than one really has.
      
      What’s a prior for the length of my house in feet?:
      
      1) uniform(30,150)… reasonably well informed, excludes a bunch of impossible values.
      2) uniform(0, 1000)… less well informed, still excludes a bunch of impossible values
      3) normal(70,20)… moderately informed, includes as *possibilities* some impossible values (negative, astronomically large, etc) but the probability associated with those is tiny.
      4) normal(70,20) Truncated to [0,inf]… somewhat more informed as it excludes logically impossible values
      5) normal(0, 10000) relatively noninformative as we are pretty sure that uniform(30,150) covers the possibilities and that has density O(1/100) whereas this has density O(1/10000) or a hundred times smaller. This also includes logically impossible values, and extremely improbable values (like 9000 feet)
      
      All of these are however perfectly acceptable priors because the real value is somewhere in the support and somewhere in the highish probability region of the support. That’s all that we need in Bayesian inference, that the actual value is somewhere in the relatively highish probability region.
    - JD on August 9, 2015 3:05 PM at 3:05 pm said:
      
      @Daniel
      “That’s all that we need in Bayesian inference, that the actual value is somewhere in the relatively highish probability region.”
      
      I don’t understand how this can be true… how this can lead to coherent inferences. I am coming at this from a (albeit not fully “orthodox subjective Bayesian” in the sense of de Finetti) subjective Bayesian perspective. And to me it appears that you are consciously putting 50% of your distribution on negative values, which not only seems very informative, but wrong. I’ll get a different posterior if I use a prior that doesn’t allocate 50% of the distribution on negative values, no?
      
      “Objective Bayesians” seem concerned about the use of “personal priors” introducing “bias” into the inference, but it seems to me like using noninformative priors will inevitably introduce bias in a way that doesn’t make sense to anyone.
      
      I think your response just points back to the other common criticism of noninformative priors of, “Do we ever really have no information?” I mean, I don’t know much about your house, but I have a prior that doesn’t look like any one of those you listed. However, I’m also inclined to ask you what your prior is, and if you don’t try to pass off one of those up above as yours, I’ll probably take yours as that of an expert and adopt it or something “close” to it as my prior.
      
      I realize that my concerns about “noninformative” priors are just (perhaps, with less sophistication) a rehash of many things that has been said before in this apparent impasse between “subjective” and “objective” Bayesians, but I was really curious about whether (perhaps even hoping that) something new has come up in a reading of Jaynes that resolves this impasse.
      
      JD
    - Daniel Lakeland on August 9, 2015 10:29 PM at 10:29 pm said:
      
      JD. Take any of those priors, then pace out the length of my house with your feet ONCE, then use any reasonable likelihood for the mean length of your pace, then apply Bayes and you will get an answer that makes sense.
      
      Yes, (3) or (5) are terrible priors, they use much less information than we really have. For example, we KNOW that a length can’t be negative. I personally always use moderately informative priors, but (5) is a perfectly good place to start if needed, especially if you know that your measurement device has reasonable precision relative to the true value of the thing you’re measuring. If you’re going to measure the length of my house by eyeballing it from a pair of binoculars from the top of a nearby mountain, you might want to incorporate more information in your prior to help you get the right answer.
      
      Anonymous’s discussion about how “entropy is sort of the size of the high probability region” has to do with you taking a mean value of the log of the density. When the distribution is spread out a lot, its density can’t go very high (because of the normalization of the area to 1), when the prior isn’t very spread out, its density is large, and the average value is large.
      
      Since entropy is the NEGATIVE mean logarithm of the density, the higher the density goes the lower the entropy, and vice versa.
      
      I think I’m going to actually run the “length of my house” example on my blog. But for another example where I’m discussing the likelihood see: http://models.street-artists.org/2014/03/21/the-bayesian-approach-to-frequentist-sampling-theory/
- JD on August 7, 2015 3:24 PM at 3:24 pm said:
  
  > “I also love that 1996 paper, especially the analysis of the logical consistency of ARMA models as discretizations of an underlying continuous process. Elegant!”
  
  This is something I’ve been wondering about for a while: Does anyone know of a good monograph or other reference that does a good job of trying to “unify” the “common” statistical (discrete) time series models (ARMA, GARCH, etc.) and methods with continuous counterparts (based on diffusions etc.)? I suspect that an attempt to provide a “unifying” treatment of time series analysis would be very enlightening.
  
  And I think that it would be useful to both “camps”: understanding the assumptions that yield the discrete time approximation to a continuous reality or a continuous time approximation to a discrete reality.
  
  Reply ↓
  - hjk on August 10, 2015 10:41 AM at 10:41 am said:
    
    +1 to this request
    
    Reply ↓

Statistical Modeling, Causal Inference, and Social Science

When does Bayes do the job?

48 thoughts on “When does Bayes do the job?”

Leave a Reply Cancel reply