Xiao-Li Meng sends along this paper (coauthored with Matthew Reimherr and Dan Nicolae), which begins:

Dramatically expanded routine adoption of the Bayesian approach has substantially increased the need to assess both the confirmatory and contradictory information in our prior distribution with regard to the information provided by our likelihood function. We propose a diagnostic approach that starts with the familiar posterior matching method. For a given likelihood model, we identify the difference in information needed to form two likelihood functions that, when combined respectively with a given prior and a baseline prior, will lead to the same posterior uncertainty. In cases with independent, identically distributed samples, sample size is the natural measure of information, and this difference can be viewed as the prior data size M(k), with regard to a likelihood function based on k observations. When there is no detectable prior-likelihood conflict relative to the baseline, M(k) is roughly constant over k, a constant that captures the confirmatory information. Otherwise M(k) tends to decrease with k because the contradictory prior detracts information from the likelihood function. In the case of extreme contradiction, M(k)/k will approach its lower bound −1, representing a complete cancelation of prior and likelihood information due to conflict. We also report an intriguing super-informative phenomenon where the prior effectively gains an extra (1+r)^(−1) percent of prior data size relative to its nominal size when the prior mean coincides with the truth, where r is the percentage of the nominal prior data size relative to the total data size underlying the posterior. We demonstrate our method via several examples, including an application exploring the effect of immunoglobulin levels on lupus nephritis. We also provide a theoretical foundation of our method for virtually all likelihood-prior pairs that possess asymptotic conjugacy.

This sound like it’s potentially very important. As many of you know, I’ve been struggling for a few years with how to generally think about weakly informative priors, and this paper represents a way of looking at the problem from a different direction.

This area also is an example of the complementary nature of applied, methodological, computational, and theoretical research. Our methodological work on weakly informative priors (that is, our papers from 2006, 2008, 2013, and 2014) were motivated by persistent problems that were arising in our applied work on hierarchical models. Then, once we have these methods out there, it is possible for deep thinkers such as XL to make sense of them all. Then people can use apply larger framework to new applications, and so on.

P.S. I spoke on weakly informative priors at Harvard a few years ago (see here for a more recent version). When the talk was over, XL stood up and said, “Thank you for a weakly informative talk.” So I’m hoping the paper above gets published and I can write a discussion beginning, “Thank you for a weakly informative paper.”

I’ve never been sure what Andrew’s meant by “weakly informative prior.” I’ve taken a working definition to be something like concentrating the prior probabiliy mass in feasible regions for the parameter based on prior knowledge. Now that prior knowledge can be

1. domain knowledge for a problem (e.g., mean height of humans is not less than .25m or greater than 2m), or

2. domain knowledge for a class of problems (e.g., with standardized predictors and a small set data set of 10K items with no separation), regression coefficients tend not be greater than 5 in absolute value, but there can be outliers.

What we’ve been doing a lot in Stan is shaping priors so that they have very little impact on the posterior means, but enough impact on the posterior curvature that HMC will sample efficiently. Sort of an inverse application of the folk theorem (i.e., if sampling’s hard, there may be something wrong with the model). I believe this controls posterior tails more than anything else, which brings up the issue of how we evaluate the mass we assign to the tail regions to begin with.

Andrew’s pet example, Rubin’s 8 schools problem, is a classic case where there’s very little data and therefor the priors have a strong effect on inference.

Bob:

Roughly speaking, a weakly informative prior has some information but not as much information as you really have. Your height example is a good one. Advantages of weakly informative priors is that they can be easy to set up (because you don’t have to worry too much about the details) but they can do a lot to stabilize inferences, especially in weak-data settings. A practical example is in this paper.

If your prior is only weakly informative don’t you lose the key strength / uniqueness of a Bayesian modelling approach?

Rahul:

See my comment to Bob above. The short answer is that better modeling is (typically) better, but some modeling can be a lot better than none.

If you are not sure of what’s a good prior, doesn’t this become equivalent to a frequentist model?

Is the limit of a very very weakly informative prior a non-Bayesian model?

Rahul:

There’s not really such a thing as a frequentist model. “Frequentist” refers to methods of evaluating statistical procedures. Any procedure, Bayesian or otherwise, can be evaluated using these principles. In any case, the point of weakly informative priors is that, in some settings, a little bit of information can do a lot of regularizing. As I like to say, try N(0,10^2) rather than N(0,1000^2) or N(0,infinity).

I like this view (and the similar things you’ve said) a lot. I think you’ve probably single-handedly convinced a lot of us borderline bayesian skeptics, who never bought into the pro-bayes philosophical arguments or lazy criticisms of frequentist stats, that maybe there are some useful things there.

I still don’t like a lot of the default ‘ontological baggage’ or whatever that you find in the sell parts of papers and talks etc but I’ve found enough translations – eg regularization was a really big turning point for me – to embrace (or at least try) the bits that I think are useful. Plus subjectively (!) bayes folk just seem to be working more on the problems I want to tackle.

The eight schools example is a good one. You can put in just about any prior that seems reasonable and there’s just enough data for the data to drive the inferences. But you do have to put in _something_ for the prior.

Thanks!

Have a link? “Rubin’s 8 schools problem” on google didn’t work.

Intuitively I see WIPs as a compromise between a default implicit prior in non-Bayesian analyses and a detailed “all I know” prior in an idealized Bayesian analysis. I guess the medical analogy would be a non-invasive procedure.

What I like is the subtle point about conservatism. Ignoring the prior and letting the data speak is risky, not conservative. Risky bc you are ignoring information.

If you are a kid it is not conservative to ignore you mother’s admonition not to talk to stragers, including lonesome data points in a sparse stratum. ;-)

Fernando,

I think your point (about Andrew’s point) about “conservatism” is interesting. But I wonder if there aren’t two types of conservatism we could differentiate to make the point more clear: I’ll try splitting it by within-study and between-study conservatism.

I think of “within study” conservatism as kind of “letting the data speak.” This is conservative in the sense of putting minimal structure on your data analysis and thus being “conservative” with regards to your beliefs about how the data is generated in the real world (its like saying “I don’t know the underlying processes that generated this data, but I want to make these comparisons”).

I think of “between study” conservatism as the place where Bayesian-type thinking is not just useful but absolutely 100% necessary – regardless of how much structure I put into my own analysis, I should only use the results to update my beliefs about the world in relation to other evidence, and not to just accept my results as face-value true about the whole world. That is – it is conservative to judiciously and carefully update our beliefs about the world in relation to any particular finding. This is where we agree on “risky bc you are ignoring information.”

To me the question I keep coming back to here is: to what extent do we want to build “between study” conservatism into “within study” methodology. That is – how much do we want to include prior knowledge in the model itself, and how much do we want to just use our prior knowledge when updating our beliefs about the world in light of some new finding. Its still not at all clear to me that Bayesian approaches really help us with “between study” problems, and the cost is often a lack of clarity and parsimony in the analyses that are conducted (I concede that my lack of experience with these models may be part of the problem).

in your final paragraph: “Its still not at all clear to me that Bayesian approaches really help us with ‘between study’ problems…”

Did you mean “within study”?? since you mention Bayesian approaches as a method for between study inference in previous paragraphs.

I’m not sure your dichotomy is that strict or helpful. You say that “within study” conservatism is a kind of “letting the data speak” and putting “minimal structure on your data analysis”.

I very rarely see this approach as desirable. To me the way to study the world is to create models that are fairly carefully structured, and then see what the data helps me discover about the model.

If I am not confident that my model is the “one true” model, then I would be interested in comparing it to other models. But each model I’d take very seriously and try to put as much structure as I can justify into it.

BTW: I hope we’re going to get together down here in SoCal soon. And maybe we can tackle one of your problems in parallel, using a Bayesian approach and some more standard econometrics approaches, and you can see what you think of the experience.

I should probably clarify though, I agree with Andrew that using moderately informative priors, ones which encode your prior information approximately and non-strictly is desirable. That’s a kind of conservatism that I can get behind, using much of your prior information, but not so much that it becomes strongly controversial.

Can you illustrate what

“encode your prior information approximately and non-strictly”means versus the alternative? Any simple toy example to explain this?What’s a strict prior versus a non strict prior?

Sure, From your name I assume you are male. From background information on male humans I could construct a prior for your height. Using the CDC data graphed here:

http://en.wikipedia.org/wiki/Human_height#mediaviewer/File:Male_Stature_vs_Age.svg

I could create a prior that was normal(175,8) in centimeters to approximately match the range shown. This would be using pretty much all the information that I have. But given that I might be studying height in a context where there might be a bias in selection relative to the whole population or other reason to believe that my data set would be different from the overall population, I am more comfortable with normal(175,20) which allows a much wider range than is actually observed in the background data.

This contrasts to doing something like looking at the wikipedia page on human height

http://en.wikipedia.org/wiki/Human_height

seeing that they show the historical range of adult human heights as 60 to 260 cm and constructing a prior as say uniform(0,300) in cm, which is much less informed, or just noting that the average male height is about 175 and constructing a prior as exponential(1/175) which is also much less informed. And we could even go so far as to do something like normal(200,1000) truncated to the interval [0,inf] to essentially say we know nothing about human height…. I wouldn’t advocate those other methods, I’d advocate normal(175,20) or normal(175,40) because that’s a good mix between using our knowledge, and allowing extra wiggle room for our specific study.

@Daniel Lakeland:

I see what you mean. But I’m not sure I agree entirely.

Essentially, we are talking about how much to weigh “our knowledge” versus the “specific study”. To me this can *only* be done on a case to case basis, employing context & domain expertise.

Rahul: from my perspective, all research can only be done on a case by case basis employing context and domain expertise. We need to design experimental protocols, or data collection protocols, or choose which topics to study, which instrument variables to use, what the questions to ask are, how to put our study in the context of other studies, whether to do followups, how to trade off power vs cost, how to trade off benefits to society vs cost… etc etc.

“the prior” is in my opinion among the least objectionable parts of research choice. It is at least evident.

Daniel,

I see how I was unclear – what I meant was that I don’t see how doing a Bayesian methodology on a particular study (the “within study” part) helps us with doing the (likely informal) Bayesian prior-updating we do by comparing numerous studies (the “between study” part).* And the cost of doing Bayesian analysis is often times that it reduces parsimony and clarity.

As for “I very rarely see this approach as desirable” I wonder how much of that is us generally answering different kinds of problems. In the applied microeconomics world, we are often working with large datasets collected regularly on large samples of people. When we want to identify some causal parameter of interest, we often look for some quasi-experimental type variation in the world that affects some covariate X for some people and not for others, in a way we think is orthogonal to their other unobservable characteristics that determine the outcome Y. In these cases, we often explicitly do not want to model, say, wages themselves, or remittances, or voting patterns. We only want to use the econometrics to make a comparison between two groups in the world. The more structure we add, the further we get from the simple comparison we are after. I mean – regression discontiuity, IV, differences-in-differences – these are all explicitly BAD models of people themselves, but good ways of cutting up the world to make comparisons.

In contrast, when my Dad models electrical field propagation over long distances, he does employ a fairly detailed model, because, you know, electricity follows laws of motion. And as much as Entosophy wants to jump in here and argue otherwise – I don’t think human beings follow laws of motion.

Now – on to the important stuff – I can’t wait to sit down in LA (we’re considering Eagle Rock, thoughts? Atwater, Echo Park, Highland Park, and maybe Silverlake are also on the list) and see some Bayesian modelling in action and draw stuff and point at stuff and ask questions. See you in August!

I am not sure what you are pointing out here, within study likelihoods are very peaked while between study likelihoods are very flat – with the very peaked only really very peaked given randomization?

I wasn’t trying to be technical, I was just trying to be philosophical. We have two problems: statistical inference within any given study (getting the p-values right or the uncertainty or whatever); and forming general beliefs about the effects of some covariate X on some outcome Y. I don’t think that these are the same thing, nor do I think they should be the same thing (meaning, sometimes you just get unlikely data, and that’s fine).

I’m saying that Bayesian modelling as its discussed here mostly seems to me to be about statistical inference in the first sense (within a study), but that the place where we really need Bayesian thinking is in the second sense (inference as to how the world is actually working). And I’m not sure that the first thing helps the second thing.

All that said – much of the reason I put these comments out is so that smart people like you and Daniel can show me why I’m wrong or give me ideas that can help me be less wrong in the future. For instance – there is a reasonable chance that my understanding of “statistical inference” in the Bayesian sense is so bad that the sentence I wrote above about “unlikely data” is actually, on its face, a meaningless or contradictory statement.

@jrc your philosophy doesn’t make any sense.

If you don’t trust the procedure in the second case, why would you trust the procedure in the first case.

combining information across studies is precisely what needs to be done more rigorously. It can be done, but you can’t get away with doing it by rote.

@anonymous

“If you don’t trust the procedure in the second case, why would you trust the procedure in the first case.”

I wasn’t implying that I “trust the procedure” in the first case. In fact, I don’t “trust” any procedure to produce anything that purports to tell us a whole lot about a population parameter from a single sample/experiment (in part because I don’t really believe in population parameters applied to things like “returns to education” or “elasticity of labor supply w/r/t income tax rates” – the kinds of things we estimate in applied micro – and in part because I don’t believe our assumptions over the data generating process, and in part because I believe that published results are cherry-picked).

“combining information across studies is precisely what needs to be done more rigorously. It can be done, but you can’t get away with doing it by rote.”

I’m not sure if we agree or disagree. Here’s my thinking:

Because I don’t believe that a p-value of .01 in one study is evidence of some meaningful association in the real world (whatever method generated it), I look for consistent evidence across repeated studies looking at similar things. This is where I think that Bayesian \emph{thinking} is really important (note: I don’t necessarily mean Bayesian statistics here, I mean updating priors with new information in the more general sense).

My concern with making that updating process more formal (in the second sense of using multiple studies for a meta-analysis or something) has to do with the kinds of judgments we make every day as empiricists but that are very difficult to encode mathematically in probability models – things like: where is the identifying variation coming from, how do the coefficient estimates change if we use different estimators latching on to different types of variation (in the same data), is the measurement of the outcome good, is the manipulation/treatment really similar to the one in some other study, etc.

I see that this is an awkward line of reasoning, because usually we think of inference as being from the sample to the population. But I tend to think of inference (in the sense of hypothesis testing or confidence intervals or whatever type of precision estimate) as really about the sample itself (and its variation) and not really relating to some metaphysical super-population somewhere (or, to all other actual people for that matter). So I tend to look at whatever studies I can find, think hard about how they were conducted and what they were measuring, look for similarities in results in the papers I think were done best, and draw a (subjective) conclusion about what is going on in the world.

jrc, the kinds of issues you’ve raised are of concrete and practical concern in the field of clinical trial design, particularly clinical trials of medical devices. Often a medical device is an incremental change (hopefully, improvement) on a previously studied design. In the interest of minimizing losses of human life and/or quality of life due to either falsely passing a detrimental change or falsely failing an improvement, it is highly desirable to use a hierarchical model that can adaptively tune the posterior for the proper balance of your “within-study” and “between-study” conservatisms. Brad Carlin has done a lot of work making this formal rather than informal; the key phrase is “commensurate prior”.

Although I doubt there is such a thing as a “law of motion” of individual humans, I think there are probably plenty of ways in which we can model systematic trends for aggregate statistics, and by incorporating these models we will get much better results.

Let’s go back to something like that study on Chinese coal pollution.

Let’s agree that the authors of this study were interested in how realistic levels of coal pollution within typical achievable ranges affect health outcomes. They find a location where there is presumably a variation in exposure to this pollution which maybe is not accompanied by a significant variation in other factors.

Your point is that they’re interested in this difference independent of other things that affect health, such as diet, exercise, and the like. If we had a strong model of say cardiopulmonary health, we wouldn’t be able to use it, because we don’t have strong information about inputs for all the individuals involved…. ok so far?

Now suppose I say that we’re not really interested in the effect of the coal burning policy at this location on these specific people, we’re really interested in how health is affected by pollution (because we want to generalize to other locations where policies and etc might be different, and if we know how health is affected by pollution in general, we can apply that to this specific case). The policy difference is just a way to get a “natural experiment” about pollution.

So suppose we had information about pollution measured at say ground monitoring stations throughout the area? Suppose we had information about the prevalence of different types of furnaces (say, clean vs dirty burning) in the area, and we had information about the SES and demographics of the various neighborhoods.

Now, we don’t just have a single spatial axis (above and below the river) but instead, a rich spatial timeseries dataset. Wouldn’t you agree that we should use this rich spatial timeseries dataset?

Now, if we created a model in which we use demographics of age, and cumulative exposure through time of the study to approximate “health trajectories” for each neighborhood, and then used those health trajectories to estimate the prevalence of heart attack or whatever it was, wouldn’t that be a better model than just plugging distance from the river into a discontinuous cubic??

Suppose our model found that one of the reasons we didn’t see as much variation in outcomes as we thought we would is that the spatial variation in ages of the population worked “against” the pollution effect (that is, the people exposed to more pollution tended to be younger, and so were in a “healthier” period of their lives?) our model, building that knowledge in, will give us useful information about the parameters that control “Detrimental effects” because it will be an expected consequence of such a model… but the simple regression on “distance from the river” won’t because it doesn’t know about the real way in which health varies.

That’s the kind of thing that I mean by bringing in real structure to the model.

As for locations, I think all the places you mentioned are worth considering. You might also look at Mt Washington, the eastern portions of Altadena (where I am), Pasadena, South Pasadena (a separate town), Glendale, and Montrose.

If I remember correctly you’ll be commuting east, so Pasadena, South Pasadena, and Altadena will put you so you don’t have to go through the snarls of 5/10/710/101/110 freeways etc

Email me if you like.

@jrc

My understanding is most Bayesians want to make inference about population models. But even if you only want to make an inference about the effect of a new curriculum in 8 specific schools, you should not ignore your prior knowledge. Mainly to regularize outcomes, minimize bad luck.

That said I like to start with randomization test, then modelling (which I’d like to be Bayesian). The reason is modeling includes many assumptions beyond the prior. If these inferences conflict, you know assumptions are playing big role.

Rahul: The most important difference between frequentist and Bayesian approaches is not the prior, but the idea of averaging rather than optimizing. Bayesian methods tend to use the posterior mean; frequentist methods (when translated into the Bayesian framework) tend to use (an approximation of) the posterior mode.

Thanks!

Though the distinction is well worth pointing out, the averaging is completely motivated _given a prior_ while the optimizing often works well when it works (finds true optimums) – but it can fail badly (e.g. Neyman-Scott problems).