Everything is Obvious (once you know the answer)

Duncan Watts gave his new book the above title, reflecting his irritation with those annoying people who, upon hearing of the latest social science research, reply with: Duh-I-knew-that. (I don’t know how to say Duh in Australian; maybe someone can translate that for me?) I, like Duncan, am easily irritated, and I looked forward to reading the book. I enjoyed it a lot, even though it has only one graph, and that graph has a problem with its y-axis. (OK, the book also has two diagrams and a graph of fake data, but that doesn’t count.)

Before going on, let me say that I agree wholeheartedly with Duncan’s central point: social science research findings are often surprising, but the best results cause us to rethink our world in such a way that they seem completely obvious, in retrospect. (Don Rubin used to tell us that there’s no such thing as a “paradox”: once you fully understand a phenomenon, it should not seem paradoxical any more. When learning science, we sometimes speak of training our intuitions.) I’ve jumped to enough wrong conclusions in my applied research to realize that lots of things can seem obvious but be completely wrong. In his book, Duncan does a great job at describing several areas of research with which he’s been involved, explaining why this research is important for the world (not just a set of intellectual amusements) and why it’s not as obvious as one might think at first. Continue reading

The old, old story: Effective graphics for conveying information vs. effective graphics for grabbing your attention

One thing that I remember from reading Bill James every year in the mid-80’s was that certain topics came up over and over, issues that would never really be resolved but appeared in all sorts of different situations. (For Bill James, these topics included the so-called Pesky/Stuart comparison of players who had different areas of strength, the eternal question (associated with Whitey Herzog) of the value of foot speed on offense and defense, and the mystery of exactly what it is that good managers do.)

Similarly, on this blog–or, more generally, in my experiences as a statistician–certain unresolvable issues come up now and again. I’m not thinking here of things that I know and enjoy explaining to others (the secret weapon, Mister P, graphs instead of tables, and the like) or even points of persistent confusion that I keep feeling the need to clean up (No, Bayesian model checking does not “use the data twice”; No, Bayesian data analysis is not particularly “subjective”; Yes, statistical graphics can be particularly effective when done in the context of a fitted model; etc.). Rather, I’m thinking about certain tradeoffs that may well be inevitable and inherent in the statistical enterprise.

Which brings me to this week’s example. Continue reading

Confusion about Bayesian model checking

As regular readers of this space should be aware, Bayesian model checking is very important to me:

1. Bayesian inference can make strong claims, and, without the safety valve of model checking, many of these claims will be ridiculous. To put it another way, particular Bayesian inferences are often clearly wrong, and I want a mechanism for identifying and dealing with these problems. I certainly don’t want to return to the circa-1990 status quo in Bayesian statistics, in which it was considered virtually illegal to check your model’s fit to data.

2. Looking at it from the other direction, model checking can become much more effective in the context of complex Bayesian models (see here and here, two papers that I just love, even though, at least as measured by citations, they haven’t influenced many others).

On occasion, direct Bayesian model checking has been criticized from a misguided “don’t use the data twice” perspective (which I won’t discuss here beyond referring to this blog entry and this article of mine arguing the point).

Here I want to talk about something different: a particular attempted refutation of Bayesian model checking that I’ve come across now and then, most recently an a blog comment by Ajg:

The example [of the proportion of heads in a number of “fair tosses”] is the most deeply damning example for any straightforward proposal that probability assertions are falsifiable.

The probabilistic claim “T” that “p(heads) = 1/2, tosses are independent” is very special in that it, in itself, gives no grounds for preferring any one sequence of N predictions over another: HHHHHH…, HTHTHT…, etc: all have identical probability .5^N and indeed this equality-of-all-possibilities is the very content of “T”. There is simply nothing inherent in theory “T” that could justify saying that HHHHHH… ‘falsifies’ T in some way that some other observed sequence HTHTHT… doesn’t, because T gives no (and in fact, explicitly denies that it could give any) basis for differentiating them.

Continue reading

Those people who go around telling you not to do posterior predictive checks

I started to post this item on posterior predictive checks and then I realize I already did post it several months ago! Memories (including my own) are short, though, so here it is again:

A researcher writes,

I have made use of the material in Ch. 6 of your Bayesian Data Analysis book to help select among candidate models for inference in risk analysis. In doing so, I have received some criticism from an anonymous reviewer that I don’t quite understand, and was wondering if you have perhaps run into this criticism. Here’s the setting. I have observable events occurring in time, and I need to choose between a homogeneous Poisson process, and a nonhomogeneous Poisson process, in which the rate is a function of time ( e.g., lognlinear model for the rate, which I’ll call lambda).

I could use DIC to select between a model with constant lambda and one where the log of lambda is a linear function of time. However, I decided to try to come up with an approach that would appeal to my frequentist friends, who are more familiar with a chi-square test against the null hypothesis of constant lambda. So, following your approach in Ch. 6, I had WinBUGS compute two posterior distributions. The first, which I call the observed chi-square, subtracts the posterior mean (mu[i] = lambda[i]*t[i]) from each observed value, square this, and divides by the mean. I then add all of these values up, getting a distribution for the total. I then do the same thing, but with draws from the posterior predictive distribution of X. I call this the replicated chi-square statistic.

If my putative model has good predictive validity, it seems that the observed and replicated distributions should have substantial overlap. I called this overlap (calculated with the step funtion in WinBUGS) a “Bayesian p-value.” The model with the larger p-value is a better fit, just like my frequentist friends are used to.

Now to the criticism. An anonymous reviewer suggests this approach is weakened by “using the observed data twice.” Well, yes, I do use the observed data to estimate the posterior distribution of mu, and then I use it again to calculate a statistic. However, I don’t see how this is a problem, in the sense that empirical Bayes is problematic to some because it uses the data first to estimate a prior distribution, then again to update that prior. I am also not interested in “degrees of freedom” in the usual sense associated with MLEs either.

I am tempted to just write this off as a confused reviewer, but I am not an expert in this area, so I thought I would see if I am missing something. I appreciate any light you can shed on this problem.

My thoughts: Continue reading

Bayesian model selection

A researcher writes,

I have made use of the material in Ch. 6 of your Bayesian Data Analysis book to help select among candidate models for inference in risk analysis. In doing so, I have received some criticism from an anonymous reviewer that I don’t quite understand, and was wondering if you have perhaps run into this criticism. Here’s the setting. I have observable events occurring in time, and I need to choose between a homogeneous Poisson process, and a nonhomogeneous Poisson process, in which the rate is a function of time ( e.g., lognlinear model for the rate, which I’ll call lambda).

I could use DIC to select between a model with constant lambda and one where the log of lambda is a linear function of time. However, I decided to try to come up with an approach that would appeal to my frequentist friends, who are more familiar with a chi-square test against the null hypothesis of constant lambda. So, following your approach in Ch. 6, I had WinBUGS compute two posterior distributions. The first, which I call the observed chi-square, subtracts the posterior mean (mu[i] = lambda[i]*t[i]) from each observed value, square this, and divides by the mean. I then add all of these values up, getting a distribution for the total. I then do the same thing, but with draws from the posterior predictive distribution of X. I call this the replicated chi-square statistic.

If my putative model has good predictive validity, it seems that the observed and replicated distributions should have substantial overlap. I called this overlap (calculated with the step funtion in WinBUGS) a “Bayesian p-value.” The model with the larger p-value is a better fit, just like my frequentist friends are used to.

Now to the criticism. An anonymous reviewer suggests this approach is weakened by “using the observed data twice.” Well, yes, I do use the observed data to estimate the posterior distribution of mu, and then I use it again to calculate a statistic. However, I don’t see how this is a problem, in the sense that empirical Bayes is problematic to some because it uses the data first to estimate a prior distribution, then again to update that prior. I am also not interested in “degrees of freedom” in the usual sense associated with MLEs either.

I am tempted to just write this off as a confused reviewer, but I am not an expert in this area, so I thought I would see if I am missing something. I appreciate any light you can shed on this problem.

My thoughts: Continue reading