The “Washington read” and the algebra of conditional distributions

I was trying to explain in class how a (Bayesian) statistician reads the formula for a probability distribution. In old-fashioned statistics textbooks you’re told that if you want to compute a conditional distribution from a joint distribution you need to do some heavy math: p(a|b) = p(a,b)/\int p(a’,b)da’.

When doing Bayesian statistics, though, you usually don’t have to do the integration or the division. If you have parameters theta and data y, you first write p(y,theta). Then to get p(theta|y), you don’t need to integrate or divide. All you have to do is look at p(y,theta) in a certain way: Treat y as a constant and theta as a variable. Similarly, if you’re doing the Gibbs sampler and want a conditional distribution, just consider the parameter you’re updating as the variable and everything else as a constant. No need to integrate or divide, you just take the joint distribution and look at it from the right perspective.

Awhile ago Yair told me there’s something called the “Washington read,” where you pick up a book, go straight to the index, and see if, where, and how often you’re mentioned.

It struck me, when explaining Bayesian algebra, that what we’re really doing when we get a conditional distribution is to take a Washington read of the joint distribution, from the perspective of the parameter or parameters of interest.

More generally, I’ve found that an important step in being able to do mathematics for statistics is learning how to focus on different symbols in a formula. In math, all symbols are in some sense equal, whereas in statistics, x and y and pi and theta and sigma and lambda all play different roles. If you don’t get that—if you read formulas in a flat two dimensions without seeing the context or implications or personality of each symbol—you can easily get stuck, sort of like how it would be essentially impossible to read this passage if you had to look up every one of its words in the dictionary.

The “Washington read” for conditional distributions is an example of statistical reading of mathematics. (Another example is that, with rare exceptions, I read “38.24%” or “38.2%” as 38%, or even 40%.)

15 thoughts on “The “Washington read” and the algebra of conditional distributions

  1. True, but you could also say the same thing for physics or any other application of mathematics where you’re supremely concerned about which variables are know and which are ones you’d like to know.

    • Yes, I agree. A good physicist should have no problem with this idea. It’s particularly relevant to statistics, though, because statistics does not generally require the level of mathematical sophistication used in physics, thus this particular principle looms large as an obstacle when people are learning how to do applied probability calculations. It can be confusing to students, I think, that the “focus parameter” switches when you move from probability to inference. (Consider, for example, the Poisson distribution, which looks much different as a function of y than as a function of theta.)

  2. I like the insight this offers into the basic mechanics of the conditional distribution. As a beginner in Bayesian methods, I have recently found Kruschke’s “Doing Bayesian Data Analysis” very useful. He first approaches the conditional through 2-way tables, then you restrict your attention to one row, developing intuition, then he extends it to continuous variables, almost glossing over the integration, then extends it to higher dimensions. I found this a great way to develop intuition. As a miserable medic I had previously come across Bayes in its “Pre-test probability” context, so this teaching method immediately gelled with me. “Restricted attention” is something useful to learn at any stage.

  3. One of my office mates at CMU always read the index of books to see if he was mentioned. So in my first book, I included an index entry with his name that pointed to the index page it was on.

    Personally, I used to find the notation in BDA and most applied statistics incredibly confusing. When you write Bayes’ rule as p(x|y) = p(y|x) p(x) / p(y), you need to keep in mind that there are four different densities involved, all of which are determined by the joint distribution p(x,y), which your nefarious author may just write as p(y,x) if he or she’s in the mood. The root problem is that the names of the variables are being used to implicitly disambiguate which density is involved. Add to that the fact that p(x,y) is equivalent to p(x|y) up to a multiplicative constant if y is fixed, and students are in for a world of confusion.

    The other problem I had in following BDA is that it conflates random variables and plain old bound variables. On the other hand, following a theory book and defining random variables $latex X$ and $latex Y$ and using separate bound variables $latex x$ and $latex y$, then writing $latex p_{X,Y}(x,y)$, $latex p_{X|Y}(x|y)$, $latex p_{Y|X}(y|x)$, $latex p_X(x)$ and $latex p_Y(y)$ seems both redundant (though it’s not) and verbose (which it is). It gets really pathological when authors try to play games with expectation notation, which implicitly shifts the interpretation of the random variables.

  4. Pingback: Shared Articles for October 16th

  5. Feller says not to do what you’re suggesting (I don’t have my copy handy for a reference–in fact, I suspect some Bayesian stole it and burned it as subversive material), but intuitively, there’s a whole lot that can go wrong when the limits of integration are not rectangular.

  6. Andrew: Nicely pointed to.

    Agree it is all to easy to use background concepts students don’t really have and I have even complained about other’s apparently being unwilling to leave continuity aside, at least initially, but in attempt to continue to seem disagreeable ;-)

    Charles Geyer has a nice law about the law of preservation of mathematical difficulty which would suggest you are likely just shifting what it is difficult to grasp somewhere else, though of course changing the particular barriers students need to overcome to appreciate and resolve the difficulty.

    On a personal note, my first clue that I did not understand statistics was when Don Fraser kept emphasising marginal, conditional and joint distributions in class and I could not see what all the fuss was about.

    But also, if there is an interest to focus marginally on some subset of the parameters you will need some kind of marginalisation and with a full joint, integration would win hands down? (Conditioning on R^n usually only simplifies things when n is 1 or 2 at least for me.)

    And then there is Jim Berger’s notation (e.g. in LP) for observed random variables, unobserved random variable split into interest and nuisance variables and unobserved fixed parameters split into interest and nuisance parameters and unobserved random parameters split into interest and nuisance random parameters (usually considered drawn from distributions with parameters that are some subset of unobserved fixed parameters split into interest and nuisance parameters.)

Comments are closed.