Skip to content

Expectation propagation as a way of life

Aki Vehtari, Pasi Jylänki, Christian Robert, Nicolas Chopin, John Cunningham, and I write:

We revisit expectation propagation (EP) as a prototype for scalable algorithms that partition big datasets into many parts and analyze each part in parallel to perform inference of shared parameters. The algorithm should be particularly efficient for hierarchical models, for which the EP algorithm works on the shared parameters (hyperparameters) of the model.

The central idea of EP is to work at each step with a “tilted distribution” that combines the likelihood for a part of the data with the “cavity distribution,” which is the approximate model for the prior and all other parts of the data. EP iteratively approximates the moments of the tilted distributions and incorporates those approximations into a global posterior approximation. As such, EP can be used to divide the computation for large models into manageable sizes. The computation for each partition can be made parallel with occasional exchanging of information between processes through the global posterior approximation. Moments of multivariate tilted distributions can be approximated in various ways, including, MCMC, Laplace approximations, and importance sampling.

I love love love love love this. The idea is to forget about the usual derivation of EP (the Kullback-Leibler discrepancy, etc.) and to instead start at the other end, with Bayesian data-splitting algorithms, with the idea of taking a big problem and dividing it into K little pieces, performing inference on each of the K pieces, and then putting them together to get an approximate posterior inference.

The difficulty with such algorithms, as usually constructed, is that each of the K pieces has only partial information; as a result, for any of these pieces, you’re wasting a lot of computation in places that are contradicted by the other K-1 pieces.

This sketch (with K=5) shows the story:

Screen Shot 2014-12-14 at 6.23.09 PM

We’d like to do our computation in the region of overlap.

And that’s how the EP-like algorithm works! When performing the inference for each piece, we use, as a prior, the cavity distribution based on the approximation to the other K-1 pieces.

Here’s a quick picture of how the cavity distribution works. This picture shows how the EP-like approximation is not the same as simply approximating each likelihood separately. The cavity distribution serves to focus the approximation in the zone of inference of parameter space:

Screen Shot 2014-12-14 at 6.23.45 PM

But the real killer app of this approach is hierarchical models, because then we’re partitioning the parameters at the same time as we’re partitioning the data, so we get real savings in complexity and computation time:

Screen Shot 2014-12-14 at 6.23.54 PM

EP. It’s a way of life. And a new way of thinking about data-partitioning algorithms.

Damn, I was off by a factor of 2!

I hate when that happens. Demography is tricky.

Oh well, as they say in astronomy, who cares, it was less than an order of magnitude!

“Now the company appears to have screwed up badly, and they’ve done it in pretty much exactly the way you would expect a company to screw up when it doesn’t drill down into the data.”

Palko tells a good story:

One of the accepted truths of the Netflix narrative is that CEO Reed Hastings is obsessed with data and everything the company does is data driven . . .

Of course, all 21st century corporations are relatively data-driven. The fact that Netflix has large data sets on customer behavior does not set it apart, nor does the fact that it has occasionally made use of that data. Furthermore, we have extensive evidence that the company often makes less use of certain data then do most other competitors. . . .

I can’t vouch for the details here but the general point, about what it means to be “data-driven,” is important.

On deck this week

Mon: “Now the company appears to have screwed up badly, and they’ve done it in pretty much exactly the way you would expect a company to screw up when it doesn’t drill down into the data.”

Tues: Expectation propagation as a way of life

Wed: I’d like to see a preregistered replication on this one

Thurs: A key part of statistical thinking is to use additive rather than Boolean models

Fri: Defense by escalation

Sat: Sokal: “science is not merely a bag of clever tricks . . . Rather, the natural sciences are nothing more or less than one particular application — albeit an unusually successful one — of a more general rationalist worldview”

Sun: It’s Too Hard to Publish Criticisms and Obtain Data for Replication

The latest episode in my continuing effort to use non-sports analogies

In a unit about the law of large numbers, sample size, and margins of error, I used the notorious beauty, sex, and power example:

A researcher, working with a sample of size 3000, found that the children of beautiful parents were more likely to be girls, compared to the children of less-attractive parents.

Can such a claim really be supported by the data at hand?

One way to get a sense of this is to consider possible effect sizes. It’s hard to envision a large effect; based on everything I’ve seen about sex ratios, I’d say .005 (i.e., one-half of one percentage point) is an upper bound on any possible difference in Pr(girl) comparing attractive and unattractive parents.

How big a sample size do you need to measure a proportion with that sort of accuracy? Since we’re doing a comparison, you’d need to measure the proportion of girls within each group (attractive or unattractive parents) to within about a quarter of a percentage point, or .0025. The standard deviation of a proportion is .5/sqrt(n), so we need to have roughly .5/sqrt(n)=.0025, or n=(.5/.0025)^2=40,000 in each group.

So, to have any chance of discovering this hypothetical difference in sex ratios, we’d need at least 40,000 attractive parents and 40,000 unattractive—a sample of 80,000 at an absolute minimum.

What the researcher actually had was a sample of 3000. Hopeless.

You might as well try to weld steel with a cigarette lighter.

Or do embroidery with a knitting needle.

OK, they’re not the best analogies ever. But I avoided sports!

I like the clever way they tell the story. It’s a straightforward series of graphs but the reader has to figure out where to click and what to do, which makes the experience feel more like a voyage of discovery.

Jonathan Falk asks what I think of this animated slideshow by Matthew Klein on “How Americans Die”:

Screen Shot 2014-10-30 at 9.35.59 PM

Please click on the above to see the actual slideshow, as this static image does not do it justice.

What do I think? Here was my reaction:

It is good, but I was thrown off by the very first page because it says that it looks like progress stopped in the mid-1990s, but on the actual graphs, the mortality rate continued to drop after the mid-1990s. Also the x-axis labeling was confusing to me, it took awhile for me to figure out that the numbers for the years are not written at the corresponding places on the axes, and I wasn’t clear on what the units are on the y-axis.

I guess what I’m saying is: I like the clever way they tell the story. It’s a straightforward series of graphs but the reader has to figure out where to click and what to do, which makes the experience feel more like a voyage of discovery. The only thing I didn’t like was some of the execution, in that it’s not always clear what the graphs are exactly saying. It’s a good idea and I could see it as a template for future graphical presentations.

It’s also an interesting example because it’s not just displaying data, it’s also giving a little statistics lesson.

Don’t, don’t, don’t, don’t . . . We’re brothers of the same mind, unblind


Hype can be irritating but sometimes it’s necessary to get people’s attention (as in the example pictured above). So I think it’s important to keep these two things separate: (a) reactions (positive or negative) to the hype, and (b) attitudes about the subject of the hype.

Overall, I like the idea of “data science” and I think it represents a useful change of focus. I’m on record as saying that statistics is the least important part of data science, and I’m happy if the phrase “data science” can open people up to new ideas and new approaches.

Data science, like any just about new idea you’ve heard of, gets hyped. Indeed, if it weren’t for the hype, you might not have heard of it!

So let me emphasize, that in my criticism of some recent hype, I’m not dissing data science, I’m just trying to help people out a bit by pointing out which of their directions might be more fruitful than others.

Yes, it’s hype, but I don’t mind

Phillip Middleton writes:

I don’t want to rehash the Data Science / Stats debate yet again. However, I find the following post quite interesting from Vincent Granville, a blogger and heavy promoter of Data Science.

I’m not quite sure if what he’s saying makes Data Science a ‘new paradigm’ or not. Perhaps it is reflective of something new apart from classical statistics, but then I would also say so of Bayesian analysis as paradigmatic (or at least a still budding movement) itself. But what he alleges – i.e that ‘Big Data’ by its very existence necessarily implies that cause of a response/event/observation can be ascertained, and seemingly w/o any measure of uncertainty, seems rather ‘over-promising’ and hypish.

I am a bit concerned with what I’m thinking he implies regarding ‘black box’ methods – that is the blind reliance upon them by those who are technically non-proficient. I feel the notion that one should always trust ‘the black box’ is not in alignment with reality.

He does appear to discuss dispensing with p-values. In a few cases, like SHT, I’m not totally inclined to disagree (for reasons you speak aobut frequently), but I don’t think we can be quite so universal about it. That would pretty much throw out most every frequentist test wrt to comparison, goodness-of-fit, what have you.

Overall I get the feeling that he’s implying the ‘new’ era as one of solving problems w/ certainty, which seems more the ideal than the reality.

What do you think?

OK, so I took a look at Granville’s post, where he characterizes data science as a new paradigm “very different, if not the opposite of old techniques that were designed to be implemented on abacus, rather than computers.”

I think he’s joking about the abacus but I agree with this general point. Let me rephrase it from a statistical perspective.

It’s been said that the most important thing in statistics is not what you do with the data, but, rather, what data you use. What makes new statistical methods great is that they open the door to the use of more data. Just for example:

- Lasso and other regularization approaches allow you to routinely thrown in hundreds or thousands of predictors, whereas classical regression models blow up at that. Now, just to push this point a bit, back before there was lasso etc., statisticians could still handle large numbers of predictors, they’d just use other tools such as factor analysis for dimension reduction. But lasso, support vector machines, etc., were good because they allowed people to more easily and more automatically include lots of predictors.

- Multiple imputation allows you to routinely work with datasets with missingness, which in turn allows you to work with more variables at once. Before multiple imputation existed, statisticians could still handle missing data but they’d need to develop a customized approach for each problem, which is enough of a pain that it would often be easier to simply work with smaller, cleaner datasets.

- Multilevel modeling allows us to use more data without having that agonizing decision of whether to combine two datasets or keep them separate. Partial pooling allows this to be done smoothly and (relatively) automatically. This can be done in other ways but the point is that we want to be able to use more data without being tied up in the strong assumptions required to believe in a complete-pooling estimate.

And so on.

Similarly, the point of data science (as I see it) is to be able to grab the damn data. All the fancy statistics in the world won’t tell you where the data are. To move forward, you have to find the data, you need to know how to scrape and grab and move data from one format into another.

On the other hand, he’s wrong in all the details

But I have to admit that I’m disturbed on how much Granville gets wrong. His buzzwords include “Model-free confidence intervals” (huh?), “non-periodic high-quality random number generators” (??), “identify causes rather than correlations” (yeah, right), and “perform 20,000 A/B tests without having tons of false positives.” OK, sure, whatever you say, as I gradually back away from the door. At this point we’ve moved beyond hype into marketing.

Can we put aside the cynicism, please?

Granville writes:

Why some people don’t see the unfolding data revolution?
They might see it coming but are afraid: it means automating data analyses at a fraction of the current cost, replacing employees by robots, yet producing better insights based on approximate solutions. It is a threat to would-be data scientists.

Ugh. I hate that sort of thing, the idea that people who disagree with you, do so out of corrupt reasons. So tacky. Wake up, man! People who disagree with you aren’t “afraid of the truth,” they just have different experiences than yours, they have different perspectives. Your perspective may be closer to the truth—as noted above, I agree with much of what Granville writes—but you’re a fool if you so naively dismiss the perspectives of others.
Continue reading ‘Don’t, don’t, don’t, don’t . . . We’re brothers of the same mind, unblind’ »

Saying things that are out of place


Basbøll points us to a column by Michael Shermer, a journalist and self-described skeptic who’s written a lot about skepticism, atheism, etc. Recently, though, Shermer wrote of an event that “shook [his] skepticism to its core”—it was a story about an old radio that didn’t work, then briefly started to work again, then stopped working.

From the outside it doesn’t sound like much (and indeed Shermer’s blog commenters aren’t particularly impressed) but, hey, they all laughed at Arthur Conan Doyle when he said he saw pictures of fairies (see image above), but who’s laughing now???

OK, sure, it’s easy to mock Doyle or Shermer (indeed, I couldn’t resist a bit of mockery myself in my comment to Basbøll’s above-linked post) or to move gently from mocking to being patronizing, saying that Doyle’s spiritualism is understandable given the sad events of his life, or that Shermer is being charmingly romantic in telling a story about his bride. Maybe we need a bit more romanticism and sentimentality when it comes to later-in-life weddings.

But I don’t want to take either of these paths here. Instead I’d like to talk about the somewhat unstable way in which we use different sources of discourse in different aspects of life, and the awkwardness that can arise when we use the wrong words at the wrong time.

For an obvious example, I’m all lovey-lovey when I talk with my family (except when we’re screaming at each other, of course) but that wouldn’t be appropriate at work. Sure, sometimes, I’ll get overcome with emotion and say “I love you guys” to my class, but it’s pretty rare, it certainly wouldn’t be appropriate to do that every day.

We also cordon off different aspects of inquiry. I have no problem mocking studies of fat arms and voting or whatever, but you’re not gonna see me making fun of the Bible here. Why? It’s not that the Bible is sacred to me, it just seems like it’s on a different dimension. It’s not claiming to be science. It’s a bunch of stories. If people want to believe that Moses crossed the Red Sea, or for that matter that there was an actual Moses or an actual King Arthur or whatever, fine. It doesn’t seem to interact in any direct way with statistical modeling, causal inference, or social science so it’s not particularly relevant to what we’re doing here.

So Shermer did this goofy thing where he was using romantic love discourse in a space where people were expecting science discourse or journalism discourse. It didn’t really work.

Sometimes the practice of unexpected forms of discourse can produce interesting results. For example, say what you want about Scott Adams, but sometimes his offbeat cartoonist’s perspective on public affairs can be interesting. So my message is not that everyone needs to stay in his place, or that Shermer shouldn’t display his romantic love in a column about skepticism. What I’m saying is that our norm is to evaluate statements in their context. When a comedian says something on a sitcom, we evaluate it based on how funny it is (and maybe on how offensive it might be), when a self-declared skeptic writes a column, we evaluate things in that way, etc. A story doesn’t exist on its own, it exists in context.

We tend to assume that since Shermer labels himself as a skeptic, that he is not superstitious and should “know better” than to believe in ghosts. But maybe that’s a misguided view of Shermer. Perhaps he has strong superstitious feelings, but has been convinced that those feelings are unscientific hence his career as a skeptical journalist, but the superstition is still there and bubbles up from time to time.

That’s just a story, of course, but my point is that it’s natural to interpret the Shermer story in terms of some frame or another. The frame “prominent skeptic is stunned by a coincidence” is just one way to interpret this story.

Next Generation Political Campaign Platform?


[This post is by David K. Park]

I’ve been imagining the next generation political campaign platform. If I were to build it, the platform would have five components:

  1. Data Collection, Sanitization, Storage, Streaming and Ingestion: This area will focus on the identification and development of the tools necessary to acquire the correct data sets for a given campaign, sanitizing the data and readying it for ingestion into the analytical components of the framework. This includes geotagged social media data, such as Twitter, fB, Instagram, Pinterest, Vine, etc.  and traditional local news, etc. focused not only on the candidate but challenger as well.
    • [side note (and potentially useless) idea: Embed rfid or inexpensive sensors to campaign lawn signs so we can measure how many people/cars pass by the sign.]
  2. Referential Data Sets: This area will focus on the identification and development of data sets that are referential in nature. Such sets might be databases and/or services to assist with geolocation, classification, etc. This includes demographic and marketing data, campaign specified sources, such as donors, and surveys as well data from Catalist and the Atlas Project.
  3. Analytics Engine: This area will focus on the identification and development of the tools necessary to provide the core analytical work for the specific project. Here, I’m thinking of statistical (use STAN of course), machine learning and NLP packages, both open source and commercially available. This includes language sentiment analysis, polling trends, and so on.
  4. Model Forms: These would be the models, and the underlying software to drive the models, used within the analysis. We can readily exploit the packages that already exist in this area, whether Python, R, Umbra, etc., and build custom models where necessary. This includes direct marketing impact analysis (i.e., A/B testing campaigns,) overall metrics for campaign health, election prediction, and more.
  5. Interpretation of Results, Data Visualization, and Visual Steering: Identifying and developing the necessary data visualization toolkits necessary to provide insights by adequately displaying the visual representations of quantitative and statistical information. Further, solving the problem of getting resultant data sets to the visualization system in a reliable fashion and making this connection tightly coupled and full duplex, allowing for a visual steering model to emerge. This includes geographic and demographic segmentation, overview of historical political context, results of various marketing messages, etc.

Just a (rough) thought at this point…

The Fallacy of Placing Confidence in Confidence Intervals

Richard Morey writes:

On the tail of our previous paper about confidence intervals, showing that researchers tend to misunderstand the inferences one can draw from CIs, we [Morey, Rink Hoekstra, Jeffrey Rouder, Michael Lee, and EJ Wagenmakers] have another paper that we have just submitted which talks about the theory underlying inference by CIs. Our main goal is to elucidate for researchers why many of the things commonly believed about CIs are false, and to show that the theory of CIs does not offer a very compelling theory for inference.

One thing that I [Morey] have noted going back to the classic literature is how clear Neyman seemed about all this. Neyman was under no illusions about what the theory could or could not support. It was later authors who tacked on all kinds of extra interpretations to CIs. I think he would be appalled at how CIs are used.

From their abstract:

The width of confidence intervals is thought to index the precision of an estimate; the parameter values contained within a CI are thought to be more plausible than those outside the interval; and the confidence coefficient of the interval (typically 95%) is thought to index the plausibility that the true parameter is included in the interval. We show in a number of examples that CIs do not necessarily have any of these properties, and generally lead to incoherent inferences. For this reason, we recommend against the use of the method of CIs for inference.

I agree, and I too have been pushing against the idea that confidence intervals resolve the well-known problems with null hypothesis significance testing. I also had some specific thoughts:

For another take on the precision fallacy (the idea that the width of a confidence interval is a measure of the precision of an estimate), see my post, “Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests.” See in particular the graph which illustrates the problem very clearly, I think:

Regarding the general issue that confidence intervals are no inferential panacea, see my recent article, “P values and statistical practice,” in which I discuss the problem of taking a confidence interval from a flat prior and using it to make inferences and decisions.

My current favorite (hypothetical) example is an epidemiology study of some small effect where the point estimate of the odds ratio is 3.0 with a 95% conf interval of [1.1, 8.2]. As a 95% conf interval, this is fine (assuming the underlying assumptions regarding sampling, causal identification, etc. are valid). But if you slap on a flat prior you get a Bayes 95% posterior interval of [1.1, 8.2] which will not in general make sense, because real-world odds ratios are much more likely to be near 1.1 than to be near 8.2. In a practical sense, the uniform prior is causing big problems by introducing the possibility of these high values that are not realistic. And taking a confidence interval and treating it as a posterior interval gives problems too. Hence the generic advice to look at confidence intervals rather than p-values does not solve the problem.

I think the Morey et al. paper is important in putting all these various ideas together and making it clear what are the unstated assumptions of interval estimation.