It’s been a dramatic month: A month ago, a coalition of some of the leading teams qualifies for the $1 million grand prize for improving the accuracy of the movie-recommending model by more than 10%. But, they would close the competition 30 days afterward, in case someone else is able to improve upon the result. This happened less than a day before the deadline, by The enormous Ensemble, composed of 23 previously separate teams and individuals. Of course, most of the progress towards the victory was through the models making use of new significant patterns in the data, such as that of time.

The development of an ensemble from many separate teams was another accomplishment, and the GPT’s inclusion rules provide some insight into the process: “shares” of the winnings were distributed based on how much was a contribution able to improve the result in terms of percentage points. Simon Owens describes what it was like to participate in The Ensemble.

Bayesian statistics always works with ensembles: the posterior is a weighted average of all models, the weight being based on the fit of each model times the prior quality of the model. There are some additional Bayesian elements that could be a part of future competitions, such as Bayesian scoring functions.

In the past I was asked to contrast Occam’s razor with the Epicurean principle. Occam’s razor is the Bayesian prior, or the the yang principle: simpler models have greater a priori weight (because we tend to economize that what is useful). Occam’s razor goes back to Aristotle, who wrote *“For the more limited, if adequate, is always preferable,”* and *“For if the consequences are the same, it is always better to assume the more limited antecedent”* in his Physics. We mathematically express it as the prior.

Epicurean principle is the yin, or mathematically expressed as the integral over the model space. Ensembles go back to Epicurus’ letter to Herodotus: *“When, therefore, we investigate the causes of [...] phenomena, [...] we must take into account the variety of ways in which analogous occurrences happen within our experience.”* Thus, Bayesian statistics combines the yin and the yang, balancing the pursuit of simplicity with the limitations of uncertainty.

[7/31/09: Added a link to Simon Owens' interview with The Ensemble.]

I find it a little unsettling that the winner was an ensemble of methods. Even though each method might make sense, when you mash them all together, it feels too much like a black box to me. I'd love to see a model do even better that is interpretable.

Is it really that much of a black box? Get ten friends who know you pretty well, ask them what movie you should watch this weekend, then watch whatever the plurality recommends. Nothing terribly mysterious there. The ensemble approach is the same thing, using models instead of friends.

OK, so maybe not a black box. But with your analogy, I would think with all the data available for me, there would exist a friend who was more expert than all others combined.

… model averaging does up-weight models that do better – so, given sufficient training data, it does "listen" to the expert friend more carefully than others.

Actually, I find ensemble methods less black-boxy.

First of all, there are the component models. These can often be understood in isolation.

Second, there's the combination rule. In the overly simple case, we have inverse variance weighting, which is pretty intuitive.

Interestingly, divide $1,000,000 among 23 groups and individuals and that's not much after taxes. It's about the game, not the reward.

Anon – interpretability is always a plus. Often there will be clusters in the ensemble, simplifying the interpretation.

While Bayesian stats works with ensemble methods, it's not really very good. It's like using horse and buggy in the time of jet travel: you can get there, but it will be slow and a pain in the rear on a long road. Bayesian methods, like their Frequentist counterparts, are 20th century methods. Modern stats on this scale is algorithmic in nature. I just happened on your blog, but I hope you don't get too involved in Bayesian methods. I wouldn't bother picking up Bayesian methods, and instead look to methods by Friedman, Breiman, and others. Bayesian Model Averaging, for instance, is cute, but not really worth the time. Notice that the 2nd place finisher included a BMA expert, and notice how little progress they made in 30 days, even after adding in a far better team and their methods (Pragmatic Theory).

RAF: Regarding the effectiveness of Bayesian methods, see here.