Skip to content
 

Epicurus and Progress on Netflix Prize

Many of you buy and rank books, movies on the web, you click on links, bookmark them, blog about them. By doing this, you are leaving traces behind. The traces are of great help to those who will find themselves in the same situation as you. Personalization technology tries to help you navigate the choices using the actions of people who were there before you, and with the the implicit (clicks or purchases you’ve made) or explicit (preferences you’ve expressed) knowledge about yourself.

Greg Linden’s blog is an excellent source of insightful posts on personalization technology. A while ago he posted a link to a collection of material from KDD about the Netflix Prize: a challenge where one has to predict how much you will like a particular movie based on your history of movies you’ve seen and based on others’ ratings of movies they’ve seen.

What’s notable is that some of the current competition leaders have written extensive papers about their approach. BellKor’s approach is quite simple and combines nearest-neighbor ideas with a more global factor model. On the other hand, Gravity employs a diverse collection of tools, including matrix factorization, neural networks, nearest neighbor models and clustering. The Gravity team provides an interesting picture of their factor model for movie Constantine:

constantine.png

They assume adjacent factors to be correlated, they infer this matrix purely from the ratings data, and they named some of the factors manually in the end. Compare their annotations with the hand-designed list of genres (Action / Drama / Fantasy / Horror / Thriller) or keywords (Androgyny / Twin / Black Cat / Christian Horror / Electrocution / …) assigned to the movie by human editors. Many of these keywords might rarely be relevant for determining whether the movie is worth seeing or not.

In recent weeks, the two currently leading groups (judging from the leaderboard) were formed by fusing the AT&T Labs group BellKor with another researcher into KorBell, and the Hungarian BME group with the Princeton into Gravity-Dinosaurs) have consolidated their efforts: as it is becoming clear that several models are better than just one. I am not sure either group used background knowledge that can be obtained by crawling the web or using other databases. At the moment, both groups are at around 8.4% improvement over baseline. They need 10% improvement over baseline to win the $1e6 grand prize, and a year ago the improvement was 4.5%.

This is just another piece of support for the Epicurus’ principle of multiple explanations (“Keep all hypotheses that are consistent with the facts!”, often used by proponents of algorithmic probability as a summary of a longer statement from his letter to Herodotus, “If then we think that an event could happen in one or other particular way out of several, we shall be as tranquil when we recognize that it actually comes about in more ways than one as if we knew that it happens in this particular way.”) leads to superior practical results. This corresponds to the Bayesian practice of integrating over all possible parameter settings (but assigning them weights in proportion to their predictive power – likelihood – and in proportion to our a priori trust in them – prior) instead of picking just the best one. To the famous statement “All models are wrong, but some are useful.” we should add “but use them all!”

[Correction 10/12:] Chris Volinsky kindly corrected my statement that BellKor combined their predictions with others groups (although other groups offered this). They did not do that, and they argue against this approach. They did, however, combine researchers Bell, Koren with researcher Volinsky into the KorBell approach.

11 Comments

  1. Greg Linden says:

    Thanks, Aleks, I appreciate the kind words about my blog.This is a very nice summary of the techniques being used by the current leaders of the Netflix contest. Thanks for writing it.I too have noticed that the leaders seem to be combining whatever appears to work into an increasingly large ball of algorithmic goo. As much as I celebrate their tenacity, that strategy worries me.In particular, I am worried about overfitting to the strict letter of this contest. I am worried that Netflix may find that the winning algorithm is actually quite poor at the task at hand — recommending movies to Netflix customers — because it is overoptimized to this particular contest data and the particular success metric of this contest.Even so, Netflix clearly will benefit from seeding a huge amount of new research into large scale recommender systems, whether or not the winner itself actually is useful to them.

  2. Nathan says:

    Kind of reminds me of the book I'm reading, Wisdom of Crowds.

  3. Phil says:

    I always enjoy Alek's posts, but I often have no idea what the hell they are about. This one is a great example. I don't know what "personalization tecnology" refers to, or KDD, so the first paragraph doesn't tell me much. I do remember something about a Netflix prize — for figuring out what movies to recommend, I think — but I don't really understand anything about the second paragraph, including especially: what is the problem that people are trying to solve?

    I guess I could follow some of the links and probably figure this out, but that seems like a lot of work. Could someone provide a one- or two-paragraph summary of: what is this post about? Thanks!

  4. Aleks says:

    @Phil: Thanks for feedback! I've added some more text that will hopefully explain the background better. The post is about the benefits of combining models, based on the example of NetFlix. There's a sprinkling of details about the currently leading approaches, and about the philosophy of model combination.

    @Greg: I agree – combining models is the obvious last resort. But BellKor's two models are both quite simple, and I continue to be surprised that nobody is trying to use (or admits to using) the background knowledge (not to even mention the temporal, "market segmentation" and so on).

    @Nathan: Yes, here it's the consequence that none of the models is right, so a predictively minded Bayesian would instead model uncertainty about which one is right (or best). On the other hand, an interpretation minded Bayesian would try to simplify the model as to make sense of it. Excellent blog of yours, BTW!

  5. ZBicyclist says:

    Sometime when it's a slow blog day you might want to contrast

    Epicurus' principle of multiple explanations ("Keep all hypotheses that are consistent with the facts!"),

    with

    William of Occam's Razor ("Use the simplest explanation possible")

    I've blogged a few thoughts on this at http://mikekr.blogspot.com/

  6. Chris says:

    Aleks,

    For clarification, we (the AT&T team which comprises BellKor and KorBell) resisted combining our results with other teams, as we felt it was against the spirit of the competition.

    Also, the reason that "background knowledge" has not been used is because 1) the rules prohibit use of data that is gotten illegally – and crawling sites like Netflix, IMDB, etc for a commercial purpose (which this is) is against their end user agreements; and 2) it is unclear that content-based models would add much benefit over and above collaborative filtering (CF). The CF models inherently incorporate metadata – those who like Bond movies like other Bond movies, etc. The power of the CF models combined with statistical regression models, without any movie metadata, has been surprising.

    Finally, I will add that this competition has been a great experience for those of us involved. There have been many great discussions on the forum where competitors have generously shared their approaches, even to the point of posting their code. Also, at the KDD Workshop, organized by Netflix, all of the leading teams presented their work openly, despite the possibility that competitors might benefit. The openness of Netflix (in providing the data) and of our competitors has helped to advance the science of recommender systems.

  7. Aleks says:

    @Chris: Thanks for the correction! Also thanks for clarifications regarding the data access rights: I have not investigated this issue as thoroughly as you have.

    @ZBicyclist: Thanks for the idea, this would indeed deserve a blog entry of its own. But let me give it a quick shot.
    There are two objectives to modeling: 1) Predict 2) Understand. For 1) you should use all tools you have. For 2) you should seek to explain the complexity of the world in the simplest terms possible, so that students can learn with less effort.
    Sometimes the goals agree: overfitting is both complex and predicts badly. But sometimes they disagree: model combination or Bayesian predictive distributions are complex yet predict well.

  8. tom s. says:

    Can I promote my own essay on the Netflix prize? Not so much about techniques as about the likely limitations of the prize.

    Thanks.

  9. Greg Linden says:

    Hi, Chris. You said:

    It is unclear that content-based models would add much benefit over and above collaborative filtering (CF). The CF models inherently incorporate metadata – those who like Bond movies like other Bond movies, etc.

    Normally I would agree with this, but not here.I think there is some evidence that there simply is not enough data on a percentage of the movies in the training set to make accurate predictions. The problem is not that we do not have the right kind of data, only that we do not have enough data. The only choice when that happens is to get more data.It does not have to be content data. It could be additional ratings data if a source of that could be found. But, I suspect it is true that more data will be necessary to win.I have to admit, however, that I would have thought contest participants already would have needed more data to achieve the scores they have. So, perhaps I do not know what I am talking about.

  10. Yehuda says:

    Hi Greg,

    You wrote:

    I think there is some evidence that there simply is not enough data on a percentage of the movies in the training set to make accurate predictions. The problem is not that we do not have the right kind of data, only that we do not have enough data.

    Well, we will never know till we try, but let me explain the counter argument.

    In terms of data coverage, there is a significant discrepancy between users and movies for two reasons.
    First, there are far fewer movies than users, so individual movies are covered much better than individual users.
    Second, the test set is sampled uniformly across users (but not across movies or ratings). This way, users that don't rate much are over-represented in the test set, whereas as for the movies, the test set reflects their true distribution, meaning that movies that are being rated a lot are proportionally over-represented in the test set.

    Thus, not only we know a lot about movies, but in addition, movies with a lot of ratings dominate the test set. Overall, we feel that the movie side is pretty much covered, and additional information might not help as much as people hope.

    (We made a small related experiment, squeezing data form movie-titles and release year, which were supplied by Netflix. We could really find strong signal there, e.g., one can relate "24: season 1" to "24: season 2" just based on titles' edit distance. Or, you can find users that prefer old classics and those who prefer the newer fare. However, when putting this information within the total blend, almost all captured signal faded. It appears that the rating data already (implicitly) covers these aspects.)

    As for the users, it is a completely opposite story. Now, find me extra information about these anonymous users and I bet that the day later we can split the grand prize :)

    Regards,
    Yehuda Koren
    BellKor team

  11. Silicon Monkey says:

    Blending of results in Netflix is really just a meta-algorithm. If two algorithms are combined to get a better result then it means either that: (1) what is really going on is that the contestants are training against the quiz data set; or (2) each is somehow picking up some signal that the other is missing. Assuming that (2) is what is happening then ultimately by comparing the functioning of algorithms that combine well it may be possible to construct one integrated algorithm that effectively uses all of the information.