NY Times has a good article on the state of recommender systems: “If You Liked This, Sure to Love That “. This is a description of one of the problems:

But his progress had slowed to a crawl. [...] Bertoni says it’s partly because of “Napoleon Dynamite,” an indie comedy from 2004 that achieved cult status and went on to become extremely popular on Netflix. It is, Bertoni and others have discovered, maddeningly hard to determine how much people will like it. When Bertoni runs his algorithms on regular hits like “Lethal Weapon” or “Miss Congeniality” and tries to predict how any given Netflix user will rate them, he’s usually within eight-tenths of a star. But with films like “Napoleon Dynamite,” he’s off by an average of 1.2 stars.

The reason, Bertoni says, is that “Napoleon Dynamite” is very weird and very polarizing. [...] It’s the type of quirky entertainment that tends to be either loved or despised.

And here is the stunning conclusion by fortunately anonymous computer scientists:

Some computer scientists think the “Napoleon Dynamite” problem exposes a serious weakness of computers. They cannot anticipate the eccentric ways that real people actually decide to take a chance on a movie.

Actually, computers do quite a good job modeling probability distributions for those more eccentric and unpredictable of us. Yes, the humble probability distribution, the centuries-old staple of statisticians is enough to model eccentricity! The problem is that Netflix makes it hard to use sophisticated models the scoring function is the antiquated and not just pre-Bayesian but actually pre-probabilistic *root mean squared error* or RMSE. For all practical purposes, the square root in RMSE is a monotonic transformation that won’t affect the ranking of recommender models, and we can drop it outright.

So, if one looked at the distribution of ratings for Napoleon Dynamite on Amazon, it has high variance:

On the other hand, Lethal Weapon 4 ratings have lower variance:

If we use the average number of stars as the context-ignorant unpersonalized predictor (which I’ve discussed before), ND will give you mean squared pain of 3.8, and LW4 will give you the mean squared pain of 2.7. Now, your model might choose not to make recommendations with controversial movies – but this won’t help you on Netflix Prize – you’re forced to make errors even when you know you’re making them. **(R)MSE is pre-probabilistic: it gives no advantage to a probabilistic model that’s aware of its own uncertainty.**

This is an absolutely great point.

Nevertheless, at some point, you have to present people with a recommendation which might or might not be based on predicted ratings. I guess Netflix assumes that you must predict ratings at all costs. Clearly this is false.

Thanks for the insight.

LOL

It's Netflix problem if their prize criterion doesn't make sense for the purpose they are supposedly after.

As a riddle the prize still stands.

Yeah great point but let me say as a poor Phd Student in Statistics, strongly Bayesian, I can avoid subscribing to the cable but not to Netflix… even if Netflix people are frequentist eheheheh

apart from the joke, I was wondering if the Netflix analytics group could complement the RMSE with some sort of mixed effect to tackle the customers heterogeneity.

Good article, and great point, Aleks. Maybe they should ask for p% predictive intervals for the rating of each movie, and give a prize for the most accurate coverage.

I'm also intrigued by the fact that people rate the same movie differently at different times. It almost sounds like measurement error to me. If a person's "mood" is basically unpredictable, and if affects their rating of a given movie at a given time by +/- 0.5 points, for example, then this adds a layer of randomness to the ratings that may prevent contestants from ever getting 10% improvement over cinematch.

A hierarchical model fit to enough data, on the other hand, could estimate the within-person variance of ratings over repeated ratings of the same movie, and be very useful, without being able to improve rmse.

I don't see how this is non-Bayesian. The Netflix competition looks to me like it fits in the standard decision-theoretic framework. The action is to score a previously unseen movie. The loss is a convex function of the difference between the true score and the predicted score. You choose the action by minimizing the expected loss, given the customer's scoring history. You can also compare the performance of competing probability models by summing the expected losses over all customers and, presumably, pick the best performing model. The existence of the validation set allows you to confirm that your probability calculations are well calibrated.

Mean squared error looks like a perfectly good loss function to me, and I cannot see what is wrong with it being "pre-probabilistic". The choice of loss function is a question of utility (to Netflix) not probability.

What am I missing?

I'm still waiting to hear how a basis point improvement in RMSE translates to bottom-line profits for Netflix. Is this the case of a drug reducing your cholestorol level but does not increase longevity?

Daniel, heterogeneity is in the predictive model, although Netflix could assist it by providing more data.

Anonymous, interval coverage just by itself could reward very wide intervals. I'm fond of the proper loss functions. Probability covers all sources of uncertainty, drift included (usually I wouldn't enjoy watching a movie the second time as much as I did the first time).

Martyn, the thing is that Netflix *does* have the option of showing you a recommendation or not, they also do have the option of ordering their recommendations. To do this, they would benefit from uncertainty, but they're not rewarding it. So yes, it fits the standard decision-theoretic framework, but couldn't it be improved?

Junkcharts, it's a long path from predictions to recommendations, to clicks, to happiness, to churn/word-of-mouth, to costs and profits…

The fraction of a star that the 10% improvement from .95 to .86 RMSE would be hard to see on Netflix's current user interface — the scale is only 4 points (5-1=4) after all, so even .86 is a relatively large error (.86^2 = .74 to put it back on a linear scale). They often do a more-like-this style recommendation anyway, and you have to browse by genre and sort by stars to get direct personal movie rankings.

At the very least, you could add your posterior uncertainty into the rankings to get a probability estimate you'd like one movie more than another. Of course, some pairs would have high uncertainty about rankings.

KDD 2007 ran a contest that tried to predict which movies someone would rent next using the Netflix data. That seems like a better approximation of the real problem, both for recommendations and logistics.

What I'd like to see is something that takes the diversity of results into account, like maximum marginal relevance does for search. I don't need to be continually recommended the same popular movie or huge groups of similar movies (like ten seasons of the Simpsons). Finding something with more uncertainty that I haven't heard of would be more useful for me. I also tend to like those movies that split the crowds.

Furthermore, I was astonished to re-read this quote:

(R)MSE is pre-probabilistic: it gives no advantage to a probabilistic model that's aware of its own uncertainty.

This is nonsense. How would you compare, for instance, generalized least squares or iteratively reweighted least squares. In both cases, estimates for uncertainty are indirectly used to minimize the sum of squared errors. These are obscenely old 20th century methods, probably older than anyone writing this blog.

So, I am at a loss to determine what you find so bad about RMSE. It is clearly feasible to do better if you can estimate your model's uncertainty, and it is clearly the case that it is an arbitrary loss function, which anybody offering $1M should be free to select.

RAF