Mark Palko writes:

I’ve got a stat problem I’d like to run past you. It’s one of those annoying problems that feels like it should be obvious but the solution has evaded me and the colleagues I’ve discussed it with. I’m working on a project where the metric of interest is defined in relation to pairs of data points. It has nothing to do with sports or betting but the following analogy (which I also post on the blog) covers the basic situation:

“You want to build a model predicting the spread for games in a new football league. Because the line-up of teams is still in flux, you decide to use only stats from individual teams as inputs (for example, an indicator variable for when the Ambushers play the Ravagers would not be allowed).”

Is there a standard approach for modeling this kind of data?

My reply: I don’t quite understand your question, but are you familiar with the Bradley-Terry and Thurstone-Mosteller models for paired comparisons? These are old–from the 1920s and 1940s, I believe–but they might do what you need. Interesting work has been done on these models recently by Hal Stern, Mark Glickman, and others, to allow the underlying parameters to vary over time.

If I understand the problem correctly, it has nothing to do with "paired comparisons" as the term is used in statistics.

I imagine that for each team there is information such as how fast the quarteback can run 50 yards and the total weight of the offensive line. There is data on the difference in scores for games played along with this information on the two teams. We want a regression model that will take this information for both teams and predict the score difference.

This seems to me to be a mostly standard regression problem, except that the predictors come in to corresponding groups. In particular, swapping the two groups of predictors should just result in negating the predicted score difference. To ensure this, perhaps one can just duplicate all the observations, once with predictors of (team A, team B), the other with (team B, team A), with the score difference for one being the negation of the score difference for the other.

Andrew, I think you misread or misunderstood the question: there is no information (or at least, you are not allowed to use it) about the performance when one team plays another. As I understand it, no team has ever played any other team.

But then, Palko: it's not clear what data you _do_ have. If we take the sports analogy literally — and, by the way, I really think you would be better off discussing your problem, not a different problem that is somewhat analogous to your problem — then perhaps you have data like the salaries of the individual players, their speed in running 40 yards, the weight they can bench-press, the number of times they were All Americans in college, and so on, and you're trying to predict how each team will compete against each other team? What you need, obviously, is _some_ data that allow you to relate your explanatory variables to relative team abilities. For instance, if you have data from a different league, you could look at those data: in ExistingLeague, a league has a 0.3-point advantage, on average, for each All Star player they have above the number that the other team has. If you add the 40-yard times of their starting receivers, the faster team has a 0.1-point advantage for each 0.3 seconds of time advantage. And so on. Basically, you could create a predictive model based on team attributes by looking at how those attributes predict victory odds for a case where you _do_ have paired comparisons, and then you would apply that model to the new season or new league or whatever.

If you don't have _any_ paired comparisons data from anywhere — it's not just a new league, it's a newly invented sport, and you have no data on how much of an advantage it is to be tall or strong or fast or well-paid or whatever — then I honestly don't see how you can approach this as a statistical problem.

I don't quite understand the problem setup (like Phil, I don't understand what data you do have), but let's take this analogy.

Let's suppose these are teams of CHESS players. They have never played each other, but all have Elo ratings

http://en.wikipedia.org/wiki/Elo_rating_system

which are based on games they have won or lost against other players (none of whom have to be in common, although there's an assumption that there are some distant linkages so you are not dealing with two completely isolated populations).

The Elo system is fundamentally a paired comparison system, is quite flexible, and there's sufficient work on it so you can determine probabilities of winning. Chess has no equivalent of point spreads, but Elo ratings have spread to other contexts that do have point spreads (e.g. college football). For more info, see the Wiki reference above. I'm no expert either with the Elo system or chess (1650 rating).

I'm afraid my analogy may have clouded rather than clarified.

Here's the actual problem:

I have a text mining tool that measures, for lack of a better word, the similarity between two pieces of texts. I've run through samples from a number of authors and gotten the results you'd expect. For example, Dickens is more similar to Trollope than to Twain and more similar to Twain than Veblen (I put Thorstein in as kind of an intentional outlier).

There is an excellent body of literature on the components of the tool, but they all approach the problem as Bayesian classification — what is the likelihood that this passage came from group A compared to the likelihood it came from group B. When I extend that to look at the relative similarities of n different authors the classification approach doesn't really apply.

I'd like to know more about the drivers of similarity. Normally I'd just build a model at this point, but in all of the models I'm familiar with the attribute of interest is associated with each individual observation. Here I want to use attributes associated with individual authors to predict an attribute associated with a pair of authors. That seems like it should be a simple problem but for some reason I can't see an obvious solution.

I hadn't been familiar with Elo ratings until Andrew pointed out the research on the subject by Hal Stern and Mark Glickman.

Perhaps what you want is something along the lines of what is called "multi-dimensional scaling".

I wasn't familiar with multi-dimensional scaling but based on a quick read it certainly looks relevant.

Thanks.

Mark – you may wish to look at the correspondence analysis variation on scaling.

Assume many of your features will be categorical and with this variation (and recent books by Michael Greenacre) its not too hard to "see it" as regression analysis where constructed variables (dimensions) are used to predict discrepancies from independence.

And you can plot the authors in the prediction space

K?

p.s. if using SAS be careful to check manual updates as there was a nasty typo until a few months ago

Mark,

OK, this is a much more interesting problem than the one I thought you were interested in! But I'm still not crystal clear on what you are trying to do. You say "Here I want to use attributes associated with individual authors to predict an attribute associated with a pair of authors." I can't figure out what that sentence means. Can you rephrase, or give an example?