Trey Causey writes:
If you’ll permit a bit of a diversion, I was wondering if you’d mind sharing your thoughts on how sabermetrics approaches the measurement of luck vs. skill. Phil Birnbaum and Tom Tango use the following method (which I’ve quoted below).
It seems to embody the innovative but often non-intuitive way that sabermetrics approaches problems, but something about it feels off to me.
Causey quotes Phil Birnbaum:
To go from a record of performance to an estimate of a team’s talent, you have to regress its winning percentage towards the mean. How do you figure out how much to regress?
Tango has often given these instructions:
1. First, figure out the standard deviation of team performance. For MLB, for all teams playing at least 160 games up until 2009, that figure is 0.070 (about 11.34 wins per 162 games).
Second, figure out the theoretical standard deviation of luck over a season, using the binomial approximation to normal. That’s estimated by the formula
Square root of (p(1-p)/g))
For baseball, p = .500 (since the average team must be .500), and g = 162. So the SD of luck works out to about 0.039 (6.36 games per season).
So SD(performance) = 0.070, and SD(luck) = 0.039. Square those numbers to get var(performance) and var(luck). Then, if luck is independent of talent, we get
var(performance) = var(talent) + var(luck)
That means var(talent) equals 0.058 squared, so SD(talent) = 0.058.
2. Now, find the number of games for which the SD(luck) equals SD(talent), or 0.058. It turns out that’s about 74 games, because the square root of (p(1-p))/74 is approximately equal to 0.058.
3. That number, 74, is your “answer”. So, now, any time you want to regress a team’s record to the mean, take 74 games of .500 ball (37-37), and add them to the actual performance. The result is your best estimate of the team’s talent.
For instance, suppose your team goes 100-62. What’s its expected talent? Adjust the record to 137-99. That gives an estimated talent of .581, or 94-68.
Or, suppose your team starts 2-6. Adjust it to 39-43. That’s an estimated talent of .476, or 77-85.
Those estimates seemed reasonable to me, but I often wondered: does this really work? Is it really true that you can add 74 games to a 162 game season, and it’ll work, but you can also add 74 games to an 8 game season, and that’ll work too? Surely you want to add fewer .500 games when your original sample is smaller, no?
And why always add the exact number of games that makes the talent SD equal to the luck SD? Is it a rule of thumb? Is it a guess? Again, that can’t be the mathematically best way, can it?
It can, actually. I spent a couple of hours doing some algebra, and it turns out that Tango’s method is exactly right. I was very surprised. Also, I don’t know how Tango figured it out … maybe he use an easier, more intuitive way to figure out that it works than going through a bunch of algebra.
But I can’t find one, so let me take you through the algebra, if you care. Tango, is there an obvious explanation for why this works, more obvious that what I’ve done?
As I wrote a few paragraphs ago,
var(overall) = var(talent) + var(luck). [Call this “equation 1″ for later.]
Let v^2 =var(overall), and let t^2 = var(talent). Also, let “g” be the number of games.
From the binomial approximation to normal, we know var(luck) = (.25/g). So
v = SD(overall)
t = SD(talent)
sqr(.25/g) = SD(luck)
Suppose you run a regression on overall outcome vs. talent. The variance of talent is t^2. The variance of overall outcome is v^2. Therefore, we know that talent will explain t^2/v^2 of the variance of outcome, so the r-squared we get out of the regression will be t^2/v^2. That means the correlation coefficient, “r”, will be equal to the square root of that, or t/v.
There’s a property of regression in general that implies this: If we want to predict talent from outcome, then, if the outcome X is y standard deviations from the mean, talent will be y(t/v) standard deviations from the mean. That’s one of the things that’s true for any regression of two variables.
Expected talent = average + (number of SDs outcome is away from the mean) (t/v) * (SD of talent)
Expected talent = average + [(outcome – mean)/SD of outcome] [t/v] * (SD of talent)
Expected talent = average + (X – mean)/v * (t/v) * t
Expected talent = average + t^2/v^2 (X – mean)
That last equation means that when we look at how far the observation is from average, we “keep” t^2/v^2 of the difference, and regress to the mean by the rest. In other words, we regress to the mean by (1 – t^2/v^2), or “(100 * (1 – t^2/v^2)) percent”.
Now, if we regress to the mean by (1 – t^2/v^2), that’s the exactly the same as averaging
– (1 – t^2/v^2) parts average performance, and
— (t^2/v^2) parts observed performance.
For instance, if you’re regressing one-third of the way to the mean, you can do it two ways. You can (a) move from the average to the observation, and then move the other way by 1/3 of the difference, or (b) you can just take an average of two parts original and one part mean.
But how does that translate, in practical terms, into how many games of average performance we need to add?
From above, we know that:
For every t^2/v^2 games of observed performance, we want (1 – t^2/v^2) games of average performance.
And now a little algebra:
For every 1 game of observed performance, we want (1 – t^2/v^2)/(t^2/v^2) games of average performance.
For every game of observed performance, we want (v^2-t^2)/t^2 games of average performance.
Multiply by g:
For every “g” games of observed performance, we want g(v^2-t^2)/t^2 games of average performance.
But, from equation 1, we know that (v^2-t^2) is just the squared SD of luck, which is .25/g. So,
For every “g” games of observed performance, we want g(.25/g)/t^2 games of average performance.
The “g”s cancel, and we get,
For every “g” games of observed performance, we want .25/t^2 games of average performance.
And that doesn’t depend on g! So no matter whether you’re regressing a team over 1 game, or 10 games, or 20 games, or 162 games, you can always add *the same number of average games* and get the right answer! I wouldn’t have guessed that.
But how many games? Well, it’s (.25/t^2) games.
For baseball, we calculated earlier now that t = 0.058. So .25/t^2 equals … 74 games. Exactly as Tango said, the number of games we’re adding is exactly the number of games for which SD(luck) equals SD(talent)!
Is that a coincidence? No, it’s not. It’s the way it has to be. Why? Here’s a semi-intuitive explanation.
As we saw above, the number of games we have to add does NOT depend on the number of games we started with in the observed W-L record. So, we can pick any number of games. Suppose we just happened to start with 74 games — maybe a team that was 40-34, or something.
Now, for that team, the SD of its talent is 0.058. And, the SD of its luck is also 0.058. Therefore, if we were to do a regression of talent vs. observed, we would necessarily come up with an r-squared of 0.5 — since the variances of talent and luck are exactly equal, talent explains half of the total variance.
That means the correlation coefficient, r, is the square root of 0.5, or 1 divided by the square root of 2. For every SD change in performance, we predict 1/sqr(2) SD change in talent. But the SD of talent is exactly 1/sqr(2) times the SD of performance. Multiply those two 1/sqr(2)’s together and you get 1/2, which means for every win change in performance, we predict 1/2 win change in talent.
That’s another way of saying that we want to regress exactly halfway back to the mean. That, in turn, is the equivalent of averaging one part observation, and one part mean. Since we have 74 games of observation, we need to add 74 games of mean.
So, in the case of “starting with 74 games of observation,” the answer is, “we need to add 74 games of .500 to properly regress to the mean.”
However, we showed above that we want to add the *same* number of .500 games regardless of how many observed games we started with. Since this case works out to 74 games, *all* situations must work out to 74 games.
I responded that “Tom Tango” is a great name, in the “Vance Maverick” or “Larry Lancaster” category. But Causey burst my bubble by informing me that he thinks it’s a pseudonym.
In answer to the original question, they’re doing hierarchical Bayes with a point estimate for the group-level variance. We do something similar in chapter 2 of BDA (the cancer-rate example), just with a Poisson rather than a binomial model. It’s good to see people re-deriving this sort of thing from scratch, and justifying it based on it making sense rather than just because it’s Bayesian. Of course if the model is correct the posterior mean will have lowest mean squared error, so all sorts of different derivations will get you the right answer.