## Bill James does model checking

Regular readers will know that Bill James was one of my inspirations for becoming a statistician.

I happened to be browsing through the Bill James Historical Baseball Abstract the other day and came across this passage on Glenn Hubbard, who he ranks as the 88th best second baseman of all time:

Total Baseball has Glenn Hubbard rated as a better player than Pete Rose, Brooks Robinson, Dale Murphy, Ken Boyer, or Sandy Koufax, a conclusion which is every bit as preposterous as it seems to be at first blush.

To a large extent, this rating is caused by the failure to adjust Hubbard’s fielding statistics for the ground-ball tendency of his pitching staff. Hubbard played second base for teams which had very high numbers of ground balls, as is reflected in their team assists totals. The Braves led the National League in team assists in 1985, 1986, and 1987, and were near the league lead in the other years that Hubbard was a regular. Total Baseball makes no adjustment for this, and thus concludes that Hubbard is reaching scores of baseballs every year that an average second basement would not reach, hence that he has enormous value.

Posterior checking! This would fit in perfectly in chapter 6 of BDA.

This idea is so fundamental to statistics—to science—and yet so many theories of statistics and theories of science have no place for it.

The alternative to the Jamesian, model-checking approach—so close to the Jaynesian approach!—is exemplified by Pete Palmer’s Total Baseball book, mentioned in the above quote. Pete Palmer did a lot of great stuff, and Bill James is a fan of Palmer, but Palmer follows the all-too-common approach of just taking the results from his model and then . . . well, then, what can you do? You use what you’ve got.

What makes Bill James special is that he’s interested in getting it right, and he’s interested in seeing where things went wrong.

A chicken is an egg’s way of making another egg.

To make the analogy explicit: the “egg” is the model and data, and the chicken is the inferences from the model. The chicken is implicit in the egg, but it needs some growing. The inferences are implicit in the model and the data, but it takes some computing.

All the effort that went into Total Baseball was useful for sabermetrics, in part for the direct relevance of the results (a delicious “chicken”) and in part because Total Baseball included so much data and made so many inferences that people such as James could come in and see which of these statements made no sense—and what this revealed about the problems with Palmer’s model.

It’s like Lakatos said in Proofs and Refutations: once you have the external counterexample—an implication that doesn’t make sense—you go find the internal flaw—the assumption in the model that went wrong, often an assumption that was so implicit in the construction of your procedure that you didn’t even realize it was an assumption at all. (Remember the Speed Racer principle?) Or, conversely, if you first find an internal assumption that concerns you, you should follow the thread outward and figure out what are its external consequences: what does it imply that it does not make sense.

P.S. James is doing a posterior check, not a prior check, because his criticism of the Total Baseball model comes from the absurdity of one of its inferences, conditional on the data.

1. Tom says:

But why is it preposterous to say Glenn Hubbard was better than Pete Rose and others? That judgment is based on one’s internal, unstated model, which varies from individual to individual. That has always been my objection to Bill James’s work: he makes up calculations, then discards them if they disagree with his preconceived notions. If the result of the unspecified, internal model is the right one, then who needs the Bill James model?

• Steve Sailer says:

One of the things Bill James did is check contemporary opinions of knowledgeable observers who saw Hubbard play every day. As Yogi Berra said, you can observe a lot just by watching. James’ statistical techniques don’t usually have the intention of discovering somebody completely overlooked by even season ticketholders, but instead just overlooked by sportswriters in other cities who voted for MVPs based on weaker statistics like RBIs.

At the time, Hubbard was considered by those who saw him a lot to be a quality contributor to a team that was pretty good. He made one All Star team in his 12 seasons in the league. Once you adjust for the large number of ground balls his pitchers generated, his defensive stats look less than unworldly.

2. Jonathan (another one) says:

Yes… BUT. (a) If picking a pitching staff is conditioned on the slick-fielding prowess of your second baseman, then, to the extent it is, Hubbard’s contribution is undervalued; (b) in those days, the characterization of “ground ball” was influenced by range, so that a wide ranging shortstop or second baseman makes the pitching staff look more like ground-ball generators; (c) Hubbard’s best skill, a very quick turn on the double play, raises the percentage of ground balls all by himself by creating two outs on one ball.

I’m not saying any of this invalidates the point James is making, but back in the day I loved me some Glenn Hubbard. The identically named economist, not so much.

The problem with the posterior test is that it implies a second, stronger, use of the prior, no? If I decide that any method of evaluation I use can’t have as an outcome Hubbard>Murphy (including, one hopes, a standard error on that comparison which you and James elide here) then shouldn’t that have been part of the prior? (Note: I’m trolling here… the problem with this statement is left as an exercise to the reader.)

3. Keith O'Rourke says:

An interpretation is a representation’s way of making another (not necessarily better) representation.

4. Shravan says:

Whenever I see a statistics book with a chapter on baseball or some other American sport’s statistics, I just skip that chapter. Same if it’s cricket (although nobody uses cricket). The presumption that people would take the trouble to learn how the game works to read the chapter is a bit excessive to me.

• Jonathan (another one) says:

I think you have it backwards. The assumption underlying the example is that people have naïve statistical models relating to the game in question in their heads. Thus, the chapter (if done well) allows one to confront the data and the model in a realistic way. Of course, to the extent the game itself is not background knowledge, the example fails except as a word problem probably more complex than it needs to be.

I find that students are often too passive when confronting artificial examples. Indeed, the problem I find is that they are a bit too active when they are too caught up in their priors.

• Shravan says:

It seems presumptuous to assume that everyone outside the US would have a naive statistical model of baseball in their heads. It’s a bit like just assuming everyone speaks English.

• Rahul says:

I never get that argument. Say, I’m writing a book about Chemical Reactors. I’m perfectly fine if the message doesn’t read non-English speakers.

There’s no reason to always be accessible to the lowest common denominator. Even if I included a crop-yields example in a book on Stats one could criticize “It’s assuming everyone knows Agriculture”

• Shravan says:

The next time I need a sports example in my statistics class, I will start by building a model of Kabaddi. There must be millions of players in South Asia. This is OK under your reasoning. Baseball is as obscure as Kabaddi to me. Crop yields I can understand well enough without having to put in much effort.

• Shravan says:

I agree that there is a difference between the truly obscure and popular and yet not universal. However, there is no shortage of easily understood examples to make a statistical point. Granted that the US is a huge audience, but the author doesn’t or shouldn’t have an eye on their financial bottom line when they write a book on statistics. The goal is generally to try to communicate to scientists everywhere; US-centric examples don’t serve that goal.

5. Dave C. says:

An interesting sports example was a model that I helped build a few years ago to illustrate GLM. Not being a football fan, I initially set up a model that produced roughly the same likelihood for a team to score 5 points as to score 6 points. The residual plots looked very strange and that let us refine the model to be much more resemble actual scores.

Domain knowledge is key.

6. Steve Sailer says:

Pete Palmer’s “Total Baseball” was published in, I think, 1984. It’s a prodigious book, but it came a decade or two too early in the revolution of baseball statistics to reliably accomplish its goal of ranking every player who ever lived. In contrast, Bill James did a lot of piecemeal studies of smaller questions that helped analysts build up toward more reliable overall rankings by the early 21st century.

I’m reminded of this history when I read about attempts to rank all the teachers in a school district in terms of “Value Added” so that the bottom X percent can be fired and the top Y percent can be given bonuses. I see various attempts to power directly to a Pete Palmer-type Total solution, but I worry that we’re missing the crucial stage of Bill James-style building blocks.

• hgfalling says:

Pretty sure if we had accurate Value over Replacement Teacher metrics we could do something pretty good with them.

• Steve Sailer says:

Sure, but it took 20-30 years of enthusiastic collaboration to get there in sabermetrics. Trying to get to the ultimate solution in one step, like Palmer did, misfired.

7. Jonathan says:

I want to put in a plug for Jim Albert’s Teaching Statistics Using Baseball. I think it’s a great way to explain basic statistics to boys (and men) who are drawn to sports and may read about WAR and UZR but have no clue how to begin thinking about a sport using fairly simple analytical methods.