## Why I think the top batting average will be higher than .311: Over-pooling of point predictions in Bayesian inference

In a post from 22 May 2017 entitled, “Who is Going to Win the Batting Crown?”, Jim Albert writes:

At this point in the season, folks are interested in extreme stats and want to predict final season measures. On the morning of Saturday May 20, here are the leading batting averages:

Justin Turner .379
Ryan Zimmerman .375
Buster Posey .369

At the end of this season, who among these three will have the highest average? . . . these batting averages are based on a small number of at-bats (between 120 and 144) and one expects all of these extreme averages to move towards the mean as the season progresses. One might think that Turner will win the batting crown, but certainly not with a batting average of .379. . . .

I’m scheduling this post to appear in October, at which point we’ll know the answer!

Albert makes his season predictions not just using current batting average, but also using strikeout rates, home run rates, and batting average for balls in play. I think he’s only using data from the current year, which doesn’t seem quite right, but I guess it’s fine given that this is a demonstration of statistical methods and is not intended to represent a full-information prediction. In any case, here’s what he concludes in May:

I [Albert] predict Posey to finish with a .311 average followed by Zimmerman at .305 and Turner at .297.

These are reasonable predictions. But . . . I’m guessing that the league-leading batting average will be higher than .311!

Why do I say this? Check out recent history. The top batting averages in the past ten seasons (listed most recent year first) have been .348, .333, .319, .331, .336, .337, .336, .342, .364, .340. Actually, it looks like the top batting average in MLB has never been as low as .311. So I doubt that will happen this year. In 2016 there appear to have been 12 players who batted over .311 during the season.

What happened? Nothing wrong with Albert’s predictions. He’s just giving the posterior mean for each player, which cannot be directly examined to given an inference for the maximum over all players. Assuming he’s fitting his models in Stan—there’s no good reason to do otherwise—he’s also getting posterior simulation draws. He could then simulate, say, 1000 possibilities for the end-of-season records—and there he’d find that in just about any particular simulation the top batting average will exceed .311. Lots of players have a chance to make it, not just those three listed above.

This is not to diss Albert’s post; I’m just extending it by demonstrating out the perils of estimating extreme values from point predictions. It’s an issue that Phil and I discussed in our article, All maps of parameter estimates are misleading.

P.S. This post is appearing, as scheduled, on 19 Oct, during the playoffs. The season’s over so we can check what happened:

Buster Posey hit .320
Ryan Zimmerman hit .303
Justin Turner hit .322.

The league-leading batting averages were Charlie Blackmon at .331 and Jose Altuve at .346. So Albert’s predictions were not far off (these three batters did a bit better than the point predictions but I assume they’re well within the margin of error) and, indeed, it was two other hitters that won the batting titles.

From a math point of view, it’s an interesting example of how the mean of the maximum of a set of random variables is higher than the max of the individual means.

1. Hernan Bruno says:

Is the last sentence true in general? Or does it require some extra condition? Perhaps I am not understanding the statement correctly.

• Anoneuoid says:

It’s not true if the variables all have the same value. The ratio mean(max(x))/max(mean(x)) increases from 1 as x becomes less uniform:

> sim = function(x){
+ rbind(replicate(5, mean(max(x))),
+ replicate(5, max(mean(x))))
+ }
>
> sim(rnorm(1000, 100, 100))
[,1] [,2] [,3] [,4] [,5]
[1,] 392.5742 392.5742 392.5742 392.5742 392.5742
[2,] 101.2832 101.2832 101.2832 101.2832 101.2832
> sim(rnorm(1000, 100, 1))
[,1] [,2] [,3] [,4] [,5]
[1,] 103.6604 103.6604 103.6604 103.6604 103.6604
[2,] 99.9336 99.9336 99.9336 99.9336 99.9336
> sim(rnorm(1000, 100, 0))
[,1] [,2] [,3] [,4] [,5]
[1,] 100 100 100 100 100
[2,] 100 100 100 100 100

This sounds like Hodler’s inequality, not sure if its related:
https://en.wikipedia.org/wiki/H%C3%B6lder%27s_inequality

• Anoneuoid says:

Oops. I had the code right but then thought I could change it to simplify. That made it produce the same x every replication… Here we go:

> sim = function(f){
+ rbind(replicate(5, mean(max(f()))),
+ replicate(5, max(mean(f()))))
+ }
>
> sim(function() rnorm(1000, 100, 100))
[,1] [,2] [,3] [,4] [,5]
[1,] 427.68917 425.5127 367.08679 428.42285 454.5031
[2,] 94.55133 102.4583 99.10667 97.56432 102.7885
> sim(function() rnorm(1000, 100, 1))
[,1] [,2] [,3] [,4] [,5]
[1,] 103.1330 103.02975 104.20603 104.54716 103.36931
[2,] 100.0318 99.97283 99.96608 99.99843 99.96715
> sim(function() rnorm(1000, 100, 0))
[,1] [,2] [,3] [,4] [,5]
[1,] 100 100 100 100 100
[2,] 100 100 100 100 100

• Sean O'Rourke says:

The last statement is merely a restatement of Jensen’s inequality, no? Recall:
Given a convex function f, a random variable X in the domain of f, and the expectation operator E, then
f(E[X]) <= E[f(X)].

The maximum of a set of points is definitely convex, so max E(X) <= E[max X], as Dr. Gelman said.

2. John Hall says:

They could gain some familiarity with extreme value theory. He could have obtained the distribution of the maximum through simulation. Simulate the batting average of all players and find the maximum. After simulating N times, you get the distribution of the maximum batting average. Then, take the mean of that.

3. Guy says:

An additional problem is this statement: “one expects all of these extreme averages to move towards the mean.” In fact, we expect each hitter to move toward his *own* mean, not “the mean,” over the remainder of the season. Zimmerman was about an average hitter over the prior 4 seasons, and his projection was spot on. But Posey and Turner are both far above average hitters, and as a result they over-performed Albert’s projections.

Maybe I missed it the paper, but I think the prediction could be slightly improved by recognizing the process as a random walk. That is to say, we can use multi-level modeling to estimate player_posterior_mean…but our year end estimate should be

(1-a) * player_posterior_mean + a * player_current_BA

where “a” is defined as current proportion of at bats in the season (technically not fixed itself, but it’s variance should be fairly low). It is mentioned that this is early in the season, so “a” should be small and these two predictions shouldn’t be terribly different. The results on three players is easily explained by coincidence, but it does look like this method would have slightly improve the estimates.

• Corey Yanofsky says:

This seems to be the only thing that might qualify as “over-pooling of point predictions in Bayesian inference” in the analysis. The problem with Albert’s prediction of which of the three identified players will have the highest final average isn’t over-pooling but rather using the wrong expectation the make the prediction (as AG notes in the post).

Corey:

I will fully admit to only skimming the paper. And I understand AG’s point that E[max(X1, X2)] is greater than max(E[X1], E[X2]) (under the condition that both X1 and X2 are non-deterministic, have overlapping support and are independent).

What I’m not quite clear on is whether player_posterior_mean is the posterior mean *across all seasons* (which I think it is), or player_posterior_mean for this season alone. You would be correct that this would “double dipping” if it’s the posterior mean for this season…but my (definitely potentially flawed) understanding is that this was the posterior mean across all seasons.

5. Steve Sailer says:

In general, batting averages are usually considered an obsolete statistic by sabermetricians because they don’t measure power or other ways to get on base such as walks.

But, while batting averages aren’t that good of a way to compare players, the statistic remains useful for comparing seasons of the same player. For a typical player’s career, a peak batting average correlates with a peak season overall. For example, Ted Williams hit .406 in 1941 and .388 in 1957, both phenomenal years by just about every modern statistic as well. Mickey Mantle hit .353 in 1956 and .365 in 1957, superb peaks.

This correlation is hardly 1.00, but it happens enough that it jumps out at you. A slump year correlates with a low batting average (e.g., Giancarlo Stanton hit .240 last year with 27 homers) and a peak year with a relatively high batting average (Stanton hit .281 this year with 59 homers).

So a relatively low batting average within a player’s career is a pretty good symptom of something not being right, while an above average batting average suggests he is hitting on all eight cylinders.

6. mike says:

Yastremski won AL batting title in 1968 with .301 average. MLB lowered pitching mounds the next year to start giving batters a chance again.