Just show me the data, baseball edition

Andrew’s always enjoining people to include their raw data. Jim Albert, of course, does it right. Here’s a recent post from his always fascinating baseball blog, Exploring Baseball Data with R,

The post “just” plots the raw data and does a bit of exploratory data analysis, concluding that the apparent trends are puzzling. Albert’s blog has it all. The very next post fits a simple Bayesian predictive model to answer the question every baseball fan in NY is asking,

P.S. If you like Albert’s blog, check out his fantastic intro to baseball stats, which only assumes a bit of algebra, yet introduces most of statistics through simulation. It’s always the first book I recommend to anyone who wants a taste of modern statistical thinking and isn’t put off by the subject matter,

  • Jim Albert and Jay Bennet. 2001. Curve Ball. Copernicus.


 

The Tampa Bay Rays baseball team is looking to hire a Stan user

Andrew and I have blogged before about job opportunities in baseball for Stan users (e.g., here and here) and here’s a new one. This time it’s the Tampa Bay Rays who are hiring. The job title is “Analyst, Baseball Research & Development” and here are the responsibilities and qualifications:

Responsibilities:
* Build customized statistical modeling tools for accurate prediction and inference for various baseball applications.
* Provide statistical modeling expertise to other R&D Analysts.
* Optimize code to ensure quick and reliable model sampling/optimization.
* Author both technical and non-technical internal reports on your work.

Qualifications:
* Experience with Stan or other probabilistic programming language
* Experience with R or Python
* Deep understanding of the fundamentals of Bayesian Inference, MCMC, and Autocorrelation/Time Series Modeling.
* Start date is flexible. For example, candidates with an extensive amount of remaining time left in an academic program are encouraged to apply immediately.
* Candidates with non-traditional schooling backgrounds, as well as candidates with Advanced degree (Masters or PhD) in Statistics, Data Science, Machine Learning, or a related field are encouraged to apply

That’s just part of the job ad, so I recommend checking out the full posting, which includes important details like the fact that remote work is a possibility.

Here are a few other details I can share that aren’t included in the job ad:

  • The Rays have already been using Stan for years now so you won’t be the only Stan user there.
  • A few years ago a few of us (Stan developers) did some consulting/training work for the Rays and had a great experience. Some of their R&D team members have changed since then but I still know some of the ones there and I highly recommend working with them if you’re interested in baseball.
  • The Rays always have one of the lowest payrolls for their roster and yet they are somehow consistently competitive (they even made the World Series last year!). I’m sure there are multiple reasons for this, but I strongly suspect that the strength of the R&D team you’d be joining is one of them.

 

Will Stanton hit 61 home runs this season?

[edit: Juho Kokkala corrected my homework. Thanks! I updated the post. Also see some further elaboration in my reply to Andrew’s comment. As Andrew likes to say …]

So far, Giancarlo Stanton has hit 56 home runs in 555 at bats over 149 games. Miami has 10 games left to play. What’s the chance he’ll hit 61 or more home runs? Let’s make a simple back-of-the-envelope Bayesian model and see what the posterior event probability estimate is.

Sampling notation

A simple model that assumes a home run rate per at bat with a uniform (conjugate) prior:

$latex \theta \sim \mbox{Beta}(1, 1)$

The data we’ve seen so far is 56 home runs in 555 at bats, so that gives us our likelihood.

$latex 56 \sim \mbox{Binomial}(555, \theta)$

Now we need to simulate the rest of the season and compute event probabilities. We start by assuming the at-bats in the rest of the season is Poisson.

$latex \mathit{ab} \sim \mbox{Poisson}(10 \times 555 / 149)$

We then take the number of home runs to be binomial given the number of at bats and the home run rate.

$latex h \sim \mbox{Binomial}(\mathit{ab}, \theta)$

Finally, we define an indicator variable that takes the value 1 if the total number of home runs is 61 or greater and the value of 0 otherwise.

$latex \mbox{gte61} = \mbox{I}[h \geq (61 – 56)]$

Event probability

The probability Stanton hits 61 or more home runs (conditioned on our model and his performance so far) is then the posterior expectation of that indicator variable,

$latex \displaystyle \mbox{Pr}[h \geq (61 – 56)] \\[6pt] \hspace*{3em} \displaystyle { } = \ \int_{\theta} \ \sum_{ab} \, \ \mathrm{I}[h \geq 61 – 56] \\ \hspace*{8em} \ \times \ \mbox{Binomial}(h \mid ab, \theta) \\[6pt] \hspace*{8em} \ \times \ \mbox{Poisson}(ab \mid 10 \ \times \ 555 / 149) \\[6pt] \hspace*{8em} \ \times \ \mbox{Beta}(\theta \mid 1 + 56, 1 + 555 – 56) \ \mathrm{d}\theta.$

Computation in R

The posterior for $latex \theta$ is analytic because the prior is conjugate, letting us simulate the posterior chance of success given the observed successes (56) and number of trials (555). The number of at bats is independent and also easy to simulate with a Poisson random number generator. We then simulate the number of hits on the outside as a random binomial, and finally, we compare it to the total and then report the fraction of simulations in which the simulated number of home runs put Stanton at 61 or more:

> sum(rbinom(1e5,
             rpois(1e5, 10 * 555 / 149),
             rbeta(1e5, 1 + 56, 1 + 555 - 56))
       >= (61 - 56)) / 1e5
[1] 0.34

That is, I’d give Stanton about a 34% chance conditioned on all of our assumptions and what he’s done so far.

Disclaimer

The above is intended for recreational use only and is not intended for serious bookmaking.

Exercise

You guessed it—code this up in Stan. You can do it for any batter, any number of games left, etc. It really works for any old statistics. It’d be even better hierarchically with more players (that’ll pull the estimate for $latex \theta$ down toward the league average). Finally, the event probability can be done with an indicator variable in the generated quantities block.

The basic expression looks like you need discrete random variables, but we only need them here for the posterior calculation in generated quantities. So we can use Stan’s random number generator functions to do the posterior simulations right in Stan.

Jim Albert’s Baseball Blog

Jim Albert has a baseball blog:

I sent a link internally to people I knew were into baseball, to which Andrew replied, “I agree that it’s cool that he doesn’t just talk, he has code.” (No kidding—the latest post as of writing this was on an R package to compute value above replacement players (VAR).)

You may know me from…

You may know Jim Albert from the “Albert and Chib” approach to Gibbs sampling for probit regression. I first learned about him through his fantastic book, Curve Ball, which I recommend at every opportunity (the physical book’s inexpensive and I’m stunned Springer’s selling an inexpensive PDF with no DRM—no reason not to get it). It’s not only very insightful about baseball, it’s a wonderful introduction to statistics via simulation. It starts out analyzing All-Star Baseball, a game based on spinners. This book went a long way in helping me understand statistics, but at a level I could share with friends and family, not just math geeks. It then took Gelman and Hill’s regression book and understanding the BUGS examples until I could make sense of BDA.

In the same vein, Albert has a solo book aimed at undergraduates or their professors—Teaching Statistics Using Baseball. And I just saw from his home page, a book on Analyzing Baseball Data with R.

Little Professor Baseball

I first wrote to Jim Albert way back before I was working with Andrew on Stan. I’d just read Curve Ball and had just created my very simple baseball simulation, Little Professor Baseball. I was very pleased with how I’d made it simple like All-Star Baseball, but included pitching and batting, like Strat-o-Matic Baseball (a more “serious” baseball simulation game). My only contribution was figuring out how to allow both players (offense/defnese) to roll dice, with the resulting being read from the card of the highest roller. I had to solve a quadratic equation to adjust for the bias of taking the highest roller and further adjusting to deal with the Strat-o-Matic-style correction for only reading the results off a player’s card half the time (here’s the derivations with a statistical discussion on getting the expectations right). I analyze the 1970 Major League Baseball season (same one used by Efron and Morris, by the way). I even name-drop Andrew’s hero, Earl Weaver, in the writeup.