Someone who wants to remain anonymous writes:

I am working to create a more accurate in-game win probability model for basketball games. My idea is for each timestep in a game (a second, 5 seconds, etc), use the Vegas line, the current score differential, who has the ball, and the number of possessions played already (to account for differences in pace) to create a point estimate probability of the home team winning.

This problem would seem to fit a multi-level model structure well. It seems silly to estimate 2,000 regressions (one for each timestep), but the coefficients should vary at each timestep. Do you have suggestions for what type of model this could/would be? Additionally, I believe this needs to be some form of logit/probit given the binary dependent variable (win or loss).

Finally, do you have suggestions for what package could accomplish this in Stata or R?

To answer the questions in reverse order:

3. I’d hope this could be done in Stan (which can be run from R).

2. Yes, a model with varying coefficients would make sense. I’d play around with the data, graph some estimates based on different timesteps, and then from there fit a parametric model that fits the data and makes sense.

1. Don’t model the probability of win, model the expected score differential. Yeah, I know, I know, what you really want to know is who wins. But the most efficient way to get there is to model the score differential and then map that back to win probabilities. The exact same issue comes up in election modeling: it makes sense to predict vote differential and then map that to Pr(win), rather than predicting Pr(win) directly. This is most obvious in very close games (or elections) or blowouts; in either of these settings the win/loss outcome provides essentially zero information. But it’s true more generally that there’s a lot of information in the score (or vote) differential that’s thrown away if you just look at win/loss.

This is related to this paper, right?

http://www.stat.columbia.edu/~gelman/research/published/thirds5.pdf

(Not that I can speak intelligently on score differentials in sports…)

Can someone explain more as to what sort of model this is? I’m still confused as to the nitty gritty? Are we utilizing the time series aspect of the game progression?

It sounds like some kind of model where the score is a timeseries poisson process with variable rate (of scoring), where the rate is predicted from the current score, score differential, possession, and proxies for tiredness and other aspects of the development of the game.

I think you’d have problems fitting that kind of thing in Stan because of the poisson nature of the score. Perhaps basketball games score high enough that you could divide the score by 100 and treat it as a gaussian process. That certainly wouldn’t make sense for something like baseball or football where there are relatively few scoring events.

Also, your score can’t go down, but score differential can change in both directions, so modeling score differential would work better if you’re going to divide by 100 and use a gaussian type process.

“I think you’d have problems fitting that kind of thing in Stan because of the poisson nature of the score.”

Why is this a problem? I’m doing something similar with football (soccer) using Stan without any problems so far.

Stan can’t sample discrete parameters. I guess so long as you’re only interested in historic games that’s ok since they’re observed, but if you’re trying to observe say up to half-time and then see distributions over future score differentials (where the future score differentials are now parameters) then you won’t be able to do it until Stan can sample the poisson paths.

Daniel:

It depends. Stan can’t do inference on discrete parameters, but Stan can simulate discrete generated quantities. So, if you have a model with continuous parameters (which would be appropriate for modeling basketball teams) with data up to halftime, and then you want to simulate from the posterior distribution of final score differentials, yes, you can do that in Stan with no problem.

Suppose your model is a Markov chain, so that the posterior distribution of score differential at time i is dependent on the score and other parameters at i-1. Now you split up the time into 100 intervals and you have data up to time 50, I could see how maybe stan could sample from the posterior for time 51 but could it sample from the posterior for time 100?? Now various statistics at time 51 become parameters that affects time 52 and that in turn affects 53… etc

If Stan can do that, I would love to know how.

Daniel:

Yes, Stan could simulate time 51, 52, etc. sequentially as generated quantities—at least, if I’m understanding your model correctly. All of this seems to be simply forward simulation, no MCMC needed conditional on the inferences for the parameters in the model.

hmm. I guess I’m thinking actually of a different kind of difficulty. I had a friend who was working on something similar, but in her case it was kind of a missing data interpolation issue. Since the sum of all the missing data values had to add up to certain observed values the outcomes for intermediate time points were conditional on later data and couldn’t be simply generated quantities.

I guess the equivalent sports analogy would be having the score for each time at certain time point, 0, 10, 22, 40, 50 for example, and trying to infer what the score might have been at intermediate times conditional on the timeseries going through the known values at the known time points.

Write something on the prevent defense in the NFL please. IOW, say your team is up by 3 touchdowns in the 4th quarter. Does it make sense to give away scores (say by runs or short passes) that take a long time, rather than take chances on allowing long passes. And then on the offense side, running out the clock (ignoring some opportunity for long passes of own, allowing the opposing defense to concentrate defenders in the box).

And then how this interacts with Vegas lines, prediction models, etc. And if it is a winning strategy.

An excellent subject! The anonymous questioner should probably review Mike Beuoy’s work, if he hasn’t, at http://stats.inpredictable.com/nba/wpNBA.php

Some other older work by Ryan J Parker (using Brownian Motion model) and Ed Kupfer (both work for NBA teams now) is located at http://web.archive.org/web/20080820164306/http://www.whichteamwins.com/blog/2008/04/29/nba-win-probability-graphs/

and http://web.archive.org/web/20081004132640/http://sonicscentral.com/apbrmetrics/viewtopic.php?t=586

There are others who have dealt with the subject as well over at the APBRmetrics forum.

Thanks Daniel.

Andrew – the approach I took was a modification of the LOESS technique, which is basically just a rolling, weighted linear regression. A nice feature of LOESS is that it is very responsive to what the data is actually telling you, without trying to force it into a pre-determined equation with a set of parameters to optimize (not too familiar with multilevel modeling, it may have the same advantage). I used R’s locfit package which extends the LOESS framework to incorporate logistic regression. I see your point regarding modeling point differential, but I think that may break down as you try to model late game situations.

Even with the vast amount of data points you’re going to get from NBA play by play data, things are still going to be noisy at the 5 second bucket increments, so you’ll still probably need to “borrow” nearby data points from other time buckets to get rational win probabilities that don’t jerk up and down (this is why I went with the LOESS approach).

All that being said, I think my regression approach starts to break down as you get down to the last 30 seconds or so of game time. I think Brian Burke has called out similar difficulties when modeling NFL win probability (and his task was several orders of magnitude harder: more discrete states to model, and sparser data).

I like this post. It’s a useful reminder to work with underlying substantive quantities (point differential), instead of default summarized quantities (win/loss)

This has been done. Brian Burke (advancednflstats.com) does a model like this for the NFL (he is also the NY Times’ 4th Down Bot on twitter). During the NCAA tourny a couple years ago he created the same model using time of possession and score, based on data from something like 20 years of games.

You can see those here: http://www.advancednflstats.com/2011/03/live-ncaa-basketball-win-probability.html

Ken Pomeroy has also done similar: http://kenpom.com/blog/index.php/weblog/entry/in-game_win_probabilities

“But it’s true more generally that there’s a lot of information in the score (or vote) differential that’s thrown away if you just look at win/loss.”

Wouldn’t it be even more general try to model the scores for both teams, not just the differential? For example, some teams might be good at defending and generally have very slow scoring games etc. So using just the score differential you would thow some information away.

But, I guess, it would quote a bit be more difficult as well. What would be the right choice for the joint distribution of scores etc.

Interestingly, the NCAA forbids the use of score differentials in team rankings (at least those rankings that have their imprimatur) they do this to avoid giving teams incentives to run up the score. (One can explain that any sensible model would truncate, or at least shrink large differentials in particular games, but the NCAA isn’t composed of data geeks.) I once did a calculation comparing college hockey predictability using a Bradley-Terry model (wins and losses only, with adjustments for a few covariates like home team) versus a simple Poisson regression model of team scoring with individual offensive and defensive ratings for each team (ie 2N parameters where N is the number of teams) with again a few covariates. The improvement in predictability is striking. The takeaway id that wins don’t predict wins nearly as well as offensive and defensive prowess… and it’s not close. Many are still troubled, though, by the fact that wins aren’t the best estimators of wins.

Jonathan:

This is a good excuse to recall that statistical inference is different than ranking. There are many settings where data are available that will improve prediction or estimation but are not used in rating or ranking because of issues of fairness or incentives. For example, suppose students in a class are given a pre-test at the beginning of the semester and a post-test at the end. To form the final grade, it will (probably) be more efficient to include pre-test in the grading formula, but it does not seem fair to base a student’s final grade on a pre-test.

Why is that more efficient? Unless you are rating the teacher? To the extent you believe grades reflect & signal the quality of your student product to external “purchasers” isn’t the pre-test score absolutely irrelevant?

To engage in a bit of self-promotion, we submitted a paper to the Sloan Sports Analysis Conference that does something like this, but on the possession level: http://www.sloansportsconference.com/wp-content/uploads/2014/02/2014_SSAC_Pointwise-Predicting-Points-and-Valuing-Decisions-in-Real-Time.pdf. We took the stochastic process approach here, but spent a fair amount of time at the beginning thinking about whether we could hack together a solution based on marginal regressions like the questioner originally suggested.

To answer the question of whether to fit 2000 regressions, the answer is almost certainly no. As other respondents above have implied, it’s better to think about the game as a temporal stochastic process. Doing 2000 regressions, one for each “k seconds left to go” timeslice opens you up to incoherent inferences between timeslices and throws away a lot of useful information that you could share between timeslices that are close to each other. The stochastic process view allows you to do that information sharing, and ensures that your inference will be coherent between timeslices. It also gives you a way to do an honest accounting of your uncertainty, because in the 2000 regression approach, you’d be using the same outcome 2000 times, once for each regression, but your standard errors would be computed within each model without an easy way to assess how your errors are correlated across timeslices. The stochastic process view solves this too, since you can think of each increment as being a replication, and you can use one model to treat the whole game, which gives you coherent standard errors.

Problem with modeling point difference is that it’s super duper bimodal because games don’t end in ties.

Ties are an issue but a minor issue. Overtime games are rare. One could just count the score before overtime (thus including ties) or else just use final score and not worry about it. For most purposes it won’t really matter.

Do you think it matters much that winning teams in blowouts often play conservatively, not scoring as much as they can? This is an issue in basketball and football more than baseball (where there is no clock).

Contrary to the NCAA dictum that ranking shouldn’t depend on scores, since betting does depend on scores it seems reasonable to

use scores as Andrew suggests. One way to do this is described here: March madness, quantile regression bracketology, and the Hayek hypothesis

Roger Koenker; Gilbert W. Bassett Jr. JBES, Vol. 28.2010, 1, p. 26-35.

[…] Schneider writes: Apropos of your recent blog post about modeling score differential of basketball games, I thought you might enjoy a site I built,gambletron2000.com, that gathers real-time win […]