Milad Kharratzadeh shares this analysis of the English Premier League during last year’s famous season. He fit a Bayesian model using Stan, and the R markdown file is here.

The analysis has three interesting features:

1. Team ability is allowed to continuously vary throughout the season; thus, once the season is over, you can see an estimate of which teams were improving or declining.

2. But that’s *not* what is shown in the plot above. Rather, the plot above shows estimated team abilities after the model was fit to prior information plus week 1 data alone; prior information plus data from weeks 1 and 2; prior information plus data from weeks 1, 2, and 3; etc. For example, look at the plot for surprise victor Leicester City: after a few games, the team is already estimated to be in the middle of the pack; then throughout the season the team’s estimated ability gradually increases. This does not necessarily mean that we think the team improved during the season; it’s a graph showing the accumulation of information.

3. The graphs showing parameter estimates and raw data on the same scale, which I find helpful for understanding the information that goes into the fit.

From a modeling standpoint, item 1 above is the most important: with Bayesian inference and Stan, we can estimate continuously varying parameters even though we only get one game per week from each team.

From the standpoint of interpreting the results, I like item 2 in that it addresses the question of when those notorious 5000-1 odds should’ve been lowered. Based on the above graph, it looks like the answer is: Right away, after the very first game. It’s not that the first game of the season necessarily offers any useful clue to the future, but it does inform a bit about the possibilities. The point is not that Leicester City’s early performance signaled that they were a top team; it’s that the data did not rule out that possibility at 5000:1 odds.

Also, Figure 2 is kinda counterintuitive, in that you might think that if you already have a time-varying parameter, you could just plot its estimate as a function of time and you’d be done. But, no, if you want to see how inferences develop during the season, you need to re-fit the model after each new week of data (or use some sort of particle-filtering, importance-sampling approach, but in this case it’s easier to use the hammer and just re-fit the entire model each time in Stan.

**P.S.** The R code is in the markdown file. Here’s the Stan program:

data { intnteams; // number of teams (20) int ngames; // number of games int nweeks; // number of weeks int home_week[ngames]; // week number for the home team int away_week[ngames]; // week number for the away team int home_team[ngames]; // home team ID (1, ..., 20) int away_team[ngames]; // away team ID (1, ..., 20) vector[ngames] score_diff; // home_goals - away_goals row_vector[nteams] prev_perf; // a score between -1 and +1 } parameters { real b_home; // the effect of hosting the game in mean of score_diff dist. real b_prev; // regression coefficient of prev_perf real sigma_a0; // teams ability variation real tau_a; // hyper-param for game-to-game variation real nu; // t-dist degree of freedom real sigma_y; // score_diff variation row_vector [nteams] sigma_a_raw; // game-to-game variation matrix[nweeks,nteams] eta_a; // random component } transformed parameters { matrix[nweeks, nteams] a; // team abilities row_vector [nteams] sigma_a; // game-to-game variation a[1] = b_prev * prev_perf + sigma_a0 * eta_a[1]; // initial abilities (at week 1) sigma_a = tau_a * sigma_a_raw; for (w in 2:nweeks) { a[w] = a[w-1] + sigma_a .* eta_a[w]; // evolution of abilities } } model { vector[ngames] a_diff; // Priors nu ~ gamma(2,0.1); b_prev ~ normal(0,1); sigma_a0 ~ normal(0,1); sigma_y ~ normal(0,5); b_home ~ normal(0,1); sigma_a_raw ~ normal(0,1); tau_a ~ cauchy(0,1); to_vector(eta_a) ~ normal(0,1); // Likelihood for (g in 1:ngames) { a_diff[g] = a[home_week[g],home_team[g]] - a[away_week[g],away_team[g]]; } score_diff ~ student_t(nu, a_diff + b_home, sigma_y); } generated quantities { vector[ngames] score_diff_rep; for (g in 1:ngames) score_diff_rep[g] = student_t_rng(nu, a[home_week[g],home_team[g]] - a[away_week[g],away_team[g]]+b_home, sigma_y); }

As explained in the markdown file, the generated quantities are there for posterior predictive checks.

I really think Stan is a key part of reproducible science in that the models are so clear and portable.

Since I love soccer and thus as a sort of fun note: the top EPL player this year switched from Leicester to Chelsea AND he only spent the one year at Leicester. Defensive midfielder with no offensive skill but a huge disrupter and he’s maybe, maybe 5’6″. So Leceister wins and fades away this year, then Chelsea wins after a terrible season. Note this for modeling purposes. As in, if you have N’Golo Kanté, then you will immediately improve substantially but that’s hard to see in statistics, even in broken-up passes because they start to avoid his area. Makes me think of the Shane Battier analysis and the development of NBA adanced individual analytics (which are to me suspect because there’s so much confounding). Soccer is hard to unravel and modeling using results compresses a lot of complex behaviors. Some stuff is easy: get Messi, you’re a much better team, get LeBron and you should win 50 games with a cast of whomevers. Get Shane Battier? Maybe that’s because his skill set is complementary when there are other good players, etc. The deeper levels of modeling are resistant, which leads to a comment about so much of the work you discuss here: as in the Kangaroo analogy, much of the stuff isn’t replicable because it’s glimpses into confounded and complex behavior. It’s nice to work with results but sometimes they really aren’t results, but are more like games played in a long season.

Confounding seems like much more of an issue when evaluating player value (an inherently causal quantity) than estimating team ability. In fact, I’m not sure I see how it’s an issue at all when estimating team ability.

On a similar note, I’m a bit surprised that this model is only updating on the difference in home and away goals per game. Considering that scoring a goal in soccer is a low probability event, wouldn’t it make sense to incorporate measures of team ability for which we have more data and a wider range of outcomes? There’s shots, shots on goal, corners, possession, passing completion, shots saved, crossed delivered, crosses blocked… if I were gambling on an outcome, I’d feel better having these in my model than just goal difference – despite the obvious fact that goal difference determines the payout.

For all I know you’d get nearly the same results. Might be a fun summer project!

The other time-varying component is who plays who when. I don’t know about English soccer but in other sports I know of the highest ranked teams (preseason) play the lowest ranked teams first and play the highest ranked teams towards the end of the season – it makes the end of the season more exciting.

Looking at the plots for Norwich/Bournemouth versus Man City/Arsenal, if I am understanding it correctly, gives the impression that the priors are quite strong.

You don’t have this type of schedule in the European football competitions. What is maybe of interest concerning who plays when is that the higher ranking teams, based on last season’s performance, sometimes are prone to losing points after they played a game for one of the European competitions (Champions League, Europa League) for which they qualified based on their ranking.