A different Bayesian World Cup model using Stan (opportunity for model checking and improvement)

Maurits Evers writes:

Inspired by your posts on using Stan for analysing football World Cup data here and here, as well as the follow-up here, I had some fun using your model in Stan to predict outcomes for this year’s football WC in Qatar. Here’s the summary on Netlify. Links to the code repo on Bitbucket are given on the website.

Your readers might be interested in comparing model/data/assumptions/results with those from Leonardo Egidi’s recent posts here and here.

Enjoy, soccerheads!

P.S. See comments below. Evers’s model makes some highly implausible predictions and on its face seems like it should not be taken seriously. From the statistical perspective, the challenge is to follow the trail of breadcrumbs and figure out where the problems in the model came from. Are they from bad data? A bug in the code? Or perhaps a flaw in the model so that the data were not used in the way that were intended? One of the great things about generative models is that they can be used to make lots and lots of predictions, and this can help us learn where we have gone wrong. I’ve added a parenthetical to the title of this post to emphasize this point. Also good to be reminded that just cos a method uses Bayesian inference, that doesn’t mean that its predictions make any sense! The output is only as good as its input and how that input is processed.

Update 2 – World Cup Qatar 2022 Predictions with footBayes/Stan

Time to update our World Cup 2022 model!

The DIBP (diagonal-inflated bivariate Poisson) model performed very well in the first match-day of the group stage in terms of predictive accuracy – consider that the ‘peudo R-squared’, namely the geometric mean of the probabilities assigned from the model to the ‘true’ final match results, is about 0.4, whereas, on average, the main bookmakers got 0.36.

It’s now time to re-fit the model after the first 16 group stage games with the footBayes R package and obtain the probabilistic predictions for the second match-day. Here there are the posterior predictive match probabilities for the held-out matches of the Qatar 2022 group stage played from November 25th to November 28th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color – ‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results.

Plot/table updates: (see Andrew’ suggestions from the previous post, we’re still developing these plots to improve their appearance, see below some more notes). In the plots below, the first team listed in each sub-title is the ‘favorite’ (x-axis), whereas the second team is the ‘underdog’ (y-axis). The 2-way grid displays the 16 held-out matches in such a way that closer matches appear at the top-left of the grid, whereas more unbalanced matches (‘blowouts’) appear at the bottom-right.  The matches are then ordered from top-left to bottom-right in terms of increasing winning probability for the favorite teams. The table reports instead the matches according to a chronological order.

The most unbalanced game seems Brazil-Switzerland, where the Brazil is the favorite team with an associated winning probability about 71%. The closest game seems Iran-Wales – Iran just won with two goals of margin scored in the last ten minutes! – whereas France is given only 44% probability of winning against Denmark. Argentina seems to be ahead against Mexico, whereas Spain seems to have a non-negligible advantage in the match against Germany.

Another predictive note: Regarding ‘most-likely-outcomes’ (mlo here above), the model ‘guessed’ 4 ‘mlo’ out of 16 in the previous match-day.

You find the complete results, R code and analysis here.

Some more technical notes/suggestions about the table and the plots above:

  • We replaced ‘home’ and ‘away’ by ‘favorite’ and ‘underdog’.
  • I find difficult to handle ‘xlab’ and ‘ylab’ in faceted plots with ggplot2! (A better solution could be in fact to directly put the team names on each of the axes of the sub-plots).
  • The occurrence ‘4’ actually stands for ‘4+’, meaning that it captures the probability of scoring ‘4 or more goals’ (I did not like the thick ‘4+’ in the plot, for this reason we just set ‘4’, however we could improve this).
  • We could consider adding some global ‘x’ and ‘y’-axes with probability margins between underdog and  favorite. Thus, for Brazil-Switzerland, we should have a thick on the x-axis at approximately 62%, whereas for Iran-Wales at 5%.

For other technical notes and model limitations check the previous post.

Next steps: we are going to update the predictions for the third match-day and even compute some World Cup winning probabilities through a ahead-simulation of the whole tournament.

Stay tuned!

4th down update and my own cognitive illusion

Following up on our recent discussion regarding going for it on 4th down, Paul Campos writes:

The specific suggestion here is that tactics that might make sense in much lower scoring eras cease to make sense when scoring becomes higher, but neither coaches nor fans adjust to the new reality, or adjust very slowly.

This explanation doesn’t really work for the NFL, since scoring in that league has been remarkably stable for the entire post-WWII era. When we look at NFL scoring averages, it’s obvious that the game’s rules makers are constantly tweaking the rules to maintain a balance between offense and defense that results in a scoring average of about 20-23 points per game per team, with significant changes being made whenever — such as in the late 1970s when pass blocking rules were liberalized — scoring begins to fall outside this very narrow range.

I had no idea! I remember when I was a kid there was a Super Bowl that was 16-6. Before that the Dolphins beat the Redskins 14-6, and then there was that Jets-Colts Super Bowl which was a few years before my time. Nowadays it seems like the games all end up with scores like 42-37. So it had been my general impression that average points per game had approximately doubled during the past few decades.

Actually, though, yeah, at least in the regular season the scoring has been very stable, with an average of 20.5 points per team per game in 1980 to an average of 23.0 in 2021. OK, actually 23.0 is a bit higher then 20.5 (and I’m not cheating here by picking atypical years; you can follow the above link to see the numbers).

Also, I was a football fan in the mid-70s, which was a relatively low-scoring period, with about 19 points per team per game on average.

My cognitive illusion

So yes, there has been an increase in scoring during the past several decades, but not by nearly as much as I’d thought. I feel like there’s an illusion here, which has two steps:

1. A 12% increase (from 20.5 points per game to 23.0) might seem small, especially when spread out over decades, but it was actually noticeable to a casual observer.

2. I did notice the increase, but in noticing it I way overestimated it.

I wonder if my error is similar to the error that economists Gertler et al. did when overestimating the effect of early childhood intervention. As you might recall, they reported a statistically significant effect of 42% on earnings. But to be “statistical significant,” the estimate had to be at least about 40%. If you follow the general procedure of reporting statistically significant results, your estimates will be biased upward in magnitude (“type M error”).

Now consider my impressions of trends in football scoring. Whatever impression I had of these trends came from various individual games that I’d heard about: not a random sample but a small sample in any case. Given that average scores have increased in the past few decades, it makes sense that my recollections would also be of an increase—but my recollections represent a very noisy estimate. Had I remembered not much change, I wouldn’t think much about it. But the games that happened to come to mind were low-scoring games in the past and high-scoring recent games. Also, it could be that trends in Super Bowl scores are different than trends in regular-season averages. In any case, the point is that I’m more likely to notice big changes; thus, conditional on my noticing something, it makes sense that my estimate was an overestimate.

One thing that never seems to come up in these discussions is that the fans (or at least, some large subset of “the fans”) want less punting and more chances. As I wrote in my original post, as a kid, I always loved when teams would go for it on 4th down or try an onside kick or run trick plays like fake punts, double reverses, etc.

A different issue that some people brought up in comments was that the relative benefits of different offensive strategies will in general depend on what the defenses are doing. Still, I’m guessing it will pretty much always be a good idea to go for it with 4th-and-2 on the 50-yard line early in the game, and for many years this was more of an automatic punt situation.

Going for it on 4th down: What’s striking is not so much that we were wrong, but that we had so little imagination that we didn’t even consider the possibility that we might be wrong. (I wonder what Gerd Gigerenzer, Daniel Kahneman, Josh “hot hand” Miller, and other experts on cognitive illusions think about this one.)

In retrospect, it’s kind of amazing how narrow our sports thinking used to be. As a kid, I always loved when teams would go for it on 4th down or try an onside kick or run trick plays like fake punts, double reverses, etc., but I just assumed that the standard by-the-book approach was the best. The idea that going for it on 4th down was not just fun but also a smart move . . . I had no idea, and I don’t recall any sportswriters or TV commentators suggesting it.

That said, I know next to nothing about football analytics, and it’s possible that these unconventional plays had less of an expected-value payoff back in the 70s when field position was more important and points were harder to come by.

I guess part of the problem is, to use some psychology and statistics jargon, a cognitive bias induced by ecological correlation. There always were some teams that tried unconventional plays, but they tended to be less successful teams that tried these tactics as a last resort. The Oklahomas, the Michigans, the Vikings and Steelers didn’t need this sort of thing. The only thing at all out of the ordinary I can remember being routinely played is Dallas’s two-minute offense with Roger Staubach in the shotgun, but that was a rare exception, as I recall it.

Consider a sequence over the decades:

1. Tactics are developed during the play-in-the-mud, Army-beats-Navy-3-to-0 era.

2. Conservative coaches stick with these tactics for decades.

3. Spectators are so used to things being done that way that they don’t even question it.

4. Analytics revolution.

5. Even now, coaches shade toward the conservative choices, even when stakes are high.

We’re now in step 5. In his above-linked post, Campos expresses frustration about it. And I get his frustration, as this is similar to my frustrations about misconceptions in science, or clueless political reporting, or whatever. But what really intrigues me is step 3, the subject of this post, which is how we were so deep inside this particular framework of assumptions that we couldn’t even see out. Or, it’s not that we couldn’t see out, but that we didn’t even know we were inside all this time.

I wonder what Gerd Gigerenzer, Daniel Kahneman, Josh “hot hand” Miller, and other experts on cognitive illusions think about this one.

P.S. We discussed some of this back in 2006, but there we were focused on the question of why were teams almost always punting on 4th down. Now that it’s become routine to go for it on 4th down, the question shifts to why did it take so long and why hasn’t the new approach completely taken over.

Football World Cup 2022 Predictions with footBayes/Stan

It’s time for football (aka soccer) World Cup Qatar 2022 and statistical predictions!

This year me and my collaborator Vasilis Palaskas implemented a diagonal-inflated bivariate Poisson model for the scores through our `footBayes` R CRAN package (depending on the `rstan` package), by considering as a training set more than 3000 international matches played during the years’ range 2018-2022. The model incorporates some dynamic-autoregressive team-parameters priors for attack and defense abilities and the Coca-Cola/FIFA rankings differences as the only predictor. The model, firstly proposed by Karlis & Ntzoufras in 2003, extends the usual bivariate Poisson model by allowing to inflate the number of draw occurrences. Weakly informative prior distributions for the remaining parameters are assumed, whereas sum-to-zero constraints for attack/defense abilities are considered to achieve model identifiability. Previous World Cup and Euro Cup models posted in this blog can be found here, here and here.

Here is the new model for the joint couple of scores (X,Y,) of a soccer match. In brief:

We fitted the model by using HMC sampling, with 4 Markov Chains, 2000 HMC iterations each, checking for their convergence and effective sample sizes. Here there are the posterior predictive matches probabilities for the held-out matches of the Qatar 2022 group stage, played from November 20th to November 24th, along with some ppd ‘chessboard plots’ for the exact outcomes in gray-scale color (‘mlo’ in the table denotes the ‘most likely result’ , whereas darker regions in the plots correspond to more likely results):

Better teams are acknowledged to have higher chances in these first group stage matches:

  • In Portugal-Ghana, Portugal has an estimated winning probability about 81%, whereas in Argentina-Saudi Arabia Argentina has an estimated winning probability about 72%. The match between England and Iran seems instead more balanced, and a similar trend is observed for Germany-Japan. USA is estimated to be ahead in the match against Wales, with a winning probability about 47%.

Some technical notes and model limitations:

  • Keep in mind that ‘home’ and ‘away’ do not mean anything in particular here – the only home team is Qatar! – but they just refer to the first and the second team of the single matches. ‘mlo’ denotes the most likely exact outcome.
  • The posterior predictive probabilities appear to be approximated at the third decimal digit, which could sound a bit ‘bogus’… However, we transparently reported the ppd probabilities as those returned from our package computations.
  • One could use these probabilities for betting purposes, for instance by betting on that particular result – among home win, draw, or away win – for which the model probability exceeds the bookmaker-induced probability. However, we are not responsible for your money loss!
  • Why a diagonal-inflated bivariate Poisson model, and not other models? We developed some sensitivity checks in terms of leave-one-out CV on the training set to choose the best model. Furthermore, we also checked our model in terms of calibration measures and posterior predictive checks.
  • The model incorporates the (rescaled) FIFA ranking as the only predictor. Thus, we do not have many relevant covariates here.
  • We did not distinguish between friendly matches, world cup qualifiers, euro cup qualifiers, etc. in the training data, rather we consider all the data as coming from the same ‘population’ of matches. This data assumption could be poor in terms of predictive performances.
  • We do not incorporate any individual players’-based information in the model, and this also could represent a major limitation.
  • We’ll compute some predictions’ scores – Brier score, pseudo R-squared – to check the predictive power of the model.
  • We’ll fit this model after each stage, by adding the previous matches in the training set and predicting the next matches.

This model is just an approximation for a very complex football tornament. Anyway, we strongly support scientific replication, and for such reason the reports, data, R and RMarkdown codes can be fully found here, in my personal web page. Feel free to play with the data and fit your own model!

And stay tuned for the next predictions in the blog. We’ll add some plots, tables and further considerations. Hopefully, we’ll improve predictive performance as the tournament proceeds.

Another book about poker

I just finished Last Call, a science fiction novel by Tim Powers, that I’m mentioning here to add to our list of literary descriptions of poker. Last Call is pretty good: it’s full of action and it reads like a cross between Stephen King, Roger Zelazny, and George Pelecanos. I thought the ending was weak, but, hey, nobody’s perfect.

The poker scenes in Last Call were carried out well. The only problem I had was in some of the exposition near the beginning, where it seemed that the author was regurgitating a bunch of Frank Wallace’s classic, “Poker: A guaranteed income for life by using the advanced concepts of poker,” even to the extent of repeating the anecdote about the sandwich. Wallace’s book remains very readable, and I have no problem using it as background, but it’s gotta be processed first so it doesn’t look like raw research.

Tigers need your help.

Jim Logue, Director of Baseball R&D at the Detroit Tigers, writes:

We are now hiring a Principal Quantitative Analyst. With this position we’re looking for someone with extensive Bayesian experience, with a secondary emphasis on baseball knowledge.

The Tigers went 66-96 last year so the good news is that if you join them now you can take some credit for whatever improvement they show next year!

I assume that knowledge of Stan will be a plus.

Cheating in sports vs. cheating in journalism vs. cheating in science

Sports cheating has been in the news lately. Nothing about the Astros, but the chess-cheating scandal that people keep talking about—or, at least, people keep sending me emails asking me to blog about it—and the cheating scandals in poker and fishing. All of this, though, is nothing compared to the juiced elephant in the room: the drug-assisted home run totals of 1998-2001, which kept coming up during the past few months as Aaron Judge approached and then eventually reached the record-breaking total of 62 home runs during a season.

On this blog we haven’t talked much about cheating in sports (there was this post, though, and also something a few years back about one of those runners who wasn’t really finishing the races), but we’ve occasionally talked about cheating in journalism (for example here, here, here, and here—hey, those last two are about cheating in journalism about chess!), and we’ve talked lots and lots about cheating in science.

So this got me thinking: What are the differences between cheating in sports, journalism, and science?

1. The biggest difference that I see is that in sports, when you cheat, you’re actually doing what you claim to do, you’re just doing it using an unauthorized method. With cheating in journalism and science, typically you’re not doing what you claimed.

Let me put it this way: Barry Bonds may have juiced, but he really did hit 7 zillion home runs. Lance Armstrong doped, but he really did pedal around France faster than anyone else. Jose Altuve really did hit the ball out of the park. Stockfish-aided or not, that dude really did checkmate the other dude’s king. Etc. The only cases I can think of, where the cheaters didn’t actually do what they claimed to do, are the Minecraft guy, Rosie Ruiz, and those guys who did a “Mark Twain” on their fish. Usually, what sports cheaters do is use unapproved methods to achieve real ends.

But when journalism cheaters cheat, the usual way they do it is by making stuff up. That is, they put things in the newspaper that didn’t really happen. The problem with Stephen Glass or Janet Cooke or Jonah Lehrer is not that they achieved drug-enhanced scoops or even that they broke some laws in order to break some stories. No, the problem was that they reported things that weren’t true. I’m not saying that journalism cheats are worse than sports cheats, just that it’s a different thing. Sometimes cheating writers cheat by copying others’ work without attribution, and that alone doesn’t necessarily lead to falsehoods getting published, but it often does, which makes sense: once you start copying without attribution, it becomes harder for readers to track down your sources and find your errors, which in turn makes it easier to be sloppy and reduces the incentives for accuracy.

When scientists cheat, sometimes it’s by just making things up, or presenting claims with no empirical support—for example, there’s no evidence that the Irrationality guy ever had that custom-made shredder, or that the Pizzagate guy ever really ran a “masterpiece” of an experiment with a bottomless soup bowl or had people lift an 80-pound rock, or that Mary Rosh ever did that survey. Other times they just say things that aren’t true, for example describing a 3-day study as “long-term”. In that latter case you might say that the scientist in question is just an idiot, not a cheater—but, ultimately, I do think it’s a form of cheating to publish a scientific paper with a title that doesn’t describe its contents.

But I think the main why scientists cheat is by being loose enough with their reasoning that they can make strong claims that aren’t supported by the data. Is this “cheating,” exactly? I’m not sure. Take something like that ESP guy or the beauty-and-sex-ratio guy who manage to find statistical methods that give them the answers they want. At some level, the boundary between incompetence and cheating doesn’t really matter; recall Clarke’s Law.

The real point here, though, is that, whatever you want to call it, the problem with bad science is that it comes up with false or unsupported claims. To put it another way: it’s not that Mark Hauser or whoever is taking some drugs that allow him to make a discovery that nobody else could make; the problem is that he’s claiming something’s a discovery but it isn’t. To put it yet another way: there is no perpetual motion machine.

The scientific analogy to sports cheating would be something like . . . Scientist B breaks into Scientist A’s lab, steals his compounds, and uses them to make a big discovery. Or, Scientist X cuts corners by using some forbidden technique, for example violating some rule regarding safe disposal of chemical waste, and this allows him to work faster and make some discovery. But I don’t get a sense that this happens much, or at least I don’t really hear about it. There was the Robert Gallo story, but even there the outcome was not a new discovery, it was just a matter of credit.

And the journalistic analogy to sports cheating would be something like that hacked phone scandal in Britain a few years back . . . OK, I guess that does happen sometimes. But my guess is that the kinds of journalists who’d hack phones are also the kind of journalists who’d make stuff up or suppress key parts of a story or otherwise manipulate evidence in a way to mislead. In which case, again, they can end up publishing something that didn’t happen, or polluting the scientific and popular literature.

2. Another difference is that sports have a more clearly-defined goal than journalism or science. An extreme example is bicycle racing: if the top cyclists are doping and you want to compete on their level, then you have to dope also; there’s simply no other option. But in journalism, no matter how successful Mike Barnicle was, other journalists didn’t have to fabricate to keep up with him. There are enough true stories to report, that honest journalists can compete. Yes, restricting yourself to the truth can put you at a disadvantage, but it doesn’t crowd you out entirely. Similarly, if you’re a social scientist who’s not willing to fabricate surveys or report hyped-up conclusions based on forking paths, yes, your job is a bit harder, but you can still survive in the publication jungle. There are enough paths to success that cheating is not a necessity, even if it’s a viable option.

3. The main similarity I see among misbehavior in sports, journalism, and science is that the boundary between cheating and legitimate behavior is blurry. When “everybody does it,” is it cheating? With science there’s also the unclear distinction between cheating and simple incompetence—with the twist that incompetence at scientific reasoning could represent a sort of super-competence at scientific self-promotion. Only a fool would say that the replication rate in psychology is “statistically indistinguishable from 100%“—but being that sort of fool can be a step toward success in our Ted/Freakonomics/NPR media environment. You’d think that professional athletes would be more aware of what drugs they put in their bodies than scientists would be aware of what numbers they put into their t-tests, but sports figures have sometimes claimed that they took banned drugs without their knowledge. The point is that a lot is happening at once, and there are people who will do what it takes to win.

4. Finally, it can be entertaining to talk about cheating in science, but as I’ve said before, I think the much bigger problem is scientists who are not trying to cheat but are just using bad methods with noisy data. Indeed, the focus on cheaters can let incompetent but sincere scientists off the hook. Recall our discussion from a few years ago, The flashy crooks get the headlines, but the bigger problem is everyday routine bad science done by non-crooks. The flashy crooks get the headlines, but the bigger problem is everyday routine bad science done by non-crooks. Similarly, with journalism, I’d say the bigger problem is not the fabricators so much as the everyday corruption of slanted journalism, and public relations presented in journalistic form. To me, the biggest concern with journalistic cheating is not so much the cases of fraud as much as when the establishment closes ranks to defend the fraudster, just as in academia there’s no real mechanism to do anything about bad science.

Cheating in sports feels different, maybe in part because a sport is defined by its rules in a way that we would not say about journalism or science.

P.S. After posting the above, I got to thinking about cheating in business, politics, and war, which seem to me to have a different flavor than cheating in sports, journalism, or science. I have personal experience in sports, journalism, and science, but little to no experience in business, politics, and war. So I’m just speculating, but here goes:
To me, what’s characteristic about cheating in business, politics, and war is that some flexible line is pushed to the breaking point. For example, losing candidates will often try to sow doubt about the legitimacy of an election, but they rarely take it to the next level and get on the phone with election officials and demand they add votes to their total. Similarly with business cheating such as creative accounting, dumping of waste, etc.: it’s standard practice to work at the edge of what’s acceptable, but cheaters such as the Theranos gang go beyond hype to flat-out lying. Same thing for war crimes: there’s no sharp line, and cheating or violation arises when armies go far beyond what is currently considered standard behavior. This all seems different than cheating in sports, journalism, or science, all of which are more clearly defined relative to objective truth.

I think there’s more to be said on all this.

How did Bill James get this one wrong on regression to the mean? Here are 6 reasons:

I’m a big fan of Bill James, but I think he might be picking up the wrong end of the stick here.

The great baseball analyst writes about what he calls the Law of Competitive Balance. His starting point is that teams that are behind are motivated to work harder to have a chance of winning, which moves them to switch to high-variance strategies such as long passes in football (more likely to score a touchdown, also more likely to get intercepted), etc. Here’s Bill James:

Why was there an increased chance of a touchdown being thrown?

Because the team was behind.

Because the team was behind, they had an increased NEED to score.

Because they had an increased need to score points, they scored more points.

That is one of three key drivers of The Law of Competitive Balance: that success increases when there is an increased need for success. This applies not merely in sports, but in every area of life. But in the sports arena, it implies that the sports universe is asymmetrical. . . .

Because this is true, the team which has the larger need is more willing to take chances, thus more likely to score points. The team which is ahead gets conservative, predictable, limited. This moves the odds. The team which, based on their position, would have a 90% chance to win doesn’t actually have a 90% chance to win. They may have an 80% chance to win; they may have an 88% chance to win, they may have an 89.9% chance to win, but not 90.

I think he’s mixing a correct point here with an incorrect point.

James’s true statement is that, as he puts it, “there is an imbalance in the motivation of the two teams, an imbalance in their willingness to take risks.” The team that’s behind is motivated to switch to strategies that increase the variance of the score differential, even at the expense of lowering its expected score differential. Meanwhile, the team that’s ahead is motivated to switch to strategies that decrease the variance of the score differential, even at the expense of lowering its expected score differential. In basketball, it can be as simple as the team that’s behind pushing up the pace and the team that’s ahead slowing things down. The team that’s trailing is trying to have some chance of catching up—their goal is to win, not to lose by a smaller margin; conversely, the team that’s in the lead is trying to minimize the chance of the score differential going to zero, not to run up the score. As James says, these patterns are averages and won’t occur from play to play. Even if you’re behind by 10 in a basketball game with 3 minutes to play, you’ll still take the open layup rather than force the 3-pointer, and even if you’re ahead by 10, you’ll still take the open shot with 20 seconds left on the shot clock rather than purely trying to eat up time. But on average the logic of the game leads to different strategies for the leading and trailing teams, and that will have consequences on the scoreboard.

James’s mistake is to think, when this comes to probability of winning, that this dynamic on balance favors the team that’s behind. When strategies are flexible, the team that’s behind does not necessarily increase its probability of winning relative to what that probability would be if team strategies were constant. Yes, the team that’s behind will use strategies to increase the probability of winning, but the team that’s ahead will alter its strategy too. Speeding up the pace of play should, on average, increase the probability of winning for the trailing team (for example, increasing the probability from, I dunno, 10% to 15%), but meanwhile the team that’s ahead is slowing down the pace of play, which should send that probability back down. On net, will this favor the leading team or the trailing team when it comes to win probability? It will depend on the game situation. In some settings (for example, a football game where the team that’s ahead has the ball on first down with a minute left), it will favor the team that’s ahead. In other settings it will go the other way.

James continues:

That is one of the three key drivers of the Law of Competitive Balance. The others, of course, are adjustments and effort. When you’re losing, it is easier to see what you are doing wrong. Of course a good coach can recognize flaws in their plan of attack even when they are winning, but when you’re losing, they beat you over the head.

I don’t know about that. As a coach myself, I could just as well argue the opposite point, as follows. When you’re winning, you can see what works while having the freedom to experiment and adapt to fix what doesn’t work. But when you’re losing, it can be hard to know where to start or have a sense of what to do to improve.

Later on in his post, James mentions that, when you’re winning, part of that will be due to situational factors that won’t necessarily repeat. The quick way to say that is that, when you’re winning, part of your success is likely to be from “luck”; a formulation that I’m OK with as long as we take this term generally enough to refer to factors that don’t necessarily repeat, such as pitcher/batter matchups, to take an example from James’s post.

But James doesn’t integrate this insight into his understanding of the law of competitive balance. Instead, he writes:

If a baseball team is 20 games over .500 one year, they tend to be 10 games over the next. If a team is 20 games under .500 one year, they tend to be 10 games under the next year. If a team improves by 20 games in one year (even from 61-101 to 81-81) they tend to fall back by 10 games the next season. If they DECLINE by 10 games in a year, they tend to improve by 5 games the next season.

I began to notice similar patterns all over the map. If a batter hits .250 one year and .300 the next, he tends to hit about .275 the third year. Although I have not demonstrated that similar things happen in other sports, I have no doubt that they do. I began to wonder if this was actually the same thing happening, but in a different guise. You get behind, you make adjustments. You lose 100 games, you make adjustments. You get busy. You work harder. You take more chances. You win 100 games, you relax. You stand pat.

James’s description of the data is fine; his mistake is to attribute these changes to teams “making adjustments” or “standing pat.” That could be, but it could also be that teams that win 100 games “push harder” and that teams that lose 100 games “give up.” The real point is statistical, which is that this sort of “regression to the mean” will happen without any such adjustment effects, just from “luck” or “random variation” or varying situational factors.

Here’s a famous example from Tversky and Kahneman (1973):

The instructors in a flight school adopted a policy of consistent positive reinforcement recommended by psychologists. They verbally reinforced each successful execution of a flight maneuver. After some experience with this training approach, the instructors claimed that contrary to psychological doctrine, high praise for good execution of complex maneuvers typically results in a decrement of performance on the next try.

Actually, though:

Regression is inevitable in flight maneuvers because performance is not perfectly reliable and progress between successive maneuvers is slow. Hence, pilots who did exceptionally well on one trial are likely to deteriorate on the next, regardless of the instructors’ reaction to the initial success. The experienced flight instructors actually discovered the regression but attributed it to the detrimental effect of positive reinforcement.

“Performance is not perfectly reliable and progress between successive maneuvers is slow”: That describes pro sports!

As we write in Regression and Other Stories, the point here is that a quantitative understanding of prediction clarifies a fundamental qualitative confusion about variation and causality. From purely mathematical considerations, it is expected that the best pilots will decline, relative to the others, while the worst will improve in their rankings, in the same way that we expect daughters of tall mothers to be, on average, tall but not quite as tall as their mothers, and so on.

I was surprised to see Bill James make this mistake. All the years I’ve read him writing about the law of competitive balance and the plexiglass principle, I always assumed that he’d understood this as an inevitable statistical consequence of variation without needing to try to attribute it to poorly-performing teams trying harder etc.

How did he get this one so wrong? Here are 6 reasons.

This raises a new question, which is how could such a savvy analyst make such a basic mistake? I have six answers:

1. Multiplicity. Statistics is hard, and if you do enough statistics, you’ll eventually make some mistakes. I make mistakes too! It just happens that this “regression to the mean” fallacy is a mistake that James made.

2. It’s a basic mistake and an important mistake, but it’s not a trivial mistake. Regression to the mean is a notoriously difficult topic to teach (you can cruise over to chapter 6 of our book and see how we do; maybe not so great!).

3. Statistics textbooks, including my own, are full of boring details, so I can see that, whether or not Bill James has read any such books, he wouldn’t get so much out of them.

4. In his attribution of regression to the mean, James is making an error of causal reasoning and a modeling error, but it’s not a prediction error. The law of competitive balance and the plexiglass principle give valid predictions, and they represent insights that were not widely available in baseball (and many other fields) before James came along. Conceptual errors aside, James was still moving the ball forward, as it were. When he goes beyond prediction in his post, for example making strategy recommendations, I’m doubtful, but I’m guessing that the main influence on readers of his “law of competitive balance” is to the predictive part.

5. Hero worship. The man is a living legend. That’s great—he deserves all his fame—but the drawback is that maybe it’s a bit too easy for him to fall for his own hype and not question himself or fully hear criticism. We’ve seen the same thing happen with baseball and political analyst Nate Silver, who continues to do excellent work but sometimes can’t seem to digest feedback from outsiders.

6. Related to point 5 is that James made his breakthroughs by fighting the establishment. For many decades he’s been saying outrageous things and standing up for his outrageous claims even when they’ve been opposed by experts in the field. So he keeps doing it, which in some ways is great but can also lead him astray, by trusting his intuitions too much and not leaving himself more open for feedback.

I guess we could say that, in sabermetrics, James was on a winning streak for a long time so he relaxes. He stands pat. He has less motivation to see what’s going wrong.

P.S. Again, I’m a big fan of Bill James. It’s interesting when smart people make mistakes. When dumb people make mistakes, that’s boring. When someone who’s thought so much about statistics makes such a basic statistical error, that’s interesting to me. And, as noted in item 4 above, I can see how James could have this misconception for decades without it having much negative effect on his work.

P.P.S. Just to clarify: Bill James made two statements. The first was predictive and correct; the second was causal and misinformed.

His first, correct statement is that there is “regression to the mean” or “competitive balance” or “plexiglas” or whatever you want to call it: players or teams that do well in time 1 tend to decline in time 2, and players or teams that do poorly in time 2 tend to improve in time 2. This statement, or principle, is correct, and it can be understood as a general mathematical or statistical pattern that arises when correlations are less than 100%. This pattern is not always true—it depends on the joint distribution of the before and after measurements (see here) but it is typically the case.

His second, misinformed statement is that this is caused by players or teams that are behind at time 1 being more innovative or trying harder and players or teams that are ahead at time 2 being complacent or standing pat. This statement is misinformed because the descriptive phenomenon of regression-to-the-mean or competitive-balance or plexiglas will happen even in the absence of any behavioral changes. And, as discussed in the above post, behavioral changes can go in either direction; there’s no good reason to think that, when both teams perform strategic adjustments, that these adjustments on net will benefit the team that’s behind.

This is all difficult because it is natural to observe the first, correct predictive statement and from this to mistakenly infer the second, misinformed causal statement. Indeed this inference is such a common error that it is a major topic in statistics and is typically covered in introductory texts. We devote a whole chapter to it in Regression and Other Stories, and if you’re interested in understanding this I recommend you read chapter 6; the book is freely available online.

For the reasons discussed above, I’m not shocked that Bill James made this error: for his purposes, the predictive observation has been more important than the erroneous causal inference, and he figured this stuff out on his own, without the benefit or hindrance of textbooks, and given his past success as a rebel, I can see how it can be hard for him to accept when outsiders point out a subtle mistake. But, as noted in the P.S. above, when smart people get things wrong, it’s interesting; hence this post.

Chess cheating: how to detect it (other than catching someone with a shoe phone)

This post is by Phil Price, not Andrew.

Some of you have surely heard about the cheating scandal that has recently rocked the chess world (or perhaps it’s more correct to say the ‘cheating-accusation scandal.’) The whole kerfuffle started when World Champion Magnus Carlsen withdrew from a tournament after losing a game to a guy named Hans Niemann. Carlsen didn’t say at the time why he resigned, in fact he said “I really prefer not to speak. If I speak, I am in big trouble.” Most people correctly guessed that Carlsen suspected that Niemann had cheated to win. Carlsen later confirmed that suspicion. Perhaps he didn’t say so at the start because he was afraid of being sued for slander.

Carlsen faced Niemann again in a tournament just a week or two after the initial one, and Carlsen resigned on move 2.

In both of those cases, Carlsen and Niemann were playing “over the board” or “OTB”, i.e. sitting across a chess board from each other and moving the pieces by hand. That’s in contrast to “online” chess, in which players compete by moving pieces on a virtual board. Cheating in online chess is very easy: you just run a “chess engine” (a chess-playing program) and enter the moves from your game into the engine as you play, and let it tell you what move to make next. Cheating in OTB chess is not so simple: at high-level tournaments players go through a metal detector before playing and are not allowed to carry a phone or other device. (A chess engine running on a phone can easily beat the best human players. A chess commentator once responded to the claim “my phone can beat the world chess champion” by saying “that’s nothing, my microwave can beat the world chess champion.”). But if the incentives are high enough, some people will take difficult steps in order to win. In at least one tournament it seems that a player was using a chess computer (or perhaps a communication device) concealed in his shoe.

I don’t know if there are specific allegations related to how Niemann might have cheated in OTB games. A shoe device again, which Niemann uses to both enter the moves as they occur and to get the results through vibration? A confederate who enters the moves and signals Niemann somehow (a suppository that vibrates?). I’m not really sure what the options are. It would be very hard to “prove” cheating simply by looking at the moves that are made in a single game: at the highest levels both players can be expected to play almost perfectly, usually making one of the top two or three moves on every move (as evaluated by the computer), so simply playing very very well is not enough to prove anything.

Continue reading

A baseball analytics job using Stan!

Tony Williams writes:

I have nothing to do with this job, but it might be interesting to your readers since they specifically mention Stan as a desired skill.

From the link:

Data Scientist, Baseball Research & Development

The Cleveland Guardians Baseball Research & Development (R&D) group is seeking data scientists at a variety of experience levels . . . You will analyze video, player tracking, and biomechanics data as well as traditional baseball data sources like box scores to help us acquire and develop baseball players into a championship-caliber team. . . .

Qualifications

– Demonstrated experience or advanced degree in a quantitative field such as Statistics, Computer Science, Economics, Machine Learning, or Operations Research.

– Programming skills in a language such as R or Python to work efficiently at scale with large data sets.

– Desire to continue learning about data science applications in baseball.

And then in the Preferred Experience section, along with “Demonstrated research experience in a sports context (baseball is a plus)” and “Experience with computer vision” and a few other things, they have:

– Experience with Bayesian statistics and languages such as Stan.

How cool is that??

And, hey! I just looked it up . . . the Guardians have a winning record this year and they’re headed for the playoffs! Nothing like the Cleveland MLB teams I remember from my childhood . . .

What’s the difference between Derek Jeter and preregistration?

There are probably lots of clever answers to this one, but I’ll go with: One of them was hyped in the media as a clean-cut fresh face that would restore fan confidence in a tired, scandal-plagued entertainment cartel—and the other is a retired baseball player.

Let me put it another way. Derek Jeter had three salient attributes:

1. He was an excellent baseball player, rated by one source at the time of his retirement as the 58th best position player of all time.

2. He was famously overrated.

3. He was a symbol of integrity.

The challenge is to hold 1 and 2 together in your mind.

I was thinking about this after Palko pointed me to a recent article by Rose McDermott that begins:

Pre-registration has become an increasingly popular proposal to address concerns regarding questionable research practices. Yet preregistration does not necessarily solve these problems. It also causes additional problems, including raising costs for more junior and less resourced scholars. In addition, pre-registration restricts creativity and diminishes the broader scientific enterprise. In this way, pre-registration neither solves the problems it is intended to address, nor does it come without costs. Pre-registration is neither necessary nor sufficient for producing novel or ethical work. In short, pre-registration represents a form of virtue signaling that is more performative than actual.

I think this is like saying, “Derek Jeter is no Cal Ripken, he’s overrated, gets too much credit for being in the right place at the right time, he made the Yankees worse, his fans don’t understand how the game of baseball really works, and it was a bad idea to promote him as the ethical savior of the sport.”

Here’s what I think of preregistration: It’s a great idea. It’s also not the solution to problems of science. I have found preregistration to be useful in my own work. I’ve seen lots of great work that is not preregistered.

I disagree with the claim in the above-linked paper that “Under the guidelines of preregistration, scholars are expected to know what they will find before they run the study; if they get findings they do not expect, they cannot publish them because the study will not be considered legitimate if it was not preregistered.” I disagree with that statement in part for the straight-up empirical reason that it’s false; there are counterexamples; indeed a couple years ago we discussed a political science study that was preregistered and yielded unexpected findings which were published and were considered legitimate by the journal and the political science profession.

More generally, I think of preregistration as a floor, not a ceiling. The preregistered data collection and analysis is what you need to do. In addition, you can do whatever else you want.

Preregistration remains overrated if you think it’s gonna fix science. Preregistration facilitates the conditions for better science, but if you preregister a bad design, it’s still a bad design. Suppose you could go back in time and preregister the collected work of the beauty-and-sex-ratio guy, the ESP guy, and the Cornell Food and Brand Lab guy, and then do all those studies. The result wouldn’t be a spate of scientific discoveries; it would just be a bunch of inconclusive results, pretty much no different than the inconclusive results we actually got from that crowd but with the improvement that the inconclusiveness would have been more apparent. As we’ve discussed before, the benefits of procedural reforms such as preregistration are indirect—making it harder for scientists to fool themselves and others with bad designs—but not direct. Are these indirect benefits greater than the costs? I don’t know; maybe McDermott is correct that they’re not. I guess it depends on the context.

I think preregistration can be valuable, and I say that while recognizing that it’s been overrated and inappropriately sold as a miracle cure for scientific corruption. As I wrote a few years ago:

In the long term, I believe we as social scientists need to move beyond the paradigm in which a single study can establish a definitive result. In addition to the procedural innovations [of preregistration and mock reports], I think we have to more seriously consider the integration of new studies with the existing literature, going beyond the simple (and wrong) dichotomy in which statistically significant findings are considered as true and nonsignificant results are taken to be zero. But registration of studies seems like a useful step in any case.

Derek Jeter was overrated. He was a times a drag on the Yankees’ performance. He was still an excellent player and overall was very much a net positive.

“You should always (always) have a Naive model. It’s the simplest, cleanest, most intuitive way to explain whether your system is at least treading water. And if it is (that’s a big IF), how much better than Naive is it.”

Jonathan Falk points us to this bit from baseball analyst Tangotiger, who writes:

Back around 2002 or so, I [Tango] was getting really (really) tired with all of the baseball forecasting systems coming out of the woodwork, each one proclaiming it was better than the next.

I set out not to be the best, but to be the worst. I needed to create a Naive model, so simple, that we can measure all the forecasting systems against it. And so transparent that anybody could recreate it. . . .

The model was straightforward:

1. limit the data to the last three years, giving more weight to the more recent seasons

2. include an aging component

3. apply a regression amount

That’s it. I basically modeled it the way a baseball fan might look at the back of a baseball card (sorry, yet another dated reference), and come up with a reasonable forecast. Very intuitive. And never, ever, would you get some outlandish or out of character forecast. Remember, I wasn’t trying to be the best. I was just trying to create a system that seemed plausible enough to keep its head above water. The replacement level of forecasting systems.

I don’t get exactly what he’s doing here, but the general principle makes sense to me. It’s related to what we call workflow, or the trail of breadcrumbs: If you have a complicated method, you can understand it by tracing a path back to some easy-to-understand baseline model. The point is not to “reject” the baseline model but to use it as a starting point for improvements.

Tango continues with his evaluation of his simple baseline model:

Much to my [Tango’s] surprise, it was not the worst. Indeed, it was one of the best. In some years, it actually was the best.

This had the benefit of what I was after: knocking out all those so-called forecasting systems that were really below replacement level. They had no business calling themselves forecasting systems, and especially trying to sell their inferior product to unsuspecting, and hungry, baseball fans.

What was left were forecasting systems that actually were good.

He summarizes:

You should always (always) have a Naive model. It’s the simplest, cleanest, most intuitive way to explain whether your system is at least treading water. And if it is (that’s a big IF), how much better than Naive is it.

Well put. I’d prefer calling it a “baseline” rather than “naive” model, but I agree with his general point, and I also agree with his implicit point that we don’t make this general point often enough when explaining what we’re doing.

A couple of additional points

The only real place that I’d modify Tango’s advice is to say the following: In sports, as in other applications, there are many goals, and you have to be careful not to tie yourself to just one measure of success. For example, he seems to be talking about predicting performance of individual players or teams, but sometimes we have counterfactual questions, not just straight-up predictions. Also, in practice there can be a fuzzy line between a null/naive/baseline model and a fancy model. For example, Tango talks about using up to three years of data, but what if you have a player with just one year of data? Or a player who only had 50 plate appearances last year? What do you do with minor-league stats? Injuries? Etc. I’m not saying you can’t handle these things, just that decisions need to be made, and there’s no sharp distinction between data-processing decisions and what you might call modeling.

Again, this is not a disagreement with Tango’s point, just an exploration of how it can get complicated when real data and real decisions are involved.

Paul Campos: Should he stick to sports?

“Over the course of my life I have met liberals who used to be conservatives and Catholics who used to be Communists, and even women who used to be considered men. But I have never met a Michigan fan who used to be an Ohio State fan or vice versa. Indeed, the very idea seems in some fundamental way absurd.”Paul Campos, A Fan’s Life

This book is a mix of sports rants, political rants, hobbyhorses, and rabbit holes. The sports content is mostly about college football, which is not my favorite, but I guess I know enough about it to appreciate the stories. Campos walks the fine line between being a hardcore fan and recognizing the absurdity of it all, reminiscent of some clear-eyed writing about religion, love, and other intense yet inherently ridiculous institutions. Oddly enough, you don’t always see this balanced perspective in arts writing, for example those rock critics who take themselves and their topic all too seriously, or at the other extreme the reviewers who make you wonder why they’re writing about the topic at all. I feel that more needs to be written on sports fandom but this is a start. The book also has some economics and politics.

Hmm . . . a quick google of *sports fandom book* turns up this edited volume which I think I’d hate. . . . OK, I take that back. That book is a collection of 34 essays by different authors, and I’d probably get something valuable out of 15 of them, which, if so, is not a bad ratio. I can’t quite bring myself to spend $200 for it but maybe I’ll check it out from the library when I return.

P.S. Also in the area of overlap between sports, politics, and sociology is Frank Guridy’s The Sports Revolution: How Texas Changed the Culture of American Athletics, which contains approximately zero statistics but is full of interesting stories and perspectives; I think many readers of the blog would find it enjoyable and worth reading.

P.P.S. As a special benefit for those of you who have read this far, here’s a post from 2009, “Sports fans as potential Republicans?” The only data I could conveniently get were from the 1990s; sorry:

sport.png

Some interesting discussion there in comments, too. This all seems related to Campos’s book.

Also we came across this intriguing if slightly mysterious graph from Reid Wilson in 2010:

Update on estimates of effects of college football games on election outcomes

Anthony Fowler writes:

As you may recall, Pablo Montagnes and I wrote a paper on college football and elections in 2015 where we looked at additional evidence and concluded that the original Healy, Malhota, Mo (2010) result was likely a false positive. You covered this here and here.

Interestingly, the story isn’t completely over. Graham, Huber, Malhotra, and Mo have a forthcoming JOP paper claiming that the evidence is mostly supportive of the original hypothesis. They added some new observations, pooled all the data together, and re-ran some specifications that are very similar to those of the original Healy et al. paper. The results got weaker, but they’re still mostly in the expected direction.

Pablo and I wrote a reply to this paper, available here, which is also forthcoming in the JOP. We ran some simulations showing that their results are in line with what we would expect if the original result was a chance false positive, and their results are much weaker than what we would expect if the original result was a genuine effect of the magnitude reported in the original paper.

They wrote a reply to our reply, which we only learned about recently when it appeared on the JOP site.

We have written a brief reply to their reply. We assume that the JOP won’t be interested in publishing yet another reply, but if you think this is interesting, we would greatly appreciate you covering this topic and sharing our reply.

There is a lot more to discuss. For example, Graham et al. say that they are conducting an independent replication using the principles of open science. But the data and design are very similar to the original paper, so this is neither independent nor a replication. They argue that they pre-registered their analyses, but they had already seen very similar specifications run on very similar data, so it’s not so clear that we should think of these as pre-registered analyses. They appear to have deviated from their pre-analysis plan by failing to report results using only the out-of-sample data (they just show results using the in-sample data and the pooled data, but not the out-of-sample data). They also exercise some degrees of freedom (and deviate from their pre-analysis plan) in deciding what should count as out of sample.

I have three quick comments:

First, much of the above-linked discussion concerns what counts as a preregistered replication. It’s important for people to consider these issues carefully but they don’t interest me so much, at least not in this setting where, ultimately, the amount of data is not large enough to learn much of anything without some strong theory.

Second, although I’m generally in sympathy with the arguments made by Fowler and Montagnes, I don’t like their framing of the problem in terms of “false positives.” I don’t think the effect of a football game on the outcome of an election is zero. What I do think (until persuaded by strong evidence to the contrary) is that these effects are likely to be small, are highly variable when they’re not small, and won’t show up as large effects in average analyses. In practice, that’s not a lot different than calling these effects “false positives,” but I don’t like going around saying that effects are zero. It’s enough to say that they are not large and predictable, which would be necessary for them to be detectable from the usual statistical analysis.

Third, when reading Fowler and Montagnes’s final points regarding political accountability, I’m reminded of our work on the piranha principle: Once you accept the purportedly large effects of football games, shark attacks, etc., where do you stop? To put it another way, it’s not impossible that college football games have large and consistent effects on election outcomes, but there are serious theoretical problems with such a model of the world, because then you have to either have a theory of what’s so special about college football or else you have a logjam of large effects from all sorts of inputs.

How to win the Sloan Sports hackathon

Stan developer Daniel Lee writes:

I walked in with knowing a few things about the work needed to win hackathons:

– Define a problem.
If you can clearly define a problem, you’ll end up it the top third of the competition. It has to be clear why the problem matters and you have to communicate this effectively.

– Specify a solution.
If you’re able to specify a solution to the problem, you’ll end up in the top 10%. It has to be clear to the judges that this solution solves the problem.

– Implement the solution.
If you’ve gotten this far and you’re now able to actually implement the solution that you’ve outlined, you’ll end up in the top 3. It’s hard to get to this point. We’re talking about understanding the topic well enough to define a problem of interest, having explored enough of the solution space to specify a solution, then applying skills through focused effort to build the solution in a short amount of time. Do that and I’m sure you’ll be a finalist.

– Build interactivity.
If the judges can do something with the solution, specifically evaluate “what if” scenarios, then you’ve gone above and beyond the scopes of a hackathon. That should get you a win.

Winning a hackathon takes work and focus. It’s mentally and physically draining to compete in a hackathon. You have to pace yourself well, adjust to different challenges as they come, and have enough time and energy at the end to switch context to present the work.

One additional note: the solution only needs to be a proof of concept and pass a smell test. It’s important to know when to move on.

Positive, negative, or neutral?

We’ve talked in the past about advice being positive, negative, or neutral.

Given that Daniel is giving advice on how to win a competition that has only one winner, you might think I’d call it zero-sum. Actually, though, I’d call it positive-sum, in that the goal of a hackathon is not just to pick a winner, it’s also to get people involved in the field of study. It’s good for a hackathon if its entries are good.

The story

Daniel writes:

I [Daniel] participated in the SSAC22 hackathon. I showed up, found a teammate [Fabrice Mulumba], and won. Here’s a writeup about our project, our strategy for winning, and how we did it.

The Data

All hackathon participants were provided data from the 2020 Stanley Cup Finals. This included:

– Tracking data for 40 players, the puck, and the referees. . . . x, y, z positions with estimates of velocity recorded at ~100 Hz. The data are from chips attached to jerseys and in the puck.

– Play-by-play data. Two separate streams of play-by-play data were included: hand-generated and system-generated. . . .

– Other meta data. Player information, rink information, game time, etc.

Data was provided for each of the 6 games in the series. For a sense of scale: one game has about 1.5M rows of tracking data with 1.5 GB of JSON files across the different types of data.

The Hackathon

There were two divisions for the Hackathon: Student and Open. The competition itself had very little structure. . . . Each team would present to the judges starting at 4 pm and the top teams would present in the finals. . . .

Daniel tells how it came together:

Fabrice and I [Daniel] made a pretty good team. But it almost didn’t happen.

Both Fabrice and I had competed in hackathons before. We first met around 8:30 am, half an hour before the hackathon started. As Fabrice was setting up, I saw that he had on an AfroTech sweatshirt and a Major League Hacking sticker on his laptop. I said hi, asked if he was competing alone, and if he was looking for a teammate. He told me he wanted to compete alone. I was hoping to find a teammate, but had been preparing to compete alone too. While it’s hard to do all the things above alone, it’s actually harder if you have the wrong teammate. We went our separate ways. A few minutes later, we decided to team up.

Something about the team felt right from the start. Maybe I was more comfortable teaming up with one of the few other POC in the room. Maybe there was a familiar cadence and vibe from having parents that immigrated to the US. Maybe it was knowing that the other had been through an intense working session in the past and was voluntarily going through it again. Whatever it was, it worked.

In the few days prior, I had spent a couple hours trying to gain some knowledge about hockey from friends that know the sport. The night before, I found a couple of people that worked for the LA Kings and asked questions about what they thought about and why. I came in thinking we should look at something related to goalie position. Fabrice came in wanting to work on a web app and focus on identifying a process within the game. These ideas melded together and formed the winning project.

For the most part, we worked on separate parts of the problem. We were able to split the work and trust that the other would get their part done. . . .

The Winning Project: Sloan Goalie Card

We focused on a simple question. Does goaltender depth matter?

Having access to x, y, z position of every player meant that we could analyze where the goalie was at the time when shots were taken. Speaking to some hockey people, we found out that this data wasn’t publicly available, so this would be one of the first attempts at this type of analysis.

In the allotted time, we pulled off a quick analysis of goalie depth and built the Sloan Goalie Card web app.

I don’t know anything about hockey so I can’t comment on the actual project. What I like is Daniel’s general advice.

P.S. I googled *how to win a hackathon*. It’s a popular topic, including posts going back to 2014. Some of the advice seems pretty ridiculous; for example one of the links promises “Five Easy Steps to Developer Victory”—which makes me wonder what would happen if two competitors tried this advice for the same hackathon. They couldn’t both win, right?

Essentialism!

This article by Albert Burneko doesn’t directly cite the developmental psychology literature on essentialism—indeed, it doesn’t cite any literature at all—but it’s consistent with a modern understanding of children’s thought. As Burneko says it:

Have you ever met a pre-K child? That is literally all they talk and think about. Sorting things into categories is their whole deal.

I kinda wish they’d stick to sports, though. Maybe some zillionaire could sue them out of existence?

Just show me the data, baseball edition

Andrew’s always enjoining people to include their raw data. Jim Albert, of course, does it right. Here’s a recent post from his always fascinating baseball blog, Exploring Baseball Data with R,

The post “just” plots the raw data and does a bit of exploratory data analysis, concluding that the apparent trends are puzzling. Albert’s blog has it all. The very next post fits a simple Bayesian predictive model to answer the question every baseball fan in NY is asking,

P.S. If you like Albert’s blog, check out his fantastic intro to baseball stats, which only assumes a bit of algebra, yet introduces most of statistics through simulation. It’s always the first book I recommend to anyone who wants a taste of modern statistical thinking and isn’t put off by the subject matter,

  • Jim Albert and Jay Bennet. 2001. Curve Ball. Copernicus.


 

“Columbia Loses Its No. 2 Spot in the U.S. News Rankings”

Here’s the latest news, from Anemona Hartocollis at the New York Times:

Without fanfare, U.S. News & World Report announced that it had “unranked” Columbia University, which had been in a three-way tie for the No. 2 spot in the 2022 edition of Best Colleges, after being unable to verify the underlying data submitted by the university.

The decision was posted on the U.S. News website a week after Columbia said it was withdrawing from the upcoming 2023 rankings.

The Ivy League university said then that it would not participate in the next rankings because it was investigating accusations by one of its own mathematics professors that the No. 2 ranking was based on inaccurate and misleading data.

So far, so good. Indeed, math professor Michael Thaddeus pointed out many suspicious things about the Columbia data—see here (Arts & Sciences), here (Engineering), and here (more on Engineering)—and the university has yet to seriously question any of his specific claims (the best they could come up with so far was reported as, “The 100 percent figure was rounded up, officials said, and they believed they were allowed some leeway,” which isn’t very encouraging), so at this point it would be pretty hard for me to not believe that Columbia’s ranking was not, based on inaccurate and misleading data.

So, yeah, Columbia steps back and U.S. News delists us. Fair enough. It’s like what the IOC does if you’ve been doping or the NCAA does if you break one of its regulations. The IOC and NCAA have notorious problems; still, if you’re in the game, you’re supposed to follow the rules.

From the news article:

In its blog post on Thursday, U.S. News said that after learning of the criticism in March, it had asked Columbia to substantiate the data it had reported, including information about the number of instructional full-time and part-time faculty, the number of full-time faculty with the highest degree in their field, the student-faculty ratio, undergraduate class size and education expenditures.

“To date, Columbia has been unable to provide satisfactory responses to the information U.S. News requested,” the post said.

That sounds about right.

But this bothers me . . .

One thing, though. Here’s something from a recent statement issued by Columbia:

“A thorough review cannot be rushed,” the university wrote. “While we are disappointed in U.S. News & World Report’s decision, we consider this a matter of integrity and will take no shortcuts in getting it right.”

The “matter of integrity” thing seems fine—I guess maybe some people will get fired or, more likely, be encouraged to take early retirement or seek jobs elsewhere—but what’s this “While we are disappointed in U.S. News & World Report’s decision” bit?

Why is Columbia “disappointed in U.S. News & World Report’s decision”? What would they want U.S. News to do? What, in Columbia’s view, would be an appropriate action by U.S. News? I’m honestly not sure. Presumably it would be inappropriate for them to keep Columbia at the #2 ranking, as this ranking is based on numbers which have now been revealed to be inaccurate and misleading. And, even more so, it would be inappropriate for Columbia to be moved up to the #1 ranking. So, if U.S. News wasn’t going to delist Columbia, what should they have done? Move the college’s ranking down to #8? #18? #38? Should U.S. News impute the missing values in Columbia’s data? Maybe hire some (non-Columbia) statistician to help on that?

I get that Columbia is disappointed that it turns out they have some employees who were supposed to putting together accurate numbers on enrollment, etc., and didn’t do that. That’s annoying! But to be disappointed in U.S. News’s decision—that doesn’t make sense at all. What else could U.S. News have possibly done? Keeping Columbia at #2 after all learning about all these data problems would be like continuing to label North Korea as a country with “moderate electoral integrity.”

So, anyway, I liked that NYT article, but I wished they’d pushed back and pointed out how ridiculous that “we are disappointed” statement was.

P.S. I very much appreciate the academic freedom by which I can write posts like this, and Thaddeus can report his findings, without fear of retaliation by the university. It’s not like Google, where if you question their numbers, you might get canned. Columbia is a great place, and the existence of real problems here should not lead us to think that everything is bad.
Continue reading

High-intensity exercise, some new news


This post is by Phil Price, not Andrew.

Several months I noticed something interesting (to me!) about my heart rate, and I thought about blogging about it…but I didn’t feel like it would be interesting (to you!) so I’ve been hesitant. But then the NYT published something that is kinda related and I thought OK, what the hell, maybe it’s time for an update about this stuff. So here I am.

The story starts way back in 2010, when I wrote a blog article called “Exercise and Weight Loss: Shouldn’t Somebody See if there’s a Relationship?” In that article I pointed out that there had been many claims in the medical / physiology literature that claim that exercise doesn’t lead to weight loss in most people, but that those studies seemed to be overwhelmingly looking at low- and medium-intensity exercise, really not much (or at all) above warmup intensity. When I wrote that article I had just lost about twelve pounds in twelve weeks when I started doing high-intensity exercise again after a gap of years, and I was making the point that before claiming that exercise doesn’t lead to weight loss, maybe someone should test whether the claim is actually true, rather that assuming that just because low-intensity exercise doesn’t lead to weight loss, no other type of exercise would either.

Eight years later, four years ago, I wrote a follow-up post along the same lines. I had gained some weight when an injury stopped me from getting exercise. As I wrote at the time, “Already this experience would seem to contradict the suggestion that exercise doesn’t control weight: if I wasn’t gaining weight due to lack of exercise, why was I gaining it?” And then I resumed exercise, in particular exercise that had some maximum short-term efforts as I tried to get in shape for a bike trip in the Alps, and I quickly lost the weight again. Even though I wasn’t conducting a formal experiment, this is still an example of what one can learn through “self-experimentation,” which has a rich history in medical research.

Well, it’s not like I’ve kept up with research on this in the mean time, but I did just see a New York Times article called “Why Does a Hard Workout Make You Less Hungry” that summarizes a study published in Nature that implicates a newly-discovered “molecule — a mix of lactate and the amino acid phenylalanine — [that] was created apparently in response to the high levels of lactate released during exercise. The scientists named it lac-phe.” As described in the article, the evidence seems pretty convincing that high-intensity exercise helps mice lose weight or keep it off, although the evidence is a lot weaker for humans. That said, the humans they tested do generate the same molecule, and a lot more of it after high-intensity exercise than lower-intensity exercise. So maybe lac-phe does help suppress appetite in humans too.

As for the interesting-to-me (but not to you!) thing that I noticed about my heart rate, that’s only tangentially related but here’s the story anyway. For most of the past dozen years a friend and I have done bike trips in the Alps, Pyrenees, or Dolomites. Not wanting a climb up Mont Ventoux or Stelvio to turn into a death march due to under-training, I always train hard for a few months in the spring, before the trip. That training includes some high-intensity intervals, in which I go all-out for twenty or thirty seconds, repeatedly within a few minutes, and my heart rate gets to within a few beats per minute of my maximum. While I’m doing this training I lose the several pounds I gained during the winter. Unfortunately, as you may recall we have had a pandemic since early 2020. My friend and I did not do bike trips. With nothing to train for, I didn’t do my high-intensity intervals. I still did plenty of bike riding, but didn’t get my heart rate up to its maximum. I gained a few pounds, not a big deal. But a few months ago I decided to get back in shape, thinking I might try to do a big ride in the fall if not the summer. My first high-intensity interval, I couldn’t get to within 8 beats per minute of my usual standard, which had been nearly unchanged over the previous 12 years! Prior to 2020, I wouldn’t give myself credit for an interval if my heart rate hadn’t hit at least 180 bpm; now I maxed out at 172. My first thought: blame the equipment. Maybe my heart rate monitor isn’t working right, maybe a software update has changed it to average over a longer time interval, maybe something else is wrong. But trying two monitors, and checking against my self-timed pulse rate, I confirmed that it was working correctly, I really was maxing out at 172 instead of 180. Holy cow. I decided to discuss this with my doctor the next time I have a physical, but in the mean time I kept doing occasional maximum-intensity intervals…and my max heart rate started creeping up. A few days ago I hit 178, so it’s up about 6 bmp in the past four months. And I’ve lost those few extra pounds and now I’m pretty much back to my regular weight for my bike trips. The whole experience has (1) reinforced my already-strong belief that high-intensity exercise makes me lose weight if I’m carrying a few extra pounds, and (2) made me question the conventional wisdom that everyone’s max heart rate decreases with age: maybe if you keep exercising at or very near your maximum heart rate, your maximum heart rate doesn’t decrease, or at least not much? (Of course at some point your maximum heart rate goes to 0 bpm. Whaddyagonnado.)

So, to summarize: (1) Finally someone is taking seriously the possibility that high-intensity exercise might lead to weight loss, and even looking for a mechanism, and (2) when I stopped high-intensity exercise for a couple years, my maximum heart rate dropped…a lot.

Sorry those are not more closely related, but I was already thinking about item 2 when I encountered item 1, so they seem connected to me.