Unsustainable research on corporate sustainability

In a paper to be published in the Journal of Financial Reporting, Luca Berchicci and Andy King shoot down an earlier article claiming that corporate sustainability reliably predicted stock returns. It turns out that this earlier research had lots of problems.

King writes to me:

Getting to the point of publication was an odyssey. At two other journals, we were told that we should not replicate and test previous work but instead fish for even better results and then theorize about those:

“I encourage the authors to consider using the estimates from figure 2 as the dependent variables analyzing which model choices help a researcher to more robustly understand the relation between CSR measures and stock returns. This will also allow the authors to build theory in the paper, which is currently completely absent…”

“In fact, there are some combinations of proxies/ model specifications that are to the left of Khan et al.’s estimate. I am curious as to what proxies/ combinations enhance the results?”

Also, the original authors seem to have attempted to confuse the issues we raise and salvage the standing of their paper (see attached: Understanding the Business Relevance of ESG Issues). We have written a rebuttal (also attached).

Here’s the relevant part of the response, by George Serafeim and Aaron Yoon:

Models estimated in Berchicci and King (2021) suggest that making different variable construction, sample period, and control variable choices can yield different results with regards to the relation between ESG scores and business performance. . . . However, not all models are created equal . . . For example, Khan, Serafeim and Yoon (2016) use a dichotomous instead of a continuous measure because of the weaknesses of ESG data and the crudeness of the KLD data, which is a series of binary variables. Creating a dichotomous variable (i.e., top quintile for example) could be well suited when trying to identify firms on a specific characteristic and the metric identifying that characteristic is likely to be noisy. A continuous measure assumes that for the whole sample researchers can be confident in the distance that each firm exhibits from each other. Therefore, the use of continuous measure is likely to lead to significantly weaker results, as in Berchicci and King (2021) . . .

Noooooooo! Dichotomizing your variable almost always has bad consequences for statistical efficiency. You might want to dichotomize to improve interpretability, but you then should be aware of the loss of efficiency of your estimates, and you should consider approaches to mitigate this loss.

Berchicci and King’s rebuttal is crisp:

The issue debated in Khan, Serafeim, and Yoon (2016) and Berchicci and King (2022) is whether guidance on materiality from the Sustainable Accounting Standards Board (SASB) can be used to select ESG measures that reliably predict stock returns. Khan, Serafeim, and Yoon (2016) (hereafter “KSY”) estimate that had investors possessed SASB materiality data, they could have selected stock portfolios that delivered vastly higher returns, an additional 300 to 600 basis points per year for a period of 20 years. Berchicci and King (2022) (hereafter “BK”) contend that there is no evidence that SASB guidance could have provided a reliable advantage and contend that KSY’s findings are a statistical artifact.

In their defense of KSY, Yoon and Serafeim (2022) ignore the evidence provided in Berchicci and King and leave its main points unrefuted. Rather than make their case directly, they try to buttress their claim with a selective review of research on materiality. Yet a closer look at this literature reveals that little of it is relevant to the debate. Of the 28 articles cited, only two evaluate the connection between SASB materiality guidance and stock price, and both are self-citations.

Berchicci and King continue:

Indeed, in other forums, Serafeim has made a contrasting argument, contending that KSY is a uniquely important study – a breakthrough that shifted decades of understanding (Porter, Serafeim, and Kramer, 2016). Surely, such an important study should be evaluated on its own merits.

That’s funny. It reminds me of the general point that in research we want our results simultaneously to be surprising and to make perfect sense. In this case, this put Yoon and Serafeim in a bind.

And more:

In BK, we evaluate whether KSY’s results are a fair representation of the true link between material sustainability and stock return. We evaluate over 400 ways that the relationship could be analyzed and reveal that 98% of the models result in estimates smaller than the one reported by KSY and that the median estimate was close to zero. We then show that KSY’s estimate is not robust to simple changes in their model . . . Next, we evaluate the cause of KSY’s strong estimate and uncover evidence that it is a statistical artifact. . . . We then show that their measure also lacks face validity because it judges as materially sustainable firms that were (and continue to be) leading emitters of toxic pollution and greenhouse gasses. In some years, this included a large majority of the firms in extractive industries (e.g. oil, coal, cement, etc.). . . . KSY do not address any of these criticisms and instead rely on a belief that their measure and model are the only ones that should be considered. . . .

Where do they sit on the ladder?

It’s good to see this criticism out there, and as usual it’s frustrating to see such a stubborn response by the original authors. A few years ago we presented a ladder of responses to criticism, from the most responsible to the most destructive:

1. Look into the issue and, if you find there really was an error, fix it publicly and thank the person who told you about it.

2. Look into the issue and, if you find there really was an error, quietly fix it without acknowledging you’ve ever made a mistake.

3. Look into the issue and, if you find there really was an error, don’t ever acknowledge or fix it, but be careful to avoid this error in your future work.

4. Avoid looking into the question, ignore the possible error, act as if it had never happened, and keep making the same mistake over and over.

5. If forced to acknowledge the potential error, actively minimize its importance, perhaps throwing in an “everybody does it” defense.

6. Attempt to patch the error by misrepresenting what you’ve written, introducing additional errors in an attempt to protect your original claim.

7. Attack the messenger: attempt to smear the people who pointed out the error in your work, lie about them, and enlist your friends in the attack.

In this case, the authors of the original article are stuck somewhere around rung 4. Not the worse possible reaction—they’ve avoided attacking the messenger, and they don’t seem to have introduced any new errors—but they haven’t reached the all-important step of recognizing their mistake. Not good for them going forward. How can you make serious research progress if you can’t learn from what you’ve done wrong in the past. You’re building a house on a foundation of sand.

P.S. According to Google, the original article, “Corporate Sustainability: First Evidence on Materiality,” has been cited 861 times. How is it that such a flawed paper has so many citations? Part of this might be the instant credibility conveyed by the Harvard affiliations of the authors, and part of this might be the doing-well-by-doing-good happy-talk finding that “investments in sustainability issues are shareholder-value enhancing.” Kinda like that fishy claim about unionization and stock prices or the claims of huge economic benefits from early childhood stimulation. Forking paths allow you to get the message you want from the data, and this is a message that many people want to hear.

International Workshop on Statistical Modelling – IWSM 2022 in Trieste (Italy)

I am glad to announce that the next International Workshop on Statistical Modelling (IWSM), the major activity of the Statistical Modelling Society, will take place in Trieste, Italy, between July 18 and July 22 2022, organized by University of Trieste.

The conference will be anticipated by the short course “Statistical Modelling of Football Data” by Ioannis Ntzoufras (AUEB) and Leonardo Egidi (Univ. of Trieste) on July 17th. The course is based on Stan and provided to people with a minimal statistical/mathematical background.

Interested participants may register choosing between some options:

  • whole conference
  • conference + short course
  • short course

Any information about registration and fees can be found here. The call for papers deadline for submitting a 4-pages abstract is April 4th (likely to be extended). For any information visit the IWSM 2022 website.

Stay tuned, and share this event with whoever may be interested in the conference.

footBayes: an R package for football (soccer) modeling using Stan

footBayes 0.1.0 is on CRAN! The goal of the package is to propose a complete workflow to:

– fit the most well-known football (soccer) models: double Poisson, bivariate Poisson, Skellam, student t through the maximum likelihood approach and HMC Bayesian methods using Stan;

– visualize the teams’ abilities, the model pp checks, the rank-league reconstruction;

– predict out-of-sample matches via the pp distribution.

Here a super quick use of the package for the Italian Serie A. For any detail, check out the vignette and enjoy!

p.s. the vignette has been compiled without plot rendering to save time during the CRAN submission

library(footBayes)
require(engsoccerdata)
require(dplyr)

# dataset for Italian serie A

italy <- as_tibble(italy)
italy_2000_2002<- italy %>%
   dplyr::select(Season, home, visitor, hgoal, vgoal) %>%
   dplyr::filter(Season=="2000" | Season=="2001" | Season =="2002")

fit1 <- stan_foot(data = italy_2000_2002,
                  model="double_pois",
                  predict = 36) # double poisson fit (predict last 4 match-days)
foot_abilities(fit1, italy_2000_2002) # plot teams abilities
pp_foot(italy_2000_2002, fit1)   # pp checks
foot_rank(italy_2000_2002, fit1) # rank league reconstruction
foot_prob(fit1, italy_2000_2002) # out-of-sample posterior pred. probabilities

 

 

Hierarchical model golf putting success!

The other day we discussed my struggles fitting models to the golf putting data.

My earlier modeling was mostly a success—it’s a popular example, it’s a Stan case study, and it’s in our workflow article. We had an initial dataset that we can fit with a simple one-parameter geometry-based model:

Then we got new data where the first model doesn’t fit but we can fix by following Mark Broadie’s suggestion and adding just one more parameter to capture a little bit more of the geometry of the problem:

That was all good but we had convergence problems fitting this model in Stan, and the only way I could get it to fit smoothly was to add a fudge factor, an independent error term at each distance. Including this extra error did not bother me—after all, we would not expect an simple model to fit real data perfectly—but I was annoyed that, to add this error term, I needed to approximate the binomial likelihood with a normal distribution. Such an approximation would give problems going forward if we wanted to model the probability of success given players, golf courses, and weather conditions, in which case we’d have lots of cells with just 1 or 2 observations so the normal approximation wouldn’t work.

So I tried a direct approach, adding an error term to the modeled probability of success—but that couldn’t be done on the probability scale because then the probability could go below 0 or above 1, so I tried an additive error on the logistic scale; in Stan:

p = inv_logit(logit(p_angle .* p_distance) + sigma_eta*eta);

Here, p_angle .* p_distance is the predicted probability of success (the probability of getting both the shot angle and the shot distance within tolerance), and sigma_eta*eta is the vector of errors (with eta given a normal(0,1) prior and sigma_eta representing the scale of the errors). The logistic and inverse logistic transformations keep the probabilities bounded between 0 and 1.

But it didn’t work! Convergence problems again.

And that’s where we were a few days ago. Stuck! Stuck stuck stuck.

There were various suggestions in comments, but none were directly helpful, until this came from Kj:

The problem seems rooted in the model needing the shortest putts probability to be very close to 1 in order to fit the rest of the data. Before the normal hack, the (poorly sampled) model estimates the probability of the shortest putts to be 10^9 in logit space.

The normal hack applies to probability space, and there the error is tiny, so it works fine. But if you look at the error in logit space, the fit remains really bad.

And I was like, Aha! Here’s a solution: a three-parameter model that scales all the probabilities down from 1:

data {
  int J;
  array[J] int n;
  vector[J] x;
  array[J] int y;
  real r;
  real R;
  real overshot;
  real distance_tolerance;
}
transformed data {
  vector[J] threshold_angle = asin((R-r) ./ x);
}
parameters {
  real<lower=0> sigma_angle;
  real<lower=0> sigma_distance;
  real<lower=0, upper=1> epsilon;
}
model {
  vector[J] p_angle = 2*Phi(threshold_angle / sigma_angle) - 1;
  vector[J] p_distance = Phi((distance_tolerance - overshot) ./ ((x + overshot)*sigma_distance)) -
               Phi((- overshot) ./ ((x + overshot)*sigma_distance));
  vector[J] p = p_angle .* p_distance * (1 - epsilon);
  y ~ binomial(n, p);
  [sigma_angle, sigma_distance] ~ normal(0, 1);
}

The key is to make it a multiplier that has to be less than 1. This eliminates the problem with the boundary and the need for the logit.

Success! I’m so happy. Here’s the fit:

       variable        mean      median    sd   mad          q5         q95  rhat ess_bulk ess_tail
 lp__           -363841.724 -363841.000 1.311 1.483 -363844.000 -363840.000 1.000     1636       NA
 sigma_angle          0.018       0.018 0.000 0.000       0.018       0.018 1.002     1905     2165
 sigma_distance       0.080       0.080 0.001 0.001       0.079       0.081 1.002     1659     1859
 epsilon              0.001       0.001 0.000 0.000       0.001       0.001 1.002     1545     1493

I’m not sure what’s going on with the tail effective sample size; we’ll have to look into this. I suspect it’s caused by some rounding error. Doesn’t really matter, though.

The above model fits the data in the sense of going through the data points, but it’s still just a three-parameter model so to really do things right we might still want to add an error term. We can do this, using the same principle of making the errors multiplicative and constraining them to fall between 0 and 1:

data {
  int J;
  array[J] int n;
  vector[J] x;
  array[J] int y;
  real r;
  real R;
  real overshot;
  real distance_tolerance;
}
transformed data {
  vector[J] threshold_angle = asin((R-r) ./ x);
}
parameters {
  real<lower=0> sigma_angle;
  real<lower=0> sigma_distance;
  real<lower=0> sigma_epsilon;
  vector<lower=0, upper=1>[J] epsilon;
}
model {
  vector[J] p_angle = 2*Phi(threshold_angle / sigma_angle) - 1;
  vector[J] p_distance = Phi((distance_tolerance - overshot) ./ ((x + overshot)*sigma_distance)) -
               Phi((- overshot) ./ ((x + overshot)*sigma_distance));
  vector[J] p = p_angle .* p_distance .* (1 - epsilon);
  epsilon ~ exponential(1/sigma_epsilon);
  y ~ binomial(n, p);
  [sigma_angle, sigma_distance] ~ normal(0, 1);
}

This is a little bit hacky because we’re using the exponential density for the epsilons and then constraining to be no more than 1, but in practice it will be fine. The scale parameter sigma_epsilon keeps the errors under control. (I tried epsilon ~ normal(0, sigma_epsilon); model and it gave essentially the same results.) We can also augment the model so it computes residuals:

data {
  int J;
  array[J] int n;
  vector[J] x;
  array[J] int y;
  real r;
  real R;
  real overshot;
  real distance_tolerance;
}
transformed data {
  vector[J] threshold_angle = asin((R-r) ./ x);
  vector[J] raw_proportion = to_vector(y) ./ to_vector(n);
}
parameters {
  real<lower=0> sigma_angle;
  real<lower=0> sigma_distance;
  real<lower=0> sigma_epsilon;
  vector<lower=0, upper=1>[J] epsilon;
}
transformed parameters {
  vector[J] p_angle = 2*Phi(threshold_angle / sigma_angle) - 1;
  vector[J] p_distance = Phi((distance_tolerance - overshot) ./ ((x + overshot)*sigma_distance)) -
               Phi((- overshot) ./ ((x + overshot)*sigma_distance));
  vector[J] p = p_angle .* p_distance .* (1 - epsilon);
}
model {
  epsilon ~ exponential(1/sigma_epsilon);
  y ~ binomial(n, p);
  [sigma_angle, sigma_distance] ~ normal(0, 1);
}
generated quantities {
  vector[J] residual = raw_proportion - p_angle .* p_distance;
}

We needed to move some things into the transformed parameters block so they’d be accessible in the generated quantities calculation. Also, we compute residual relative to p_angle .* p_distance, not relative to p, because the whole point is to look at the fit of two-parameter model fits. The error term epsilon is not part of the prediction, in this sense, even though it would appear to be so in the usual framework of the Bayesian model, for example when computing elpd etc.

Anyway, here’s a plot of the fitted model and the posterior mean of its residuals:

This looks a little bit different from our residual plot before:

Our new plot looks a little bit worse, actually! But I guess it’s a price I’m willing to pay to have a model that is more mathematically coherent.

Hmmm, this gets me wondering . . . What are the residuals from our three-parameter model above, the one where p = p_angle .* p_distance .* (1 – epsilon);, so that there’s a fixed downward multiplier? Let’s take a look:

Hey! This looks fine. So I’m inclined to just stop here for now and not bother with that model with the separate epsilon for each distance.

As has been discussed in the comment thread, there are lots of ways this model could be improved, but now we have a simple three-parameter model that fits the data without that normal-approximation hack, so this is what I’d start with going forward, then allowing these parameters to vary by golfer, hole, and weather condition.

And here are the files.

The important “It exists, and it’s not going away” argument, as it applies to economics, political science, sabermetrics, and many aspects of statistics

Many fields of research can be justified based on the argument that their object of study exists, and that denying its existence won’t make it go away. For example:

Economics: Denying the existence of economics (for example, by trying to set up a command economy) doesn’t resolve the fundamental problems of economics. Issues such as scarcity, opportunity costs, etc., will just arise in other form; they can’t be legislated away.

Political science: There is no such thing as a political vacuum. Conflicts about power, resource allocation, etc., still need to resolved, one way or another, even in the absence of a formal government.

Sabermetrics: People make judgments about baseball statistics. As Bill James put it, the alternative to “good statistics” is not “no statistics,” it’s “bad statistics.”

Causal inference: Everybody cares ultimately about causal questions. As Jennifer Hill says, even if you claim to be just studying association or descriptive statistics, really this is motivated by have underlying causal questions.

Bayesian inference: Every analysis uses prior information; the only question is whether you want to acknowledge it explicitly.

Defaults: We all have defaults, so let’s try to set them well. Yes, it’s true that no default is perfect, or close to perfect—any default has its zone of effectiveness, outside of which it fails—but defaults are inevitable, so the only way forward it so choose good defaults and then understand where they work and don’t work.

Workflow: Theoretical statistics is the theory of applied statistics. In real life, researchers learn from a dataset by fitting lots of models, including lots of mistakes. Let’s recognize this is what we do and design our procedures accordingly.

Much, perhaps most, of statistical practice is tacit. We make lots of decisions without thinking about them. Let’s study statistics so we can do it better: it exists, and it’s not going away.

Hierarchical model golf putting struggle

Start by reading the golf case study; it’s also in section 10 of the workflow paper.

The final version of the model has a hack, I went in to try to clean it up, and now I’m having problems fitting what should be the cleaner model.

I’m not sure what’s wrong: it could be a problem with my code, it could be a conceptual problem with my model, or it could be what we call “bad geometry,” leading to computational problem that maybe could be fixed with a reparameterization.

Right now I’m in the middle of things, and I thought I’d share that with you. Usually we present completed projects or vague ideas; this time you’ll get to see what it looks like when were partway through working things out.

What we did before

Here’s the story. We have data on success rates of golf putts as a function of distance from the hole, from a bunch of pro tournaments. There are lots of ways these data could be analyzed: you could estimate the abilities of individual players, improvement over time, the difficulties of different courses, the effects of bad weather, etc. Here we keep it simple: we aggregate all the data together to estimate Pr(success | distance from hole), fitting a two-parameter curve based on a simple mathematical model from Mark Broadie in which the golfer’s challenges are to get the angle and distance correct, and there’s error in both: the two parameters correspond to the standard deviations of the errors in angle and relative distance.

The model fits the data pretty well but not perfectly:

Also, annoyingly, the chains do not mix well when the model is fit in Stan. You don’t really see it in the above graph because the poor convergence is all happening in a close neighborhood of the fitted model. The convergence problem is thus not a major practical concern here, but it’s still annoying and it gives us concerns if we were to apply the model going forward. And it’s just a simple two-parameter model? What’s going wrong?

Here are the data for the first few bins of distance (the data came to us in bins; we don’t have the distance and success for every shot):

Sample sizes are huge in the initial bins, hence the binomial model tries to fit these points nearly exactly. This in turn makes the model difficult to fit: the constraint for the first few data points essentially ties down the parameters and makes it difficult to effectively more through the posterior distribution.

So I decided I needed to add an error term to grease the wheels of commerce. Here’s what we did:

And the happy result:

Greasing the wheels of commerce—it really worked!

What we’re trying to do

I haven’t yet described where we’re stuck. So far all we did was take a model that wasn’t quite working and add an independent error term to capture modeling error. The independence assumption didn’t quite make sense, but the extra error term did the job in allowing the Hamiltonian Monte Carlo algorithm within Stan to move effectively through the two-dimensional parameter space of interest.

But it was still a hack. What bothered me was the independent error term (although, yes, that could be improved) but, rather, that I needed to resort to the normal approximation. What I really wanted to do was to keep the binomial model and then add the error on the logistic scale or something like that, something to keep the probabilities bounded between 0 and 1. So I put this in the Stan model:

data {
  int J;
  array[J] int n;
  vector[J] x;
  array[J] int y;
  real r;
  real R;
  real overshot;
  real distance_tolerance;
}
transformed data {
  vector[J] threshold_angle = asin((R-r) ./ x);
}
parameters {
  real<lower=0> sigma_angle;
  real<lower=0> sigma_distance;
  real<lower=0> sigma_eta;
  vector[J] eta;
}
model {
  vector[J] p_angle = 2*Phi(threshold_angle / sigma_angle) - 1;
  vector[J] p_distance = Phi((distance_tolerance - overshot) ./
    ((x + overshot)*sigma_distance)) -
    Phi((- overshot) ./ ((x + overshot)*sigma_distance));
  vector[J] p = inv_logit(logit(p_angle .* p_distance) + sigma_eta*eta);
  y ~ binomial(n, p);
  eta ~ normal(0, 1);
  [sigma_angle, sigma_distance, sigma_eta] ~ normal(0, 1);
}

Let me explain. Most of the model is what we had before: it’s a calculation, given the hyperparameters, that your shot’s angular error is small enough and that it’s distance error is small enough that the ball goes in the hole, according to our simple geometric model. We’re assuming the two errors are statistically independent; hence we were using p = p_angle .* pdistance.

In this new version of the model we added the error term, sigma_eta*eta, and to keep everything in the unit interval, we added it to the probability on the logit scale.

The bad news is when fit this new model to our data, it doesn’t work. The console fills up with warnings and the chains don’t mix.

Where we are now

We were able to fit the golf data with a two-parameter model plus an error term, but we needed to use the hack of the normal approximation to the binomial distribution. Sticking the error term on the logistic scale is a hack too, but less of a hack . . . unfortunately it’s giving me computation problems! It could just be something simple that I’m missing in my code, or there could be something deeper going on. I guess the next step in debugging is to see if the model fits ok to data simulated from the model. What to do next depends on what happens with the fake-data check.

Again, the reason for this post is to give a peek behind the curtain and give a sense of what it feels like to be in the middle of things.

If you want to try any of this out, here’s the data and code.

P.S. Success! See here.

Gambling is fun but it can ruin addicts

As an anti-Caesars dude from way back, I appreciated this op-ed from Ross Douthat:

When future historians ponder the forces that unraveled the American social fabric between the 1960s and the 2020s, I hope they spare some time for one besetting vice in particular: our fatal impulse toward consistency.

This is a good weekend for thinking about that impulse, because Super Bowl Sunday is capping off a transition in big-time sports that has made the symbiosis between professional athletics and professional gambling all but complete. The cascading, state-after-state legalization of sports betting, the ubiquitous ads for online gambling in the football playoffs, the billion dollars that the National Football League hopes to soon be making annually from its deals with sports betting companies — everywhere you look, the thin wall separating the games from the gambling industry is being torn away.

This transformation will separate many millions of non-wealthy Americans from their money, very often harmlessly but in some cases disastrously, with a lot of sustainable-or-are-they gambling addictions falling somewhere in between. . . . once we decided that some forms of gambling should be legally available, in some places, with some people profiting, it became inevitable that restrictions would eventually crumble on a much larger scale. The multi-generational path from Las Vegas and Atlantic City, to Native American casinos, to today’s ubiquitous online gambling looks like one continuous process, with no natural stopping place along the way.

But the trouble is that societal health often depends on law and custom not being perfectly consistent, not taking every permission to its logical conclusion.

In the case of gambling, some limited permission was always necessary: Betting will always be with us, it’s a harmless vice for many people, if you over-police it you’ll end up with an array of injustices.

But the easier it is to gamble, the more unhappy outcomes you’ll get. The more money in the industry, the stronger the incentives to come up with new ways to hook people and then bleed and ruin them. . . . So what you want, then, is for society to be able to say this far and no farther, even if the limiting principle is somewhat arbitrary . . . encouraging Americans to treat the gambling experience as a holiday from the everyday, not seriously wicked but still a little bit shameful or indulgent — which is why it stays under the table, or in Vegas. . . .

Speaking just as a citizen, not as a policy analyst, this makes sense to me. Betting on the Super Bowl is fun! But office pools should be enough, no need for immersive internet betting experiences.

There was just one thing about Douthat’s article that bothered me, and that’s where he writes:

Part of what we’re witnessing from #MeToo-era feminism, for instance, is a backlash against the ruthless logic of an unregulated sexual marketplace, and a quest for some organic form of social regulation, some new set of imperfect-but-still-useful scruples and taboos.

I don’t get that at all. It’s my impression that the me-too movement is all about old-school sexual harassment: rich and powerful men engaging in sexual harassment. “Ruthless” this may be, but I don’t see how it’s the “logic of an unregulated sexual marketplace.” Douthat seems to be saying that me-too is a backlash to something new, but it seems to me that it’s a reaction to something old. What’s new is not sexual harassment; what’s new is that it’s harder to get away with it. Anyway, I agree with Douthat on the general point that rules can be inconsistent and still be useful, so I guess I can just set aside that particular example and focus on the betting story.

As a statistician, I’ve thought a lot about betting. Gambling is closely related to uncertainty, and betting can be fun, as well as being a way to fix ideas and even “put your money where your mouth is.” On the other hand, so much of organized gambling is about ripping people off. A little bit of people being ripped off is OK, I guess, but too much of it is . . . too much. One reason I’m wary of attempts to make betting into a foundational principle of statistics or social science is that it’s tied so closely to successful efforts to con people.

Thinking Bayesianly about the being-behind-at-halftime effect in basketball

Steve Heston reminded me of the claim from business school professors Devin Pope and Jonah Berger, published several years ago, that basketball teams do better when they’re behind by one point at halftime. I discussed this a couple times (see here and here).
Here’s the graph from the original paper:

Halfscore.jpg

You can see the usual problem with regression discontinuity analysis here: the big jump at the discontinuity can be seen as an artifact of the noisily-estimated and implausible baseline curve.

The article was published in Management Science in 2011, and, according to Google, it has 233 citations—that’s a lot! From the abstract:

Analysis of more than 18,000 professional basketball games illustrates that being slightly behind at halftime leads to a discontinuous increase in winning percentage. Teams behind by a point at halftime, for example, actually win more often than teams ahead by one, or approximately six percentage points more often than expected. This psychological effect is roughly half the size of the proverbial home-team advantage. Analysis of more than 45,000 collegiate basketball games finds consistent, though smaller, results. . . .

Here are their results:

These look much better than the original graph above, and full credit to Pope and Berger for improving their analysis before publication. The coefficient estimates are 0.058 (with standard error 0.015) for the pros and 0.025 (with standard error 0.010) for the college games. They did their analysis using all NBA games between the 1993/1994 season and March 1, 2009, and all NCAA games between the 1999/2000 season and March 22, 2009, restricting to games that were within 10 points of being tied at halftime.

The statement in the abstract of the paper, “Teams behind by a point at halftime, for example, actually win more often than teams ahead by one . . .” is misleading. If you look at the graph, you can see the probabilities are essentially equal, so they’re leading with a chance pattern in the data. Also, they don’t mention it in the abstract, but in the NCAA games, teams behind by a point at halftime actually win less often than teams ahead by one. The effect being smaller in college than the pros seems surprising, as I’d think that less experienced players would be more subject to psychological factors. But, who knows, and, in any case, the difference between those two estimates is explainable by noise.

They found an effect. Do I “believe it”? I don’t know.

On one hand, yeah, you see something there in the data, and the second and third fitted curves above look reasonable—nothing like those regression discontinuity disasters that arise from time to time. And the idea that being behind at halftime could be a benefit—that’s not ridiculous to me. Players and coaches do have to decide how hard to play during the second half, and it doesn’t seem implausible to think that halftime strategy decisions could be slightly discontinuous with respect to being in the lead or being bheind.

On the other hand, there’s the usual story of forking paths, also in this case a potential bias arising from the functional form of the fitted model. Also there could be a bias because they fit a logistic curve but the true underlying curve won’t be logistic. I’m not completely sure, but it’s worth raising as a concern that this model isn’t quite fitting the effect of being in the lead at halftime vs. being behind at halftime; it’s fitting this with respect to a particular parametric curve, and I’m thinking that effects are small enough that misspecification of the curve could induce systematic error in the estimate. I guess this particular question could be addressed by fake-data simulation.

This is a great example for us to consider: not obviously wrong, not obviously right either.

Bringing the Bayes, and thinking about a potential replication study

I wonder if anyone’s followed up? We now have another 13 years of data.

What if I had to guess (or “bet”) what would happen if these new data were analyzed in the same way? What do I think we’d see?

We can consider a series of analyses:

1. Start by taking the 2011 result at face value and being Bayesian. For the NBA, the data estimate is 0.058 with standard error 0.015. What’s our prior? I’d center it at zero—really no reason to think ahead of time that there’d be a jump in probability at zero. What about the prior sd? I’m not sure, but we can start with their statement that 0.06 is approximately half the size of the home-count advantage. If the home-court advantage is 0.12, then I’d think any halftime effect would be much less—let’s say a prior sd of 0.01, so it’s highly unlikely to see an effect of more than 2 percentage points in either direction. My posterior estimate of effect size is then (1/0.015^2)*0.058/((1/0.015^2) + (1/0.01^2)) = 0.018, with standard deviation 1/sqrt((1/0.015^2) + (1/0.01^2)) = 0.08.

2. But the above analysis doesn’t seem quite right, given the strong disagreement between prior and likelihood. I guess the point is that we shouldn’t quite believe the prior or the likelihood here: we shouldn’t believe the prior because maybe I’m suffering from a failure of imagination and the effect could actually be larger, and we shouldn’t believe the likelihood because it doesn’t account for model errors or selection in which model was used to summarize the data. We could, say, double the sd of the prior and the likelihood, in which case we’d still get a posterior estimate of 0.018, but now with a standard deviation of 0.016.

3. All that’s for the NBA. We should also do the NCAA. Indeed, had the NCAA result been larger, I assume the published article would’ve focused on that part of the analysis. For the college games the estimate is 0.025 with standard error 0.01; combining that with the normal(0, 0.01) prior gives a posterior mean of 0.0125 with standard deviation 0.007; again, following step 2 above let’s double the likelihood and prior uncertainties, so now we have a posterior mean of 0.015 with standard deviation 0.014.

4. If I take those analyses seriously, I’d have to say I’m something like 85% sure that the true effect is positive, where “true effect” is defined as some increase in the average probability of winning, if you’re behind at halftime, compared to what would be expected under a smooth model. Do I really think 85%? Maybe. If you asked me what would I expect is the true effect going forward, if it were possible to get data on zillions of games and estimate this very precisely, I guess I’d give less than an 85% probability of a positive effect. Maybe a 60% probability? And my best estimate of the effect size would be less than 0.01. The point is that the effect size itself varies: past data give some insight into the future, but this particular effect seems so fragile (not in the statistical estimation sense, but fragile in the sense that any effect is some unstable combination of strategy and psychology that could well change in different leagues or different eras of the game) that I wouldn’t want to think of this as some near-constant effect going forward. As they say in psychology, it’s domain-specific, and the “domain” isn’t just sports, or basketball, or even the NBA and NCAA, but rather these leagues at these particular times.

5. What about actual new data? That’s tough, because even 13 years of new data is not a lot; any estimates will be noisy compared to actual effect sizes. Still, I’d be interested in seeing what comes up.

6. Suppose someone did a preregistered analysis on the new data, and further suppose, just for simplicity, that the new sample size is the same as in the Devin and Pope (2011) article. In that case, what’s my probability that the new results are “successful” (as defined using the conventional way, as an estimate that is positive and more than two standard errors from zero)? Even setting aside potential changes in the effect over time, the probability of a “successful replication” is surprisingly low! For the NBA study, the standard error of the data estimate was 0.015, so in a replication we’d need an estimate of at least 0.03. My posterior (see item 2 above) probability of this happening is 1 – pnorm(0.03, 0.018, 0.016) = 23%. For the NCAA, we’d need an estimate of at least 0.02, which has a posterior (see item 3 above) probability of 1 – pnorm(0.02, 0.015, 0.014) = 36%. Considering these two studies as independent events, the probability of both these happening is 8% and the probability of at least one of them happening is 51%. I think that if a new study were performed and just one of the two comparisons reached the statistical significance threshold, it would be considered a success. So there you have it.

The big picture: Basketball

As noted above, I think that any effect of being ahead or behind at halftime will be context-specific, and I’d guess the best way of studying this would be to look at something like how many minutes are played by bench players in the third quarter. Just looking at won-lost isn’t so great because there’s not a lot of information in binary data.

At this point, you might say: But isn’t everything context-specific? The home-court advantage: that’s context-specific too! But we have no problem talking about that, without needing a million qualifiers. The difference is that the home-court advantage is large and persistent, two things we can’t really say with any confidence regarding the behind-at-halftime effect. Yes, the larger of the two estimates reported in the published paper was a 6 percentage point increase in win probability, and that ain’t nothing—but, as discussed above, we have lots of reasons to think the true effect was much smaller.

The big picture: Bayesian analysis of empirical studies

It was interesting to go through all the steps. The experience was similar to when we tried to think hard about probabilities in election forecasts:

No, Bayes does not like Mayor Pete. (Pitfalls of using implied betting market odds to estimate electability.)

Do we really believe the Democrats have an 88% chance of winning the presidential election?

How to think about extremely unlikely events (such as Biden winning Alabama, Trump winning California, or Biden winning Ohio but losing the election)?

Sh*ttin brix in the tail…

Or my favorite simple example in Section 3 of our article, Holes in Bayesian Statistics.

It can be challenging to think Bayesianly with the goal of coming up with a believable and coherent set of inferences, but I think the effort is worth it.

P.S. In comments, Mike points to a new article, Does Losing Lead to Winning? An Empirical Analysis for Four Sports, by Bouke Klein Teeselink, Martijn J. van den Assem, and Dennie van Dolder, that appears to have performed the replication that I was looking for! They find:

When we revisit the phenomenon for basketball, we only find supportive evidence for NBA matches from the period analyzed in Berger and Pope. There is no significant effect for NBA matches from outside this sample period, for NCAA matches, or for matches from the Women’s NBA.

This is as expected given our analysis above. I’m glad I’d not head of this new paper before writing the above post, as I think it was instructive to think through my priors before seeing any new data.

P.P.S. Just to clarify one issue: Whether or not there’s ultimately a persistent and measurable behind-at-halftime effect, it’s still better to be up by 1 than down by 1. Any reasonable estimate of the being-behind-at-halftime effect will be smaller than the natural difference in probability corresponding to moving up by 2 in the point differential. The discussion of the behind-at-halftime effect is entirely about whether the curve of Pr (win | halftime point differential) is flatter right around zero than in the negative or positive zones.

Not only did this guy not hold the world record in the 100 meter or 110-yard dash for 35 years, he didn’t even run the 110-yard dash in 10.8 seconds, nor did he see a million patients, nor was he on the Michigan football and track teams, nor did Michigan even have a track team when he attended the university. It seems likely that he did know Jack Dempsey, though.

Paul Campos has the story of Fred Bonine of Niles, Michigan. When I’d looked the guy up, I’d been properly suspicious of his athletic claims, but I hadn’t thought to think through that suspicious line about seeing a million patients.

The good news is that Ring Lardner’s stories still appear to hold up.

What is fame? The perspective from Niles, Michigan. Including an irrelevant anecdote about “the man who invented hedging”

Following up on our recent discussions of the different dimensions of fame (see here and here), I thought something could be gained by looking carefully at a narrowly defined subset.

Here’s Wikipedia’s list of notable people from Niles, Michigan. Wikipedia lists them in alphabetical order:

Joanna Beasley (born 1986), musician

Fred Bonine (1863–1941), held world’s record in the 100-meter dash from 1886 until 1921; became internationally known eye doctor who saw over one million patients at his office in Niles

Jake Cinninger (born 1975), musician, Umphrey’s McGee

Greydon Clark (born 1943), film director

John Francis Dodge (1864–1920), automobile industry pioneer

Horace Elgin Dodge (1868–1920), automobile industry pioneer

Edward L. Hamilton (1857–1923), U.S. Representative from 1897 until 1921. Served as chair of the United States House Committee on Territories from 1903 until 1911.

Thomas Fitzgerald (1796–1855), U.S. Senator and probate judge

Tommy James (born 1947), musician, Tommy James and the Shondells

Ring Lardner (1885–1933) Sr., satirist, short story writer and sports columnist

Lillian Luckey (1919–2021), All-American Girls Professional Baseball League player

Michael Mabry (born 1955), graphic designer and illustrator

Dave Schmidt (born 1957), Major League Baseball pitcher

Diane Seuss (born 1956), poet, finalist for Pulitzer Prize

Michael D. West, (born 1953) founder of Geron, now CEO of BioTime

Aaron Montgomery Ward (1844–1913), founder, Montgomery Ward

Some of these people are obscure, once-famous or, in some cases, never-famous. I could care less about the former chair of the United States House Committee on Territories. But a few of them stand out:

Fred Bonine: I’d never heard of the guy before this, but, hey! of all the people on this list, he seems the most impressive to me. Holding the world record for 35 years . . . how is that even possible? And then to see a million patients—that’s really something. If I had to pick one person to represent Niles, Michigan, it would be Fred Bonine.

I was curious about this world record so I noodled around on Wikipedia . . . “In 1886, Bonine set a world’s record with a time of 10.8 seconds in the 110-yard dash. The record stood for 35 years until it was broken in 1921 by Charley Paddock.” 10.8 seconds, that’s not bad! How were things going at the Olympics? In 1896, it seems that the winning time was 11.8 seconds, which seems kinda slow, actually. It didn’t go below 10.8 until 1912. But then we can list the 100 meters world record progression (which Wikipedia amusingly refers to as the “Men’s 100 metres world record progression”), which starts with a bunch of guys doing it in 10.8 at many different places, from 1891 through 1903, then it drops to 10.6 in 1906 and 10.5 in 1911. Bonine’s not listed at all, even though 110 yards is actually a slightly longer distance than 100 meters! Whassup with that! Maybe Bonine’s 10.8 was clocked by someone with an itchy stopwatch finger, I have no idea. I’m kinda thinking he didn’t really do it in 10.8 seconds, but who knows?

Anyway, to continue: the Dodge brothers probably have the most famous names in the list. I doubt many people really care now about their pioneering work in the auto industry, but oldsters still remember their names on millions of cars. When I was a kid we had a Dodge Dart. It was a piece of crap, typical American car always breaking down, and with a hinky three-on-the-tree gearshift.

Ring Lardner—yes, I knew he was from Niles; it was a line in “Here’s Ring Lardner’s Autobiography,” which I will copy here in its entirety, because, why not:

Hardly a man is now alive
Who cares that in March, 1885,
I was born in the city of Niles,
Michigan, which is 94 miles
From Chicago, a city in Illinois.
Sixteen years later, still only a boy,
I graduated from the Niles High School
With a general knowledge of rotation pool.
After my schooling, I thought it best
To give my soul and body a rest.
In 1905 this came to an end,
When I went to work on The Times in Souse Bend.
Thence to Chi, where I labored first
On the Inter-Ocean and then for Hearst,

Then for the Tribune and then to St. Lews,
Where I was editor of Sporting News.
And thence to Boston, where later a can
Was tied to me by the manager man.
1919 was the year
When, in Chicago, I finished my daily newspaper career.
In those 14 years—just a horse’s age—
My stuff was all on the sporting page.
In the last nine years (since it became illegal to drink),
I’ve been connected with The Bell Syndicate, Inc.,
I have four children as well as one Missus,
None of whom can write a poem as good as this is.

Ring Lardner’s one of the greatest writers who’s ever lived! In some sense. Let me say that Lardner’s on the efficient frontier of writers. There’s no better writer of sports fiction, but it’s not just that. Lardner is just special, in some way that’s hard to specify. But I’d trade 10 Damon Runyons for one Ring Lardner, and Damon Runyon is part of our national patrimony. It’s like . . . ummm, how’s this? Many years ago I recall reading an interview with Paul McCartney, and he said that some days he’d wake up and think, “I’m Paul McCartney,” and just reflect on how amazing that was. (McCartney wasn’t presenting this as an ego trip; it was more that he remains astounded by his persistent fame.) Anyway, that’s how I feel about Lardner. He’s Ring Lardner, and that will always be amazing. Ummmm, I better explain this to the non-Lardner-fans out there: the above poem is nothing so amazing. It’s more that, if you’ve read enough Lardner and you know his amazingness, you’ll enjoy the poem as it will remind you of many of his facets.

OK, to continue . . . I’d never heard of Lillian Luckey and I guess I’ll never think about her again, but she was “Listed at 5 ft 1 in (1.55 m), 126 lb (57 kg), she batted and threw right handed.” And she lived to the age of 102! She might be the person on the list who had the best stories.

I’d say Aaron Montgomery Ward was the most consequential person on the list. He practically invented the mail-order store, changing how Americans lived and shopped. He’s the only person on the list who seems to have had a unique historical niche.

And this reminds me of a story. When I was in college at MIT, it was possible to apply to summer jobs through the career center. One year I applied for a summer job at some actuarial firm. I didn’t really know what this was, but I knew they used probability and statistics, a subject that I enjoyed. In the interview, the guy was talking enthusiastically about his boss, a man who, according to my interviewer, “practically invented hedging.” The interviewer started telling me about some scheme his boss had come up with to avoid paying taxes. It sounded kinda creepy to me, but I don’t have a lot of principles, so it’s not like I stalked off in disgust. I just kept on with the interview, trying to sound enthusiastic. I didn’t get the summer job, which in retrospect is kinda funny: how many applicants did they have with perfect grades from MIT? But maybe he was able to detect my lack of interest, I dunno. I never really followed up to find out who the guy was who practically invented hedging.

Tommy James: You may not recall this name, but . . . he wrote the song Crimson and Clover. You’ve heard that! As a matter of fact, that’s what motivated this post: I heard Crimson and Clover on the radio, looked it up on wikipedia to find its story, then clicked on the name of the songwriter. His page said he was from Niles, Michigan, which rang a bell—the Lardner poem!—so I looked up that city to see who else was from there. And here we are.

So what has this little study taught us about fame? Not so much, I guess: we already knew that fame is multidimensional: there are different ways of being famous. Montgomery Ward is a mostly forgotten name, but in some sort of integral of fame over time, he might be the winner here. Dodge is even more famous, but I’d say it’s thought of more as a company than a personal name. Ward is that way too, but Dodge even more so, as Ward has some personal reputation as a pioneer in business. Tommy James was personally famous only for a short time, but his song has been well known for many decades now. Ring Lardner was a famous journalist back in the 1920s and is now an obscure historical figure, but I’d guess that people will still be reading some of his stories, long after Ward etc. are just historical figures. And Lillian Luckey and Fred Bonine illustrate how you can be locally famous. I think some of this discussion is improved by its narrow focus. Going down the ladder of fame like this gives us some clarity that is lost when comparing biggies such as Norman Lear, Henry Kissinger, and Queen Elizabeth.

P.S. More here.

Djokovic, data sleuthing, and the Case of the Incoherent Covid Test Records

Kaiser Fung tells the story. First the background:

Australia, having pursued a zero Covid policy for most of the pandemic, only allows vaccinated visitors to enter. Djokovic, who’s the world #1 male tennis player, is also a prominent anti-vaxxer. Much earlier in the pandemic, he infamously organized a tennis tournament, which had to be aborted when several players, including himself, caught Covid-19. He is still unvaccinated, and yet he was allowed into Australia to play the Open. . . . When the public learned that Djokovic received a special exemption, the Australian government decided to cancel his visa. . . . This then became messier and messier . . .

In the midst of it all, some enterprising data journalists uncovered tantalizing clues that demonstrate that Djokovic’s story used to obtain the exemption is full of holes. It’s a great example of the sleuthing work that data analysts undertake to understand the data.

Next come the details. I haven’t looked into any of this, so if you want more you can follow the links at Kaiser’s post:

A central plank of the tennis player’s story is that he tested positive for Covid-19 on December 16. This test result provided grounds for an exemption from vaccination . . . The timing of the test result was convenient, raising the question of whether it was faked. . . .

Digital breadcrumbs caught up with Djokovic. As everyone should know by now, every email receipt, every online transaction, every time you use a mobile app, you are leaving a long trail for investigators. It turns out that test results from Serbia include a QR code. QR code is nothing but a fancy bar code. It’s not an encrypted message that can only be opened by authorized people. Since Djokovic’s lawyers submitted the test result in court documents, data journalists from the German newspaper Spiegel, partnering with a consultancy Zerforschung, scanned the QR code, and landed on the Serbian government’s webpage that informs citizens of their test results.

The information displayed on screen was limited and not very informative. It just showed the test result was positive (or negative), and a confirmation code. What caught the journalists’ eyes was that during the investigation, they scanned the QR code multiple times, and saw Djokovic’s test result flip-flop. At 1 pm, on December 10, the test was shown as negative (!) but about an hour later, it appeared as positive. That’s the first red flag.

Kaiser then remarks:

Since statistical sleuthing inevitably involves guesswork, we typically want multiple red flags before we sound the alarm.

He’ll return to the uncertain nature of evidence.

But now let’s continue with the sleuthing:

The next item of interest is the confirmation code which consists of two numbers separated by a dash. The investigators were able to show that the first number is a serial number. This is an index number used by databases to keep track of the millions of test results. In many systems, this is just a running count. If it is a running count, data sleuths can learn some things from it. This is why even so-called metadata can reveal more than you think. . . .

Djokovic’s supposedly positive test result on December 16 has serial number 7371999. If someone else’s test has a smaller number, we can surmise that the person took the test prior to Dec 16, 1 pm. Similarly, if someone took a test after Dec 16, 1 pm, it should have an serial number larger than 7371999. There’s more. The gap between two serial numbers provides information about the duration between the two tests. Further, this type of index is hard to manipulate. If you want to fake a test in the past, there is no index number available for insertion if the count increments by one for each new test! (One can of course insert a fake test right now before the next real test result arrives.)

Wow—this is getting interesting! Kaiser continues:

The researchers compared the gaps in these serial numbers and the official tally of tests conducted within a time window, and felt satisifed that the first part of the confirmation code is an index that effectively counts the number of tests conducted in Serbia. Why is this important?

It turns out that Djokovic’s lawyers submitted another test result to prove that he has recovered. The negative test result was supposedly conducted on December 22. What’s odd is that this test result has a smaller serial number than the initial positive test result, suggesting that the first (positive) test may have come after the second (negative) test. That’s red flag #2!

To get to this point, the detectives performed some delicious work. The landing page from the QR code does not actually include a time stamp, which would be a huge blocker to any of the investigation. But… digital breadcrumbs.

While human beings don’t need index numbers, machines almost always do. The URL of the landing page actually contains a disguised date. For the December 22 test result, the date was shown as 1640187792. Engineers will immediately recognize this as a “Unix date”. A simple decoder returns a human-readable date: December 22, 16:43:12 CET 2021. So this second test was indeed performed on the day the lawyers had presented to the court.

Dates are also a type of index, which can only increment. Surprisingly, the Unix date on the earlier positive test translates to December 26, 13:21:20 CET 2021. If our interpretation of the date values is correct, then the positive test appeared 4 days after the negative test in the system. That’s red flag #3.

To build confidence that they interpreted dates correctly, the investigators examined the two possible intervals: December 16 and 22 (Djokovic’s lawyers), and December 22 and 26 (apparent online data). Remember the jump in serial numbers in each period should correspond to the number of tests performed during that period. It turned out that the Dec 22-26 time frame fits the data better than Dec 16-22!

But:

The stuff of this project is fun – if you’re into data analysis. The analysts offer quite strong evidence that there may be something smelly about the test results, and they have a working theory about how the tests were faked.

That said, statistics do not nail fraudsters. We can show plausibility or even high probability but we cannot use statistics alone to rule out any outliers. Typically, statistical evidence needs physical evidence.

And then:

Some of the reaction to the Spiegel article demonstrates what happens with suggestive data that nonetheless are not infallible.

Some aspects of the story were immediately confirmed by Serbians who have taken Covid-19 tests. The first part of the confirmation number appears to change with each test, and the more recent serial number is larger than the older ones. The second part of the confirmation number, we learned, is a kind of person ID, as it does not vary between successive test results.

One part of the story did not hold up. The date found on the landing page URL does not seem to be the date of the test, but the date on which someone requests a PDF download of the result. This behavior can easily be verified by anyone who has test results in the system.

Kaiser explains:

Because of this one misinterpretation, the data journalists seemed to have lost a portion of readers, who now consider the entire data investigation debunked. Unfortunately, this reaction is typical. It’s even natural in some circles. It’s related to the use of “counterexamples” to invalidate hypotheses. Since someone found the one thing that isn’t consistent with the hypothesis, the entire argument is thought to have collapsed.

However, this type of reasoning should be avoided in statistics, which is not like pure mathematics. One counterexample does not spell doom to a statistical argument. A counterexample may well be an outlier. The preponderance of evidence may still point in the same direction. Remember there were multiple red flags. Misinterpreting the dates does not invalidate the other red flags. In fact, the new interpretation of the dates cannot explain the jumbled serial numbers, which do not vary by the requested PDFs.

This point about weighing the evidence is important, because there are always people who will want to believe. Whether it’s political lies about the election (see background here) or endlessly debunked junk science such as the critical positivity ratio (see here), people just won’t let go. Once their story has been shot down, they’ll look for some other handhold to grab onto.

In any case, the Case of the Incoherent Covid Test Records is a fun example of data sleuthing with some general lessons about statistical evidence.

Kaiser’s discussion is great. It just needs some screenshots to make the storytelling really work.

P.S. In comments, Dieter Menne links to some screenshots, which I’ve added to the post above.

The so-called “lucky golf ball”: The Association for Psychological Science promotes junk science while ignoring the careful, serious work of replication

“Not replicable, but citable” is how ‎Robert Calin-Jageman puts it. His colleague Geoff Cumming tells the story:

The APS [Association for Psychological Science] has just given a kick along to what’s most likely a myth: The Lucky Golf Ball. Alas!

Golf.com recently ran a story titled ‘Lucky’ golf items might actually work, according to study. The story told of Tiger Woods sinking a very long putt to send the U.S. Open to a playoff. “That day, Tiger had two lucky charms in-play: His Tiger headcover, and his legendary red shirt.”

The story cited Damisch et al. (2010), published in Psychological Science, as evidence the lucky charms may have contributed to the miraculous putt success.

Laudably, the APS highlights public mentions of research published in its journals. It posted this summary of the Golf.com story, and included it (‘Our science in the news’) in the latest weekly email to members.

However, this was a misfire, because the Damisch results have failed to replicate, and the pattern of results has prompted criticism of the work. . . .

The original Lucky Golf Ball study

Damisch et al. reported a study in which students in the experimental group were told—with some ceremony—that they were using the lucky golf ball; those in the control group were not. Mean performance was 6.4 of 10 putts holed for the experimental group, and 4.8 for controls—a remarkable difference of d = 0.81 [0.05, 1.60]. (See ITNS, p. 171.) Two further studies using different luck manipulations gave similar results.

The replications

Bob and colleague Tracy Caldwell (Calin-Jageman & Caldwell, 2014) carried out two large preregistered close replications of the Damisch golf ball study. Lysann Damisch kindly assisted them make the replications as similar as possible to the original study. Both replications found effects close to zero. . . .

The pattern of Damisch results

The six [confidence intervals in the original published study] are astonishingly consistent, all with p values a little below .05. Greg Francis, in this 2016 post, summarised several analyses of the patterns of results in the original Damisch article. All, including p-curve analysis, provided evidence that the reported results most likely had been p-hacked or selected in some way.

Another failure to replicate

Dickhäuser et al. (2020) reported two high-powered preregistered replications of a different one of the original Damisch studies, in which participants solved anagrams. Both found effects close to zero.

All in all, there’s little evidence for the lucky golf ball. APS should skip any mention of the effect.

I can’t really blame the authors of the original study. 2010 was the dark ages, before people fully realized the problems with making research claims by sifting through “statistically significant” comparisons. Sure, we realized that p-values had problems, and we knew there were ways to do better, but we didn’t understand how this mode of research could not just exaggerate effects and lead to small problems but could allow researchers to find patterns out of absolutely nothing at all. As E. J. Wagenmakers put it a few years later, “disbelief does in fact remain an option.”

I can, however, blame the leaders of the Association for Psychological Science. To promote work that failed to replicate, without mentioning that failure (or the statistical problems with the original study), that’s unscientific, it’s disgraceful, it’s the kind of behavior I expect to see from the Association for Psychological Science. So I’m not surprised. But it still makes me sad. I know lots of psychology researchers, and they do great work? Why can’t the APS write about that? Why can’t they write about the careful work of Calin-Jageman and Caldwell? This is selection bias, a Gresham’s law in which the crappier work gets hyped. Not a good look, APS.

As to Golf.com, sure, they fell down on the job too, but I can excuse them for naively thinking that a paper published in a leading scientific journal should be taken seriously. As the graph above shows, this problem of citing unreplicated work goes far beyond Golf.com.

I get that the APS made a mistake in 2010 by publishing the original golf ball paper. Everybody makes mistakes. But to promote it ten years later, in spite of all the failed replications, that’s not cool. They’re not just passively benefiting from journalists’ credulity; they’re fanning the flames.

Golf: No big deal?

Yeah, sure, golf, no big deal, all in good fun, etc., right? Sure, but . . .

1. The APS dedicated a chunk of space in its flagship journal to that golf article, which has the amusing-in-retrospect title, “Keep Your Fingers Crossed!: How Superstition Improves Performance.” If golf is important enough to write about in the first place, it’s important enough to not want to spread misleading claims about.

2. The same issue—the Association for Psychological Science promoting low-quality publications—arises in topics more important than golf; see for example here and here.

The NFL regression puzzle . . . and my discussion of possible solutions:

Alex Tabarrok writes:

Here’s a regression puzzle courtesy of Advanced NFL Stats from a few years ago and pointed to recently by Holden Karnofsky from his interesting new blog, ColdTakes. The nominal issue is how to figure our whether Aaron Rodgers is underpaid or overpaid given data on salaries and expected points added per game. Assume that these are the right stats and correctly calculated. The real issue is which is the best graph to answer this question:

Brian 1: …just look at this super scatterplot I made of all veteran/free-agent QBs. The chart plots Expected Points Added (EPA) per Game versus adjusted salary cap hit. Both measures are averaged over the veteran periods of each player’s contracts. I added an Ordinary Least Squares (OLS) best-fit regression line to illustrate my point (r=0.46, p=0.002).

Rodgers’ production, measured by his career average Expected Points Added (EPA) per game is far higher than the trend line says would be worth his $21M/yr cost. The vertical distance between his new contract numbers, $21M/yr and about 11 EPA/G illustrates the surplus performance the Packers will likely get from Rodgers.

According to this analysis, Rodgers would be worth something like $25M or more per season. If we extend his 11 EPA/G number horizontally to the right, it would intercept the trend line at $25M. He’s literally off the chart.

Brian 2: Brian, you ignorant slut. Aaron Rodgers can’t possibly be worth that much money….I’ve made my own scatterplot and regression. Using the exact same methodology and exact same data, I’ve plotted average adjusted cap hit versus EPA/G. The only difference from your chart above is that I swapped the vertical and horizontal axes. Even the correlation and significance are exactly the same.

As you can see, you idiot, Rodgers’ new contract is about twice as expensive as it should be. The value of an 11 EPA/yr QB should be about $10M.

Alex concludes with a challenge:

Ok, so which is the best graph for answering this question? Show your work. Bonus points: What is the other graph useful for?

I posted this a few months ago and promised my solution. Here it is:
Continue reading

Objectively worse, but practically better: an example from the World Chess Championship

A position from Game 2 of the 2021 World Chess Championship match. White has just played e4.

This post is by Phil Price, not Andrew.

The World Chess Championship is going on right now. There have been some really good games and some really lousy ones — the challenger, Ian Nepomniachtchi (universally known as ‘Nepo’) has played far below his capabilities in a few games. The reigning champ, Magnus Carlsen, is almost certain to retain his title (I’ll offer 12:1 if anyone is interested!).

It would take some real commitment to watch the games in real time in their entirety, but if you choose to do so there is excellent coverage in which strong grandmasters discuss the positions and speculate on what might be played next. They are aided in this by computers that can evaluate the positions “objectively”, and occasionally they will indeed mention what the computer suggests, but much of the time the commenters ignore the computer and discuss their own evaluations.

I suppose it’s worth mentioning that computers are by far the strongest chess-playing entities, easily capable of beating the best human players even if the computer is given a significant disadvantage at the start (such as being down a pawn). Even the best computer programs don’t play perfect chess, but for practical purposes the evaluation of a position by a top computer program can be thought of as the objective truth.

I watched a fair amount of live commentary on Game 2, commented by Judit Polgar and Anish Giri…just sort of got caught up in it and spent way more time watching than I had intended. At the point in the commentary shown in the image (1:21 into the YouTube video), the computer evaluation says the players are dead even, but both Polgar and Giri felt that in practice White has a significant advantage. As Giri put it, “disharmonious positions [like the one black is in] require weird solutions…Ian has great pattern recognition, but where has he seen a pattern of pawns [on] f6, e6, c6, queen [on] d7? Which pattern is he trying to recognize? The pattern recognition is, like, it’s broken… I don’t know how I’m playing with black, I’ve never seen such a position before. Fortunately. I don’t want to see it anymore, either.”

In the end, Nepo — the player with Black — managed to draw the position, but I don’t think anyone (including Nepo) would disagree with their assessment that at this point in the game it is much easier to play for White than for Black.

Interestingly, this position was reached following a decision much earlier in the game in which Carlsen played a line a move that, according to the computer, gave Nepo a slight edge. This was quite early in the game, when both players were still “in their preparation”, meaning that they were playing moves that they had memorized. (At this level, each player knows the types of openings that the other likes to play, so they can anticipate that they will likely play one of a manageable number of sequences, or ‘lines’, for the first eight to fifteen moves. When I say “manageable number” I mean a few hundred.). At that earlier point in the game, when Carlsen made that “bad” move, Giri pointed out that this might take Nepo out of his preparation, since you don’t usually bother looking into lines that assume the other player is going to deliberately give away his advantage.

So: Carlsen deliberately played in a way that was “objectively” worse than his alternatives, but that gave him better practical chances to win. It’s an interesting phenomenon.







“Christopher Columbus And The Replacement-Level Historical Figure”

This post from Patrick Wyman is interesting. Key quote:

Rather than casting Columbus as either the hero or the villain in an epic story about the emergence of a recognizably modern world, we should understand him as a replacement-level historical figure: not among the elite, a Clayton Kershaw or prime Carmelo Anthony; not in the mid-to-upper tier of his profession, like Nelson Cruz, Joe Flacco, or CJ McCollum. He was a notable step below that.

It is better and more accurate to think of Columbus as Bronson Arroyo or Nick Young or Trent Dilfer—an innings-eater, a bench player averaging 9 points in 25 minutes of action, the guy handing off the ball to Jamal Lewis. . . .

Outside of his flamboyance and his tendency to bray loudly about what he perceived to be his own personal brilliance and destiny—the 15th-century equivalent of Bronson Arroyo’s Stone Temple Pilots covers or Nick Young’s iconic GIFs—Columbus’s skill sets and attitudes were almost completely typical of the community of Mediterranean and Atlantic sailors to which he belonged. Columbus was remarkable, if he was remarkable at all, for how deeply unremarkable he was.

he types of experience and skills that sent Columbus out into the Atlantic and then safely home were widespread in his world. That, rather than any personal characteristic of Columbus himself, is the extraordinary and impactful thing we should strive to understand. There were dozens, hundreds, even thousands of potential Columbuses running around the bustling port cities of Europe in the 1480s and 1490s. And beyond the sailors themselves, there were shipbuilders who constructed sophisticated vessels capable of long-distance travel at sea, metallurgists who forged cannon, and the financiers whose command of complex mechanisms of credit and repayment paid for all of it. . . .

Columbus was just a person, a representative of a broader Type of Guy who was common along the sea-lanes of the western Mediterranean and Atlantic in the closing decades of the 15th century. . . . His ideas about commercial profits, the use of violence in acquiring those profits, and the potential of enslavement to meet that goal were likewise pretty standard.

It wasn’t just Columbus. Vasco da Gama was a true psychopath who burned alive Muslim passengers on a captured ship in the Indian Ocean. Cortes and Pizarro slaughtered staggering numbers of people in their conquests of the New World empires. Amerigo Vespucci was probably a pimp and definitely a ruthless slaver. Those are just the headline-grabbers, the other people whose names you might know; they’re the tip of the iceberg. The conquest of the Canary Islands in the 1480s, financed by the same group of people who funded Columbus’s voyage in 1492, had been exceedingly brutal. This was the context that produced Columbus, the tradition to which he belonged and to which he contributed. . . .

Major League Baseball is awash in pitchers who can throw 5 innings of three-run baseball every five days. The pitches they throw might look different, but it doesn’t really matter whether you’re starting Bronson Arroyo or Jason Vargas or Mike Leake or Chase Anderson 30 times in a season. There are plenty of lanky wings in the NBA who can shoot 33 percent from three and play a little defense; Kent Bazemore, Mo Harkless, Al-Farouq Aminu, and Reggie Bullock are all variations on the same theme. Mike Shanahan figured out a long time ago that he could find a competent running back to carry the ball 300 times a season in the draft’s late rounds, and NFL coaches have been following that example for two decades now. In the admittedly rarefied context of the incredible baseline levels of skill and athleticism required to play at that level, these people all basically interchangeable. The same is true of Columbus in the 1480s and 1490s. . . .

Interesting. This all makes sense, but I’d never really thought of it that way before.

Event frequencies and my dated MLB analogy

Apparently, it’s blog day!

This post is by Lizzie, and I am requesting analogy help (by the way, thanks for your recent help on how to teach simulation to students).

Yesterday morning I watched a little Metro-Vancouver parks worker trundling along in their tractor, as they gathered up the debris strewn across the beach from our recent storm. The storm had been fantastic fun to cycle home during and I snapped some photos on my ride that do not at all do justice to how riled up the ocean looked (one shown). It also triggered the now-almost-normal stream of requests to link “the severe weather effects we are seeing in [insert place] and how this relates to global warming.”

Which led me to trot out my now very old analogy to explain why we cannot generally attribute any one specific weather event (a specific storm, frost, heat wave etc.) to climate change: consider a MLB player, let’s call her Barry…. For the beginning years of her MLB career she was a pretty good hitter and every so often hit a home run. In the later part of her career she starts taking steroids and hits many more home runs on average. You can’t attribute any particular home run to Barry’s steroid use, but you can associate the changing frequency ….

I didn’t come up with this analogy. I copied it from someone who copied it from someone … and on and on until we find someone who thinks he invented it, but I bet he just forgot where he heard it.

And I like it! People generally get the connection and they are sometimes willing to let go of their urge to pressure me to stay, ‘Whoa! What a storm that was yesterday. That storm was caused by climate change, folks.’ And the steroids fits nicely with our juiced-up climate system so it’s often a good segue into what’s changing in our climate system.

But my analogy feels really out of date! It feels old and I think I lose people who try to remember back when or figure out what I am talking about. I am wondering if anyone has (and is willing to share) a better one they’re using, or wants to propose one I can use.

Research on heat extremes is moving towards terms such as ‘nearly impossible in the absence of warming‘ or ‘virtually impossible without human-caused climate change‘  so maybe I can shelve my example someday? But I am not ready for that. (For anyone waiting on rapid attribution of the PNW storm, I suspect World Weather Attribution is working on it.)

Estimating basketball shooting ability while it varies over time

Someone named Brayden writes:

I’m an electrical engineer with interests in statistics, so I read your blog from time to time, and I had a question about interpreting some statistics results relating to basketball free throws.

In basketball, free throw shooting has some variance associated with it. Suppose player A is a career 85% free throw shooter on 2000 attempts and player B is a career 85% free throw shooter on 50 attempts, and suppose that in a new NBA season, both players start out their first 50 free throw attempts shooting 95% from the line. Under ideal circumstances (if it was truly a binomial process), we could say that player A is probably just on a lucky streak, since we have so much past data indicating his “true” FT%. With player B, however, we might update what we believe is his “true” FT% is, and be more hesitant to conclude that he’s just on a hot streak, since we have very little data on past performance.

However, in the real basketball world, we have to account for “improvement” or “decline” of a player. With improvement being a possibility, we might have less reason to believe that player A is on a hot streak, and more reason to believe that they improved their free throw shooting over the off-season. So I guess my question is: when you’re trying to estimate a parameter, is there a formal process defined for how to account for a situation where your parameter *might* be changing over time as you observe it? How would you even begin to mathematically model something like that? It seems like you have a tradeoff between sample size being large enough to account for noise, but not too large such that you’re including possible improvements or declines. But how do you find the “sweet spot”?

My reply:

1. Yes, this all can be done. It should not be difficult to write a model in Stan allowing measurement error, differing player abilities, and time-varying abilities. Accuracy can vary over the career and also during the season and during the game. There’s no real tradeoff here; you just put all the variation in the model, with hyperparameters governing how much variation there is at each level. I haven’t done this with basketball, but we did something similar with time-varying public opinion in our election forecasting model.

2. Even the non-time-varying version of the model is nontrivial! Consider your above example, just changing “50 attempts” to “100 attempts” in each case so that the number of successes becomes an integer; With no correlation and no time variation in ability, you get the following data:
player A: 1795 successes out of 21000 tries, a success rate of 85.5%
player B: 180 successes out of 200 tries, a success rate of 90%.
But then you have to combine this with your priors. Let’s assume for simplicity that our priors for the two players are the same. Depending on your prior, you might conclude that player A is probably better, or you might conclude that player B is probably better. For example, if you start with a uniform (0, 1) prior on true shooting ability, the above data would suggest that player B is probably better than player A. But if you start with a normal prior with mean 0.7 and standard deviation 0.1 then the above data would lead you to conclude that player A is more likely to be the better shooter.

3. Thinking more generally, I agree with you that it represents something of a conceptual leap to think of these parameters varying over time. With the right model, it should be possible to track such variation. Cruder estimation methods that don’t model the variation can have difficulty catching up to the data. We discussed this awhile ago in the context of chess ratings.

P.S. Brayden also shares the above adorable photo of his cat, Fatty.

Fun example of an observational study: Effect of crowd noise on home-field advantage in sports

Kevin Quealy and Ben Shpigel offer “Four Reasons the N.F.L. Shattered Its Scoring Record in 2020”:

No. 1: No fans meant (essentially) no home-field advantage

With fans either barred or permitted at diminished numbers because of public-health concerns, the normal in-game din dropped to a murmur or — at some stadiums — to a near silence. That functionally eliminated any edge that a packed stadium full of screaming fans might provide to a home team. That gap has been steadily closing over the years, but visiting teams never scored more, on average, than they did in 2020.

No. 2: Referees called fewer offensive penalties

A significant force driving the scoring eruption didn’t even involve players. On-field officials, adjusting the standard by which they enforced penalties, called the fewest offensive holding penalties since at least 1998. . . . On the other side of the ball, penalties for defensive pass interference increased for the third consecutive season, to its highest levels since at least 1998, which also extended possessions.

No. 3: Coaches Were Smarter on Fourth Down

At the same time, if teams weren’t successful on third down, more of them recognized the value of going for it on fourth down. Teams went for it 658 times, up from 595 last season, especially on 4th-and-1 . . . Instead of being aggressive solely in the second half, when score and clock decay might dictate it, teams went for it before halftime more than 200 times, significantly more than they did in previous years.

No. 4: The N.F.L.’s quarterback evolution accelerated existing trends

At a position long defined by pocket proficiency, the best of this next generation marries cherished passing attributes — accuracy, arm strength and downfield vision — with mobility, elusiveness and an aptitude for extending plays. . . . Unlike defensive players, who couldn’t simulate tackling drills as they trained away from their teams’ shuttered facilities, many quarterbacks improvised by gathering running backs and receivers— and even some linemen — in parks or at school practice fields to master the scheme and build chemistry.

This is great stuff. A few years ago I complained about a bad sports analysis from Quealy, so I’m really happy to see this new piece, which is thoughtful without being gimmicky.

A research question

There are lots of things above that could be studied further. Here I want to focus on Quealy and Shpigel’s first point, “No fans meant (essentially) no home-field advantage.” They continue:

“You don’t have to worry about the noise levels,” Steelers linebacker Avery Williamson said in an interview in October, when he played for the Jets. “You don’t realize how quiet it actually is on the field when you get out there. You could hear coaches talking across the field. It’s super weird.”

The subdued atmosphere created a more forgiving atmosphere for road teams, reducing the need for quarterbacks to use silent counts and allowing masters of the hard count, like Rodgers, to use his voice to draw opponents offside. Offensive linemen, in turn, could hear the calls more quickly and clearly. False starts dropped to a record low in 2020.

So here’s my question.

How can we study the effects of crowd noise on home-field advantage? The above point no. 1 is a before-after study, which could be thought of as an observational study with two data points, comparing NFL in 2020 to NFL in previous years.

There are various ways to expand the analysis:

1. Within games, focus on particular game situations where crowd noise would be expected to be more of a big deal.

2. Within the NFL, look at interaction with crowd size or crowd noise: if the above story is true, you’d expect to see larger home-team / visiting-team spreads in noisier stadiums.

3. Look at other sports.

There are probably some other ideas I haven’t thought of.

P.S. Jonathan Falk writes:

As someone who’s done a couple of studies of home field advantage I find this interesting, but I’m dubious. First, you’re right that this is kind of a two datapoint causal analysis. That’s hard to do convincingly! But second, it is unsurprising that better teams have higher crowds and therefore louder ambient noise. Causality is at least partly two-way, which makes identification critical and… well, identification is mostly rhetoric, right? That said, I’m sure there’s *something* there in football, because if silent counts were equally effective, why wouldn’t teams use them both at home and on the road? It’s one of the very few examples where road teams actually play the game differently in any sport. I am much less convinced that crowd noise has any effect in, say, baseball, because the noise level is high at critical moments irrespective of whether the home team is batting or pitching.
Teams want to sell tickets, and the notion that fans can actually contribute to the probability that their team wins just by showing up and being enthusiastic is a romantic one which lines up nicely with the profit interests of the teams. But there are lots of causes for home field advantage that would have to be eliminated to confidently assert anything about crowd noise: travel jet lag/time zone/circadian rhythm effects, hotel living versus home living, unique conditions to which home teams can both accustom themselves and differentially field a team capable of exploiting (unique stadium sizes in baseball, wind tunnel effects effects in football, dead spots in the floor at Boston Garden) and, in a few sports, actual rules advantage for the home team (last ups in baseball, last change in hockey).

He was fooled by randomness—until he replicated his study and put it in a multilevel framework. Then he saw what was (not) going on.

An anonymous correspondent who happens to be an economist writes:

I contribute to an Atlanta Braves blog and I wanted to do something for Opening Day. Here’s a very surprising regression I just ran. I took the 50 Atlanta Braves full seasons (excluding strike years and last year) and ran the OLS regression: Wins = A + B Opening_Day_Win.

I was expecting to get B fairly close to 1, ie, “it’s only one game”. Instead I got 79.8 + 7.9 Opening_Day_Win. The first day is 8 times as important as a random day! The 95% CI is 0.5-15.2 so while you can’t quite reject B=1 at conventional significance levels, it’s really close. F-test p-value of .066

I have an explanation for this (other than chance) which is that opening day is unique in that you’re just about guaranteed to have a meeting of your best pitcher against the other team’s, which might well give more information than a random game, but I find this really surprising. Thoughts?

Note: If I really wanted to pursue this, I would add other teams, try random games rather than opening day, and maybe look at days two and three.

Before I had a chance to post anything, my correspondent sent an update, subject-line “Never mind”:

I tried every other day: 7.9 is kinda high, but there are plenty of other days that are higher and a bunch of days are negative. It’s just flat-out random…. (There’s a lesson there somewhere about robustness.) Here’s the graph of the day-to-day coefficients:

The lesson here is, as always, to take the problem you’re studying and embed it in a larger hierarchical structure. You don’t always have to go to the trouble of fitting a multilevel model; it can be enough sometimes to just place your finding as part of the larger picture. This might not get you tenure at Duke, a Ted talk, or a publication in Psychological Science circa 2015, but those are not the only goals in life. Sometimes we just want to understand things.

Tokyo Track revisited: no, I don’t think the track surface is “1-2% faster”

This post is by Phil Price, not Andrew.

A few weeks ago I posted about the claim — by the company that made the running track for the Tokyo Olympics — that the bounciness of the track makes it “1-2% faster” than other professional tracks. The claim isn’t absurd: certainly the track surface can make a difference. If today’s athletes had to run on the cinder tracks of yesteryear their speed would surely be slower.

At the time I wrote that post the 400m finals had not yet taken place, but of course they’re done by now so I went ahead and took another quick look at the whole issue…and the bottom line is that I don’t think the track in Tokyo let the runners run noticeably faster than the tracks used in recent Olympics and World Championships. Here’s the story in four plots. All show average speed rather than time: the 200m takes about twice as long as the 100m,  so they have comparable average speed. Men are faster, so in each panel (except the bottom right) the curve(s) for men are closer to the top, women are closer to the bottom.  Andrew, thanks for pointing out that this is better than having separate rows of plots for women and men, which would add a lot of visual confusion to this display. 

The top left plot shows the average speed for the 1st-, 2nd-, 3rd-, and 4th-place finishers in the 100, 200, and 400m, for men and women.  Each of the subsequent plots represents a further aggregation of these data. The upper right just adds the times together and the distances together, so, for instance, the top line is (100 + 200 + 400 meters) / (finishing time of the fastest man in the 100m + finishing time of the fastest man in the 200m + finishing time of the fastest man in the 400 m).  The bottom left aggregates even farther: the total distance run by all of the male finishers divided by the total time of all the male finishers, in all of the races; and the same for the women.

And finally, taking it to an almost ludicrous level of aggregation, the bottom right shows the mean speed — the total distance run by all of the competitors in all of the races, divided by the total of all of the times —  divided by mean of all of the mean speeds, averaged over all of the years. A point at a y-value of 1.01 on this plot would mean that the athletes that year averaged 1% faster than in an average year.

If someone wants to claim the track allows performances that are 1-2% faster than on previous tracks, they’re going to have to explain why the competitors in the sprints this year were only about 0.4% faster than the average per the past several Olympics and World Championships. 

Even that 0.4% looks a bit iffy, considering the men weren’t faster at all. You can make up a ‘just so’ story about the track being better tuned towards women’s lighter bodies and lower forces exerted on the track, but I won’t believe it. 

There’s year-to-year and event-to-event variation in results, depending on exactly what athletes are competing, where they are in their careers, what performance-enhancing drugs they are taking (if any), and other factors too (wind, temperature on race day, etc.).  It’s not inconceivable that the sprint speeds would have been 1-2% slower this year if not for the magical track, which just happened to bring them back up to around the usual values. But that’s sure not the way to bet.

This post is by Phil.