Regarding the use of “common sense” when evaluating research claims

Posted on January 20, 2024 9:38 AM by Andrew

I’ve often appealed to “common sense” or “face validity” when considering unusual research claims. For example, the statement that single women during certain times of the month were 20 percentage points more likely to support Barack Obama, or the claim that losing an election for governor increases politicians’ lifespan by 5-10 years on average, or the claim that a subliminal smiley face flashed on a computer screen causes large changes in people’s attitudes on immigration, or the claim that attractive parents are 36% more likely to have girl babies . . . these claims violated common sense. Or, to put it another way, they violated my general understanding of voting, health, political attitudes, and human reproduction.

I often appeal to common sense, but that doesn’t mean that I think common sense is always correct or that we should defer to common sense. Rather, common sense represents some approximation of a prior distribution or existing model of the world. When our inferences contradict our expectations, that is noteworthy (in a chapter 6 of BDA sort of way), and we want to address this. It could be that addressing this will result in a revision of “common sense.” That’s fine, but if we do decide that our common sense was mistaken, I think we should make that statement explicitly. What bothers me is when people report findings that contradict common sense and don’t address the revision in understanding that would be required to accept that.

In each of the above-cited examples (all discussed at various times on this blog), there was a much more convincing alternative explanation for the claimed results, given some mixture of statistical errors and selection bias (p-hacking or forking paths). That’s not to say the claims are wrong (Who knows?? All things are possible!), but it does tell us that we don’t need to abandon our prior understanding of these things. If we want to abandon our earlier common-sense views, that would be a choice to be made, an affirmative statement that those earlier views are held so weakly that they can be toppled by little if any statistical evidence.

P.S. Perhaps relevant is this recent article by Mark Whiting and Duncan Watts, “A framework for quantifying individual and collective common sense.”

Progress in 2023, Charles edition

Posted on January 19, 2024 3:00 PM by Charles Margossian

Following the examples of Andrew, Aki, and Jessica, and at Andrew’s request:

Published:

Variational Inference with Gaussian Score Matching. Neural Information Processing Systems.
(Chirag Modi, CM, Yuling Yao, Robert Gower, David Blei and Lawrence Saul)
The Shrinkage-Delinkage Trade-off: An Analysis of Factorized Gaussian Approximations for Variational Inference. Uncertainty in Artificial Intelligence (Oral).
(CM and Lawrence Saul)
Adaptive Tuning for Metropolis Adjusted Langevin Trajectories. Artificial Intelligence and Statistics.
(Lionel Riou-Durand, Pavel Sountsov, Jure Vogrinc, CM and Sam Power)

Unpublished:

For how many iterations should we run Markov chain Monte Carlo?
(CM and Andrew Gelman)
Amortized Variational Inference: When and Why?
(CM and Dave Blei)
Nested Rhat: Assessing the convergence of Markov chain Monte Carlo when running many short chains. (Revised).
(CM, Matthew Hoffman, Pavel Sountsov, Lionel Riou-Durand, Aki Vehtari and Andrew Gelman)
General Adjoint-Differentiated Laplace approximation. Technical Report.
(CM)

This year, I also served on the Stan Governing Body, where my primary role was to help bring back the in-person StanCon. StanCon 2023 took place at the University of Washington in St. Louis, MO and we got the ball rolling for the 2024 edition which will be held at Oxford University in the UK.

It was also my privilege to be invited as an instructor at the Summer School on Advanced Bayesian Methods at KU Leuven, Belgium and teach a 3-day course on Stan and Torsten, as well as teach workshops at StanCon 2023 and at the University of Buffalo.

Progress in 2023, Aki Edition

Posted on January 15, 2024 9:05 AM by Aki Vehtari

Following Andrew, here is my (Aki’s) list of published papers and preprints in 2023 (20% together with Andrew)

Published

Manushi Welandawe, Michael Riis Andersen, Aki Vehtari, and Jonathan H. Huggins (2023). Robust, Automated, and Accurate Black-box Variational Inference. Journal of Machine Learning Research, accepted for publication.
arXiv preprint arXiv:2203.15945.
Alex Cooper, Dan Simpson, Lauren Kennedy, Catherine Forbes, and Aki Vehtari (2023). Cross-validatory model selection for Bayesian autoregressions with exogenous regressors. Bayesian Analysis, accepted for publication.
arXiv preprint arXiv:2301.08276.
Noa Kallioinen, Topi Paananen, Paul-Christian Bürkner, and Aki Vehtari (2023). Detecting and diagnosing prior and likelihood sensitivity with power-scaling. Statistics and Computing, 34(57).
Online
arXiv preprint arXiv:2107.14054.
Supplementary code.
priorsense: R package
Martin Modrák, Angie H. Moon, Shinyoung Kim, Paul Bürkner, Niko Huurre, Kateřina Faltejsková, Andrew Gelman, and Aki Vehtari (2023). Simulation-based calibration checking for Bayesian computation: The choice of test quantities shapes sensitivity. Bayesian Analysis, doi:10.1214/23-BA1404.
arXiv preprint arXiv:2211.02383.
Code
SBC R package
Erik Štrumbelj, Alexandre Bouchard-Côté, Jukka Corander, Andrew Gelman, Håvard Rue, Lawrence Murray, Henri Pesonen, Martyn Plummer, and Aki Vehtari (2023). Past, Present, and Future of Software for Bayesian Inference. Statistical Science, accepted for publication. preprint
Marta Kołczyńska, Paul-Christian Bürkner, Lauren Kennedy, and Aki Vehtari (2023). Trust in state institutions in Europe, 1989–2019. Survey Research Methods, accetped for publication.
SocArXiv preprint doi:10.31235/osf.io/3v5g7.
Juho Timonen, Nikolas Siccha, Ben Bales, Harri Lähdesmäki, and Aki Vehtari (2023). An importance sampling approach for reliable and efficient inference in Bayesian ordinary differential equation models. Stat, doi:10.1002/sta4.614.
arXiv preprint arXiv:2205.09059.
Petrus Mikkola, Osvaldo A. Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, Paul-Christian Bürkner, Arto Klami (2023). Prior knowledge elicitation: The past, present, and future. Bayesian Analysis, doi:10.1214/23-BA1381.
arXiv preprint arXiv:2112.01380.
Peter Mikula, Oldřich Tomášek, Dušan Romportl, Timothy K. Aikins, Jorge E. Avendaño, Bukola D. A. Braimoh-Azaki, Adams Chaskda, Will Cresswell, Susan J. Cunningham, Svein Dale, Gabriela R. Favoretto, Kelvin S. Floyd, Hayley Glover, Tomáš Grim, Dominic A. W. Henry, Tomas Holmern, Martin Hromada, Soladoye B. Iwajomo, Amanda Lilleyman, Flora J. Magige, Rowan O. Martin, Marina F. de A. Maximiano, Eric D. Nana, Emmanuel Ncube, Henry Ndaimani, Emma Nelson, Johann H. van Niekerk, Carina Pienaar, Augusto J. Piratelli, Penny Pistorius, Anna Radkovic, Chevonne Reynolds, Eivin Røskaft, Griffin K. Shanungu, Paulo R. Siqueira, Tawanda Tarakini, Nattaly Tejeiro-Mahecha, Michelle L. Thompson, Wanyoike Wamiti, Mark Wilson, Donovan R. C. Tye, Nicholas D. Tye, Aki Vehtari, Piotr Tryjanowski, Michael A. Weston, Daniel T. Blumstein, and Tomáš Albrecht (2023). Bird tolerance to humans in open tropical ecosystems. Nature Communications, 14:2146. doi:10.1038/s41467-023-37936-5.
Gabriel Riutort-Mayol, Paul-Christian Bürkner, Michael R. Andersen, Arno Solin, and Aki Vehtari (2023). Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming. Statistics and Computing, 33(17):1-28. doi:10.1007/s11222-022-10167-2.
arXiv preprint arXiv:2004.11408.

Pre-prints

Lauren Kennedy, Aki Vehtari, and Andrew Gelman (2023). Scoring multilevel regression and poststratification based population and subpopulation estimates. arXiv preprint arXiv:2312.06334.
Alex Cooper, Aki Vehtari, Catherine Forbes, Lauren Kennedy, and Dan Simpson (2023). Bayesian cross-validation by parallel Markov chain Monte Carlo. arXiv preprint arXiv:2310.07002.
Yann McLatchie and Aki Vehtari (2023). Efficient estimation and correction of selection-induced bias with order statistics. arXiv preprint arXiv:2309.03742.
Yann McLatchie, Sölvi Rögnvaldsson, Frank Weber, and Aki Vehtari (2023). Robust and efficient projection predictive inference. arXiv preprint arXiv:2306.15581.
Frank Weber, Änne Glass, and Aki Vehtari (2023). Projection predictive variable selection for discrete response families with finite support. arXiv preprint arXiv:2301.01660.

jd asked Andrew “which paper from 2023 do you like best?”, and I also find it difficult to choose one. I highlight two papers, but I’m proud of all of them!

“Detecting and diagnosing prior and likelihood sensitivity with power-scaling” is based on an idea that had been on my todo list for a very long time, and seeing that it works so well and can have practical software implementation was really nice.

In “Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming” we didn’t come up with a new GP approximation, but we were able to develop simple diagnostics to tell whether we have enough basis functions. I just love when diagnostics can answer frequently asked questions like “How do I choose the number of basis functions?”

Bayesian BATS to advance Bayesian Thinking in STEM

Posted on January 12, 2024 7:33 AM by Aki Vehtari

Mine Dogucu writes:

We are recruiting our new cohort of STEM instructors who are interested in incorporating Bayesian thinking and methods in their teaching in US universities and colleges.

Please help us spread the word.

stat.uci.edu/bayes-bats/

Our goal is to advance Bayesian Thinking in STEM, hence the name BATS.

BATS is a three-tiered program

The first tier of the program consists of a week-long instructor training bootcamp (on the west coast at University of California Irvine in Summer 2023, and on the east coast at Vassar College in Summer 2024), to build a diverse community of Bayesian educators across different STEM fields.
In the second tier of the project, selected instructors will develop Bayesian teaching and learning materials specifically using scientific data from their fields with the support of the PIs, during the fall semester after their summer boot camp training participation.
In the third tier of the project, selected instructors will disseminate the teaching and learning materials through conferences and publications with the support of the PIs.

The BATS Project Objectives are as follows:

Increase the number of undergraduate students who are exposed to Bayesian methods;
Enhance the capacity of STEM instructors in Bayesian methods through training and community building;
Develop and enrich teaching and learning materials that showcase use of Bayesian methods in STEM fields

Prediction isn’t everything, but everything is prediction

Image

This is Leo.

Explanation or explanatory modeling can be considered to be the use of statistical models for testing causal hypotheses or associations, e.g. between a set of covariates and a response variable. Prediction or predictive modeling, (supposedly) on the other hand, is the act of using a model—or device, algorithm—to produce values of new, existing, or future observations. A lot has been written about the similarities and differences between explanation and prediction, for example Breiman (2001), Shmueli (2010), Billheimer (2019), and many more.

These are often thought to be separate dimensions of statistics, but Jonah and I have been discussing for a long time that in some sense there may actually be no such thing as explanation without prediction. Basically, although prediction itself is not the only goal of inferential statistics, everything—every objective—in inferential statistics can be reframed through the lense of prediction.

Hypothesis testing, ability estimation, hierarchical modeling, treatment effect estimation, causal inference problems, etc., can all be described in our opinion from a (inferential) predictive perspective. So far we have not found an example for which there is no way to reframe it as prediction problem. So I ask you: is there any inferential statistical ambition that cannot be described in predictive terms?

P.S. Like Billheimer (2019) and others, we think that inferential statistics should be considered as inherently predictive and be focused primarily on probabilistic predictions of observable events and quantities, rather than focusing on statistical estimates of unobersvable parameters that do not exist outside of our highly contrived models. Similarly, we also feel that the goal of Bayesian modeling should not be taught to students as finding the posterior distribution of unobservables, but rather as finding the posterior predictive distribution of the observables (with finding the posterior as an intermediate step); even when we don’t only care about predictive accuracy and we still care about understanding how a model works (model checking, GoF measures), we think the predictive modeling interpretation is generally more intuitive and effective.

Bayesians are frequentists.

Posted on January 8, 2024 9:50 AM by Andrew

We wrote about this a few years ago, but it’s a point of confusion that keeps coming up, so I’m posting it again:

Bayesians are frequentists. The Bayesian prior distribution corresponds to the frequentist sample space: it’s the set of problems for which a particular statistical model or procedure will be applied.

Bayesian and frequentist inference are both about averaging over possible problems to which a method might be applied. Just as there are many different Bayesian approaches (corresponding to different sorts of models, that is, different sorts of assumptions about the set of problems over which to average), there are many different frequentist approaches (again, corresponding to different sorts of assumptions about the set of problems over which to average).

I see a direct mapping between the frequentist reference set and the Bayesian prior distribution. Another way to put it is that a frequentist probability, defined as relative frequency in repeated trials, requires some definition of what is a repeated trial. We discuss this, from various angles, in chapter 1 of BDA.

I agree that different groups of statisticians, who are labeled “Bayesians” and “frequentists,” can approach problems in different ways. I just don’t think the differences are so fundamental, because I think that any Bayesian interpretation of probability has to have some frequentist underpinning to be useful. And, conversely, any frequentist sample space corresponds to some set of problems to be averaged over.

For example, consider the statement, “Brazil had a 0.25 probability of winning the 2018 World Cup.” To the extent that this statement is the result of a Bayesian analysis (and not simply a wild guess), it would be embedded in web of data and assumptions regarding other teams, other World Cups, other soccer games, etc. And, to me, this network of reference sets is closely related to the choice in frequentist statistics of what outcomes to average over.

Self-declared frequentists do have a way of handling one-off events: they call them “predictions.” In the classical frequentist world, you’re not allowed to make probabilistic inferences about parameters, but you are allowed to make probabilistic inferences about predictions. Indeed, in that classical framework, the difference between parameters and predictions or missing data is precisely that parameters are unmodeled, whereas predictions and missing data are modeled.

The comments are worth reading too. In particular, Ben Goodrich makes good points in disagreeing with me, and the late Keith O’Rourke and others add some useful perspective too.

This all came to mind because Erik van Zwet pointed me to some online discussion of our recent post. The commenter wrote:

In a recent blog, Andrew Gelman writes “Bayesians moving from defense to offense: I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?”

Here is what is perplexing me.

It looks to me that ‘those thousands of medical trials’ are akin to long run experiments. So isn’t this a characteristic of Frequentism? So if bayesians want to use information from long run experiments, isn’t this a win for Frequentists?

What is going offensive really mean here?

Some of the participants in that thread did seem to get the point, but nobody knew about our saying, “Bayesians are frequentists,” so I added it to the lexicon.

So, just to say it again, yes, good Bayesian inference and good frequentist inference are both about having a modeled population (also called a prior distribution, or frequency distribution, or reference set) that is a good match to the applied question being asked.

A prior that corresponds to a relevant frequency distribution is a win for frequentists and for Bayesians too! Remember Hal Stern’s principle that the most important aspect of a statistical analysis is not what you do with the data, it’s what data you use.

And, regarding what I meant by “Bayesians moving from defense to offense,” see our followup here.

Progress in 2023

Posted on January 4, 2024 9:28 AM by Andrew

Published:

[2023]. Bayesian spatial modelling of localised SARS-CoV-2 transmission through mobility networks across England. {\em PLoS Computational Biology} 19, e1011580.
(Thomas Ward, Mitzi Morris, Andrew Gelman, Bob Carpenter, William Ferguson, Christopher Overton, and Martyn Fylesn)

[2023] Generically partisan: Polarization in political communication. {\em Proceedings of the National Academy of Sciences} 120, e2309361120.
(Gustavo Novoa, Margaret Echelbarger, Andrew Gelman, and Susan Gelman)
Supplementary appendix.

[2023] Simulation-based calibration checking for Bayesian computation: The choice of test quantities shapes sensitivity. {\em Bayesian Analysis}.
(Martin Modrák, Angie H. Moon, Shinyoung Kim, Paul Bürkner, Niko Huurre, Kateřina Faltejsková, Andrew Gelman, and Aki Vehtari)

[2023] Causal quartets: Different ways to attain the same average treatment effect. {\em American Statistician}.
(Andrew Gelman, Jessica Hullman, and Lauren Kennedy)

[2023] In pursuit of campus-wide data literacy: A guide to developing a statistics course for students in non-quantitative fields. {\em Journal of Statistics and Data Science Education}.
(Alexis Lerner and Andrew Gelman)

[2023] A new look at p-values for randomized clinical trials. {\em NEJM Evidence}.
(Erik van Zwet, Andrew Gelman, Sander Greenland, Guido Imbens, Simon Schwab, and Steven N. Goodman)

[2023] Past, present, and future of software for Bayesian inference. {\em Statistical Science}.
(Erik Štrumbelj, Alexandre Bouchard-Côté, Jukka Corander, Andrew Gelman, Håvard Rue, Lawrence Murray, Henri Pesonen, Martyn Plummer, and Aki Vehtari)

[2023] Challenges in adjusting a survey that overrepresents people interested in politics. {\em Harvard Data Science Review} {\bf 5} (3).
(Andrew Gelman and Gustavo Novoa)

[2023] Using leave-one-out cross-validation (LOO) in a multilevel regression and poststratification (MRP) workflow: A cautionary tale. {\em Statistics in Medicine}.
(Swen Kuh, Lauren Kennedy, Qixuan Chen, and Andrew Gelman)

[2023] What is a standard error? {\em Journal of Econometrics} {\bf 237}, 105516.
(Andrew Gelman)

[2023] Who wants school vouchers in America? A comprehensive study using multilevel regression and poststratification. {\em Social Sciences} {\bf 12} (8), 430.
(Yu-Sung Su and Andrew Gelman)

[2023] A chain as strong as its strongest link? Understanding the causes and consequences of biases arising from selective analysis and reporting of research results. {\em Journal of Research on Educational Effectiveness}.
(Andrew Gelman)

[2023] Before data analysis: Additional recommendations for designing experiments to learn about the world. {\em Journal of Consumer Psychology}.
(Andrew Gelman)

[2023] Toward a taxonomy of trust for probabilistic machine learning. {\em Science Advances} {\bf 9}, eabn3999.
(Tamara Broderick, Andrew Gelman, Rachael Meager, Anna L. Smith, and Tian Zheng)

[2023] Federated learning as variational inference: A scalable expectation propagation approach. {\em International Conference on Learning Representations (ICLR)}.
(Han Guo, Philip Greengard, Hongyi Wang, Andrew Gelman, Yoon Kim, and Eric P. Xing)

[2023] I love this paper but it’s barely been noticed. Part of a collaborative article, “What are your most underappreciated works?” {\em Econ Journal Watch} {\bf 20}, 466.
(Andrew Gelman)

[2023] From visualization to sensification. {\em Amstat News} 547, 18–19.
(Andrew Gelman and S. Gwynn Sturdevant)

[2023] Fast methods for posterior inference of two-group normal-normal models. {\em Bayesian Analysis}.
(Philip Greengard, Jeremy Hoskins, Charles C. Margossian, Jonah Gabry, Andrew Gelman, and Aki Vehtari)

[2023] “Two truths and a lie” as a class-participation activity. {\em American Statistician} {\bf 77}, 97–101.
(Andrew Gelman)

Unpublished:

Regression, poststratification, and small-area estimation with sampling weights.
(Andrew Gelman)

Understanding posterior recalibration for a simple example.
(Andrew Gelman, Julie Gershunskaya, Terrance Savitsky, and Ben Goodrich)

Bayesian workflow for time-varying transmission in stratified compartmental infectious disease transmission models.
(Judith A. Bouman, Anthony Hauser, Simon L. Grimm, Martin Wohlfender, Samir Bhatt, Elizaveta Semenova, Andrew Gelman, Christian L. Althaus, and Julien Riou)

Artificial intelligence and aesthetic judgment.
(Jessica Hullman, Ari Holtzman, and Andrew Gelman)

The ladder of abstraction in statistical graphics.
(Andrew Gelman)

BISG: When inferring race or ethnicity, does it matter that people often live near their relatives?
(Philip Greengard and Andrew Gelman)

Enjoy.

Hey wassup Detroit Pistons? What’s gonna happen for the rest of the season? Let’s get (kinda) Bayesian. With graphs and code (but not a lot of data; sorry):

Posted on December 29, 2023 9:05 AM by Andrew

Paul Campos points us to this discussion of the record of the Detroit professional basketball team:

The Detroit Pistons broke the NBA record for most consecutive losses in a season last night, with their 27th loss in a row. . . . A team’s record is, roughly speaking, a function of two factors:

(1) The team’s quality. By “quality” I mean everything about the team”s performance that isn’t an outcome of random factors, aka luck — the ability of the players, individually and collectively, the quality of the coaching, and the quality of the team’s management, for example.

(2) Random factors, aka luck.

The relative importance of luck and skill?

The above-linked post continues:

How do we disentangle the relative importance of these two factors when evaluating a team’s performance to some point in the season? . . . The best predictor ex ante of team performance is the evaluation of people who gamble on that performance. I realize that occasionally gambling odds include significant inefficiencies, in the form of the betting public making sentimental rather than coldly rational wagers, but this is very much the exception rather than the rule. . . . the even money over/under for Detroit’s eventual winning percentage this season was, before the first game was played, a winning percentage of .340. To this point, a little more than third of the way through the season, Detroit’s winning percentage has been .0666. . . .

To the extent that the team has had unusually bad luck, then one would expect the team’s final record to be better. But how much better? Here we can again turn to the savants of Las Vegas et. al., who currently set the even money odds of the team’s final record on the basis of the assumption that it will have a .170 winning percentage in its remaining games.

Campos shares a purported Bayesian analysis and summarizes, “if we have just two pieces of information — a prior assumption of a .340 team, and the subsequent information of a .066 performance through thirty games — the combination of these two pieces of information yields a posterior prediction of a .170 winning percentage going forward, which remarkably enough is exactly what the current gambling odds predict! . . . it appears that the estimate being made by professional gamblers is that about two-thirds of Detroit’s worse than expected record is a product of an ex ante overestimate of the team’s quality, while the other third is assumed to be accounted for by bad luck.”

I think that last statement is coming from the fact that (1/3)*0.340 + (2/3)*0.067 is approximately 0.170.

I don’t quite follow his Bayesian logic. But never mind about that for now.

As I said, I didn’t quite follow the Bayesian logic shared by Campos. Here’s my problem. He posts this graph:

I think I understand the “No_Prior_Info” curve in the graph: that’s the y ~ binomial(n, p) likelihood for p, given the data n=30, y=2. But I don’t understand where the “Prior” and “Posterior” curves come from. I guess the Prior distribution has a mean of 0.340 and the Posterior distribution has a mean of 0.170, but where are the widths of these curves coming from?

Part of the confusion here is that we’re dealing with inference for p (the team’s “quality,” as summarized by the probability that they’d win against a randomly-chosen opponent on a random day) and also with predictions of outcomes. For the posterior mean, there’s no difference: under the basic model, the posterior expected proportion of future games won is equal to the posterior mean of p. It gets trickier when we talk about uncertainty in p.

How, then, could we take the beginning-of-season and current betting lines–which we will, for the purposes of our discussion here, identify as the prior and posterior means of p, ignoring systematic biases of bettors–and extract implied prior and posterior distributions? There’s surely enough information here to do this, if we use information from all 30 teams and calibrate properly.

Exploratory analysis

I started by going to the internet, finding various sources on betting odds, team records, and score differentials, and entering the data into this file. The latest Vegas odds I could find on season records were from 19 Dec; everything else came from 27 Dec.

Next step was to make some graphs. First, I looked at point differential and team records so far:

nba <- read.table("nba2023.txt", header=TRUE, skip=1)
nba$ppg <- nba$avg_points
nba$ppg_a <- nba$avg_points_opponent
nba$ppg_diff <- nba$ppg - nba$ppg_a
nba$record <- nba$win_fraction
nba$start_odds <- nba$over_under_beginning/82
nba$dec_odds <- nba$over_under_as_of_dec/82
nba$sched <- - (nba$schedule_strength - mean(nba$schedule_strength)) # signed so that positive value implies a more difficult schedule so far in season
nba$future_odds <- (82*nba$dec_odds - 30*nba$record)/52

pdf("nba2023_1.pdf", height=3.5, width=10)
par(mfrow=c(1,2), oma=c(0,0,2,0))
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="s")
rng <- range(nba$ppg_a, nba$ppg)
plot(rng, rng, xlab="Points per game allowed", ylab="Points per game scored", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$ppg_a, nba$ppg, nba$team, col="blue")
#
par(pty="m")
plot(nba$ppg_diff, nba$record, xlab="Point differential", ylab="Won/lost record so far", bty="l", type="n")
text(nba$ppg_diff, nba$record, nba$team, col="blue")
#
mtext("Points per game and won-lost record as of 27 Dec", line=.5, side=3, outer=TRUE)
dev.off()

Here's a question you should always ask yourself: What do you expect to see?

Before performing any statistical analysis it's good practice to anticipate the results. So what do you think these graphs will look like?
- Ppg scored vs. ppg allowed. What do you expect to see? Before making the graph, I could have imagined it going either way: you might expect a negative correlation, with some teams doing the run-and-gun and others the physical game, or you might expect a positive correlation, because some teams are just much better than others. My impression is that team styles don't vary as much as they used to, so I was guessing a positive correlation.
- Won/lost record vs. point differential. What do you expect to see? Before making the graph, I was expecting a high correlation. Indeed, if I could only use one of these two metrics to estimate a team's ability, I'd be inclined to use point differential.

Aaaand, here's what we found:

Hey, my intuition worked on these! It would be interesting to see data from other years to see if I just got lucky with that first one.

Which is a better predictor of won-loss record: ppg scored or ppg allowed?

OK, this is a slight distraction from Campos's question, but now I'm wondering, which is a better predictor of won-loss record: ppg scored or ppg allowed? From basic principles I'm guessing they're about equally good.

Let's do a couple of graphs:

pdf("nba2023_2.pdf", height=3.5, width=10)
par(mfrow=c(1,3), oma=c(0,0,2,0))
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="m")
rng <- range(nba$ppg_a, nba$ppg)
plot(rng, range(nba$record), xlab="Points per game scored", ylab="Won/lost record so far", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$ppg, nba$record, nba$team, col="blue")
#
par(pty="m")
plot(rng, range(nba$record), xlab="Points per game allowed", ylab="Won/lost record so far", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$ppg_a, nba$record, nba$team, col="blue")
#
par(pty="m")
plot(range(nba$ppg_diff), range(nba$record), xlab="Avg score differential", ylab="Won/lost record so far", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$ppg_diff, nba$record, nba$team, col="blue")
#
mtext("Predicting won-loss record from ppg, ppg allowed, and differential", line=.5, side=3, outer=TRUE)
dev.off()

Which yields:

So, about what we expected. To round it out, let's try some regressions:

library("rstanarm")
print(stan_glm(record ~ ppg, data=nba, refresh=0), digits=3)
print(stan_glm(record ~ ppg_a, data=nba, refresh=0), digits=3)
print(stan_glm(record ~ ppg + ppg_a, data=nba, refresh=0), digits=3)

The results:

            Median MAD_SD
(Intercept) -1.848  0.727
ppg          0.020  0.006

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.162  0.021 
------
            Median MAD_SD
(Intercept)  3.192  0.597
ppg_a       -0.023  0.005

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.146  0.019 
------
            Median MAD_SD
(Intercept)  0.691  0.335
ppg          0.029  0.002
ppg_a       -0.030  0.002

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.061  0.008

So, yeah, points scored and points allowed are about equal as predictors of won-loss record. Given that, it makes sense to recode as ppg differential and total points:

print(stan_glm(record ~ ppg_diff + I(ppg + ppg_a), data=nba, refresh=0), digits=3)

Here's what we get:

               Median MAD_SD
(Intercept)     0.695  0.346
ppg_diff        0.029  0.002
I(ppg + ppg_a) -0.001  0.001

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.062  0.009

Check. Once we include ppg_diff as a predictor, the average total points doesn't do much of anything. Again, it would be good to check with data from other seasons, as 30 games per team does not supply much of a sample.

Now on to the betting lines

Let's now include the Vegas over-unders in our analysis. First, some graphs:

pdf("nba2023_3.pdf", height=3.5, width=10)
par(mfrow=c(1,3), oma=c(0,0,2,0))
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="s")
rng <- range(nba$start_odds, nba$record)
plot(rng, rng, xlab="Betting line at start", ylab="Won/lost record so far", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$start_odds, nba$record, nba$team, col="blue")
#
par(pty="s")
rng <- range(nba$record, nba$dec_odds)
plot(rng, rng, xlab="Won/lost record so far", ylab="Betting line in Dec", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$record, nba$dec_odds, nba$team, col="blue")
#
par(pty="s")
rng <- range(nba$start_odds, nba$dec_odds)
plot(rng, rng, xlab="Betting line at start", ylab="Betting line in Dec", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$start_odds, nba$dec_odds, nba$team, col="blue")
#
mtext("Won-lost record and over-under at start and in Dec", line=.5, side=3, outer=TRUE)
dev.off()

Which yields:

Oops--I forgot to make some predictions before looking. In any case, the first graph is kinda surprising. You'd expect to see an approximate pattern of E(y|x) = x, and we do see that--but not at the low end. The teams that were predicted to do the worst this year are doing even worse than expected. It would be interesting to see the corresponding graph for earlier years. My guess is that this year is special, not only in the worst teams doing so bad, but in them underperforming their low expectations.

The second graph is as one might anticipate: Betters are predicting some regression toward the mean. Not much, though! And the third graph doesn't tell us much beyond the first graph.

Upon reflection, I'm finding the second graph difficult to interpret. The trouble is that "Betting line in Dec" is the forecast win percentage for the year, but 30/82 of that is the existing win percentage. (OK, not every team has played exactly 30 games, but close enough.) What I want to do is just look at the forecast for their win percentage for the rest of the season:

pdf("nba2023_4.pdf", height=3.5, width=10)
par(mfrow=c(1,3), oma=c(0,0,2,0))
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="s")
rng <- range(nba$record, nba$dec_odds)
plot(rng, rng, xlab="Won/lost record so far", ylab="Betting line of record for rest of season", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
fit <- coef(stan_glm(future_odds ~ record, data=nba, refresh=0))
print(fit)
abline(fit, lwd=.5, col="blue")
text(nba$record, nba$future_odds, nba$team, col="blue")
#
dev.off()

Here's the graph:

The fitted regression line has a slope of 0.66:

            Median MAD_SD
(Intercept) 0.17   0.03  
record      0.66   0.05  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.05   0.01

Next step is to predict the Vegas prediction for the rest of the season given the initial prediction and the team's record so far:

print(stan_glm(future_odds ~ start_odds + record, data=nba, refresh=0), digits=2)

            Median MAD_SD
(Intercept) -0.02   0.03 
start_odds   0.66   0.10 
record       0.37   0.06 

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.03   0.00

It's funny--everywhere we look, we see this 0.66. And 30 games is 37% of the season!

Now let's add into the regression the points-per-game differential, as this should include additional information beyond what was in the won-loss so far:

print(stan_glm(future_odds ~ start_odds + record + ppg_diff, data=nba, refresh=0), digits=2)

            Median MAD_SD
(Intercept) 0.06   0.06  
start_odds  0.67   0.09  
record      0.20   0.11  
ppg_diff    0.01   0.00  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.03   0.00

Hard to interpret this one, as ppg_diff is on a different scale from the rest. Let's quickly standardize it to be on the same scale as the won-lost record so far:

nba$ppg_diff_std <- nba$ppg_diff * sd(nba$ppg_record) / sd(nba$ppg_diff)
print(stan_glm(future_odds ~ start_odds + record + ppg_diff_std, data=nba, refresh=0), digits=2)

             Median MAD_SD
(Intercept)  0.06   0.06  
start_odds   0.67   0.09  
record       0.20   0.11  
ppg_diff_std 0.17   0.10  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.03   0.00

OK, not enough data to cleanly disentangle won-lost record and point differential as predictors here. My intuition would be that, once you have point differential, the won-lost record tells you very little about what will happen in the future, and the above fitted model is consistent with that intuition, but it's also consistent with the two predictors being equally important, indeed it's consistent with point differential being irrelevant conditional on won-lost record.

What we'd want to do here--and I know I'm repeating myself--is to repeat the analysis using data from previous years.

Interpreting the implied Vegas prediction for the rest of the season as an approximate weighted average of the preseason prediction and the current won-lost record

In any case, the weighting seems clear: approx two-thirds from starting odds and one-third from the record so far, which at least on a naive level seems reasonable, given that the season is about one-third over.

Just for laffs, we can also throw in difficulty of schedule, as that could alter our interpretation of the teams' records so far.

nba$sched_std <- nba$sched * sd(nba$record) / sd(nba$sched)
print(stan_glm(future_odds ~ start_odds + record + ppg_diff_std + sched_std, data=nba, refresh=0), digits=2)

             Median MAD_SD
(Intercept)  0.06   0.06  
start_odds   0.68   0.09  
record       0.21   0.11  
ppg_diff_std 0.17   0.10  
sched_std    0.04   0.03

So, strength of schedule does not supply much information. This makes sense, given that 30 games is enough for the teams' schedules to mostly average out.

The residuals

Now that I've fit the regression, I'm curious about the residuals. Let's look:

fit_5 <- stan_glm(future_odds ~ start_odds + record + ppg_diff_std + sched_std, data=nba, refresh=0)
fitted_5 <- fitted(fit_5)
resid_5 <- resid(fit_5)
#
pdf("nba2023_5.pdf", height=5, width=8)
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="m")
plot(fitted_5, resid_5, xlab="Vegas prediction of rest-of-season record", ylab="Residual from fitted model", bty="l", type="n")
abline(0, 0, lwd=.5, col="gray")
text(fitted_5, resid_5, nba$team, col="blue")
#
dev.off()

And here's the graph:

The residual for Detroit is negative (-0.05*52 = -2.6, so the Pistons are expected to win about 3 games less than their regression prediction based on prior odds and outcome of first 30 games). Cleveland and Boston are also expected to do a bit worse than the model would predict. On the other direction, Vegas is predicting that Memphis will win about 4 games more than predicted from the regression model.

I have no idea whassup with Memphis. The quick generic answer is that the regression model is crude, and bettors have other information not included in the regression.

Reverse engineering an implicit Bayesian prior

OK, now for the Bayesian analysis. As noted above, we aren't given a prior for team j's average win probability, p_j; we're just given a prior point estimate of each p_j.

But we can use the empirical prior-to-posterior transformation, along with the known likelihood function, under the simplifying assumption the 30 win-loss outcomes for each team j are independent with constant probability p_j for team j. This assumption that is obviously wrong, given that teams are playing each other, but let's just go with it here, recognizing that with full data it would be straightforward to extend to an item-response model with an ability parameter for each team (as here).

To continue, the above regression models show that the Vegas "posterior Bayesian" prediction of p_j after 30 games is approximately a weighted average of 0.65*(prior prediction) + 0.35*(data won-loss record). From basic Bayesian algebra (see, for example, chapter 2 of BDA), this tells us that the prior has about 65/35 as much information as data from 30 games. So, informationally, the prior is equivalent to the information from (65/35)*30 = 56 games, about two-thirds of a season worth of information.

Hey--what happened??

But, wait! That approximate 2/3 weighting for the prior and 1/3 weighting of the data from 30 games is the opposite of what Campos reported, which was a 1/3 weighting of the prior and 2/3 of the data. Recall: prior estimated win probability of 0.340, data win rate of 0.067, take (1/3)*0.340 + (2/3)*0.067 and you get 0.158, which isn't far from the implied posterior estimate of 0.170.

What happened here is that the Pistons are an unusual case, partly because the Vegas over-under for their season win record is a few percentage points lower than the linear model predicted, and partly because when the probability is low, a small percentage-point change in the probability corresponds to a big change in the implicit weights.

Again, it would be good to check all this with data from other years.

Skill and luck

There's one more loose end, and that's Campos taking the weights assigned to data and prior and characterizing them as "skill" and "luck" in prediction errors. I didn't follow that part of the reasoning at all so I'll just let it go for now. Part of the problem here is in one place Campos seems to be talking about skill and luck as contributors to the team's record, and in another place he seems to considering them as contributors to the difference between preseason predictions and actual outcomes.

One way to think about skill and luck in a way that makes sense to me is within an item-response-style model in which the game outcome is a stochastic function of team abilities and predictable factors. For example, in the model,

score differential = ability of home team - ability of away team + home-field advantage + error,

the team abilities are in the "skill" category and the error is in the "luck" category, and, ummm, I guess home-field advantage counts as "skill" too? OK, it's not so clear that the error in the model should all be called "luck." If a team plays better against a specific opponent by devising a specific offensive/defensive plan, that's skill, but it would pop up in the error term above.

In any case, once we've defined what is skill and what is luck, we can partition the variance of the total to assign percentages to each.

Another way of looking at this is to consider the extreme case of pure luck. If outcomes determined only by luck, then each game is a coin flip, and we'd see this in the data because the team win proportions after 30 games would follow a binomial distribution with n=30 and p=0.5. The actual team win proportions have mean 0.5 (of course) and sd 0.18, as compared to the theoretical mean of 0.5 and sd of 0.5/sqrt(30) = 0.09. That simple calculation suggests that skill is (0.18/0.09)^2 = 4 times as important as luck when determining the outcome of 30 games.

And maybe I'm getting just getting this all tangled myself. The first shot at any statistical analysis often will have some mix of errors in data, modeling, computing, and general understanding, with that last bit corresponding to the challenge of mapping from substantive concepts to mathematical and statistical models. Some mixture of skill and luck, I guess.

Summary

1. Data are king. In the immortal words of Hal Stern, the most important aspect of a statistical analysis is not what you do with the data, it’s what data you use. I could do more than Campos did, not so much because of my knowledge of Bayesian statistics but because I was using data from all 30 teams.

2. To continue with that point, you can do lots better than me by including data from other years.

3. Transparency is good. All my data and code are above. I might well have made some mistakes in my analyses, and, in any case, many loose ends remain.

4. Basketball isn't so important (hot hand aside). The idea of backing out an effective prior by looking at information updating, that's a more general idea worth studying further. This little example is a good entry point into the potential challenge of such studies.

5. Models can be useful, not just for prediction but also for understanding, as we saw for the problem of partitioning outcomes into skill and luck.

P.S. Last week, when the Pistons were 2-25 or something like that, I was taking with someone who's a big sports fan but not into analytics, the kind of person who Bill James talked about when he said that people interpret statistics as words that describe a situation rather than as numbers that can be added, subtracted, multiplied, and divided. The person I was talking with predicted that the Pistons would win no more than 6 games this year. I gave the statistical argument why this was unlikely: (a) historically there's been regression to the mean, with an improving record among the teams that have been doing the worst and an average decline among the teams at the top of the standings, (b) if a team does unexpectedly poorly, you can attribute some of that to luck. Also, 2/30 = 0.067, and 5/82 = 0.061, so if you bet that the Pistons will win no more than 6 games this season, you're actually predicting they might do worse in the rest of the season. All they need to do is get lucky in 5 of the remaining games. He said, yeah, sure, but they don't look like they can do it. Also, now all the other teams are trying extra hard because nobody wants to be the team that loses to the Pistons. OK, maybe. . . .

Following Campos, I'll just go with the current Vegas odds and give a point prediction the Pistons will end the season with about 11 wins.

P.P.S. Also related is a post from a few years back, “The Warriors suck”: A Bayesian exploration.

P.P.P.S. Unrelatedly, except for the Michigan connection, I recommend these two posts from a couple years ago:

What is fame? The perspective from Niles, Michigan. Including an irrelevant anecdote about “the man who invented hedging”

and

Not only did this guy not hold the world record in the 100 meter or 110-yard dash for 35 years, he didn’t even run the 110-yard dash in 10.8 seconds, nor did he see a million patients, nor was he on the Michigan football and track teams, nor did Michigan even have a track team when he attended the university. It seems likely that he did know Jack Dempsey, though.

Enjoy.

Judgments versus decisions

Posted on December 28, 2023 12:22 PM by Jessica Hullman

This is Jessica. A paper called “Decoupling Judgment and Decision Making: A Tale of Two Tails” by Oral, Dragicevic, Telea, and Dimara showed up in my feed the other day. The premise of the paper is that when people interact with some data visualization, their accuracy in making judgments might conflict with their accuracy in making decisions from the visualization. Given that the authors appear to be basing the premise in part on results from a prior paper on decision making from uncertainty visualizations I did with Alex Kale and Matt Kay, I took a look. Here’s the abstract:

Is it true that if citizens understand hurricane probabilities, they will make more rational decisions for evacuation? Finding answers to such questions is not straightforward in the literature because the terms “judgment” and “decision making” are often used interchangeably. This terminology conflation leads to a lack of clarity on whether people make suboptimal decisions because of inaccurate judgments of information conveyed in visualizations or because they use alternative yet currently unknown heuristics. To decouple judgment from decision making, we review relevant concepts from the literature and present two preregistered experiments (N=601) to investigate if the task (judgment vs. decision making), the scenario (sports vs. humanitarian), and the visualization (quantile dotplots, density plots, probability bars) affect accuracy. While experiment 1 was inconclusive, we found evidence for a difference in experiment 2. Contrary to our expectations and previous research, which found decisions less accurate than their direct-equivalent judgments, our results pointed in the opposite direction. Our findings further revealed that decisions were less vulnerable to status-quo bias, suggesting decision makers may disfavor responses associated with inaction. We also found that both scenario and visualization types can influence people’s judgments and decisions. Although effect sizes are not large and results should be interpreted carefully, we conclude that judgments cannot be safely used as proxy tasks for decision making, and discuss implications for visualization research and beyond. Materials and preregistrations are available at https://osf.io/ufzp5/?view only=adc0f78a23804c31bf7fdd9385cb264f.

There’s a lot being said here, but they seem to be getting at a difference between forming accurate beliefs from some information and making a good (e.g., utility optimal) decision. I would agree there are slightly different processes. But they are also claiming to have a way of directly comparing judgment accuracy to decision accuracy. While I appreciate the attempt to clarify terms that are often overloaded, I’m skeptical that we can meaningfully separate and compare judgments from decisions in an experiment.

Some background

Let’s start with what we found in our 2020 paper, since Oral et al base some of their questions and their own study setup on it. In that experiment we’d had people make incentivized decisions from displays that varied only how they visualized the decision-relevant probability distributions. Each one showed a distribution of expected scores in a fantasy sports game for a team with and without the addition of a new player. Participants had to decide whether to pay for the new player or not in light of the cost of adding the player, the expected score improvement, and the amount of additional monetary award they won when they scored above a certain number of points. We also elicited a (controversial) probability of superiority judgment: What do you think is the probability your team will score more points with the new player than without? In designing the experiment we held various aspects of the decision problem constant so that only the ground truth probability of superiority was varying between trials. So we talked about the probability judgment as corresponding to the decision task.

However, after modeling the results we found that depending on whether we analyzed results from the probability response question or the incentivized decision, the ranking of visualizations changed. At the time we didn’t have a good explanation for this disparity between what was helpful for doing the probability judgment versus the decision, other than maybe it was due to the probability judgment not being directly incentivized like the decision response was. But in a follow-up analysis that applied a rational agent analysis framework to this same study, allowing us to separate different sources of performance loss by calibrating the participants’ responses for the probability task, we saw that people were getting most of the decision-relevant information regardless of which question they were responding to; they just struggled to report it for the probability question. So we concluded that the most likely reason for the disparity between judgment and decision results was probably that the probability of superiority judgment was not the most intuitive judgment to be eliciting – if we really wanted to elicit the beliefs directly corresponding to the incentivized decision task, we should have asked them for the difference in the probability of scoring enough points to win the award with and without the new player. But this is still just speculation, since we still wouldn’t be able to say in such a setup how much the results were impacted by only one of the responses being incentivized.

Oral et al. gloss over this nuance, interpreting our results as finding “decisions less accurate than their direct-equivalent judgments,” and then using this as motivation to argue that “the fact that the best visualization for judgment did not necessarily lead to better decisions reveals the need to decouple these two tasks.”

Let’s consider for a moment by what means we could try to eliminate ambiguity in comparing probability judgments to the associated decisions. For instance, if only incentivizing one of the two responses confounds things, we might try incentivizing the probability judgment with its own payoff function, and compare the results to the incentivized decision results. Would this allow us to directly study the difference between judgments and decision-making?

I argue no. For one, we would need to use different scoring rules for the two different types of response, and things might rank differently depending on the rule (not to mention one rule might be easier to optimize under). But on top of this, I would argue that once you provide a scoring rule for the judgment question, it becomes hard to distinguish that response from a decision by any reasonable definition. In other words, you can’t eliminate confounds that could explain a difference between “judgment” and “decision” without turning the judgment into something indistinguishable from a decision.

What is a decision?

The paper by Oral et al. describes abundant confusion in the literature about the difference between judgment and decision-making, proposing that “One barrier to studying decision making effectively is that judgments and decisions are terms not well-defined and separated.“ They criticize various studies on visualizations for claiming to study decisions when they actually study judgments. Ultimately they describe their view as:

In summary, while decision making shares similarities with judgment, it embodies four distinguishing features: (I) it requires a choice among alternatives, implying a loss of the remaining alternatives, (II) it is future-oriented, (III) it is accompanied with overt or covert actions, and (IV) it carries a personal stake and responsibility for outcomes. The more of these features a judgment has, the more “decision-like” it becomes. When a judgment has all four features, it no longer remains a judgment and becomes a decision. This operationalization offers a fuzzy demarcation between judgment and decision making, in the sense that it does not draw a sharp line between the two concepts, but instead specifies the attributes essential to determine the extent to which a cognitive process is a judgment, a decision, or somewhere in-between [58], [59].

This captures components of other definitions of decision I’ve seen in research related to evaluating interfaces, e.g., as a decision as “a choice between alternatives,” typically involving “high stakes.” However, like these other definitions, I don’t think Oral et al.’s definition very clearly differentiates a decision from other forms of judgment.

Take the “personal stake and responsibility for outcomes” part. How do we interpret this given that we are talking about subjects in an experiment, not decisions people are making in some more naturalistic context?

In the context of an experiment, we control the stakes and one’s responsibility for their action via a scoring rule. We could instead ask people to imagine making some life or death decision in our study and call it high stakes, as many researchers do. But they are in an experiment, and they know it. In the real world people have goals, but in an experiment you have to endow them

So we should incentivize the question to ensure participants have some sense of the consequences associated with what they decide. We can ask them to separately report their beliefs, e.g., what they perceive some decision-relevant probability to be as we did in the 2020 study. But if we want to eliminate confounds between the decision and the judgment, we should incentivize the belief question too, ideally with a proper scoring rule so that it’s in their best interest to tell me the truth. Now both our decision task and our judgment task, from the standpoint of the experiment subject, would both seem to have some personal stake. So we can’t distinguish the decision easily based on its personal stakes.

Oral et al. might argue that the judgment question is still not a decision, because there are three other criteria for a decision according to their definition. Considering (I), will asking for a person’s belief require them to make a choice between alternatives? Yes, it will. Because any format we elicit their response in will naturally constrain it. Even if we just provide a text box to type in a number between 0 and 1, we’re going to get values rounded at some decimal place. So it’s hard to use “a choice among alternatives” as a distinguishing criteria in any actual experiment.

What about (II), being future-oriented? Well, if I’m incentivizing the question then it will be just as future-oriented as my decision is, in that my payoff depends on my response and the ground truth, which is unknown to me until after I respond.

When it comes to (III), overt or covert actions, as in (I), in any actual experiment, my action space will be some form of constrained response space. It’s just that now my action is my choice of which beliefs to report. The action space might be larger, but again there is no qualitative difference between choosing what beliefs to report and choosing what action to report in some more constrained decision problem.

To summarize, by trying to put judgments and decisions on equal footing by scoring both, I’ve created something that seems to achieve Oral et al.’s definition of decision. While I do think there is a difference between a belief and a decision, I don’t think it’s so easy to measure these things without leaving open various other explanations for why the responses differ.

In their paper, Oral et al. sidestep incentivizing participants directly, assuming they will be intrinsically motivated. They report on two experiments where they used a task inspired by our 2020 paper (showing visualizations of expected score distributions and asking, Do you want the team with or without the new player, where the participant’s goal is to win a monetary award that requires scoring a certain number of points). Instead of incentivizing the decision by using the scoring rule to incentivize participants, they told them to try to be accurate. And instead of eliciting the corresponding probabilistic beliefs for the decision, they asked them two questions: Which option (team) is better?, and Which of the teams do you choose? They interpret the first answer as the judgment and the second as the decision.

I can sort of see what they are trying to do here, but this seems like essentially the same task to me. Especially if you assume people are intrinsically motivated to be accurate and plan to evaluate responses using the same scoring rule, as they do. Why would we expect a difference between these two responses? To use a different example that came up in a discussion I was having with Jason Hartline, if you imagine a judge who cares only about doing the right thing (convicting the guilty and acquitting the innocent), who must decide whether to acquit or convict a defendant, why would you expect a difference (in accuracy) when you ask them ‘Is he guilty’ versus ‘Will you acquit or convict?’

In their first experiment using this simple wording, Oral et al. find no difference between responses to the two questions. In a second experiment they slightly changed the wording of the questions to emphasize that one was “your judgment” and one was “your decision.” This leads to what they say is suggestive evidence that people’s decisions are more accurate than their judgments. I’m not so sure.

The takeway

It’s natural to conceive of judgments or beliefs as being distinct from decisions. If we subscribe to a Bayesian formulation of learning from data, we expect the rational person to form beliefs about the state of the world and then choose the utility maximizing action given those beliefs. However, it is not so natural to try to directly compare judgments and decisions on equal footing in an experiment.

More generally, when it comes to evaluating human decision-making (what we generally want to do in research related to interfaces) we gain little by preferring colloquial verbal definitions over the formalisms of statistical decision theory, which provide tools designed to evaluate people’s decisions ex-ante. It’s much easier to talk about judgment and decision-making when we have a formal way of representing a decision problem (i.e., state space, action space, data-generating model, scoring rule), and a shared understanding of what the normative process of learning from data to make a decision is (i.e., start with prior beliefs, update them given some signal, choose the action that maximizes your expected score under the data-generating model). In this case, we could get some insight into how judgments and decisions can differ simply by considering the process implied by expected utility theory.

Explaining that line, “Bayesians moving from defense to offense”

Posted on December 24, 2023 9:05 AM by Andrew

Earlier today we posted something on our recent paper with Erik van Zwet et al., “A New Look at P Values for Randomized Clinical Trials.” The post had the provocative title, “Bayesians moving from defense to offense,” and indeed that title provoked some people!

The discussion thread here at the blog was reasonable enough, but someone pointed me to a thread at Hacker News where there was more confusion, so I thought I’d clarify one or two points.

First, yes, as one commenter puts it, “I don’t know when Bayesians have ever been on defense. They’ve always been on offense.” Indeed, we published the first edition of Bayesian Data Analysis back in 1995, and there was nothing defensive about our tone! We were demonstrating Bayesian solutions to a large set of statistics problems, with no apology.

As a Bayesian, I’m kinda moderate—indeed, Yuling and I published an entire, non-ironic paper on holes in Bayesian statistics, and there’s also a post a few years ago called What’s wrong with Bayes, where I wrote, “Bayesian inference can lead us astray, and we’re better statisticians if we realize that,” and “the problem with Bayes is the Bayesians. It’s the whole religion thing, the people who say that Bayesian reasoning is just rational thinking, or that rational thinking is necessarily Bayesian, the people who refuse to check their models because subjectivity, the people who try to talk you into using a ‘reference prior’ because objectivity. Bayesian inference is a tool. It solves some problems but not all, and I’m exhausted by the ideology of the Bayes-evangelists.”

So, yeah, “Bayesians on the offensive” is not new, and I don’t even always like it. Non-Bayesians have been pretty aggressive too over the years, and not always in a reasonable way; see my discussion with Christian Robert from a few years ago and our followup. As we wrote, “The missionary zeal of many Bayesians of old has been matched, in the other direction, by an attitude among some theoreticians that Bayesian methods were absurd—not merely misguided but obviously wrong in principle.”

Overall, I think there’s much more acceptance of Bayesian methods within statistics than in past decades, in part from the many practical successes of Bayesian inference and in part because recent successes of machine learning have given users and developers of methods more understanding and acceptance of regularization (also known as partial pooling or shrinkage, and central to Bayesian methods) and, conversely, have given Bayesians more understanding and acceptance of regularization methods that are not fully Bayesian.

OK, so what was I talking about?

So . . . Given all the above, what did I mean with my crack about Bayesians moving from defense to offense”? I wasn’t talking about Bayesians being positive about Bayesian statistics in general; rather, I was talking about the specific issue of informative priors.

Here’s how we used to talk, circa 1995: “Bayesian inference is a useful practical tool. Sure, you need to assign a prior distribution, but don’t worry about it: the prior can be noninformative, or in a hierarchical model the hyperparameters can be estimated from data. The most important ‘prior information’ to use is structural, not numeric.”

Here’s how we talk now: “Bayesian inference is a useful practical tool, in part because it allows us to incorporate real prior information. There’s prior information all around that we can use in order to make better inferences.”

My “moving from defense to offense” line was all about the changes in how we think about prior information. Instead of being concerned about prior sensitivity, we respect prior sensitivity and, when the prior makes a difference, we want to use good prior information. This is exactly the same as in any statistical procedure: when there’s sensitivity to data (or, in general, to any input to the model), that’s where data quality is particularly relevant.

This does not stop you from using classical p-values

Regarding the specific paper we were discussing in yesterday’s post, let me emphasize that this work is very friendly to traditional/conventional/classical approaches.

As we say right there in the abstract, “we reinterpret the P value in terms of a reference population of studies that are, or could have been, in the Cochrane Database.”

So, in that paper, we’re not saying to get rid of the p-value. It’s a data summary, people are going to compute it, and people are going to report it. That’s fine! It’s also well known that p-values are commonly misinterpreted (as detailed for example by McShane and Gal, 2017).

Given that people will be reporting p-values, and given how often they are misinterpreted, even by professionals, we believe that it’s a useful contribution to research and practice to consider how to interpret them directly, under assumptions that are more realistic than the null hypothesis.

So if you’re coming into all of this as a skeptic of Bayesian methods, my message to you is not, “Chill out, the prior doesn’t really matter anyway,” but, rather, “Consider this other interpretation of a p-value, averaging over the estimated distribution of effect sizes in the Cochrane database instead of conditioning on the null hypothesis.” So you now have two interpretations of the p-value, conditional on different assumptions. You can use them both.

Bayesians moving from defense to offense: “I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?”

Posted on December 23, 2023 9:45 AM by Andrew

Erik van Zwet, Sander Greenland, Guido Imbens, Simon Schwab, Steve Goodman, and I write:

We have examined the primary efficacy results of 23,551 randomized clinical trials from the Cochrane Database of Systematic Reviews.

We estimate that the great majority of trials have much lower statistical power for actual effects than the 80 or 90% for the stated effect sizes. Consequently, “statistically significant” estimates tend to seriously overestimate actual treatment effects, “nonsignificant” results often correspond to important effects, and efforts to replicate often fail to achieve “significance” and may even appear to contradict initial results. To address these issues, we reinterpret the P value in terms of a reference population of studies that are, or could have been, in the Cochrane Database.

This leads to an empirical guide for the interpretation of an observed P value from a “typical” clinical trial in terms of the degree of overestimation of the reported effect, the probability of the effect’s sign being wrong, and the predictive power of the trial.

Such an interpretation provides additional insight about the effect under study and can guard medical researchers against naive interpretations of the P value and overoptimistic effect sizes. Because many research fields suffer from low power, our results are also relevant outside the medical domain.

Also this new paper from Zwet with Lu Tian and Rob Tibshirani:

Evaluating a shrinkage estimator for the treatment effect in clinical trials

The main objective of most clinical trials is to estimate the effect of some treatment compared to a control condition. We define the signal-to-noise ratio (SNR) as the ratio of the true treatment effect to the SE of its estimate. In a previous publication in this journal, we estimated the distribution of the SNR among the clinical trials in the Cochrane Database of Systematic Reviews (CDSR). We found that the SNR is often low, which implies that the power against the true effect is also low in many trials. Here we use the fact that the CDSR is a collection of meta-analyses to quantitatively assess the consequences. Among trials that have reached statistical significance we find considerable overoptimism of the usual unbiased estimator and under-coverage of the associated confidence interval. Previously, we have proposed a novel shrinkage estimator to address this “winner’s curse.” We compare the performance of our shrinkage estimator to the usual unbiased estimator in terms of the root mean squared error, the coverage and the bias of the magnitude. We find superior performance of the shrinkage estimator both conditionally and unconditionally on statistical significance.

Let me just repeat that last sentence:

We find superior performance of the shrinkage estimator both conditionally and unconditionally on statistical significance.

From a Bayesian standpoint, this is no surprise. Bayes is optimal if you average over the prior distribution and can be reasonable if averaging over something close to the prior. Especially reasonable in comparison to naive unregularized estimates (as here).

Erik summarizes:

We’ve determined how much we gain (on average over the Cochrane Database) by using our shrinkage estimator. It turns out to be about a factor 2 more efficient (in terms of the MSE) than the unbiased estimator. That’s roughly like doubling the sample size! We’re using similar methods as our forthcoming paper about meta-analysis with a single trial.

People sometimes ask me how I’ve changed as a statistician over the years. One answer I’ve given is that I’ve gradually become more Bayesian. I started out as a skeptic, concerned about Bayesian methods at all; then in grad school I started using Bayesian statistics in applications and realizing it could solve some problems for me; when writing BDA and ARM, still having the Bayesian cringe and using flat priors as much as possible, or not talking about priors at all; then with Aleks, Sophia, and others moving toward weakly informative priors; eventually under the influence of Erik and others trying to use direct prior information. At this point I’ve pretty much gone full Lindley.

Just as a comparison to where my colleagues and I are now, check out my response in 2008 to a question from Sanjay Kaul about how to specify a prior distribution for a clinical trial. I wrote:

I suppose the best prior distribution would be based on a multilevel model (whether implicit or explicit) based on other, similar experiments. A noninformative prior could be ok but I prefer something weakly informative to avoid your inferences being unduly affected by extremely unrealistic possibilities in the tail of the distribuiton.

Nothing wrong with this advice, exactly, but I was still leaning in the direction of noninformativeness in a way that I would not anymore. Sander Greenland replied at the time with a recommendation to use direct prior information. (And, just for fun, here’s a discussion from 2014 on a topic where Sander and I disagree.)

Erik concludes:

I really think it’s kind of irresponsible now not to use the information from all those thousands of medical trials that came before. Is that very radical?

That last question reminds me of our paper from 2008, Bayes: Radical, Liberal, or Conservative?

P.S. Also this:

You can click through to see the whole story.

P.P.S. More here on the “from defense to offense” thing.

Adding intermediate outcomes to an item-response (Bradley-Terry) model

Posted on December 8, 2023 9:22 AM by Andrew

Huib Meulenbelt writes:

Assume we have the following hierarchical Bradley-Terry model:
data {
  int<lower=0> K;                                // players
  int<lower=0> N;                                // number of rallies
  int<lower=1, upper=K> player_a;     // player a
  int<lower=1, upper=K> player_b;     // player b
  int<lower=0> y;                                 // number of rallies won
}
parameters {
  real<lower=0> sigma;
  vector[K] skill;                                   // ability for player K
}
model{
  sigma ~ lognormal(0, 0.5); 
  skill ~ normal(0, sigma);
  y ~ binomial_logit(N, skill[player_a] - skill[player_b]);
}
In this blog post you argue “there’s a lot of information in the score (or vote) differential that’s thrown away if you just look at win/loss.”

I agree completely. Of each rally I obtained the length of the rally and I would like this variable to influence the skill level of the player and opponent. The skill level of the two players are closer to each other when the game ends in 11-7 and the match lasts 500 seconds than when the game ends in 11-7 and the match only lasts 100 seconds.

So, we move from
p[player A wins over player B] = logistic(skill_A – skill_B)

to

p[player A wins over player B] = logistic(f(rally_length) * (skill_A – skill_B))

How would you define this function?

My reply: This is known in psychometrics as an item-response model with a discrimination parameter. The multiplier, f(rally_length) in the above notation, is called the discrimination: the idea is that the higher it is, the more predictive the skill-level difference is of the outcome. If discrimination is zero, the skill difference doesn’t predict at all, and a negative discrimination corresponds to a prediction that goes in the unexpected direction (the worse player being more likely to win).

My answer to the immediate question above is: sure, try it out. You could start with some simple form for the function f and see how it works. Ultimately I’m not thrilled with this model because it is not generative. I expect you can do better by modeling the length of the rally as an intermediate outcome. You can do this in Stan too. I’d recommend starting with just a single parameter per player, but you might need to add another parameter for each player if the rally length varies systematically by player after adjusting for ability.

But the biggest thing is . . . above you say that you agree with me to model score differential and not just win/loss, but then in your model you’re only including win/loss as an outcome. You’re throwing away information! Don’t do that. Whatever model you use, I strongly recommend you use score differential, not win/loss, as your outcome.

Effective Number of Parameters in a Statistical Model

Posted on November 30, 2023 9:16 AM by Andrew

This is my talk for a seminar organized by Joe Suzuki at Osaka University on Tues 10 Sep 2024, 8:50-10:20am Japan time / 19:50-21:20 NY time:

Effective Number of Parameters in a Statistical Model

Andrew Gelman, Department of Statistics, Columbia University

Degrees-of-freedom adjustment for estimated parameters is a general idea in small-sample hypothesis testing, uncertainty estimation, and assessment of prediction accuracy. The effective number of parameters gets interesting In the presence of nonlinearity, constraints, boundary conditions, hierarchical models, informative priors, discrete parameters, and other complicating factors. Many open questions remain, including: (a) defining the effective number of parameters, (b) measuring how the effective number of parameters can depend on data and vary across parameter space, and (c) understanding how the effective number of parameters changes as sample size increases. We discuss using examples from demographics, imaging, pharmacology, political science, and other application areas.

It will be a remote talk—I won’t be flying to Japan—so maybe the eventual link will be accessible to outsiders.

It feels kinda weird to be scheduling a talk nearly a year in advance, but since I had to give a title and abstract anyway, I thought I’d share it with you. My talk will be part of a lecture series they are organizing for graduate students at Osaka, “centered around WAIC/WBIC and its mathematical complexities, covering topics such as Stan usage, regularity conditions, WAIC, WBIC, cross-validation, SBIC, and learning coefficients.” I’m not gonna touch the whole BIC thing.

Now that we have loo, I don’t see any direct use for “effective number of parameters” in applied statistics, but the concept still seems important for understanding fitted models, in vaguely the same way that R-squared is useful for understanding, even though it does not answer any direct question of interest. I thought it could be fun to give a talk on all the things that confuse me about effective number of parameters, because I think it’s a concept that we often take for granted without fully thinking through.

Agreeing to give the talk could also motivate me to write a paper on the topic, which I’d like to do, given that it’s been bugging me for about 35 years now.

Book on Stan, R, and Python by Kentaro Matsuura

Posted on November 27, 2023 3:00 PM by Bob Carpenter

A new book on Stan using CmdStanR and CmdStanPy by Kentaro Matsuura has landed. And I mean that literally as you can see from the envelope (thanks, Kentaro!). Even the packaging from Japan is beautiful—it fit the book perfectly. You may also notice my Pentel Pointliner pen (they’re the best, but there’s a lot of competition) and my Mnemosyne pad (they’re the best, full stop), both from Japan.

Kentaro Matsuura. 2022. Bayesian Statistical Modeling with Stan, R, and Python. Springer.

If you click through to Amazon using the above link, the “Read Sample” button takes you to a list where you can read a sample, which includes the table of contents and a brief intro to notation.

Yes, it comes with source code

There’s a very neatly structured GitHub package, Bayesian statistical modeling with Stan R and Python, with all of the data and source code for the book.

The book just arrived, but from thumbing through it, I really like the way it’s organized. It uses practical simulation code and realistic data to illustrate points of workflow and show users how to get unstuck from common problems. This is a lot like the way Andrew teaches this material. Unlike how Andrew teaches, it starts from the basics, like what is a probability distribution. Luckily for the reader, rather than a dry survey trying to cover everything, it hits a few insightful highlights with examples—this is the way to go if you don’t want to just introduce distributions as you go.

The book is also generous with its workflow advice and tips on dealing with problems like non-identifiability or challenges like using discrete parameters. There’s even an advanced section at the end that works up to Gaussian processes and the application of Thompson sampling (not to reinforce Andrew’s impression that I love Thompson sampling—I just don’t have a better method for sequential decision making in “bandit” problems [scare quotes also for Andrew]).

CmdStanR and CmdStanPy interfaces

This is Kentaro’s second book on Stan. The first is in Japanese and it came out before CmdStanR and CmdStanPy. I’d recommend both this book and using CmdStanR or CmdStanPy—they are our go-to recommendations for using Stan these days (along with BridgeStan if you want transforms, log densities, and gradients). After moving to Flatiron Institute, I’ve switched from R to Python and now pretty much exclusively use Python with CmdStanPy, NumPy/SciPy (basic math and stats functions), plotnine (ggplot2 clone), and pandas (R data frame clone).

Random comment on form

In another nod to Andrew, I’ll make an observation about a minor point of form. If you’re going to use code in a book set in LaTeX, use sourcecodepro. It’s a Lucida Console-like font that’s much easier to read than courier. I’d just go with mathpazo for text and math in Palatino, but I can see why people like Times because it’s so narrow. Somehow Matsuura managed to solve the dreaded twiddle problem in his displayed Courier code so the twiddles look natural and not like superscripts—I’d love to know the trick to that. Overall, though, the graphics are abundant, clear, and consistently formatted, though Andrew might not like some of the ggplot2 defaults.

Comments from the peanut gallery

Brian Ward, who’s leading Stan language development these days and also one of the core devs for CmdStanPy and BridgeStan, said that he was a bit unsettled seeing API choices he’s made set down in print. Welcome to the club :-). This is why we’re so obsessive about backward compatibility.

Multilevel models for taxonomic data structures

Posted on November 13, 2023 9:01 AM by Andrew

This post from John Cook about the hierarchy of U.S. census divisions (nation, region, division, state, county, census tract, block group, census block, household, family, person) reminded me of something that I’ve discussed before but have never really worked out, which is the construction of families of hierarchical models for taxonomic data structures.

A few ideas come into play here:

1. Taxonomic structures come up a lot in social science. Cook gives the example of geography. Two other examples are occupation and religion, each of which can be categorized in many layers. Check out the Census Bureau’s industry and occupation classifications or the many categories of religion, religious denomination, etc.

2. Hierarchical modeling. If you have a national survey of, say, 2000 people classified by county, you won’t want to just do a simple “8 schools“-style hierarchical model, because if you do that you’ll pool the small counties pretty much all the way to the national average. You’ll want to include county-level predictors. One way of doing this is to include indicators for state, division, and region: that is, a hierarchical model over the taxonomy. This would be a simple example of the “unfolding flower” structure that expands to include more model structure as more data come in, thus avoiding the problem that the usual forms of weak priors are very strong as the dimensionality of the model increases while the data remain fixed (see Section 3 of my Bayesian model-building by pure thought paper from 1996).

I think there’s a connection here to quantum mechanics, in the way that, when a system is heated, new energy levels appear.

3. Taxonomies tend to have fractal structure. Some nodes have lots of branches and lots of depth, others just stop. For example, in the taxonomy of religious denominations, Christians can be divided into Catholics, Protestant, and others, then Protestants can be divided into mainline and evangelical, each which can be further subdivided, whereas some of the others, such as Mormons, might not be divided at all. Similarly, some index entries in a book will have lots of sub-entries and can go on for multiple columns, within which some sub-entries have many sub-sub-entries, etc., and then other entries are just one line. You get the sort of long-tailed distribution that’s characteristic of self-similar processes. Mandelbrot wrote about this back in 1955! This might be his first publication of any kind about fractals, a topic that he productively chewed on for several decades more.

My colleagues and I have worked with some taxonomic models for voting based on geography:
– Section 5 of our 2002 article on the mathematics and statistics of voting power,
– Our recent unpublished paper, How democracies polarize: A multilevel perspective.
These are sort of like spatial models or network models, but structured specifically based on a hierarchical taxonomy. Both papers have some fun math.

I think there are general principles here, or at least something more to be said on the topic.

Hey! Here are some amazing articles by George Box from around 1990. Also there’s some mysterious controversy regarding his center at the University of Wisconsin.

Posted on November 10, 2023 9:30 AM by Andrew

The webpage is maintained by John Hunter, son of Box’s collaborator William Hunter, and I came across it because I was searching for background on the paper-helicopter example that we use in our classes to teach principles of experimental design and data analysis.

There’s a lot to say about the helicopter example and I’ll save that for another post.

Here I just want to talk about how much I enjoyed reading these thirty-year-old Box articles.

A Box Set from 1990

Many of the themes in those articles continue to resonate today. For example:

• The process of learning. Here’s Box from his 1995 article, “Total Quality: Its Origins and its Future”:

Scientific method accelerated that process in at least three ways:

1. By experience in the deduction of the logical consequences of the group of facts each of which was individually known but had not previously been brought together.

2. By the passive observation of systems already in operation and the analysis of data coming from such systems.

3. By experimentation – the deliberate staging of artificial experiences which often might ordinarily never occur.

A misconception is that discovery is a “one shot” affair. This idea dies hard. . . .

• Variation over time. Here’s Box from his 1989 article, “Must We Randomize Our Experiment?”:

We all live in a non-stationary world; a world in which external factors never stay still. Indeed the idea of stationarity – of a stable world in which, without our intervention, things stay put over time – is a purely conceptual one. The concept of stationarity is useful only as a background against which the real non-stationary world can be judge. For example, the manufacture of parts is an operation involving machines and people. But the parts of a machine are not fixed entities. They are wearing out, changing their dimensions, and losing their adjustment. The behavior of the people who run the machines is not fixed either. A single operator forgets things over time and alters what he does. When a number of operators are involved, the opportunities for change because of failures to communicate are further multiplied. Thus, if left to itself any process will drift away from its initial state. . . .

Stationarity, and hence the uniformity of everything depending on it, is an unnatural state that requires a great deal of effort to achieve. That is why good quality control takes so much effort and is so important. All of this is true, not only for manufacturing processes, but for any operation that we would like to be done consistently, such as the taking of blood pressures in a hospital or the performing of chemical analyses in a laboratory. Having found the best way to do it, we would like it to be done that way consistently, but experience shows that very careful planning, checking, recalibration and sometimes appropriate intervention, is needed to ensure that this happens.

Here an example, from Box’s 1992 article, “How to Get Lucky”:

For illustration Figure 1(a) shows a set of data designed so seek out the source of unacceptably large variability which, it was suspected, might be due to small differences in five, supposedly identical, heads on a machine. To test this idea, the engineer arranged that material from each of the five heads was sampled at roughly equal intervals of time in each of six successive eight-hour periods. . . . the same analysis strongly suggested that real differences in means occurred between the six eight-hour periods of time during which the experiment was conducted. . . .

• Workflow. Here’s Box from his 1999 article, “Statistics as a Catalyst to Learning by Scientific Method Part II-Discussion”:

Most of the principles of design originally developed for agricultural experimentation would be of great value in industry, but the most industry experimentation differed from agricultural experimentation in two major respects. These I will call immediacy and sequentially.

What I mean by immediacy is that for most of our investigations the results were available, if not within hours, then certainly within days and in rare cases, even within minutes. This was true whether the investigation was conducted in a laboratory, a pilot plant or on the full scale. Furthermore, because the experimental runs were usually made in sequence, the information obtained from each run, or small group of runs, was known and could be acted upon quickly and used to plan the next set of runs. I concluded that the chief quarrel that our experimenters had with using “statistics” was that they thought it would mean giving up the enormous advantages offered by immediacy and sequentially. Quite rightly, they were not prepared to make these sacrifices. The need was to find ways of using statistics to catalyze a process of investigation that was not static, but dynamic.

There’s lots more. It’s funny to read these things that Box wrote back then, that I and others have been saying over and over again in various informal contexts, decades later. It’s a problem with our statistical education (including my own textbooks) that these important ideas are buried.

More Box

A bunch of articles by Box, with some overlap but not complete overlap with the above collection, is at the site of the University of Wisconsin, where he worked for many years. Enjoy.

Some kinda feud is going on

John Hunter’s page also has this:

The Center for Quality and Productivity Improvement was created by George Box and Bill Hunter at the University of Wisconsin-Madison in 1985.

In the first few years reports were published by leading international experts including: W. Edwards Deming, Kaoru Ishikawa, Peter Scholtes, Brian Joiner, William Hunter and George Box. William Hunter died in 1986. Subsequently excellent reports continued to be published by George Box and others including: Gipsie Ranney, Soren Bisgaard, Ron Snee and Bill Hill.

These reports were all available on the Center’s web site. After George Box’s death the reports were removed. . . .

It is a sad situation that the Center abandonded the ideas of George Box and Bill Hunter. I take what has been done to the Center as a personal insult to their memory. . . .

When diagonoised with cancer my father dedicated his remaining time to creating this center with George to promote the ideas George and he had worked on throughout their lives: because it was that important to him to do what he could. They did great work and their work provided great benefits for long after Dad’s death with the leadership of Bill Hill and Soren Bisgaard but then it deteriorated. And when George died the last restraint was eliminated and the deterioration was complete.

Wow. I wonder what the story was. I asked someone I know who works at the University of Wisconsin and he had no idea. Box died in 2013 so it’s not so long ago; there must be some people who know what happened here.

For how many iterations should we run Markov chain Monte Carlo?

Posted on November 7, 2023 9:50 PM by Andrew

So, Charles and I wrote this article. It’s for the forthcoming Handbook of Markov Chain Monte Carlo, 2nd edition. We still have an opportunity to revise it, so if you see anything wrong or misleading, or if there are other issues you think we should mention, please let us know right here in comments!

Remember, it’s just one chapter of an entire book—it’s intended to be an updated version of my chapter with Kenny, Inference from Simulations and Monitoring Convergence, from the first edition. For our new chapter, Charles liked the snappier, “How many iterations,” title.

In any case, our chapter does not have to cover MCMC, just inference and monitoring convergence. But I think Charles is right that, once you think about inference and monitoring convergence, you want to step back and consider design, in particular how many chains to run, where do you start them, and how many iterations per chain.

Again, all comments will be appreciated.

Hydrology Corner: How to compare outputs from two models, one Bayesian and one non-Bayesian?

Posted on October 31, 2023 9:56 AM by Andrew

Zac McEachran writes:

I am a Hydrologist and Flood Forecaster at the National Weather Service in the Midwest. I use some Bayesian statistical methods in my research work on hydrological processes in small catchments.

I recently came across a project that I want to use a Bayesian analysis for, but I am not entirely certain what to look for to get going on this. My issue: NWS uses a protocol for calibrating our river models using a mixed conceptual/physically-based model. We want to assess whether a new calibration is better than an old calibration. This seems like a great application for a Bayesian approach. However, a lot of the literature I am finding (and methods I am more familiar with) are associated with assessing goodness-of-fit and validation for models that were fit within a Bayesian framework, and then validated in a Bayesian framework. I am interested in assessing how a non-Bayesian model output compares with another non-Bayesian model output with respect to observations. Someday I would like to learn to use Bayesian methods to calibrate our models but one step at a time!

My response: I think you need somehow to give a Bayesian interpretation to your non-Bayesian model output. This could be as simple as taking 95% prediction intervals and interpreting them as 95% posterior intervals from a normally-distributed posterior. Or if the non-Bayesian fit only gives point estimates, then do some boostrapping or something to get an effective posterior. Then you can use external validation or cross validation to compare the predictive distributions of your different models, as discussed here; also see Aki’s faq on cross validation.

A Hydrologist and Flood Forecaster . . . how cool is that?? Last time we had this level of cool was back in 2009 when we were contacted by someone who was teaching statistics to firefighters.

Blue Rose Research is hiring (yet again) !

Posted on October 27, 2023 10:00 AM by shira

Blue Rose Research has a few roles that we’re actively hiring for as we gear up to elect more Democrats in 2024, and advance progressive causes!

A bit about our work:

For the 2022 US election, we used engineering and statistics to advise major progressive organizations on directing hundreds of millions of dollars to the right ads and states.
We tested thousands of ads and talking points in the 2022 election cycle and partnered with orgs across the space to ensure that the most effective messages were deployed from the state legislative level all the way up to Senate and Gubernatorial races and spanning the issue advocacy space as well.
We were more accurate than public polling in identifying which races were close across the Senate, House, and Gubernatorial maps.
And we’ve built up a technical stack that enables us to continue to build on innovative machine learning, statistical, and engineering solutions.

Now as we are looking ahead to 2024, we are hiring for the following positions:

Data Scientist – Machine Learning (Salary: $130k – $170k)
Research Analyst (Salary: $60k – $90k)
And our overall jobs board will be updated here as additional roles open up

All positions are remote, with optional office time with the team in New York City.

Please don’t hesitate to reach out with any questions ([email protected]).

Postdoc on Bayesian methodological and applied work! To optimize patient care! Using Stan! In North Carolina!

Posted on October 24, 2023 5:32 PM by Andrew

Sam Berchuck writes:

I wanted to bring your attention to a postdoc opportunity in my group at Duke University in the Department of Biostatistics & Bioinformatics. The full job ad is here: https://forms.stat.ufl.edu/statistics-jobs/entry/10978/.

The postdoc will work on Bayesian methodological and applied work, with a focus on modeling complex longitudinal biomedical data (including electronic health records and mobile health data) to create data-driven approaches to optimize patient care among patients with chronic diseases. The position will be particularly interesting to people interested in applying Bayesian statistics in real-world big data settings. We are looking for people who have experience in Bayesian inference techniques, including Stan!

Interesting. In addition to the Stan thing, I’m interested in data-driven approaches to optimize patient care. This is an area where a Bayesian approach, or something like it, is absolutely necessary, as you typically just won’t have enough data to make firm conclusions about individual effects, so you have to keep track of uncertainty. Sounds like a wonderful opportunity.

Statistical Modeling, Causal Inference, and Social Science

Category Archives: Bayesian Statistics