Cherry blossoms—not just another prediction competition

It’s back! As regular readers know, the Cherry Blossom Prediction Competition will run throughout February 2024. We challenge you to predict the bloom date of cherry trees at five locations throughout the world and win prizes.

We’ve been promoting the competition for three years now—but we haven’t really emphasized the history of the problem. You might be surprised to know that bloom date prediction interested several famous nineteenth century statisticians. Co-organizer Jonathan Auerbach explains:

The “law of the flowering plants” states that a plant blooms after being exposed to a predetermined quantity of heat. The law was discovered in the mid-eighteenth century by René Réaumur, an early adopter of the thermometer—but it was popularized by Adolphe Quetelet, who devoted a chapter to the law in his Letters addressed to HRH the Grand Duke of Saxe Coburg and Gotha (Letter Number 33). See this tutorial for details.

Kotz and Johnson list Letters as one of eleven major breakthroughs in statistics prior to 1900, and the law of the flowering plants appears to have been well known throughout the nineteenth century, influencing statisticians such as Francis Galton and Florence Nightingale. But its popularity waned among statisticians as statistics transitioned from a collection of “fundamental” constants to a collection of principles for quantifying uncertainty. In fact, Ian Hacking mocks the law as a particularly egregious example of stamp collecting statistics.

But the law is widely used today! Charles Morren, Quetelet’s collaborator, later coined the term phenology, the name of the field that currently studies life-cycle events, such as bloom dates. Phenologists keep track of accumulated heat or growing degree days to predict bloom dates, crop yields, and the emergence of insects. Predictions are made using a methodology that is largely unchanged since Quetelet’s time—despite the large amounts of data now available and amenable to machine learning.

Fun with Dååta: Reference librarian edition

Rasmuth Bååth reports the following fun story in a blog post, The source of the cake dataset (it’s a hierarchical modeling example included with the R package lme4).

Rasmuth writes,

While looking for a dataset to illustrate a simple hierarchical model I stumbled upon another one: The cake dataset in the lme4 package which is described as containing “data on the breakage angle of chocolate cakes made with three different recipes and baked at six different temperatures [as] presented in Cook (1938).

The search is on.

… after a fair bit of flustered searching, I realized that this scholarly work, despite its obvious relevance to society, was nowhere to be found online.

The plot thickens like cake batter until Megan N. O’Donnell, a reference librarian (officially, Research Data Services Lead!) at Iowa State, the source of the original, gets involved. She replies to Rasmuth’s query,

Sorry for the delay — I got caught up in a deadline. The scan came out fairly well, but page 16 is partially cut off. I’ll put in a request to have it professionally scanned, but that will take some time. Hopefully this will do for now.

Rasmuth concludes,

She (the busy Research Data Services Lead with a looming deadline) is apologizing to me (the random Swede with an eccentric cake thesis digitization request) that it took a few days to get me everything I asked for!?

Reference librarians are amazing! Read the whole story and download the actual manuscript from Rasmuth’s original blog post. The details of the experimental design are quite interesting, including the device used to measure cake breakage angle, a photo of which is included in the post.

I think it’d be fun to organize a class around generating new, small scale and amusing data sets like this one. Maybe it sounds like more fun than it would actually be—data collection is grueling. Andrew says he’s getting tired of teaching data communication, and he’s been talking a lot more about the importance of data collection on the blog, so maybe next year…

P.S. In a related note, there’s something called a baath cake that’s popular in Goa and confused my web search.

The paradox of replication studies: A good analyst has special data analysis and interpretation skills. But it’s considered a bad or surprising thing that if you give the same data to different analysts, they come to different conclusions.

Benjamin Kircup writes:

I think you will be very interested to see this preprint that is making the rounds: Same data, different analysts: variation in effect sizes due to analytical decisions in ecology and evolutionary biology (ecoevorxiv.org)

I see several ties to social science, including the study of how data interpretation varies across scientists studying complex systems; but also the sociology of science. This is a pretty deep introspection for a field; and possibly damning. The garden of forking paths is wide. They cite you first, which is perhaps a good sign.

Ecologists frequently pride themselves on data analysis and interpretation skills. If there weren’t any variability, what skill would there be? It would all be mechanistic, rote, unimaginative, uninteresting. In general, actually, that’s the perception many have of typical biostatistics. It leaves insights on the table by being terribly rote and using the most conservative kinds of analytic tools (yet another t-test, etc). The price of this is that different people will reach different conclusions with the same data – and that’s not typically discussed, but raises questions about the literature as a whole.

One point: apparently the peer reviews didn’t systematically reward finding large effect sizes. That’s perhaps counterintuitive and suggests that the community isn’t rewarding bias, at least in that dimension. It would be interesting to see what you would do with the data.

The first thing I noticed is that the paper has about a thousand authors! This sort of collaborative paper kind of breaks the whole scientific-authorship system.

I have two more serious thoughts:

1. Kircup makes a really interesting point, that analysts “pride themselves on data analysis and interpretation skills. If there weren’t any variability, what skill would there be?”, but then it’s considered a bad or surprising thing that if you give the same data to different analysts, they come to different conclusions. There really does seem to be a fundamental paradox here. On one hand, different analysts do different things—Pete Palmer and Bill James have different styles, and you wouldn’t expect them to come to the same conclusions—; on the other hand, we expect strong results to appear no matter who is analyzing the data.

A partial resolution to this paradox is that much of the skill of data analysis and interpretation comes in what questions to ask. In these replication projects (I think Bob Carpenter calls them “bake-offs”), several different teams are given the same question and the same data and then each do their separate analysis. David Rothschild and I did one of these; it was called We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results, and we were the only analysts of that Florida poll from 2016 that estimated Trump to be in the lead. Usually, though, data and questions are not fixed, despite what it might look like when you read the published paper. Still, there’s something intriguing about what we might call the Analyst’s Paradox.

2. Regarding his final bit (“apparently the peer reviews didn’t systematically reward finding large effect sizes”), I think Kircup is missing the point. Peer reviews don’t systematically reward finding large effect sizes. What they systematically reward is finding “statistically significant” effects, i.e. those that are at least two standard errors from zero. But by restricting yourself to those, you automatically overestimate effect sizes, as I discussed to interminable length in papers such as Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors and The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. So they are rewarding bias, just indirectly.

The importance of measurement, and how you can draw ridiculous conclusions from your statistical analyses if you don’t think carefully about measurement . . . Leamer (1983) got it.

Screen Shot 2013-08-03 at 4.23.29 PM

Jacob Klerman writes:

I have noted your recent emphasis on the importance of measurement (e.g., “Here are some ways to make your study replicable…”). For reasons not relevant here, I was rereading Leamer (1983), Let’s Take the Con Out of Econometrics—now 40 years old. It’s a fun, if slightly dated, paper that you seem to be aware of.

Leamer also makes the measurement point (emphasis added):

When the sampling uncertainty S gets small compared to the misspecification uncertainty M ,it is time to look for other forms of evidence, experiments or nonexperiments. Suppose I am interested in measuring the width of a coin. and I provide rulers to a room of volunteers. After each volunteer has reported a measurement, I compute the mean and standard deviation, and I conclude that the coin has width 1.325 millimeters with a standard error of .013. Since this amount of uncertainty is not to my liking, I propose to find three other rooms full of volunteers, thereby multiplying the sample size by four, and dividing the standard error in half. That is a silly way to get a more accurate measurement, because I have already reached the point where the sampling uncertainty S is very small compared with the misspecification uncertainty M. If I want to increase the true accuracy of my estimate, it is time for me to consider using a micrometer. So to in the case of diet and heart disease. Medical researchers had more or less exhausted the vein of nonexperimental evidence, and it became time to switch to the more expensive but richer vein of experimental evidence.

Interesting. Good to see examples where ideas we talk about today were already discussed in the classic literature. I indeed thing measurement is important and is under-discussed in statistics. Economists are very familiar with the importance of measurement, both in theory (textbooks routinely discuss the big challenges in defining, let alone measuring, key microeconomic quantities such as “the money supply”) and in practice (data gathering can often be a big deal, involving archival research, data quality checking, etc., even if unfortunately this is not always done), but then once the data are in, data quality and issues of bias and variance of measurement often seem to be forgotten. Consider, for example, this notorious paper where nobody at any stage in the research, writing, reviewing, revising, or editing process seemed to be concerned about that region with a purported life expectancy of 91 (see the above graph)—and that doesn’t even get into the bizarre fitted regression curve. But, hey, p less than 0.05. Publishing and promoting such a result based on the p-value represents some sort of apogee of trusting implausible theory over realistic measurement.

Also, if you want a good story about why it’s a mistake to think that your uncertainty should just go like 1/sqrt(n), check out this story which is also included in our forthcoming book, Active Statistics.

Resources for teaching and learning survey sampling, from Scott Keeter at Pew Research

Art Owen informed me that he’ll be teaching sampling again at Stanford, and he was wondering about ideas for students gathering their own data.

I replied that I like the idea of sampling from databases, biological sampling, etc. You can point out to students that a “blood sample” is indeed a sample!

Art replied:

Your blood example reminds me that there is a whole field (now very old) on bulk sampling. People sample from production runs, from cotton samples, from coal samples and so on. Widgets might get sampled from the beginning, middle and end of the run. David Cox wrote some papers on sampling to find the quality of cotton as measured by fiber length. The process is to draw a blue line across the sample and see the length of fibers that intersect the line. This gives you a length-biased sample that you can nicely de-bias. There’s also an interesting out there about tree sampling, literally on a tree, where branches get sampled at random and fruit is counted. I’m not sure if it’s practical.

Last time I found an interesting example where people would sample ocean tracts to see if there was a whale. If they saw one, they would then sample more intensely in the neighboring tracts. Then the trick was to correct for the bias that brings. It’s in the Sampling book by S. K. Thompson. There are also good mark-recapture examples for wildlife.

I hesitate to put a lot of regression in a sampling class; It is all too easy for every class to start looking like a regression/prediction/machine learning class. We need room for the ideas about where and how data arises and it’s too easy to crowd those out by dwelling on the modeling ideas.

I’ll probably toss in some space-filling sampling plans and other ways to down size data sets as well.

The old Cochran style was: get an estimator, show it is unbiased, find an expression for its variance, find an estimate of that variance, show this estimate is unbiased and maybe even find and compare variances of several competing variance estimates. I get why he did it but it can get dry. I include some of that but I don’t let it dominate the course. Choices you can make and their costs are more interesting.

I connected Art to Scott Keeter at Pew Research, who wrote:

Fortunately, we are pretty diligent about keeping track of what we do and writing it up. The examples below have lengthy methodology sections and often there is companion material (such as blog posts or videos) about the methodological issues.

We do not have a single overview methodological piece about this kind of work but the next best thing is a great lecture that Courtney Kennedy gave at the University of Michigan last year, walking through several of our studies and the considerations that went into each one:

Here are some links to good examples, with links to the methods sections or extra features:

Our recent study of Jewish Americans, the second one we’ve done. We switched modes for this study (thus different sampling strategy), and the report materials include an analysis of mode differences https://www.pewresearch.org/religion/2021/05/11/jewish-americans-in-2020/

Appendix A: Survey methodology

Jewish Americans in 2020: Answers to frequently asked questions

Our most recent survey of the US Muslim population:

U.S. Muslims Concerned About Their Place in Society, but Continue to Believe in the American Dream


A video on the methods:
https://www.pewresearch.org/fact-tank/2017/08/16/muslim-americans-methods/

This is one of the most ambitious international studies we’ve done:

Religion in India: Tolerance and Segregation


Here’s a short video on the sampling and methodology:
https://www.youtube.com/watch?v=wz_RJXA7RZM

We then had a quick email exchange:

Me: Thanks. Post should appear in Aug.

Scott: Thanks. We’ll probably be using sampling by spaceship and data collection with telepathy by then.

Me: And I’ll be charging the expenses to my NFT.

In a more serious vein, Art looked into Scott’s suggestions and followed up:

I [Art] looked at a few things at the Pew web-site. The quality of presentation is amazingly good. I like the discussions of how you identify who to reach out to. Also the discussion of how to pose the gender identity question is something that I think would interest students. I saw some of the forms and some of the data on response rates. I also found Courtney Kennedy’s video on non-probability polls. I might avoid religious questions for in-depth followup in class. Or at least, I would have to be careful in doing it, so nobody feels singled out.

Where could I find some technical documents about the American Trends Panel? I would be interested to teach about sample reweighting, e.g., raking and related methods, as it is done for real.

I’m wondering about getting survey data for a class. I might not be able to require them to get a Pew account and then agree to terms and conditions. Would it be reasonable to share a downsampled version of a Pew data set with a class? Something about attitudes to science would be interesting for students.

To which Scott replied:

Here is an overview I wrote about how the American Trends Panel operates and how it has changed over time in response to various challenges:

Growing and Improving Pew Research Center’s American Trends Panel

This relatively short piece provides some good detail about how the panel works:
https://www.pewresearch.org/fact-tank/2021/09/07/how-do-people-in-the-u-s-take-pew-research-center-surveys-anyway/

We use the panel to conduct lots of surveys, but most of them are one-off efforts. We do make an effort to track trends over time, but that’s usually the way we used to do it when we conducted independent sample phone surveys. However, we sometimes use the panel as a panel – tracking individual-level change over time. This piece explains one application of that approach:
https://www.pewresearch.org/fact-tank/2021/01/20/how-we-know-the-drop-in-trumps-approval-rating-in-january-reflected-a-real-shift-in-public-opinion/

When we moved from mostly phone surveys to mostly online surveys, we wanted to assess the impact of the change in mode of interview on many of our standard public opinion measures. This study was a randomized controlled experiment to try to isolate the impact of mode of interview:

From Telephone to the Web: The Challenge of Mode of Interview Effects in Public Opinion Polls

Survey panels have some real benefits but they come with a risk – that panelists change as a result of their participation in the panel and no longer fully resemble the naïve population. We tried to assess whether that is happening to our panelists:

Measuring the Risks of Panel Conditioning in Survey Research

We know that all survey samples have biases, so we weight to try to correct those biases. This particularly methodology statement is more detailed than is typical and gives you some extra insight into how our weighting operates. Unfortunately, we do not have a public document that breaks down every step in the weighting process:

Methodology

Most of our weighting parameters come from U.S. government surveys such as the American Community Survey and the Current Population Survey. But some parameters are not available on government surveys (e.g., religious affiliation) so we created our own higher quality survey to collect some of these for weighting:

How Pew Research Center Uses Its National Public Opinion Reference Survey (NPORS)

This one is not easy to find on our website but it’s a good place to find wonky methodological content, not just about surveys but about our big data projects as well:

Home


We used to publish these through Medium but decided to move them in-house.

By the way, my colleagues in the survey methods group have developed an R package for the weighting and analysis of survey data. This link is to the explainer for weighting data but that piece includes links to explainers about the basic analysis package:
https://www.pewresearch.org/decoded/2020/03/26/weighting-survey-data-with-the-pewmethods-r-package/

Lots here to look at!

It’s been awhile since I’ve taught a course on survey sampling. I used to teach such a course—it was called Design and Analysis of Sample Surveys—and I enjoyed it. But . . . in the class I’d always have to spend some time discussing basic statistics and regression modeling, and this always was the part of the class that students found the most interesting! So I eventually just started teaching statistics and regression modeling, which led to my Regression and Other Stories book. The course I’m now teaching out of that book is called Applied Regression and Causal Inference. I still think survey sampling is important; it was just hard to find an audience for the course.

Progress in 2023, Leo edition

Following Andrew, Aki, Jessica, and Charles, and based on Andrew’s proposal, I list my research contributions for 2023.

Published:

  1. Egidi, L. (2023). Seconder of the vote of thanks to Narayanan, Kosmidis, and Dellaportas and contribution to the Discussion of ‘Flexible marked spatio-temporal point processes with applications to event sequences from association football’Journal of the Royal Statistical Society Series C: Applied Statistics72(5), 1129.
  2. Marzi, G., Balzano, M., Egidi, L., & Magrini, A. (2023). CLC Estimator: a tool for latent construct estimation via congeneric approaches in survey research. Multivariate Behavioral Research, 58(6), 1160-1164.
  3. Egidi, L., Pauli, F., Torelli, N., & Zaccarin, S. (2023). Clustering spatial networks through latent mixture models. Journal of the Royal Statistical Society Series A: Statistics in Society186(1), 137-156.
  4. Egidi, L., & Ntzoufras, I. (2023). Predictive Bayes factors. In SEAS IN. Book of short papers 2023 (pp. 929-934). Pearson.
  5. Macrì Demartino, R., Egidi, L., & Torelli, N. (2023). Power priors elicitation through Bayes factors. In SEAS IN. Book of short papers 2023 (pp. 923-928). Pearson.

Preprints:

  1. Consonni, G., & Egidi, L. (2023). Assessing replication success via skeptical mixture priorsarXiv preprint arXiv:2401.00257. Submitted.

Softwares:

    CLC estimator

  • free and open-source app to estimate latent unidimensional constructs via congeneric approaches in survey research (Marzi et al., 2023)

   footBayes package (CRAN version 0.2.0)

   pivmet package (CRAN version 0.5.0)

I hope and guess that the paper dealing with the replication crisis, “Assessing replication success via skeptical mixture priors” with Guido Consonni, could have good potential in the Bayesian assesment of replication success in social and hard sciences; this paper can be seen as an extension of the paper written by Leonhard Held and Samuel Pawel entitled “The Sceptical Bayes Factor for the Assessment of Replication Success“.  Moreover, I am glad that the paper “Clustering spatial networks through latent mixture models“, focused on a model-based clustering approach defined in a hybrid latent space, has been finally published in JRSS A.

Regarding softwares, the footBayes package, a tool to fit the most well-known soccer (football) models through Stan and maximum likelihood methods, has been deeply developed and enriched with new functionalities (2024 objective: incorporate CmdStan with VI/Pathfinder algorithms and write a package’s paper in JSS/R Journal format).

Learning from mistakes (my online talk for the American Statistical Association, 2:30pm Tues 30 Jan 2024)

Here’s the link:

Learning from mistakes

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

We learn so much from mistakes! How can we structure our workflow so that we can learn from mistakes more effectively? I will discuss a bunch of examples where I have learned from mistakes, including data problems, coding mishaps, errors in mathematics, and conceptual errors in theory and applications. I will also discuss situations where researchers have avoided good learning opportunities. We can then try to use all these cases to develop some general understanding of how and when we learn from errors in the context of the fractal nature of scientific revolutions.

The video is here.

It’s sooooo frustrating when people get things wrong, the mistake is explained to them, and they still don’t make the correction or take the opportunity to learn from their mistakes.

To put it another way . . . when you find out you made a mistake, you learn three things:

1. Now: Your original statement was wrong.

2. Implications for the future: Beliefs and actions that flow from that original statement may be wrong. You should investigate your reasoning going forward and adjust to account for your error.

3. Implications for the past: Something in your existing workflow led to your error. You should trace your workflow, see how that happened, and alter your workflow accordingly.

In poker, they say to evaluate the strategy, not the play. In quality control, they say to evaluate the process, not the individual outcome. Similarly with workflow.

As we’ve discussed many many times in this space (for example, here), it makes me want to screeeeeeeeeeam when people forego this opportunity to learn. Why do people, sometimes very accomplished people, give up this opportunity? I’m speaking here of people who are trying their best, not hacks and self-promoters.

The simple answer for why even honest people will avoid admitting clear mistakes is that it’s embarrassing for them to admit error, they don’t want to lose face.

The longer answer, I’m afraid, is that at some level they recognize issues 1, 2, and 3 above, and they go to some effort to avoid confronting item 1 because they really really don’t want to face item 2 (their beliefs and actions might be affected, and they don’t want to hear that!) and item 3 (they might be going about everything all wrong, and they don’t want to hear that either!).

So, paradoxically, the very benefits of learning from error are scary enough to some people that they’ll deny or bury their own mistakes. Again, I’m speaking here of otherwise-sincere people, not of people who are willing to lie to protect their investment or make some political point or whatever.

In my talk, I’ll focus on my own mistakes, not those of others. My goal is for you in the audience to learn how to improve your own workflow so you can catch errors faster and learn more from them, in all three senses listed above.

P.S. Planning a talk can be good for my research workflow. I’ll get invited to speak somewhere, then I’ll write a title and abstract that seems like it should work for that audience, then the existence of this structure gives me a chance to think about what to say. For example, I’d never quite thought of the three ways of learning from error until writing this post, which in turn was motivated by the talk coming up. I like this framework. I’m not claiming it’s new—I guess it’s in Pólya somewhere—, just that it will help my workflow. Here’s another recent example of how the act of preparing an abstract helped me think about a topic of continuing interest to me.

Regarding the use of “common sense” when evaluating research claims

I’ve often appealed to “common sense” or “face validity” when considering unusual research claims. For example, the statement that single women during certain times of the month were 20 percentage points more likely to support Barack Obama, or the claim that losing an election for governor increases politicians’ lifespan by 5-10 years on average, or the claim that a subliminal smiley face flashed on a computer screen causes large changes in people’s attitudes on immigration, or the claim that attractive parents are 36% more likely to have girl babies . . . these claims violated common sense. Or, to put it another way, they violated my general understanding of voting, health, political attitudes, and human reproduction.

I often appeal to common sense, but that doesn’t mean that I think common sense is always correct or that we should defer to common sense. Rather, common sense represents some approximation of a prior distribution or existing model of the world. When our inferences contradict our expectations, that is noteworthy (in a chapter 6 of BDA sort of way), and we want to address this. It could be that addressing this will result in a revision of “common sense.” That’s fine, but if we do decide that our common sense was mistaken, I think we should make that statement explicitly. What bothers me is when people report findings that contradict common sense and don’t address the revision in understanding that would be required to accept that.

In each of the above-cited examples (all discussed at various times on this blog), there was a much more convincing alternative explanation for the claimed results, given some mixture of statistical errors and selection bias (p-hacking or forking paths). That’s not to say the claims are wrong (Who knows?? All things are possible!), but it does tell us that we don’t need to abandon our prior understanding of these things. If we want to abandon our earlier common-sense views, that would be a choice to be made, an affirmative statement that those earlier views are held so weakly that they can be toppled by little if any statistical evidence.

P.S. Perhaps relevant is this recent article by Mark Whiting and Duncan Watts, “A framework for quantifying individual and collective common sense.”

Progress in 2023, Charles edition

Following the examples of Andrew, Aki, and Jessica, and at Andrew’s request:

Published:

Unpublished:

This year, I also served on the Stan Governing Body, where my primary role was to help bring back the in-person StanCon. StanCon 2023 took place at the University of Washington in St. Louis, MO and we got the ball rolling for the 2024 edition which will be held at Oxford University in the UK.

It was also my privilege to be invited as an instructor at the Summer School on Advanced Bayesian Methods at KU Leuven, Belgium and teach a 3-day course on Stan and Torsten, as well as teach workshops at StanCon 2023 and at the University of Buffalo.

Storytelling and Scientific Understanding (my talks with Thomas Basbøll at Johns Hopkins on 26 Apr)

Storytelling and Scientific Understanding

Andrew Gelman and Thomas Basbøll

Storytelling is central to science, not just as a tool for broadcasting scientific findings to the outside world, but also as a way that we as scientists understand and evaluate theories. We argue that, for this purpose, a story should be anomalous and immutable; that is, it should be surprising, representing some aspect of reality that is not well explained by existing models of the world, and have details that stand up to scrutiny.

We consider how this idea illuminates some famous stories in social science involving soldiers in the Alps, Chinese boatmen, and trench warfare, and we show how it helps answer literary puzzles such as why Dickens had all those coincidences, why authors are often so surprised by what their characters come up with, and why the best alternative history stories have the feature that, in these stories, our “real world” ends up as the deeper truth. We also discuss connections to chatbots and human reasoning, stylized facts and puzzles in science, and the millionth digit of pi.

At the center our framework is a paradox: learning from anomalies seems to contradict usual principles of science and statistics where we seek representative or unbiased samples. We resolve this paradox by placing learning-within-stories into a hypothetico-deductive (Popperian) framework, in which storytelling is a form of exploration of the implications of a hypothesis. This has direct implications for our work as a statistician and a writing coach.

Progress in 2023, Jessica Edition

Since Aki and Andrew are doing it… 

Published:

Unpublished/Preprints:

Performed:

If I had to choose a favorite (beyond the play, of course) it would be the rational agent benchmark paper, discussed here. But I also really like the causal quartets paper. The first aims to increase what we learn from experiments in empirical visualization and HCI through comparison to decision-theoretic benchmarks. The second aims to get people to think twice about what they’ve learned from an average treatment effect. Both have influenced what I’ve worked on since.

A feedback loop can destroy correlation: This idea comes up in many places.

The people who go by “Slime Mold Time Mold” write:

Some people have noted that not only does correlation not imply causality, no correlation also doesn’t imply no causality. Two variables can be causally linked without having an observable correlation. Two examples of people noting this previously are Nick Rowe offering the example of Milton Friedman’s thermostat and Scott Cunningham’s Do Not Confuse Correlation with Causality chapter in Causal Inference: The Mixtape.

We realized that this should be true for any control system or negative feedback loop. As long as the control of a variable is sufficiently effective, that variable won’t be correlated with the variables causally prior to it. We wrote a short blog post exploring this idea if you want to take a closer look. It appears to us that in any sufficiently effective control system, causally linked variables won’t be correlated. This puts some limitations on using correlational techniques to study anything that involves control systems, like the economy, or the human body. The stronger version of this observation, that the only case where causally linked variables aren’t correlated is when they are linked together as part of a control system, may also be true.

Our question for you is, has anyone else made this observation? Is it recognized within statistics? (Maybe this is all implied by Peston’s 1972 “The Correlation between Targets and Instruments”? But that paper seems totally focused on economics and has only 14 citations. And the two examples we give above are both economists.) If not, is it worth trying to give this some kind of formal treatment or taking other steps to bring this to people’s attention, and if so, what would those steps look like?

My response: Yes, this has come up before. It’s a subtle point, as can be seen in some of the confused comments to this post. In that example, the person who brought up the feedback-destroys-correlation example was economist Rachael Meager, and it was a psychologist, a law professor, and some dude who describes himself as “a professor, writer and keynote speaker specializing in the quality of strategic thinking and the design of decision processes” who missed the point. So it’s interesting that you brought up an example of feedback from the economics literature.

Also, as I like to say, correlation does not even imply correlation.

The point you are making about feedback is related to the idea that, at equilibrium in an idealized setting, price elasticity of demand should be -1, because if it’s higher or lower than that, it would make sense to alter the price accordingly and slide up or down that curve to maximize total $.

I’m not up on all this literature; it’s the kind of thing that people were writing about a lot back in the 1950s related to cybernetics. It’s also related to the idea that clinical trials exist on a phase transition where the new treatment exists but has not yet been determined to be better or worse than the old. This is sometimes referred to as “equipoise,” which I consider to be a very sloppy concept.

The other thing is that everybody knows how correlations can be changed by selection (Simpson’s paradox, the example of high school grades and SAT scores among students who attend a moderately selective institution, those holes in the airplane wings, etc etc.). Knowing about one mechanism for correlations to be distorted can perhaps make people less attuned to other mechanisms such as the feedback thing.

So, yeah, a lot going on here.

“Theoretical statistics is the theory of applied statistics”: A scheduled conference on the topic

Ron Kenett writes:

We are planning a conference on 11/4 that might be of interest to your blog followers.

It is a hybrid format event on the foundations of applied statistics. Discussion inputs will be most welcome.

The registration link and other information are here.

I think that “11/4” refers to 11 Apr 2024; if not, I guess that someone will correct me in comments.

Kenett’s paper on the theory of applied statistics reminds me of my dictum that theoretical statistics is the theory of applied statistics. For example of how this principle can inform both theory and applications, see this comment at the linked post:

There are lots of ways of summarizing a statistical analysis, and it’s good to have a sense of how the assumptions map to the conclusions. My problem with the paper [on early-childhood intervention; see pages 17-18 of this paper here for background] was that they presented a point estimate of an effect size magnitude (42% earnings improvement from early childhood intervention) which, if viewed classically, is positively biased (type M error) and, if viewed Bayesianly, corresponds to a substantively implausible prior distribution in which an effect of 84% is as probable as an effect of 0%.

If we want to look at the problem classically, I think researchers who use biased estimates should (i) recognize the bias, and (ii) attempt to adjust for it. Adjusting for the bias requires some assumption about plausible effect sizes; that’s just the way things are, so make the assumption and be clear about what assumption your making.

If we want to look at the problem Bayesianly, I think researchers should have to justify all aspects of their model, including their prior distribution. Sometimes the justification is along the lines of, “This part of the model doesn’t materially impact the final conclusions so we can go sloppy here,” which can be fine, but it doesn’t apply in a case like this where the flat prior is really driving the headline estimate.

The point is that theoretical concepts such as “unbiased estimation” or “prior distribution” don’t exist in a vacuum; they are theoretically relevant to the extent that they connect to applied practice.

I assume that such issues will be discussed at the conference.

“My quick answer is that I don’t care much about permutation tests because they are testing a null null hypothesis that is of typically no interest”

Riley DeHaan writes:

I’m a psych PhD student and I have a statistical question that’s been bothering me for some time and wondered if you’d have any thoughts you might be willing to share.

I’ve come across some papers employing z-scores of permutation null distributions as a primary metric in neuroscience (for an example, see here).

The authors computed a coefficient of interest in a multiple linear regression and then permuted the order of the predictors to obtain a permutation null distribution of that coefficient. “The true coefficient for functional connectivity is compared to the distribution of null coefficients to obtain a z-score and P-value.” The authors employed this permutation testing approach to avoid the need to model potentially complicated autocorrelations between the observations in their sample and then wanted a statistic that provided a measure of effect size rather than relying solely on p-values.

Is there any meaningful interpretation of a z-score of a permutation null distribution under the alternative hypothesis? Is this a commonly used approach? This approach would not appear to find meaningfully normalized estimates of effect size given the variability of the permutation null distribution may not have anything to do with the variance of the statistic of interest under its own distribution. In this case, I’m not sure a z-score based on the permutation null provides much information beyond significance. The variability of the permutation null distribution will also be a function of the sample size in this case. Could we argue that permutation null distributions would in many cases (I’m thinking about simple differences in means rather than regression coefficients) tend to overestimate the variability of the true statistic given permutation tests are conservative compared to tests based on known distributions of the statistic of interest? This z-score approach would then tend to produce conservative effect sizes. I’m not finding references to this approach online beyond this R package.

My reply: My quick answer is that I don’t care much about permutation tests because they are testing a null null hypothesis that is of typically no interest. Related thoughts are here.

P.S. If you, the reader of this blog, care about permutation tests, that’s fine! Permutation tests have a direct mathematical interpretation. They just don’t interest me, that’s all.

Prediction isn’t everything, but everything is prediction

Image

This is Leo.

Explanation or explanatory modeling can be considered to be the use of statistical models for testing causal hypotheses or associations, e.g. between a set of covariates and a response variable. Prediction or predictive modeling, (supposedly) on the other hand, is the act of using a model—or device, algorithm—to produce values of new, existing, or future observations. A lot has been written about the similarities and differences between explanation and prediction, for example Breiman (2001), Shmueli (2010), Billheimer (2019), and many more.

These are often thought to be separate dimensions of statistics, but Jonah and I have been discussing for a long time that in some sense there may actually be no such thing as explanation without prediction. Basically, although prediction itself is not the only goal of inferential statistics, everything—every objective—in inferential statistics can be reframed through the lense of prediction.

Hypothesis testing, ability estimation, hierarchical modeling, treatment effect estimation, causal inference problems, etc., can all be described in our opinion from a (inferential) predictive perspective. So far we have not found an example for which there is no way to reframe it as prediction problem. So I ask you: is there any inferential statistical ambition that cannot be described in predictive terms?

P.S. Like Billheimer (2019) and others, we think that inferential statistics should be considered as inherently predictive and be focused primarily on probabilistic predictions of observable events and quantities, rather than focusing on statistical estimates of unobersvable parameters that do not exist outside of our highly contrived models. Similarly, we also feel that the goal of Bayesian modeling should not be taught to students as finding the posterior distribution of unobservables, but rather as finding the posterior predictive distribution of the observables (with finding the posterior as an intermediate step); even when we don’t only care about predictive accuracy and we still care about understanding how a model works (model checking, GoF measures), we think the predictive modeling interpretation is generally more intuitive and effective.

Progress in 2023

Published:

Unpublished:

Enjoy.

“It’s About Time” (my talk for the upcoming NY R conference)

I speak at Jared’s NYR conference every year (see here for some past talks). It’s always fun. Here’s the title/abstract for the talk I’ll be giving this year.

It’s About Time

Statistical processes occur in time, but this is often not accounted for in the methods we use and the models we fit. Examples include imbalance in causal inference, generalization from A/B tests even when there is balance, sequential analysis, adjustment for pre-treatment measurements, poll aggregation, spatial and network models, chess ratings, sports analytics, and the replication crisis in science. The point of this talk is to motivate you to include time as a factor in your statistical analyses. This may change how you think about many applied problems!

The continuing challenge of poststratification when we don’t have full joint data on the population.

Torleif Halkjelsvik at the Norwegian Institute of Public Health writes:

Norway has very good register data (education/income/health/drugs/welfare/etc.) but it is difficult to obtain complete tables at the population level. It is however easy to get independent tables from different registries (e.g., age by gender by education as one data source and gender by age by welfare benefits as another). What if I first run a multilevel model to regularize predictions for a vast set of variables, but in the second step, instead of a full table, use a raking approach based on several independent post-stratification tables? Would that be a valid approach? And have you seen examples of this?

My reply: I think the right way to frame this is as a poststratification problem where you don’t have the full poststratification table, you only have some margins. The raking idea you propose could work, but to me it seems awkward in that it’s mixing different parts of the problem together. Instead I’d recommend first imputing a full poststrat table and then using this to do your poststratification. But then the question is how to do this. One approach is iterative proportional fitting (Deming and Stephan, 1940). I don’t know any clean examples of this sort of thing in the recent literature, but there might be something out there.

Halkjelsvik responded:

It is an interesting idea to impute a full poststrat table, but I wonder whether it is actually better than directly calculating weights using the proportions in the data itself. Cells that should be empty in the population (e.g., women, 80-90 years old, high education, sativa spray prescription) may not be empty in the imputed table when using iterative proportional fitting (IPF), and these “extreme” cells may have quite high or low predicted values. By using the data itself, such cells will be empty, and they will not “steal” any of the marginal proportions when using IPF. This is of course a problem in itself if the data is limited (if there are empty cells in the data that are not empty in the population).

Me: If you have information that certain cells are empty or nearly so, that’s information that you should include in the poststrat table. I think the IPF approach will be similar to the weighting; it is just more model-based. So if you think the IPF will give some wrong answers, that suggests you have additional information. I recommend you try to write down all the additional information you have and use all of it in constructing the poststratification table. This should allow you to do better than with any procedure that does not use this info.

Halkjelsvik:

After playing with a few scenarios (on a piece of paper, no simulation) I see that my suggested raking/weighting approach (which also would involve iterative proportional fitting) directly on the sample data is not a good idea in contexts where MRP is most relevant. That is, if the sample cell sizes are small and regularization matters, then the subgroups of interest (e.g. geographical regions) will likely have too little data on rare demographic combinations. The approach you suggested (full population table imputation based on margins) appears more reasonable, and the addition of “extra information” is obviously a good idea. But how about a hybrid: Instead of manually accounting for “extra information” (e.g., non-existing demographic combinations) this extra information can be derived directly from the proportions of the sample itself (across subgroups of interest) and can be used as “seed” values (i.e., before accounting for margins at the local level). Using information from the sample to create the initial (seed) values for the IPF may be a good way to avoid imputing positive values in cells that are structural zeros, given that the sample is sufficiently large to avoid too many “sample zeros” that are not true “structural zeros”.

So the following could be an approach for my problem?

1. Obtain regularized predictions from sample.

2. Produce full postrat seed table directly from “global” cell values in the sample (or from other available “global” data, e.g. if available only at national level). That is, regions start with identical seed structures.

3. Adjust the poststrat table by iterative proportional fitting based on local margins (but I have read that there may be convergence problems when there are many zeros in seed cells).

Me: I’m not sure! I really want to have a fully worked-out example, a case study of MRP where the population joint distribution (the poststratification table) is not known and it needs to be estimated from data. We’re always so sloppy in those settings. I’d like to do it with a full Bayesian model in Stan and then compare various approximations.

Statistical Practice as Scientific Exploration (my talk on 4 Mar 2024 at the Royal Society conference on the promises and pitfalls of preregistration)

Here’s the conference announcement:

Discussion meeting organised by Dr Tom Hardwicke, Professor Marcus Munafò, Dr Sophia Crüwell, Professor Dorothy Bishop FRS FMedSci, Professor Eric-Jan Wagenmakers.

Serious concerns about research quality have provoked debate across scientific disciplines about the merits of preregistration — publicly declaring study plans before collecting or analysing data. This meeting will initiate an interdisciplinary dialogue exploring the epistemological and pragmatic dimensions of preregistration, identifying potential limits of application, and developing a practical agenda to guide future research and optimise implementation.

And here’s the title and abstract of my talk, which is scheduled for 14h10 on Mon 4 Mar 2024:

Statistical Practice as Scientific Exploration

Much has been written on the philosophy of statistics: How can noisy data, mediated by probabilistic models, inform our understanding of the world? Researchers when using and developing statistical methods can be seen to be acting as scientists, forming, evaluating, and elaborating provisional theories about the data and processes they are modelling. This perspective has the conceptual value of pointing toward ways that statistical theory can be expanded to incorporate aspects of workflow that were formerly tacit or informal aspects of good practice, and the practical value of motivating tools for improved statistical workflow.

I won’t really be talking about preregistration, in part because I’ve already said so much on that topic here on this blog; see for example here and various links at that post. Instead I’ll be talking about the statistical workflow, which is typically presented as a set of procedures applied to data but which I think is more like a process of scientific exploration and discovery. I addressed some of these ideas in this talk from a couple years ago. But, don’t worry, I’m sure I’ll have lots of new material. Not to mention all the other speakers at the conference.

Explainable AI works, but only when we don’t need it

This is Jessica. I went to NeurIPS last week, mostly to see what it was like. While waiting for my flight home at the airport I caught a talk that did a nice job of articulating some fundamental limitations with attempts to make deep machine learning models “interpretable” or “explainable.”

It was part of an XAI workshop. My intentions in checking out the XAI workshop were not entirely pure, as its an area I’ve been skeptical of for a while. Formalizing aspects of statistical communication is very much in line with my interests, but I tried and failed to get into XAI and related work on interpretability a few years ago when it was getting popular. The ML contributions have always struck me as more of an academic exercise than a real attempt at aligning human expectations with model capabilities. When human-computer interaction people started looking into it, there started to be a little more attention to how people actually use explanations, but the methods used to study human reliance on explanations there have not been well grounded (e.g., ‘appropriate reliance’ is often defined as agreeing with the AI when it’s right and not agreeing when it’s wrong, which can be shown to be incoherent in various ways). 

The talk, by Ulrike Luxburg, which gave a sort of impossibility result for explainable AI, was refreshing. First, she distinguished two very different scenarios for explanation: the cooperative ones where you have a principal with a model furnishing the explanations and a user using them who both want the best quality/most accurate explanations, versus adversarial scenarios where you have a principal whose best interests are not aligned with the goal of accurate explanation. For example, some company who needs to explain why it denied someone a loan has little motivation to explain the actual reason behind that prediction, because it’s not in their best interest to give people fodder to then try to minimally change their features to push the prediction to a different label. Her first point was that there is little value in trying to guarantee good explanations in the adversarial case, because existing explanation techniques (e.g.,for feature attribution like SHAP or LIME) give very different explanations for the same prediction, and the same explanation technique is often highly sensitive to small differences in the function to be explained (e.g., slight changes to parameters in training). There are too many degrees of freedom in terms of selecting among inductive biases so the principal easily produce something faithful by some definition while hiding important information. Hence laws guaranteeing a right to explanation miss this point.

In the cooperative setting, maybe there is hope. But, turns out something like the anthropic principle of statistics operates here: we have techniques that we can show work well in the simple scenarios where we don’t really need explanations, but when we do really need them (e.g., deep neural nets over high dimensional feature spaces) anything we can guarantee is not going to be of much use.

There’s an analogy to clustering: back when unsupervised learning was very hot, everyone wanted guarantees for clustering algorithms but to make them required working in settings where the assumptions were very strong, such that the clusters would be obvious upon inspecting the data. In explainable AI, we have various feature attribution methods that describe which features led to the prediction on a particular instance. SHAP, which borrows Shapley values from game theory to allocate credit among features, is very popular. Typically SHAP provides the marginal contribution of each feature, but Shapley Interaction Values have been proposed to allow for local interaction effects between pairs of features. Luxburg presented a theoretical result from this paper which extends Shapley Interaction Values to n-Shapley Values, which explain individual predictions with interaction terms up to order n given some number of total features d. They are additive in that they always sum to the output of the function we’re trying to explain over all subsets of combinations of variables less than or equal to n. Starting from the original Shapley values (where n=1), n-Shapley Values successively add higher-order variable interactions to the explanations.

The theoretical result shows that n-Shapley Values recover generalized additive models (GAMs), which are GLMs where the outcome depends linearly on smoothed functions of the inputs: g(E[Y] = B_0 = f_1(x_1) + f_2(x_2) + … f_m(x_m). GAMs are considered inherently interpretable, but are also undetermined. For n-Shapley to recover a faithful representation of the function as a GAM, the order of the explanation just needs to be as large or larger than the maximum variable interaction in the model. 

However, GAMs lose their interpretability as we add interactions. When we have large numbers of features, as is typically the case in deep learning, what is the value of the explanation?  We need to look at interactions between all combinatorial subsets of the features. So when simple explanations like standard SHAP are applied to complex functions, you’re getting an average over billions of features, and there’s no reduction to be made that would give you something meaningful. The fact that in the simple setting of a GAM of order 1 we can prove SHAP does the right thing does not mean we’re anywhere close to having “solved” explainability. 

The organizers of the workshop obviously invited this rather negative talk on XAI, so perhaps the community is undergoing self-reflection that will temper the overconfidence I associate with it. Although, the day before the workshop I also heard someone complaining that his paper on calibration got rejected from the same workshop, with an accompanying explanation that it wasn’t about LIME or SHAP. Something tells me XAI will live on.

I guess one could argue there’s still value in taking a pragmatic view, where if we find that explanations of model predictions, regardless of how meaningless, lead to better human decisions in scenarios where humans must make the final decision regardless of the model accuracy (e.g., medical diagnoses, loan decisions, child welfare cases), then there’s still some value in XAI. But who would want to dock their research program on such shaky footing? And of course we still need an adequate way of measuring reliance, but I will save my thoughts on that for another post.

 Another thing that struck me about the talk was a kind of tension around just trusting one’s instincts that something is half-baked versus taking the time to get to the bottom of it. Luxburg started by talking about how her strong gut feeling as a theorist was that trying to guarantee AI explainability was not going to be possible. I believed her before she ever got into the demonstration, because it matched my intuition. But then she spent the next 30 minutes discussing an XAI paper. There’s a decision to be made sometimes, about whether to just trust your intuition and move on to something that you might still believe in versus to stop and articulate the critique. Others might benefit from the latter, but then you realize you just spent another year trying to point out issues with a line of work you stopped believing in a long time ago. Anyway, I can relate to that. (Not that I’m complaining about the paper she presented – I’m glad she took the time to figure it out as it provides a nice example). 

I was also reminded of the kind of awkward moment that happens sometimes where someone says something rather final and damning, and everyone pauses for a moment to listen to it. Then the chatter starts right back up again like it was never said. Gotta love academics!