Cross-validation FAQ

Here it is! It’s from Aki.

Aki linked to it last year in a post, “Moving cross-validation from a research idea to a routine step in Bayesian data analysis.” But I thought the FAQ deserved its own post. May it get a million views.

Here’s its current table of contents:

1 What is cross-validation?
1.1 Using cross-validation for a single model
1.2 Using cross-validation for many models
1.3 When not to use cross-validation?
2 Tutorial material on cross-validation
3 What are the parts of cross-validation?
4 How is cross-validation related to overfitting?
5 How to use cross-validation for model selection?
6 How to use cross-validation for model averaging?
7 When is cross-validation valid?
8 Can cross-validation be used for hierarchical / multilevel models?
9 Can cross-validation be used for time series?
10 Can cross-validation be used for spatial data?
11 Can other utility or loss functions be used than log predictive density?
12 What is the interpretation of ELPD / elpd_loo / elpd_diff?
13 Can cross-validation be used to compare different observation models / response distributions / likelihoods?

P.S. Also relevant is this discussion from the year before, “Rob Tibshirani, Yuling Yao, and Aki Vehtari on cross validation.”

Causal inference and the aggregation of micro effects into macro effects: The effects of wages on employment

James Traina writes:

I’m an economist at the SF Fed. I’m writing to ask for your first thoughts or suggested references on a particular problem that’s pervasive in my field: Aggregation of micro effects into macro effects.

This is an issue that has been studied since the 80s. For example, the individual-level estimates of wages on employment using quasi-experimental tax variation are much smaller than aggregate-level estimates using time series variation. More recently, there has been an active debate on how to port individual-level estimates of government transfers on consumption to macro policy.

Given your expertise, I was wondering if you had insight into how you or other folks in the stats / causal inference field would approach this problem structure more generally.

My reply: Here’s a paper from 2006, Multilevel (hierarchical) modeling: What it can and cannot do. The short answer is that you can estimate micro and macro effects in the same model, but you don’t necessarily have causal identification at both levels. It depends on the design.

You’ll also want a theoretical model. For example, in your model, if you want to talk about “the effects of wages,” it can help to consider potential interventions that could affect local wages. Such interventions could be a minimum-wage law, it could be inflation that reduces real (not nominal) wages, it could be national economic conditions that make the labor market more or less competitive, etc. You can also think about potential interventions at an individual level, such as a person getting education or training, marrying or having a child, the person’s employer changing its policies, whatever.

I don’t know enough about your application to give more detail. The point is that “wages” is not in itself a treatment. Wages is a measured variable, and different wage-effecting treatments can have different effects on employment. You can think of these as instruments, even if you’re not actually doing an instrumental variables analysis. Also, treatments that affect individual wages will be different than treatments that affect aggregate wages, so it’s no surprise that they would have different effects on employment. There’s no strong theoretical reason to think that effects would be the same.

Finally, I don’t understand how government transfers connect to wages in your problem. Government transfers do not directly affect wages, do they? So I feel like I’m missing some context here.

Explore Ledger Live, the ultimate crypto companion. Securely manage your digital assets, track market trends, and execute trades with ease. Ledger Live: where security meets simplicity.

What data to include in an analysis? Not always such an easy question. (Elliott Morris / Nate Silver / Rasmussen polls edition)

Someone pointed me to a recent post by Nate Silver, “Polling averages shouldn’t be political litmus tests, and they need consistent standards, not make-it-up-as-you-go,” where Nate wrote:

The new Editorial Director of Data Analytics at ABC News, G. Elliott Morris, who was brought in to work with the remaining FiveThirtyEight team, sent a letter to the polling firm Rasmussen Reports demanding that they answer a series of questions about their political views and polling methodology or be banned from FiveThirtyEight’s polling averages, election forecasts and news coverage. I found several things about the letter to be misguided. . . .

First, I strongly oppose subjecting pollsters to an ideological or political litmus test. . . . Why, unless you’re a dyed-in-the-wool left-leaning partisan, would having a “relationship with several right-leaning blogs and online media outlets” lead one to “doubt the ethical operation of the polling firm”? . . .

Rasmussen has indeed had strongly Republican-leaning results relative to the consensus for many years. Despite that strong Republican house effect, however, they’ve had roughly average accuracy overall because polls have considerably understated Republican performance in several recent elections (2014, 2016, 2020). . . . Is that a case of two wrongs making a right — Rasmussen has had a Republican bias, but other polls have had a Democratic bias, so they come out of the wash looking OK? Yeah, probably. Still, there are ways to adjust for that — statistical ways like a house effects adjustment . . .

Second, even if you’re going to remove Rasmussen from the averages going forward, it’s inappropriate to write them out of the past . . . It’s bad practice to revise data that’s already been published, based on decisions you made long after that data was published. For one thing, it makes your numbers less reliable as a historical record. For another, it can lead to overconfidence when using that data to train or build models. . . .

Third, I think it’s clear that the letter is an ad hoc exercise to exclude Rasmussen, not an effort to develop a consistent set of standards. . . . The thing about running a polling average is that you need a consistent and legible set of rules that be applied to hundreds of pollsters you’ll encounter over the course of an election campaign. Going on a case-by-case basis is a) extremely time-consuming . . . and b) highly likely to result in introducing your own biases . . . Perhaps Morris’s questions were getting at some larger theme or more acute problem. But if so, he have should stated it more explicitly in his letter. . . .

Nate raises several interesting questions here:

1. Is there any good reason for a relationship with “right-leaning” outlets such as Fox News and Steve Bannon to cause one to “doubt the ethical operation of the polling firm”?

2. Does it ever make sense to remove a biased poll, rather than including in your analysis with a statistical correction?

3. If you are changing your procedure going forward, is it a mistake to make those changes retroactively on past work?

4. Is it appropriate to send a letter to one polling organization without going through the equivalent process with all the other pollsters whose data you’re using?

Any followups?

I’ll go through the above questions one at a time, but first I was curious if Nate or Elliott had said anything more on the topic.

I found these two items on twitter:

This from Elliott: “asking pollsters detailed methodological questions is not (or shouldn’t be!) controversial. it’s standard practice in most media organizations, and aggregators should probably even be publishing responses for the public and using them as a way to gauge potential measurement error,” linking to a list of questions that CNN asks of all pollsters.

This from Nate, referring to Elliott’s letter to Rasmussen as a “Spanish Inquisition” and linking to this article from the Washington Examiner which, among other things, reported this from a Rasmussen poll:

Whaaaaa? As a check, I googled *abortion roe wade polling* and found some recent items:

Gallup: “As you may know, the Supreme Court overturned its 1973 Roe versus Wade decision concerning abortion, meaning there is no Constitutional protection for abortion rights and each state could set its own laws to allow, restrict or ban abortions. Do you think overturning Roe versus Wade was a good thing or a bad thing?”: 38% “good thing,” 61% “bad thing,” 1% no opinion.

CBS/YouGov: “Last year, the U.S. Supreme Court ended the constitutional right to abortion by overturning Roe v. Wade. Do you approve or disapprove of the Court overturning Roe v. Wade?”: 44% “approve,” 56% “disapprove.”

USA Today (details here): “It’s been a year since the Supreme Court overturned the Roe v. Wade decision, eliminating a
constitutional right to an abortion at some stages of pregnancy. Do you support or oppose the court decision to overturn Roe v. Wade?”: 30% “support,” 58% “oppose,” 12% undecided.

There’s other polling out there, all pretty much consistent with the above. An then there’s Rasmussen, which stands out. Would I want to include Rasmussen’s “Majority Now Approve SCOTUS Abortion Ruling” in a polling average? I’m not sure.

Some of it must could be their question wording: “Last year, the Supreme Court overturned the 1973 Roe v. Wade decision, so that each state can now determine its own laws regarding abortion. Do you approve or disapprove of the court overturning Roe v. Wade?” This isn’t far from the Gallup question, but they does remove the “Constitutional protection” phrase, and I guess that could make a difference. Also, they’re just counting “likely voters,” and much could depend on where those respondents come from.

Whether or not it makes sense to take the Rasmussen organization seriously (I remain concerned about their numbers that added up to 108%), I think it’s kinda journalistic malpractice for the Washington Examiner to report their claim of “Support for overturning Roe v. Wade is up since last year. 52% to 44%, US likely voters approve,” without even noting how much that disagrees with all other polling out there. My first thought was that, yeah, the Washington Examiner is a partisan outlet, but even partisans benefit from accurate news, right? I guess the point is that the role of an operation such as the Washington Examiner is not so much to inform readers as to circulate talking points and get them out into the general discussion—indeed, thanks to Nate and then me, it happened here!

1. Is there any good reason for a relationship with “right-leaning” outlets such as Fox News and Steve Bannon to cause one to “doubt the ethical operation of the polling firm”?

OK, now on to Nate’s questions. First, should we doubt the ethics of a pollster who hangs out with Fox News and Steve Bannon? My answer here is . . . it depends!

On one hand, . . . Should we discredit my statistical work because I teach at Columbia University, an institution whose most famous professor was Dr. Oz and which notoriously promulgated false statistics for its college rankings? Lots of people teach at Columbia, similarly lots of people go on Fox News: there’s an appeal to reaching an audience of millions. Going on Fox might be a bad idea, but does it cast doubt on a pollster’s ethics?

As I said, it depends. If a pollster or quantitative social scientist is consistently using crap statistics to push election denial, then, yes, I do doubt their ethics. The relevant point here is not that Fox and Bannon are “right-leaning” but rather that they’ve been fueling election denial misinformation, and distorted election statistics are part of the process.

So, yeah, I agree with Nate that Elliott’s phrase, “several right-leaning blogs and online media outlets,” doesn’t tell the whole story—as Nate put it, “Perhaps Morris’s questions were getting at some larger theme or more acute problem.” There is a larger theme and more acute problem, and that’s refuted claims about the election that have been endorsed by major political and media figures. Given what Rasmussen’s been doing in this area, I think Nate’s been a bit too quick to take their side of the story on this, to refer to Elliott’s inquiries as an “inquisition,” etc. You don’t have to be a “dyed-in-the-wool left-leaning partisan” to doubt the ethical operation of a polling firm that is promoting lies about the election.

How close does a pollster need to be to election deniers so that I don’t trust it at all? I don’t know. I guess it depends on context, which is a good reason for Elliott to ask specific questions to Rasmussen about their polling methodology. If they’re open about what they’re doing, that’s a good sign; if they give no details, that’s gonna make it harder to trust them. Rasmussen has no duty to respond to those questions, Fivethirtyeight has no duty to include its polls in their analyses, etc etc all down the line.

2. Does it ever make sense to remove a biased poll, rather than including in your analysis with a statistical correction?

Discarding a data point is equivalent to including it but giving it a weight of zero or, from a Bayesian point of view, allowing it to be biased with an infinite-variance prior on the bias. So we can transform Nate’s very reasonable implied question (why discard Rasmussen polls? Why not just include your skepticism in your model?) as the question: Why not just give the Rasmussen polls a very small weight or, from a Bayesian point of view, allow them to have a bias that has a very large uncertainty?

There are two answers here. The first is that if the weight is very small or the bias has a huge uncertainty, then it’s pretty much equivalent to not including the survey at all. Remember 13. The second answer is that if these surveys are really being manipulated, then there’s no reason to think the bias is consistent. To put it another way: if you don’t think the Rasmussen polls are providing useful information, then you might not want to include them for the same reason that you wouldn’t include a rotten onion in your stew. Sure, one bad onion won’t destroy the taste—it’ll be diluted amid all the other flavors (including those of all the non-rotten onions you’ve thrown in)—but what’s the point?

This second answer is as much procedural as substantive: by excluding a pollster entirely, Fivethirtyeight is saying they don’t want to be using numbers that they can’t, on some level, trust. They’re making the procedural point that they have some rules for the polls they include, some red lines that cannot be crossed.

From the other direction, Nate’s plea for Fivethirtyeight to continue including Rasmussen’s polls in its analyses is also a procedural and perception-based argument: he’s making the procedural point that “you need a consistent and legible set of rules” and can’t be making case-by-case decisions.

The funny thing is . . . Nate and Elliott are kind of saying the same thing! Elliott’s saying they’ll be removing Rasmussen unless they follow the rules and Nate’s saying that too. I looked up Fivethirtyeight’s rules for pollsters from when Nate was running the organization and it says “Pollsters must also be able to answer basic questions about their methodology, including but not limited to the polling medium used (e.g., landline calls, text, etc.), the source of their voter files, their weighting criteria, and the source of the poll’s funding.” And they don’t include “‘Nonscientific’ polls that don’t attempt to survey a representative sample of the population or electorate.” So I guess a lot depends on the details; see item 4 below.

3. If you are changing your procedure going forward, is it a mistake to make those changes retroactively on past work?

I have a lot of sympathy for Nate’s argument here. He created the Fivethirtyeight polling averages, then combined this with his interest in sports analytics, worked his butt off for over a decade . . . and now the new team is talking about changing things. It would be kind of like if CRC Press hired someone to create a fourth edition of Bayesian Data Analysis, and the new author decided to remove chapter 6 because it didn’t match his philosophy. I’d be furious! OK, that’s not a perfect analogy because my coauthors and I have copyright on BDA, but the point is that Nate was Fivethirtyeight for awhile, so it’s frustrating to think of the historical record being changed.

That said, it’s not clear to me that Elliott is planning to change the historical record. From his quoted letter: “If banned, Rasmussen Reports would also be removed from our historical averages of polls and from our pollster ratings. Your surveys would no longer appear in reporting and we would write an article explaining our reasons for the ban.” It could be that the polls would still be in the database, just flagged and not included in the averages. I think that would be OK.

To put it another way, I think it’s ok to go back and clean up old data, as long as you’re transparent about it.

From a slightly different angle, Nate writes, “There’s also an implicit conflict here about the degree to which journalists should gatekeep or shield the public from potential sources of ‘misinformation.'” I’m not exactly sure of Elliott’s motivations here, but my guess is that his goal is not so much to “shield the public” but rather to come up with more accurate forecasts. Nate argues that including a Republican-biased poll should lead to more accurate forecasts by balancing other polls with systematic polling errors favoring the Democrats. I guess that if Fivethirtyeight going forward is not going to include Rasmussen polls, they’ll have to adjust for possible systematic errors in some other way. That would make sense to me, actually. If you do want to adjust for the possibility of errors on the scale of 2016 or 2020 (polls that showed the Democrats getting approximately 2.5 percentage points more support than they actually received in the vote), then it would make sense to make that adjustment straight up, without relying on Rasmussen to do it for you.

4. Is it appropriate to send a letter to one polling organization without going through the equivalent process with all the other pollsters whose data you’re using?

I have no idea what’s been going on between Fivethirtyeight and Rasmussen and between Fivethirtyeight and other polling organizations. The quoted letter from Elliott to Rasmussen begins, “I am emailing you to send a final notice . . .”, so it seems safe to assume this is just one in a series of communications, and we haven’t seen the others that came before.

Nate writes, “I think it’s clear that the letter is an ad hoc exercise to exclude Rasmussen, not an effort to develop a consistent set of standards.” My guess is that it’s neither an ad hoc exercise to exclude Rasmussen, nor an effort to develop a consistent set of standards, but rather that it’s an effort to apply an imperfect set of standards. Rules such as “Pollsters must also be able to answer basic questions about their methodology, including but not limited to . . .” and “‘Nonscientific’ polls that don’t attempt to survey a representative sample” are imperfect—but that’s the nature of rules.

I guess what I’m saying is that it’s hard to compare Fivethirtyeight’s interactions with Rasmussen with their interactions with other pollsters, given that (a) we don’t know what their interactions with Rasmussen are, and (b) we don’t what their interactions with other pollsters are.

Let me just say that this sort of thing is always challenging, as there’s no way to have completely consistent rules. For example, we have good reasons to be suspicious that Brian Wansink ever used his famous bottomless soup bowl in any actual experiment. Do we apply this level of scrutiny to the apparatus described in every peer-reviewed research article? No, first because this would require an immense amount of effort, and second because “this level of scrutiny” is not even defined. It’s judgment calls all the way down. Fivethirtyeight has a necessarily ambiguous policy on what polls they will include in their analyses—there’s no way for such a policy to not have some ambiguity—and Nate and Elliott are making different judgment calls on whether Rasmussen violates the policy.

Having this discussion

Unfortunately there hasn’t been much of a conversation on this poll-inclusion issue, which I guess is no surprise given that Nate (indirectly) called Elliott a bullshitter and explicitly writes, “I don’t intend this a back-and-forth.” Which is too bad, given that we’ve had good conversations on forecasting before.

It’s easier for me to have this discussion because I know both Nate and Elliott. I don’t know either of them well on a personal level, but I’ve collaborated with both of them (for example, here and here) and I think they both do great work. I’ve criticized Nate’s forecasting procedure; then again, I’ve also criticized Elliott’s, even though (or especially because) it was done in collaboration with me.

To say I like both of them is not an attempt to put myself above the fray or to characterize their disagreements as minor. People often get themselves into positions where they are legitimately angry at each other—it’s happened to me plenty of times! The main point of the present post is that the decisions Elliott is making regarding which polls to include in his analysis, and the questions that Nate is asking, are challenging, with no easy answers.

P.S. Here’s a brief summary of statistical concerns with the 2020 presidential election forecasts from Economist and Fivethirtyeight forecasts. tl;dr: both had problems, in different ways.

Some challenges with existing election forecasting methods

With the presidential election season coming up (not that it’s ever ended), here’s a quick summary of the problems/challenges with two poll-based forecasting methods from 2020.

How this post came about: I have a post scheduled about a dispute between election forecasters Elliott Morris and Nate Silver about whether the site Fivethirtyeight.com should be including polls from the Rasmussen organization in their analyses.

At the end of the post I had a statistical discussion about the weaknesses of existing election forecasting methods . . . and then I realized that this little appendix was the most interesting thing in my post!

Whether Fivethirtyeight includes Rasmussen polls is a very minor issue, first because Rasmussen is only one pollster and second because if you do include their polls, any reasonable approach would be to give them a very low weight or a very large adjustment for bias. So in practice for the forecast it doesn’t matter so much if you include those polls, although I can see that from a procedural standpoint it can be challenging to come up with a rule to include or exclude them.

Now for the more important and statistically interesting stuff.

Key issues with the Fivethirtyeight forecast from 2020

They start with a polling average and then add weights and adjustments; see here for some description. I think the big challenge here is that the approach of adding fudge factors makes it difficult to add uncertainty without creating weird artifacts in the joint distribution, as discussed here and here. Relatedly, they don’t have a good way to integrate information from state and national polls. The issue here is not that they made a particular technical error; rather, they’re using a method that starts in a simple and interpretable way but then just gets harder and harder to hold together.

Key issues with the Economist forecast from 2020

From the other direction, the weakness of the Economist forecast (which I was involved in) was a lack of robustness to modeling and conceptual errors. Consider that we had to overhaul our forecast during the campaign. Also our forecasts had some problems with uncertainties, weird things relating to some choices in how we modeled between-state correlation of polling errors and time trends. I don’t think there’s any reason that a Bayesian forecast should necessarily be overconfident and non-robustness to conceptual errors in the model, but that’s what seemed to have happened with us. In contrast, the Fivethirtyeight approach was more directly empirical, which as noted above had its own problems but didn’t have a bias toward overconfidence.

Key issues with both forecasts

Both of the 2020 presidential election forecasts had difficulty handling data other than horse-race polls. The challenging information included economic and political “fundamentals,” which were included in the forecasts but with some awkwardness, in part arising from the fact that these variables themselves change over time during the campaign, known polling biases such as differential nonresponse, knowledge of systematic polling errors in previous elections, issues specific to the election at hand (street protests, covid, Clinton’s email server, Trump’s sexual assaults, etc.), issue attitudes in general to the extent they were not absorbed into horse-race polling, estimates of turnout, vote suppression, and all sorts of other data sources such as new-voter registration numbers. All these came up as possible concerns with forecasts, and it’s not so easy to include them in a forecast. No easy answers here—at some level we just need to be transparent and people can take our forecasts as data summaries—but these concerns arise in every election.

The Economist is hiring a political data scientist to do election modeling!

Dan Rosenheck writes:

Our data-journalism team conducts original quantitative research, deploying cutting-edge statistical methods to ask and answer important, relevant questions about politics, economics and society. We are looking to hire a full-time political data scientist. . . .

The data scientist will oversee all of our poll aggregators and predictive models for elections. This entails learning all of the code and calculations behind our existing politics-related technical projects, such as our forecasting systems for presidential and legislative elections in the United States, France and Germany; proposing and implementing improvements to them; and setting up and maintaining data pipelines to keep them updated regularly once they launch. The data scientist will also have the opportunity to design and build new models and trackers on newsworthy subjects.

It’s a permanent, full-time staff position, to replace Elliott Morris, who worked on the 2020 forecast with Merlin and me (see here for one of our blog posts and here and here for relevant academic articles).

Sounds like a great job for a statistician or a political scientist, and I hope I’ll have the opportunity to work with whoever the Economist’s new data scientist is. We built a hierarchical model and fit it in Stan!

Joe Simmons, Leif Nelson, and Uri Simonsohn agree with us regarding the much publicized but implausible and unsubstantiated claims of huge effects from nudge interventions

We wrote about this last year in our post, PNAS GIGO QRP WTF: This meta-analysis of nudge experiments is approaching the platonic ideal of junk science and our followup PNAS article, No reason to expect large and consistent effects of nudge interventions:

The article in question is called “The effectiveness of nudging: A meta-analysis of choice architecture interventions across behavioral domains” . . . From the abstract of the paper:

Our results show that choice architecture interventions [“nudging”] overall promote behavior change with a small to medium effect size of Cohen’s d = 0.45 . . .

Wha . . .? An effect size of 0.45 is not “small to medium”; it’s huge. Huge as in implausible that these little interventions would shift people, on average, by half a standard deviation. I mean, sure, if the data really show this, then it would be notable—it would be big news—because it would be a huge effect.

The claim of an average effect of 0.45 standard deviations does not, by itself, make the article’s conclusions wrong—it’s always possible that such large effects exist—but it’s a bad sign, and labeling it as “small to medium” points to a misconception that reminds us of the process whereby these upwardly biased estimates get published.

The article in question referred to about 200 previous papers, including 11 articles authored or coauthored by the discredited food researcher Brian Wansink, and also a notorious retracted paper coauthored by the controversial dishonesty researcher Dan Ariely.

But, I continued last year:

I would not believe the results of this meta-analysis even if it did not include any of the above 12 papers, as I don’t see any good reason to trust the individual studies that went into the meta-analysis. It’s a whole literature of noisy data, small sample sizes, and selection on statistical significance, hence massive overestimates of effect sizes. This is not a secret: look at the papers in question and you will see, over and over again, that they’re selecting what to report based on whether the p-value is less than 0.05. The problem here is not the p-value—I’d have a similar issue if they were to select on whether the Bayes factor is greater then 3, for example—; rather, the problem is the selection, which induces noise (through the reduction of continuous data to a binary summary) and bias (by not allowing small effects to be reported at all).

tl;dr. It’s a literature full of junk, and the inclusion of 12 discredited papers is a problem, not just in itself, but in that it indicates the lack of care or quality control that went into putting together the papers that went into the meta-analysis. Either the authors carefully vetted each paper in the meta-analysis—in which case, they did a rotten job of vetting, given the Wansink and Ariely papers that got in—or they didn’t vet, in which case I don’t know why we’re supposed to believe the others there. So, no, I don’t believe any of the conclusions of that meta-analysis. GIGO.

Still and all,

I indeed think that “nudging” has been oversold, but the underlying idea—“choice architecture” or whatever you want to call it—is important. Defaults can make a big difference sometimes.

Here’s how we put it in our PNAS article:

Nudge interventions may work, under certain conditions, but their effectiveness can vary to a great degree, and the conditions under which they work are barely identified in the literature. . . . the authors [of the meta-analysis under discussion] focus their conclusions on this average value and on subgroups, leaving aside the large degree of unexplained heterogeneity in apparent effects across published studies. For example, despite the analyses above being consistent with a large proportion of studies having near-zero underlying effects, the authors conclude that nudges work “across a wide range of behavioral domains, population segments, and geographical locations.”

We summarized:

As a scientific field, instead of focusing on average effects, we need to understand when and where some nudges have huge positive effects and why others are not able to repeat those successes. Until then, with a few exceptions [e.g., defaults], we see no reason to expect large and consistent effects when designing nudge experiments or running interventions.

New post by Simmons, Nelson, and Simonsohn

More recently, the psychology researchers Joe Simmons, Leif Nelson, and Uri Simonsohn wrote a post agreeing with us! They refer to the meta-analysis as estimating a “meaningless mean” (a point which I agree with regarding nudges, for reasons discussed above, and also more generally, as discussed here).

In their post, they link to our PNAS article and write, “letters published in PNAS, responding to this article, have proposed that the overall average effect may be much smaller – perhaps as low as zero – when correcting for publication bias.” Just to be clear, in our article, we do not propose an “overall average effect” of nudging; rather, as indicated in the above-quoted passage, we argue against the idea of talking about an “average effect” and we fault the article under discussion for attempting to use their meta-analysis to make such broad claims.

Simmons et al. go beyond what we did by looking in detail at a few of the papers cited in the meta-analysis. They find that it does not make much sense to combine the effect sizes from these different studies. I’m not surprised—see my comments above about vetting—but there’s a big difference between “I’m not surprised” and actually seeing the details, so I think Simmons et al. have made a useful contribution by getting down and dirty and reading those papers.

What they found does not change our conclusions—as Simmons et al. say in footnote 10 of their post, their conclusions are consistent with the point we make in our PNAS article that “instead of focusing on average effects, we need to understand when and where some nudges have huge positive effects and why others are not able to repeat those successes”—but, again, it’s good to see this common-sense conclusion backed up by a careful look at some of those individual studies.

You might ask, Why didn’t the authors of the original meta-analysis look so carefully at the individual papers they were citing? A quick answer is that, had they done so, they wouldn’t have been able to make those dramatic claims that got their paper published in PNAS. A longer answer is selection bias: the kinds of researchers who read the individual papers carefully won’t be publishing articles making those sorts of ridiculous claims, and unfortunately in the current academic and science-communication media environment, it’s often the outrageous claims that get the attention. Not always—skepticism can sometimes get publicity too—but I think the balance still favors extreme and implausible claims over reasoned common sense.

So, yeah, it’s good to see Simmons et al. putting in the effort to read those papers. You might say that this was way too much effort to spend following up on an piece of GIGO, but recall the Javert paradox and recall what they say about the dead horse. Sometimes it’s valuable for researchers to put in their expertise to explain how other people can get things so wrong. I’m a statistician so I focused on the statistical problems of working with a bunch of estimates with large and unknown biases, estimating different things; Simmons et al. are psychologists so they’re focusing on the substance of how those studies are different. And we’re in agreement that it’s annoying when researchers seem to think that statistical tools such as meta-analysis can resolve major data problems or major conceptual problems. Again, GIGO.

blme: Bayesian Linear Mixed-Effects Models

The problem:

When fitting multilevel models, the group-level variance parameters can be difficult to estimate. Posterior distributions are wide, and point estimates are noisy. The maximum marginal likelihood estimate of the variance parameter can often be zero, which is a problem for computational algorithms such as lme4 which are based on this marginal mode. For models with multiple varying coefficients (varying-intercept, varying-slope models), the bigger the group-level covariance matrix, the more likely it is that its max marginal likelihood estimate will be degenerate. This leads to computational problems as well as problems with the estimated coefficients, as they get not just partially pooled but completely pooled toward the fitted model.

The solution:

Priors. Zero-avoiding or boundary-avoiding priors to avoid zero or degenerate estimates of group-level variances, also informative priors to get more reasonable estimates when the number of groups is small.

The research papers:

[2013] A nondegenerate estimator for hierarchical variance parameters via penalized likelihood estimation. {\em Psychometrika} {\bf 78}, 685–709. (Yeojin Chung, Sophia Rabe-Hesketh, Andrew Gelman, Jingchen Liu, and Vincent Dorie)

[2014] Weakly informative prior for point estimation of covariance matrices in hierarchical models. {\em Journal of Educational and Behavioral Statistics} {\bf 40}, 136–157. (Yeojin Chung, Andrew Gelman, Sophia Rabe-Hesketh, Jingchen Liu, and Vincent Dorie)

The R package:

blme: Bayesian Linear Mixed-Effects Models, by Vince Dorie, Doug Bates, Martin Maechler, Ben Bolker, and Steven Walker

Going forward:

blme is great but we’d also like to have full Bayes. Stan does full Bayes but can be slow if you have a lot of data and a lot of groups. Just for example, suppose you have longitudinal data with 5 observations on each of 100,000 people. Then a hierarchical model will have hundreds of thousands of parameters—that’s a lot! On the other hand, the Bayesian central limit theorem should be working in your favor (see appendix B of BDA, for example). So some combination of approximate and full Bayesian inference should work.

Also, lme4, and even blme, can have trouble when you have lots of variance parameters running around, and lme4 has its own issues, which unfortunately blme inherits, regarding computation with empty groups and various issues like that which should not really be a problem with Bayesian inference with informative priors.

Right now, though, we don’t have this best-of-both-worlds Bayesian solution that does full Bayes when computationally feasible and uses appropriate approximations otherwise. So blme is part of our toolbox. Thanks to Vince!

Multilevel regression and poststratification (MRP) vs. survey sample weighting

Marnie Downes and John Carlin write:

Multilevel regression and poststratification (MRP) is a model-based approach for estimating a population parameter of interest, generally from large-scale surveys. It has been shown to be effective in highly selected samples, which is particularly relevant to investigators of large-scale population health and epidemiologic surveys facing increasing difficulties in recruiting representative samples of participants. We aimed to further examine the accuracy and precision of MRP in a context where census data provided reasonable proxies for true population quantities of interest. We considered 2 outcomes from the baseline wave of the Ten to Men study (Australia, 2013–2014) and obtained relevant population data from the 2011 Australian Census. MRP was found to achieve generally superior performance relative to conventional survey weighting methods for the population as a whole and for population subsets of varying sizes. MRP resulted in less variability among estimates across population subsets relative to sample weighting, and there was some evidence of small gains in precision when using MRP, particularly for smaller population subsets. These findings offer further support for MRP as a promising analytical approach for addressing participation bias in the estimation of population descriptive quantities from large-scale health surveys and cohort studies.

This article appeared in 2020 but I just happened to hear about it now.

Here’s the result from the first example considered by Downes and Carlin:

For the dichotomous labor-force outcome, MRP produced very accurate population estimates, particularly at the national level and for the larger states, where the employment rate was estimated within 1% in each case. For the smallest states of ACT and NT, MRP overestimated the employment rate by approximately 5%. Post-hoc analyses revealed that these discrepancies could be explained partly by important interaction terms that were evident in population data but not included in multilevel models due to insufficient data. For example, based on Census data, there was a much higher proportion of Indigenous Australians living in NT (25%) compared with all other states (<5%), but only 3% (n = 2) of the Ten to Men sample recruited from NT identified as Indigenous. There were also differences in the labor-force status of Indigenous Australians by state according to the Census: 90% of Indigenous Australians residing in ACT were employed compared with 79% residing in NT. Due to insufficient data available, it was not possible to obtain a meaningful estimate of this Indigenous status-by-state interaction effect.

And here’s their second example:

For the continuous outcome of hours worked, the performance of MRP was less impressive, with population quantities consistently overestimated by approximately 2 hours at the national level and for the larger states and by up to 4 hours for the smaller states. MRP still however, outperformed both unweighted and weighted estimation in most cases. The inaccuracy of all 4 estimation methods for this outcome likely reflects that the 2011 Census data for hours worked was not a good proxy for the true population quantities being estimated by the Ten to Men baseline survey conducted in 2013–2014. It is entirely plausible that the number of hours worked in all jobs in a given week could fluctuate considerably due to temporal factors and a wide range of individual-level covariates not included in our multilevel model. This was also evidenced by the large amount of residual variation in the multilevel model for this outcome.

Downes and Carlin summarize what they learned from the examples:

The increased consistency among state-level estimates achieved by MRP can be attributed to the partial pooling of categorical covariate parameter estimates toward their mean in multilevel modeling. This was particularly evident in the estimation of labor-force status for the smaller states of TAS, ACT, and NT, where MRP estimates fell part of the way between the unweighted state estimates and the national MRP estimate, with the degree of shrinkage reflecting the relative amount of information available about the individual state and all the states combined.

We did not observe, in this study, the large gains in precision achieved with MRP seen in our previous case study and simulation study. The multilevel models fitted here were more complex, including a larger number of covariates and multiple interaction effects. While we have sacrificed precision, this increased model complexity appears to have achieved increased accuracy. We did see small gains in precision when using MRP, particularly for the smaller states, and we might expect these gains to be larger for smaller sample sizes where the benefits of partial pooling in multilevel modeling would be greater.

Also:

The employment outcome measures considered in this study are not health outcomes per se; rather, they were chosen in the absence of any health outcomes for which census data were available to provide a comparison in terms of accuracy. We have no reason to expect MRP would behave any differently for outcome measures more commonly under investigation in population health or epidemiologic studies.

MRP can often lead to a very large number of poststratification cells. Our multilevel models generated 60,800 unique poststratification cells. With a total population size of 4,990,304, almost three-fourths of these cells contained no population data. This sparseness is not a problem, however, due to the smoothing of the multilevel model and the population cell counts used simply as weights in poststratification.

They conclude:

Results of this case-study analysis further support previous findings that MRP provides generally superior performance in both accuracy and precision relative to the use of conventional sample weighting for addressing potential participation bias in the estimation of population descriptive quantities from large-scale health surveys. Future research could involve the application of MRP to more complex problems such as estimating changes in prevalence over time in a longitudinal study or developing some user-friendly software tools to facilitate more widespread usage of this method.

It’s great to see people looking at these questions in detail. Mister P is growing up!

Poll aggregation: “Weighting” will never do the job. You need to be able to shift the estimates, not just reweight them.

Palko points us to this post, Did Republican-Leaning Polls “Flood the Zone” in 2022?, by Joe Angert et al., who write:

As polling averages shifted towards Republicans in the closing weeks of the 2022 midterms, one interpretation was that Americans were reverting to the usual pattern of favoring out-party candidates. Other observers argued that voter intentions were not changing and that the shift was driven by the release of a disproportionate number of pro-Republican polls – an argument supported by the unexpectedly favorable results for Democratic candidates on Election Day.

They continue:

We are not alleging a conspiracy among Republican pollsters to influence campaign narratives. . . . Even so, our results raise new concerns about the use of polling averages to assess campaign dynamics. A shift from one week to another may reflect changes in underlying voter preferences but can also reflect differences in the types of polls used to construct polling averages. This concern is particularly true for sites that aggregate polls without controlling for house effects (pollster-specific corrections for systematic partisan lean). . . .

Our results are also salient for aggregators who use pollster house effects to adjust raw polling data. In theory, these corrections remove poll-specific partisan biases, allowing polling averages to be compared week-to-week, even given changes in the types of polls being released. However, in most cases, aggregators use black-box models to estimate and incorporate house effects, making it impossible to assess the viability of this strategy. . . .

There’s a statistical point here, too, which is that additive “house effects” can appropriately shift individual polls so that even biased polls can supply information, but “weighting” can never do this. You need to move the numbers, not just reweight them.

BDA: Bayesian Dolphin Analysis

Matthieu Authier writes:

Here is a simulation study using regularized regression with post-stratification to estimate dolphin bycatch from non-representative samples. The Stan code is accessible here.

We’ve used also RRP on a case study with samples from FR where we know that independent observers are preferentially allowed on boat when dolphin bycatch is low (a report is being written at the moment on that, and it will be the second part of the dead dolphin duology for RRP). RRP is giving more plausible estimates in this case.

For those not familiar with the jargon, “bycatch” is “the non-intentional capture or killing of non-target species in commercial or recreational fisheries.”

It’s great to see MRP and Stan being used for all sorts of real problems.

P.S. Authier has an update:

The article has been published and vastly improved thank to peer review. The simulation study is here and the case study is here.

Stan was instrumental in both cases to be able to fit the models.

The model we developed is now used to update estimate bycatch in the Bay of Biscay.

Open problem: How to make residual plots for multilevel models (or for regularized Bayesian and machine-learning predictions more generally)?

Adam Sales writes:

I’ve got a question that seems like it should be elementary, but I haven’t seen it addressed anywhere (maybe I’m looking in the wrong places?)

When I try to use binned residual plots to evaluate a multilevel logistic regression, I often see a pattern like this (from my student, fit with glmer):

I think the reason is because of partial pooling/shrinkage of group-level intercepts being shrunk towards the grand mean.

I was able to replicate the effect (albeit kind of mirror-imaged—the above plot was from a very complex model) with fake data:

makeData <- function(ngroup=100,groupSizeMean=10,reSD=2){
  groupInt <- rnorm(ngroup,sd=reSD)
  groupSize <- rpois(ngroup,lambda=groupSizeMean)
  groups <- rep(1:ngroup,times=groupSize)
  n <- sum(groupSize)
  data.frame(group=groups,y=rbinom(n,size=1,prob=plogis(groupInt[groups])))
}
dat <- makeData()
mod <- glmer(y~(1|group),data=dat,family=binomial)
binnedplot(predict(mod,type='response'),resid(mod,type='response'))

Model estimates (i.e., point estimates of the parameters from a hierarchical model) of extreme group effects are shrunk towards 0---the grand mean intercept in this case---except at the very edges when the 0-1 bound forces the residuals to be small in magnitude (I expect the pattern would be linear in the log odds scale).

When I re-fit the same model on the same data with rstanarm and looked at the fitted values I got basically the same result.

On the other hand, when looking at 9 random posterior draws the pattern mostly goes away:

Now here come the questions---is this really a general phenomenon, like I think it is? If so, what does it mean for the use of binned residual plots for multilevel logistic regression, or really any time there's shrinkage or partial pooling? Can binned residual plots be helpful for models fit with glmer, or only by plotting individual posterior draws from a Bayesian posterior distribuion?

My reply: Yes, the positive slope for resid vs expected value . . . that would never happen in least-squares regression, so, yeah, it has to do with partial pooling. We should think about what's the right practical advice to give here. Residual plots are important.

As you note with your final graph above, the plots should have the right behavior (no slope when the model is correct) when plotting the residuals relative to the simulated parameter values. This is what Xiao-Li, Hal, and I called "realized discrepancies" in our 1996 paper on posterior predictive checking, but then in our 2000 paper on diagnostic checks for discrete-data regression models using posterior predictive simulations, Yuri, Francis, Ivan, and I found that the use of realized discrepancies added lots of noise in residual plots.

What we'd like is an approach that gives us the clean comparisons but without the noise.

Nationally poor, locally rich: Income and local context in the 2016 presidential election

Thomas Ogorzaleka, Spencer Piston, and Luisa Godinez Puig write:

When social scientists examine relationships between income and voting decisions, their measures implicitly compare people to others in the national economic distribution. Yet an absolute income level . . . does not have the same meaning in Clay County, Georgia, where the 2016 median income was $22,100, as it does in Old Greenwich, Connecticut, where the median income was $224,000. We address this limitation by incorporating a measure of one’s place in her ZIP code’s income distribution. We apply this approach to the question of the relationship between income and whites’ voting decisions in the 2016 presidential election, and test for generalizability in elections since 2000. The results show that Trump’s support was concentrated among nationally poor whites but also among locally affluent whites, complicating claims about the role of income in that election. This pattern suggests that social scientists would do well to conceive of income in relative terms: relative to one’s neighbors.

Good to see that people are continuing to work on this Red State Blue State stuff.

P.S. Regarding the graph above: They should’ve included the data too. It would’ve been easy to put in points for binned data just on top of the plots they already made. Clear benefit requring close to zero effort.

Replacing the “zoo of named tests” by linear models

Gregory Gilderman writes:

The semi-viral tweet thread by Jonas Lindeløv linked below advocates abandoning the “zoo of named tests” for Stats 101 in favor of mathematically equivalent (I believe this is the argument) varieties of linear regression:

As an adult learner of statistics, perhaps only slightly beyond the 101 level, and an R user, I have wondered what the utility of some of these tests are when regression seems to get the same job done.

I believe this is of wider interest than my own curiosity and would love to hear your thoughts on your blog.

My reply: I don’t agree with everything in Lindeløv’s post—in particular, he doesn’t get into the connection between analysis of variance and multilevel models, and sometimes he’s a bit too casual with the causal language—but I like the general flow, the idea of trying to use a modeling framework and to demystify the zoo of tests. Lindeløv doesn’t mention Regression and Other Stories, but I think he’d like it, as it follows the general principle of working through linear models rather than presenting all these tests as if they are separate things.

Also, I agree 100% with Lindeløv that things like the Wilcoxon test are best understood as linear models applied to rank-transformed data. This is a point we made in the first edition of BDA way back in 1995, and we’ve also blogged it on occasion, for example here. So, yeah, I’m glad to see Lindeløv’s post and I hope that people continue to read it.

Multilevel modeling to make better decisions using data from schools: How can we do better?

Michael Nelson writes:

I wanted to point out a paper, Stabilizing Subgroup Proficiency Results to Improve the Identification of Low-Performing Schools, by Lauren Forrow, Jennifer Starling, and Brian Gill.

The authors use Mr. P to analyze proficiency scores of students in subgroups (disability, race, FRL, etc.). The paper’s been getting a good amount of attention among my education researcher colleagues. I think this is really cool—it’s the most attention Mr. P’s gotten from ed researchers since your JREE article. This article isn’t peer reviewed, but it’s being seen by far more policymakers than any journal article would.

All the more relevant that the authors’ framing of their results is fishy. They claim that some schools identified as underperforming, based on mean subgroup scores, actually aren’t, because they would’ve gotten higher means if the subgroup n’s weren’t so small. They’re selling the idea that adjustment by poststratification (which they brand as “score stabilization”) may rescue these schools from their “bad luck” with pre-adjustment scores. What they don’t mention is that schools with genuinely underperforming (but small) subgroups could be misclassified as well-performing if they have “good luck” with post-adjustment scores. In fact, they don’t use the word “bias” at all, as in: “Individual means will have less variance but will be biased toward the grand mean.” (I guess that’s implied when they say the adjusted scores are “more stable” rather than “more accurate,” but maybe only to those with technical knowledge.)

And bias matters as much as variance when institutions are making binary decisions based on differences in point estimates around a cutpoint. Obviously, net bias up or down will be 0, in the long run, and over the entire distribution. But bias will always be net positive at the bottom of the distribution, where the cutpoint is likely to be. Besides, relying on net bias and long-run performance to make practical, short-run decisions seems counter to the philosophy I know you share, that we should look at individual differences not averages whenever possible. My fear is that, in practice, Mr. P might be used to ignore or downplay individual differences–not just statistically but literally, given that we’re talking about equity among student subgroups.

To the authors’ credit, they note in their limitations section that they ought to have computed uncertainty intervals. They didn’t, because they didn’t have student-level data, but I think that’s a copout. If, as they note, most of the means that moved from one side of the cutoff to the other are quite near it already, you can easily infer that the change is within a very narrow interval. Also to their credit, they acknowledge that binary choices are bad and nuance is good. But, also to their discredit, the entire premise of their paper is that the education system will, and presumably should, continue using cutpoints for binary decisions on proficiency. (That’s the implication, at least, of the US Dept. of Ed disseminating it.) They could’ve described a nuanced *application* of Mr. P, or illustrated the absurd consequences of using their method within the existing system, but they didn’t.

Anyway, sorry this went so negative, but I think the way Mr. P is marketed to policymakers, and its potential unintended consequences, are important.

Nelson continues:

I’ve been interested in this general method (multilevel regression with poststratification, MRP) for a while, or at least the theory behind it. (I’m not a Bayesian so I’ve never actually used it.)

As I understand it, MRP takes the average over all subgroups (their grand mean) and moves the individual subgroup means toward that grand mean, with smaller subgroups getting moved more. You can see this in the main paper’s graphs, where low means go up and high means go down, especially on the left side (smaller n’s). The grand mean will be more precise and more accurate (due to something called superefficiency), while the individual subgroup means will be much more precise but can also be much more biased toward the grand mean. The rationale for using the biased means is that very small subgroups give you very little information beyond what the grand mean is already telling you, so you should probably just use the grand mean instead.

In my view, that’s an iffy rationale for using biased subgroup proficiency scores, though, which I think the authors should’ve emphasized more. (Maybe they’ll have to in the peer-reviewed version of the paper.) Normally, bias in individual means isn’t a big deal: we take for granted that, over the long run, upward bias will be balanced out by downward bias. But, for this method and this application, the bias won’t ever go away, at least not where it matters. If what we’re looking at is just the scores around the proficiency cutoff, that’s generally going to be near the bottom of the distribution, and means near the bottom will always go up. As a result, schools with “bad luck” (as the authors say) will be pulled above the cutoff where they belong, but so will schools with subgroups that are genuinely underperforming.

I have a paper under review that derives a method for correcting a similar problem for effect sizes—it moves individual estimates not toward a grand mean but toward the true mean, in a direction and distance determined by a measure of the data’s randomness.

I kinda see what Nelson is saying, but I still like the above-linked report because I think that in general it is better to work with regularized, partially-pooled estimates than with raw estimates, even if those raw estimates are adjusted for noise or multiple comparisons or whatever.

To help convey this, let me share a few thoughts regarding hierarchical modeling in this general context of comparing averages (in this case, from different schools, but similar issues arise in medicine, business, politics, etc.).

1. Many years ago, Rubin made the point that, when you start with a bunch of estimates and uncertainties, classical multiple comparisons adjustments are effectively increasing can be increasing the standard errors so that fewer comparisons are statistically significant, whereas Bayesian methods move the estimates around. Rubin’s point was that you can get the right level of uncertainty much more effectively by moving the intervals toward each other rather than by keeping their centers fixed and then making them wider. (I’m thinking now that a dynamic visualization would be helpful to make this clear.)

It’s funny because Bayesian estimates are often thought of as trading bias for variance, but in this case the Bayesian estimate is so direct, and it’s the multiple comparisons approaches that do the tradeoff, getting the desired level of statistical significance by effectively making all the intervals wider and thus weakening the claims that can be made from data. It’s kinda horrible that, under the classical approach, your inferences for particular groups and comparisons will on expectation get vaguer as you get data from more groups.

We explored this idea in our 2000 article, Type S error rates for classical and Bayesian single and multiple comparison procedures (see here for freely-available version) and more thoroughly in our 2011 article, Why we (usually) don’t have to worry about multiple comparisons. In particular, see the discussion on pages 196-197 of that latter paper (see here for freely-available version):


2. MRP, or multilevel modeling more generally, does not “move the individual subgroup means toward that grand mean.” It moves the error terms toward zero, which implies that it moves the local averages toward their predictions from the regression model. For example, if you’re predicting test scores given various school-level predictors, then multilevel modeling partially pools the individual school means toward the fitted model. It would not in general make sense to partially pool toward the grand mean—not in any sort of large study that includes all sorts of different schools. (Yes, in Rubin’s classic 8-schools study, the estimates were pooled toward the average, but these were 8 similar schools in suburban New Jersey, and there were no available school-level predictors to distinguish them.)

3. I agree with Nelson that it’s a mistake to summarize results using statistical significance, and this can lead to artifacts when comparing different models. There’s no good reason to make decisions based on whether a 95% interval includes zero.

4. I like multilevel models, but point estimates from any source—multilevel modeling or otherwise—have unavoidable problems when the goal is to convey uncertainty. See our 1999 article, All maps of parameter estimates are misleading.

In summary, I like the Forrow et article. The next step should be to go beyond point estimates and statistical significance and to think more carefully about decision making under uncertainty in this educational context.

Thinking about statistical modeling, overfitting, and generalization error

Allister Bernard writes:

I recently came across some research on generalization error and deep learning (references below). These papers explore how generalization error improves in Deep Neural Networks by increasing model capacity and is contrary to what one would assume with the bias-variance tradeoff. I assumed this improvement with such overparameterized models was the effect of regularization (implicit and/or explicit) in these models. However, Zhang et al. show that regularization is highly unlikely to be the source of these gains.

References:
Zhang et al. https://cacm.acm.org/magazines/2021/3/250713-understanding-deep-learning-still-requires-rethinking-generalization/fulltext
Nakkiran et al. https://openai.com/blog/deep-double-descent/
Belkin et al. https://www.pnas.org/content/pnas/116/32/15849.full.pdf

Your note on the most important statistical ideas of the past 50 years, highlights the gains achieved with overparameterized models (and regularization). It has worried me that all the hype around deep learning seemed to gloss over how overparameterized the models have become. Now this is not to diminish the gains these models have made in a number of fields, especially image recognition and NLP. I do not want to minimize these achievements as they are truly wonderful.

Here are my two questions:

1. I am curious if there is any work from the statistics community, on why we see this improvement in generalization error? Most of the research I have seen is from the ML/CS community. Belkin et al. point out this behavior is observed in other types of overparameterized models like random forests.
Another possible explanation is that these improvements may be dependent on the problem domain. Feldman et al. propose a possible reason behind this phenomenon (https://arxiv.org/pdf/2008.03703.pdf).

2. Your blog has highlighted the dangers of the garden of the forking paths and I am curious if we may have another similar phenomenon here that is not well understood?

From a practical perspective, I wonder if a lot of these tools may get applied to domains where they are not applicable and end up having effects in the real world (as opposed to the theoretical world). There is currently no reason not to do so as we don’t understand where these ideas will/will not work. Besides, it is now very easy to use some of these tools via off the shelf packages.

My reply:

I took a look at the first paper linked above, and I don’t quite get what they are doing. In particular, they say, “Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice,” but they they define regularization as follows: “When the complexity of a model is very high, regularization introduces algorithmic tweaks intended to reward models of lower complexity. Regularization is a popular technique to make optimization problems ‘well posed’: when an infinite number of solutions agree with the data, regularization breaks ties in favor of the solution with lowest complexity.” This is not the regularization that I do at all! When I talk about regularization, I’m thinking about partial pooling, or more generally approaches to get more stable predictions. From a Bayesian perspective, regularization is not “algorithmic tweaks”; it’s just inference under a model. Also, regularization is not just about “breaking ties,” which implies some sort of lexicographic decision rule, nor does regularization necessarily lead to estimates with less complexity. It leads to estimates that are less variable, but that’s something different. For example, hierarchical modeling is not less complex (let alone “with lowest complexity”) than least squares, but it gives more stable predictions.

That said, my above comment is expressed in general terms, and I’m no expert on deep learning or various other machine learning techniques. I’m sympathetic with the general idea of comparing success with training and test data, and I also recognize the challenge of these evaluations, given that cross-validation tests are themselves a function of the available data.

One thing I’ve been thinking about a lot in recent years is poststratification: the idea that you’re fitting a model on data set A and then using it to make predictions in scenario B. The most important concern here might not be overfitting to the data, so much as appropriately modeling the differences between A and B.

“The butterfly and the piranha: Understanding the generalizability and reproducibility crisis from statistical and political perspectives” (my talk at the University of Minnesota political science department on Monday)

The talk is Mon 13 Mar, 11:30am Minnesota time, and it will be remote:

The butterfly and the piranha: Understanding the generalizability and reproducibility crisis from statistical and political perspectives

Researchers often act as if causal identification + statistical significance = discovery. This belief is appealing but incorrect, and it can lead to an unfortunate feedback loop by which important aspects of measurement get neglected in social science. From a statistical perspective, we can understand these problems using the framework of multilevel regression and poststratification (MRP), a method originally developed for survey research but which also applies to causal inference and generalization in other settings.

Now consider various flawed quantitative social research claiming large effects on voting and political attitudes based on factors such as hormones, football games, shark attacks, and subliminal smiley faces. We argue there is a political dimension to the continued appeal of what might be called the foolish-voter model. We explore the connections between the statistical problems of ungeneralizable or unreplicable claims, and the political positions supported by those claims.

Here’s the zoom link.

A bit of harmful advice from “Mostly Harmless Econometrics”

John Bullock sends along this from Joshua Angrist and Jorn-Steffen Pischke’s Mostly Harmless Econometrics—page 223, note 2:

They don’t seem to know about the idea of adjusting for the group-level mean of pre-treatment predictors (as in this 2006 paper with Joe Bafumi).

I like Angrist and Pischke’s book a lot so am happy to be able to help out by patching this little hole.

I’d also like to do some further analysis updating that paper with Bafumi using Bayesian analysis.

Rohrer and Arslan’s nonet: More ideas regarding interactions in statistical models

Ruben Arslan writes:

I liked the causal quartet you recently posted and wanted to forward a similar homage (in style if not content) Julia Rohrer and I recently made to accompany this paper. We had to go to a triple triptych though, so as not to compress it too much.

The paper in question is called Precise Answers to Vague Questions: Issues With Interactions.

What to do when a regression coefficient doesn’t make sense? The connection with interactions.

In addition to the cool graph, I like Rohrer and Arslan’s paper a lot because it addresses a very common problem in statistical modeling, a problem I’ve talked about a lot but which, as far as I can remember, I only wrote up once, on page 200 in Regression and Other Stories, in the middle of chapter 12, where it wouldn’t be noticed by anybody.

Here it is:

When you fit a regression to observational data and you get a coefficient that makes no sense, you should be able interpret it using interactions.

Here’s my go-to example, from a meta-analysis published in 1999 on the effects of incentives to increase the response rate in sample surveys:

What jumps out here is that big fat coefficient of -6.9 for Gift. The standard error is small, so it’s not an issue of sampling error either. As we wrote in our article:

Not all of the coefficient estimates in Table 1 seem believable. In particular, the estimated effect for gift versus cash incentive is very large in the context of the other effects in the table. For example, from Table 1, the expected effect of a postpaid cash incentive of $10 in a low-burden survey is 1.4 + 10(-.34) – 6.9 = -2.1%, actually lowering the response rate.

Ahhhh, that makes no sense! OK, yeah, with some effort you could tell a counterintuitive story where this negative effect could be possible, but there’d be no good reason to believe such a story. As we said:

It is reasonable to suspect that this reflects differences between the studies in the meta-analysis, rather than such a large causal effect of incentive form.

That is, the studies where a gift incentive was tried happened to be studies where the incentive was less effective. Each study in this meta-analysis was a randomized experiment, but the treatments were not chosen randomly between studies, so there’s no reason to think that treatment interactions would happen to balance out.

Some lessons from our example

First, if a coefficient makes no sense, don’t just suck it up and accept it. Instead, think about what this really means; use the unexpected result as a way to build a better model.

Second, avoid fitting models with rigid priors when fitting models to observational data. There could be a causal effect that you know must be positive—but, in an observational setting, the effect could be tangled with an interaction so that the relevant coefficient is negative.

Third, these problems don’t have to involve sign flipping. That is, even if a coefficient doesn’t go in the “wrong direction,” it can still be way off. Partly from the familiar problems of forking paths and selection on statistical significance, but also from interactions. For example, remember that indoor-coal-heating-and-lifespan analysis? That’s an observational study! (And calling it a “natural experiment” or “regression discontinuity” doesn’t change that.) So the treatment can be tangled in an interaction, even aside from issues of selection and variation.

So, yeah, interactions are important, and I think the Rohrer and Arslan paper is a good step forward in thinking about that.

Overestimated health effects of air pollution

Last year I wrote a post, “Why the New Pollution Literature is Credible” . . . but I’m still guessing that the effects are being overestimated:.

Since then, Vincent Bagilet and Léo Zabrocki-Hallak wrote an article, Why Some Acute Health Effects of Air Pollution Could Be Inflated, that begins:

Hundreds of studies show that air pollution affects health in the immediate short-run, and play a key role in setting air quality standards. Yet, estimated effect sizes vary widely across studies. Analyzing the results published in epidemiology and economics, we first find that a substantial share of estimates are likely to be inflated due publication bias and a lack of statistical power. Second, we run real data simulations to identify the design parameters causing these issues. We show that this exaggeration may be driven by the small number of exogenous shocks leveraged, by the limited strength of the instruments used or by sparse outcomes. These concerns likely extend to studies in other fields relying on comparable research designs. Our paper provides a principled workflow to evaluate and avoid the risk of exaggeration when conducting an observational study.

Their article also includes the above graph. It’s good to see this work being done and to see these type M results applied to different scientific fields.

P.S. I’m putting this in the Multilevel Modeling category because that’s what’s going on; they’re in essence partially pooling information across multiple studies, and individual researchers could do better by partially pooling within their studies, rather than selecting the biggest results.

Software to sow doubts as you meta-analyze

This is Jessica. Alex Kale, Sarah Lee, TJ Goan, Beth Tipton, and I write,

Scientists often use meta-analysis to characterize the impact of an intervention on some outcome of interest across a body of literature. However, threats to the utility and validity of meta-analytic estimates arise when scientists average over potentially important variations in context like different research designs. Uncertainty about quality and commensurability of evidence casts doubt on results from meta-analysis, yet existing software tools for meta-analysis do not necessarily emphasize addressing these concerns in their workflows. We present MetaExplorer, a prototype system for meta-analysis that we developed using iterative design with meta-analysis experts to provide a guided process for eliciting assessments of uncertainty and reasoning about how to incorporate them during statistical inference. Our qualitative evaluation of MetaExplorer with experienced meta-analysts shows that imposing a structured workflow both elevates the perceived importance of epistemic concerns and presents opportunities for tools to engage users in dialogue around goals and standards for evidence aggregation.

One way to think about good interface design is that we want to reduce sources of the “friction” like the cognitive effort users have to exert when they go to do some task; in other words minimize the so-called gulf of execution. But then there are tasks like meta-analysis where being on auto-pilot can result in misleading results. We don’t necessarily want to create tools that encourage certain mindsets, like when users get overzealous about suppressing sources of heterogeneity across studies in order to get some average that they can interpret as the ‘true’ fixed effect. So what do you do instead? One option is to create a tool that undermines the analyst’s attempts to combine disparate sources of evidence every chance it gets. 

This is essentially the philosophy behind MetaExplorer. This project started when I was approached by an AI firm pursuing a contract with the Navy, where systematic review and meta-analysis are used to make recommendations to higher-ups about training protocols or other interventions that could be adopted. Five years later, a project that I had naively figured would take a year (this was my first time collaborating with a government agency) culminated in a tool that differs from other software out there primarily in its heavy emphasis on sources of heterogeneity and uncertainty. It guides the user through making their goals explicit, like what the target context they care about is; extracting effect estimates and supporting information from a set of studies; identifying characteristics of the studied populations and analysis approaches; and noting concerns about assymmetries, flaws in analysis, or mismatch between the studied and target context. These sources of epistemic uncertainty get propagated to a forest plot view where the analyst can see how an estimate varies as studies are regrouped or omitted. It’s limited to small meta-analyses of controlled experiments, and we have various ideas based on our interviews of meta-analysts that could improve its value for training and collaboration. But maybe some of the ideas will be useful either to those doing meta-analysis or building software. Codebase is here.