Explaining that line, “Bayesians moving from defense to offense”

Earlier today we posted something on our recent paper with Erik van Zwet et al., “A New Look at P Values for Randomized Clinical Trials.” The post had the provocative title, “Bayesians moving from defense to offense,” and indeed that title provoked some people!

The discussion thread here at the blog was reasonable enough, but someone pointed me to a thread at Hacker News where there was more confusion, so I thought I’d clarify one or two points.

First, yes, as one commenter puts it, “I don’t know when Bayesians have ever been on defense. They’ve always been on offense.” Indeed, we published the first edition of Bayesian Data Analysis back in 1995, and there was nothing defensive about our tone! We were demonstrating Bayesian solutions to a large set of statistics problems, with no apology.

As a Bayesian, I’m kinda moderate—indeed, Yuling and I published an entire, non-ironic paper on holes in Bayesian statistics, and there’s also a post a few years ago called What’s wrong with Bayes, where I wrote, “Bayesian inference can lead us astray, and we’re better statisticians if we realize that,” and “the problem with Bayes is the Bayesians. It’s the whole religion thing, the people who say that Bayesian reasoning is just rational thinking, or that rational thinking is necessarily Bayesian, the people who refuse to check their models because subjectivity, the people who try to talk you into using a ‘reference prior’ because objectivity. Bayesian inference is a tool. It solves some problems but not all, and I’m exhausted by the ideology of the Bayes-evangelists.”

So, yeah, “Bayesians on the offensive” is not new, and I don’t even always like it. Non-Bayesians have been pretty aggressive too over the years, and not always in a reasonable way; see my discussion with Christian Robert from a few years ago and our followup. As we wrote, “The missionary zeal of many Bayesians of old has been matched, in the other direction, by an attitude among some theoreticians that Bayesian methods were absurd—not merely misguided but obviously wrong in principle.”

Overall, I think there’s much more acceptance of Bayesian methods within statistics than in past decades, in part from the many practical successes of Bayesian inference and in part because recent successes of machine learning have given users and developers of methods more understanding and acceptance of regularization (also known as partial pooling or shrinkage, and central to Bayesian methods) and, conversely, have given Bayesians more understanding and acceptance of regularization methods that are not fully Bayesian.

OK, so what was I talking about?

So . . . Given all the above, what did I mean with my crack about Bayesians moving from defense to offense”? I wasn’t talking about Bayesians being positive about Bayesian statistics in general; rather, I was talking about the specific issue of informative priors.

Here’s how we used to talk, circa 1995: “Bayesian inference is a useful practical tool. Sure, you need to assign a prior distribution, but don’t worry about it: the prior can be noninformative, or in a hierarchical model the hyperparameters can be estimated from data. The most important ‘prior information’ to use is structural, not numeric.”

Here’s how we talk now: “Bayesian inference is a useful practical tool, in part because it allows us to incorporate real prior information. There’s prior information all around that we can use in order to make better inferences.”

My “moving from defense to offense” line was all about the changes in how we think about prior information. Instead of being concerned about prior sensitivity, we respect prior sensitivity and, when the prior makes a difference, we want to use good prior information. This is exactly the same as in any statistical procedure: when there’s sensitivity to data (or, in general, to any input to the model), that’s where data quality is particularly relevant.

This does not stop you from using classical p-values

Regarding the specific paper we were discussing in yesterday’s post, let me emphasize that this work is very friendly to traditional/conventional/classical approaches.

As we say right there in the abstract, “we reinterpret the P value in terms of a reference population of studies that are, or could have been, in the Cochrane Database.”

So, in that paper, we’re not saying to get rid of the p-value. It’s a data summary, people are going to compute it, and people are going to report it. That’s fine! It’s also well known that p-values are commonly misinterpreted (as detailed for example by McShane and Gal, 2017).

Given that people will be reporting p-values, and given how often they are misinterpreted, even by professionals, we believe that it’s a useful contribution to research and practice to consider how to interpret them directly, under assumptions that are more realistic than the null hypothesis.

So if you’re coming into all of this as a skeptic of Bayesian methods, my message to you is not, “Chill out, the prior doesn’t really matter anyway,” but, rather, “Consider this other interpretation of a p-value, averaging over the estimated distribution of effect sizes in the Cochrane database instead of conditioning on the null hypothesis.” So you now have two interpretations of the p-value, conditional on different assumptions. You can use them both.

Dave Krantz

“One of the things that makes scientific research hard is that one is usually not sure what hat one should be wearing in the given situation.” — David H. Krantz, 1938-2023

It’s been a bunch of years since I talked or corresponded with Dave Krantz—a quick check of my email reveals one email from 2013, an exchange from 2012, one item from 2011, one item from 2010, one from 2009, one from 2008, and, before that I guess he was still at Columbia and I was talking to him regularly in person.

After he passed away, I was asked to contribute something to his obituary. This is what I sent:

Dave had deep knowledge about everything. I treasure the many long conversations we had at Columbia. Just about every day something comes up in my teaching or research for which I’d love to have Dave’s perspective. Fortunately, he wrote well and had long conversations with many people, so his ideas and perspectives will remain with us.

I was surprised when searching online to not find any academic memoirs or obituaries of Dave—maybe the Society for Judgement and Decision Making is preparing some something? To help fill the gap, I thought I’d post some of the last emails I received from Dave, as they give a sense of his depth and range.

The last email came on 11 Jul 2013. I’d sent Dave an email discussion I’d had with Dan Kahan—hey, that’s another person I haven’t heard from in many years!—regarding hypothesis testing. In one of those messages, I’d written this to Dan:

Regarding your [Dan’s] later point about evaluating competing hypotheses, that’s another (interesting) story. Dave Krantz once was telling me his research regarding how people think of evidence comparing hyps A and B. His point was that there are 4 types of information:
– supporting A
– supporting B
– opposing A
– opposing B
In usual likelihood-ratio or Bayes theory, all 4 sorts of evidence can be placed on a common scale, but in his expts people treated them differently. I think one example was a comparison between two settings: (a) weak evidence in favor of A, weak evidence in favor of B; (b) strong evidence against A, strong evidence against B. This was in a situation where the truth really had to be A or B, there was no “other” option. At least, that’s how I remember it. Anyway, I think Dave told me that people had much different reactions to settings (a) and (b). Which I believe.

And this was Dave’s reply to me (addressing the entire thread, not just my snippet above):

Thanks for sharing this interesting discussion with me. I have a few comments.

First, I do of course agree with Kahan’s point about multiple working hypotheses and good experimental design. John Platt’s Science paper on “strong inference” used to be the standard discussion of this point; but I think it has been forgotten by now …

Nonetheless, there are situations where one uses evidence to reject a hypothesis even though it does not support any alternative. In lab experiments, the best examples are those in which a study is designed to discriminate two or more scientifically interesting theories but the data reject all of them strongly. One usually concludes that something has gone seriously wrong with method or with modeling. “Something gone wrong” is a diffuse alternative, which was not originally under consideration in the design and which does not lend itself to generating conditional probabilities of the observations, given the hypothesis. Sometimes one is led a step farther: something is wrong and THAT is really interesting — new hypotheses, never before considered, are generated. It is the use of evidence to generate diffuse alternatives or interesting novel alternatives that undercuts the likelihood principle as well as the more general Platt principle of “strong inference.”

A legal example would be discovering that the prime suspect has an airtight confirmed alibi. One (provisionally) rejects the hypothesis that he or she is guilty, but the alternative, “someone else” is too diffuse to be useful.

Second, I think that your characterization of NHST is exactly right.

Third, in answer to your questions about different sorts of evidence, I think you are conflating several things that I’ve worked on; and also, it is necessary to take framing into account as to what is evidence for or against. With one prime suspect, an alibi is evidence against. With two prime suspects, an alibi for one is (logically) evidence pointing to the other, and can be framed either way.

One of the differences between strong evidence for each of two incompatible hypotheses, A and B, versus weak evidence for each, is that a moderate piece of evidence pointing to one “should” weigh fairly heavily in the weak case but not so much in the strong case. When I say “should” I am referring both to my own intuitions and to the Dempster-Shafer rule assuming cognitively independent evidence; but NOT to empirical studies. I did do some of those, but the results were too confusing and messy to publish.

The best recent work about evidence for and against, with framing, is by Elke Weber and Eric Johnson (Query Theory). You probably know about this.

I can describe my own forays into empirical studies of evidence judgment (joint work with Laura Briggs and with Rahul Dodhia) and empirical studies of the use of weak evidence in probability judgment (modifications of Tversky’s Support Theory, with Dan Osherson); but most of it is not that relevant. The most important is the study that Briggs and I published (J. Behavioral Decision Making, 1992, 5, 77-106). It is particularly important for separating evidence judgment from probability judgment and for showing at least some possibility of focusing judgment on “designated” evidence — something that is critical in many tasks, from juror decision making to preparation of arguments.

Fourth, and finally, I’m sure that I don’t need to tell Dan Kahan to be careful in judging rationality of rules of evidence. In particular, legal systems have multiple goals: not just making judgments of evidence in particular cases but also system goals. One of the latter is to deter the use of inflammatory prejudicial information within the legal system.

If X embezzled in the past, this is pretty good evidence that X will embezzle in the future — thus, relevant to hiring decisions — but it is not good evidence that X embezzled on the present occasion; and it may well have more prejudicial than probative value, taking into account the temperament of judges and juries, and also taking into account system goals as noted above. On the other hand, if X has demonstrated scrupulous honesty in the past, this is decent character evidence against an accusation on the present occasion; but it then becomes admissible to refute such character evidence by pointing to past occasions of embezzlement.

As to things in general — I’ve been well, mostly these days in Nashville, but planning to continue teaching every Winter term at Columbia for the next few years. I hope you and your family are all well.

My second-to-last discussion with Dave was on 14 May 2012. I forwarded to him a blog post entitled, “I hate to get all Gerd Gigerenzer on you here, but . . .”, that was critical of a researcher in the judgment and decision making space. We had an exchange of several emails back and forth. Some of this involved discussion of the contributions of Gigerenzer and some it involved Dave’s skepticism of Tversky and Kahneman’s “two-system” theory, and I connected this to my later-published work with Basbøll on the nature of stories. Dave summarized the discussion in this way:

Everyone seems to be capable of failing to apply things that they know well when they are not being tip-top-alert.

I have long puzzled about the ontogeny of mathematical reasoning. On the one hand, we almost all have one heart, two kidneys (I am down to 1 2/3), etc., and we are capable of knowing one another’s minds on the basis of our own. We do make mistakes in this; but we make mistakes in inferences about our own goals and feelings as well. So if most humans are so similar, why do we see such a range of performance in mathematics, dance, musical composition, etc.? Some of it I believe derives from huge positive-feedback loops. Each good move is satisfying and motivates the person making it to develop further.

There are many tricky little problems that have been used in the laboratory to show how stupid people are. I have hardly ever fallen for them, but it isn’t because I’m smarter, it is because I’ve developed my mathematical persona and have confidence in it, and when I see such problems I don’t respond to them as Dave Krantz, I respond to them as Dave Krantz-plus-math hat. Someone who lacks confidence in his or her math hat may respond by putting on a dunce hat instead, or more simply, no hat at all. If the situation does not cue me to put the math hat on — as, for example, with Tversky and Kahneman’s original conjunction fallacy problems — then I behave just like most people. One of the things that makes scientific research hard is that one is usually not sure what hat one should be wearing in the given situation.

People don’t like this account much and it may be wrong. Certainly it strains to account for the feats of Mozart, Capablanca, and Picasso. But “supposed feats” would be more accurate. The documentation is thin.

Dave was an incredibly thoughtful person. OK, not always—like all of us, he was capable of failing to apply things that he knew well when he was not being tip-top-alert—but he was tip-top-alert most of the times that I ever saw him. Generous, amazing insights, big picture, the whole thing.

P.S. Here are some Krantz-related items from the blog archives:

12 Jul 2005: Dave Krantz on decision analysis and quantum physics, leading to a Jim Thomspon reference and then back to Penrose’s theory that consciousness is inherently quantum-mechanical

13 Jul 2007: Goals and plans in decision making

5 Oct 2007: More on significance testing in economics

25 Oct 2007: Dave Krantz on utility and value

19 May 2009: I got the idea of type M errors from Dave Krantz; apparently it was a well-known concept in psychometrics (although without the “type M” name)

Suppose you realize that a paper on an important topic, with thousands of citations, is fatally flawed? Where should the correction be published?

OK, here’s a hypothetical scenario. You’re a researcher. You look carefully at one of the most cited-papers in an important subfield—perhaps the most influential paper published there in the past decade. It turns out that the paper is fatally flawed. Unfortunately, it seems very unlikely that the authors of the original paper will do an “Our bad, we retract.” You can write a short article detailing the flaws in that paper. But what do you do with your short article?

Here are some options:

– Publish it in a journal such as Econ Journal Watch or Sociological Science that specializes in criticism of published work. The trouble here is (a) there aren’t many such outlets, and (b) a publication there might be barely noticed.

– Publish it the same journal that published the original article. The trouble here is that the journal that published the original article might defer to the authors of the original article, for example publishing your criticism as a letter with a reply by the original authors, which in theory could be fine but in practice could just be a way to muddy the waters. Or the journal might just flat-out refuse to publish your article at all, taking the position that they just want to publish original work (even if wrong) and not commentary.

– Publish it in a top journal in the field. That should be possible, given that you’re criticizing a very influential paper. But, again, top journals often don’t want to publish criticism, also there can be a circling-the-wagons thing where they really don’t want to see criticism of influential work.

– Publish it on a preprint server and blog it. This can create some short-term stir, but I think that if the criticism’s not in the scholarly literature, it will fade, except in an extreme scenario such as Wansink, Ariely, or Tol where revelations of sloppy practice are so clear that a researcher’s entire corpus is called into question.

– Don’t publish it right away. Instead, do further research and publish a new paper on the topic with a new conclusion, incidentally shooting down that old paper that’s all wrong. Publish that new paper in a top journal. This could be the best option, but (a) it takes a lot of work, (b) it delays the revelation of the problems with the earlier paper, and (c) it shouldn’t be necessary: this implies higher standards for corrections than for original work.

P.S. Joshua in comments writes:

What about another option – contacting the authors of the paper with an elaboration of your critique, and offering to collaborate on a follow up. CC the journal that published the original paper.

My reply: Sure, I guess that’s worth a try. It’s generally not worked for me. In my experience, the original researchers and the journals that published the original papers are too committed to the result. One problem here is that I’ll typically hear about a paper in the first place because it got inappropriately uncritical publicity, and, when that happens, most authors just do not want to hear that they did something wrong, and they go to great efforts to avoid confronting the issue.

For example, I tried several times to contact the authors of the notorious early-childhood-intervention-in-Jamaica paper, using various channels including two different intermediaries whom I knew personally, and I never received any response. The authors of that paper did not seem to be interested in exploring what went on or doing better.

Indeed, I’ve had struggles with this sort of thing for decades. As a student, I collaborated with someone who had a paper in preparation for journal publication that had a fatal error. I told him about it and explained the error, and he refused to do anything about it: the paper was already accepted and he just wanted to take the W on his C.V. and move on. Also as a student, I found a problem in a published paper—not an error, exactly, but a bad analysis for which there was a clear direction for improvement. I contacted the authors by letter and phone, and they refused to share their data or to even consider that they might not have done the best analysis. The story for that one is here.

Oh, and here’s another story where I contacted a colleague at another institution who’d promoted work of which I was skeptical, but he didn’t want to engage in any serious way. And here’s another such story, where it was only possible to collaborate asynchronously, as it were. That last case was the best, because the data were available.

Other times I’ve contacted authors had had fruitful exchanges and they want to figure out how to do better; an example is here. So it can happen.

AI bus route fail: Typically the most important thing is not how you do the optimization but rather what you decide to optimize.

Robert Farley shares this amusing story of a city that contracted out the routing of its school buses to a company that “uses artificial intelligence to generate the routes with the intent of reducing the number of routes. Last year, JCPS had 730 routes last year, and that was cut to 600 beginning this year . . .” The result was reported to be a “transportation disaster.”

I don’t know if you can blame AI here . . . Reducing the number of routes by over 15%, that’s gonna be a major problem! To first approximation we might expect routes to be over 15% longer, but that’s just an average: you can bet it will be much worse for some routes. No surprise that the bus drivers hate it.

As Farley says, “In theory, developing a bus route algorithm is something that AI could do well . . . [to] optimize the incredibly difficult problem of getting thousands of kids to over 150 schools in tight time windows,” but:

1. Effective problem solving for the real world requires feedback, and it’s not clear that any feedback was involved in this system: the company might have just taken the contract, run their program, and sent the output to the school district without ever checking that the results made sense, not to mention getting feedback from bus drivers and school administrators. I wonder how many people at the company take the bus to work themselves every day!

2. It sounds like the goal was to reduce the number of routes, not to produce routes that worked. If you optimize on factor A, you can pay big on factor B. Again, this is a reason for getting feedback and solving the problem iteratively.

3. Farley describes the AI solution as “high modernist thinking.” That’s a funny and insightful way to put it! I have no idea what sort of “artificial intelligence” was used in this bus routing program. It’s an optimization problem, and typically the most important thing is not how you do the optimization but rather what you decide to optimize.

In that sense, the biggest problem with “AI” here is not that it led to a bad solution—if you try to optimize the wrong thing, I’d guess that any algorithm not backed up by feedback will fail—but rather that it had an air of magic which led people to accept its results unquestioningly. “AI,” like “Bayesian,” can serve as a slogan that leads people to turn off their skepticism. They might as well have said they used quantum computing or room-conductor superconductors or whatever.

I guess the connection to “high modernist thinking” is (a) the idea that we can and should replace the old with the new, clear the “slums” and build clean shiny new buildings, etc., and (b) the idea of looking only at surfaces, kinda like how Theranos conned lots of people by building fake machines that looked like clean Apple-brand devices. In this case, I have no reason to think the bus routing program is a con; it sounds more like an optimization program plus good marketing, and this was just one more poorly-planned corporate/government contract, with “AI” just providing a plausible cover story.

This empirical paper has been cited 1616 times but I don’t find it convincing. There’s no single fatal flaw, but the evidence does not seem so clear. How to think about this sort of thing? What to do? First, accept that evidence might not all go in one direction. Second, make lots of graphs. Also, an amusing story about how this paper is getting cited nowadays.

1. When can we trust? How can we navigate social science with skepticism?

2. Why I’m not convinced by that Quebec child-care study

3. 20 years on

1. When can we trust? How can we navigate social science with skepticism?

The other day I happened to run across a post from 2016 that I think is still worth sharing.

Here’s the background. Someone pointed me to a paper making the claim that “Canada’s universal childcare hurt children and families. . . . the evidence suggests that children are worse off by measures ranging from aggression to motor and social skills to illness. We also uncover evidence that the new child care program led to more hostile, less consistent parenting, worse parental health, and lower‐quality parental relationships.”

I looked at the paper carefully and wasn’t convinced. In short, the evidence went in all sorts of different directions, and I felt that the authors had been trying too hard to fit it all into a consistent story. It’s not that the paper had fatal flaws—it was not at all in the category of horror classics such as the beauty-and-sex-ratio paper, the ESP paper, the himmicanes paper, the air-rage paper, the pizzagate papers, the ovulation-and-voting paper, the air-pollution-in-China paper, etc etc etc.—it just didn’t really add up to me.

The question then is, if a paper can appear in a top journal, have no single killer flaw but still not be convincing, can we trust anything at all in the social sciences? At what point does skepticism become nihilism? Must I invoke the Chestertonian principle on myself?

I don’t know.

What I do think is that the first step is to carefully assess the connection between published claims, the analysis that led to these claims, and the data used in the analysis. The above-discussed paper has a problem that I’ve seen a lot, which is an implicit assumption that all the evidence should go in the same direction, a compression of complexity which I think is related to the cognitive illusion that Tversky and Kahneman called “the law of small numbers.” The first step in climbing out of this sort of hole is to look at lots of things at once, rather than treating empirical results as a sort of big bowl of fruit where the researcher can just pick out the the juiciest items and leave the rest behind.

2. Why I’m not convinced by that Quebec child-care study

Here’s what I wrote on that paper back in 2006:

Yesterday we discussed the difficulties of learning from a small, noisy experiment, in the context of a longitudinal study conducted in Jamaica where researchers reported that an early-childhood intervention program caused a 42%, or 25%, gain in later earnings. I expressed skepticism.

Today I want to talk about a paper making an opposite claim: “Canada’s universal childcare hurt children and families.”

I’m skeptical of this one too.

Here’s the background. I happened to mention the problems with the Jamaica study in a talk I gave recently at Google, and afterward Hal Varian pointed me to this summary by Les Picker of a recent research article:

In Universal Childcare, Maternal Labor Supply, and Family Well-Being (NBER Working Paper No. 11832), authors Michael Baker, Jonathan Gruber, and Kevin Milligan measure the implications of universal childcare by studying the effects of the Quebec Family Policy. Beginning in 1997, the Canadian province of Quebec extended full-time kindergarten to all 5-year olds and included the provision of childcare at an out-of-pocket price of $5 per day to all 4-year olds. This $5 per day policy was extended to all 3-year olds in 1998, all 2-year olds in 1999, and finally to all children younger than 2 years old in 2000.

(Nearly) free child care: that’s a big deal. And the gradual rollout gives researchers a chance to estimate the effects of the program by comparing children at each age, those who were and were not eligible for this program.

The summary continues:

The authors first find that there was an enormous rise in childcare use in response to these subsidies: childcare use rose by one-third over just a few years. About a third of this shift appears to arise from women who previously had informal arrangements moving into the formal (subsidized) sector, and there were also equally large shifts from family and friend-based child care to paid care. Correspondingly, there was a large rise in the labor supply of married women when this program was introduced.

That makes sense. As usual, we expect elasticities to be between 0 and 1.

But what about the kids?

Disturbingly, the authors report that children’s outcomes have worsened since the program was introduced along a variety of behavioral and health dimensions. The NLSCY contains a host of measures of child well being developed by social scientists, ranging from aggression and hyperactivity, to motor-social skills, to illness. Along virtually every one of these dimensions, children in Quebec see their outcomes deteriorate relative to children in the rest of the nation over this time period.

More specifically:

Their results imply that this policy resulted in a rise of anxiety of children exposed to this new program of between 60 percent and 150 percent, and a decline in motor/social skills of between 8 percent and 20 percent. These findings represent a sharp break from previous trends in Quebec and the rest of the nation, and there are no such effects found for older children who were not subject to this policy change.

Also:

The authors also find that families became more strained with the introduction of the program, as manifested in more hostile, less consistent parenting, worse adult mental health, and lower relationship satisfaction for mothers.

I just find all this hard to believe. A doubling of anxiety? A decline in motor/social skills? Are these day care centers really that horrible? I guess it’s possible that the kids are ruining their health by giving each other colds (“There is a significant negative effect on the odds of being in excellent health of 5.3 percentage points.”)—but of course I’ve also heard the opposite, that it’s better to give your immune system a workout than to be preserved in a bubble. They also report “a policy effect on the treated of 155.8% to 394.6%” in the rate of nose/throat infection.

OK, hre’s the research article.

The authors seem to be considering three situations: “childcare,” “informal childcare,” and “no childcare.” But I don’t understand how these are defined. Every child is cared for in some way, right? It’s not like the kid’s just sitting out on the street. So I’d assume that “no childcare” is actually informal childcare: mostly care by mom, dad, sibs, grandparents, etc. But then what do they mean by the category “informal childcare”? If parents are trading off taking care of the kid, does this count as informal childcare or no childcare? I find it hard to follow exactly what is going on in the paper, starting with the descriptive statistics, because I’m not quite sure what they’re talking about.

I think what’s needed here is some more comprehensive organization of the results. For example, consider this paragraph:

The results for 6-11 year olds, who were less affected by this policy change (but not unaffected due to the subsidization of after-school care) are in the third column of Table 4. They are largely consistent with a causal interpretation of the estimates. For three of the six measures for which data on 6-11 year olds is available (hyperactivity, aggressiveness and injury) the estimates are wrong-signed, and the estimate for injuries is statistically significant. For excellent health, there is also a negative effect on 6-11 year olds, but it is much smaller than the effect on 0-4 year olds. For anxiety, however, there is a significant and large effect on 6-11 year olds which is of similar magnitude as the result for 0-4 year olds.

The first sentence of the above excerpt has a cover-all-bases kind of feeling: if results are similar for 6-11 year olds as for 2-4 year olds, you can go with “but not unaffected”; if they differ, you can go with “less effective.” Various things are pulled out based on whether they are statistically significant, and they never return to the result for anxiety, which would seem to contradict their story. Instead they write, “the lack of consistent findings for 6-11 year olds confirm that this is a causal impact of the policy change.” “Confirm” seems a bit strong to me.

The authors also suggest:

For example, higher exposure to childcare could lead to increased reports of bad outcomes with no real underlying deterioration in child behaviour, if childcare providers identify negative behaviours not noticed (or previously acknowledged) by parents.

This seems like a reasonable guess to me! But the authors immediately dismiss this idea:

While we can’t rule out these alternatives, they seem unlikely given the consistency of our findings both across a broad spectrum of indices, and across the categories that make up each index (as shown in Appendix C). In particular, these alternatives would not suggest such strong findings for health-based measures, or for the more objective evaluations that underlie the motor-social skills index (such as counting to ten, or speaking a sentence of three words or more).

Health, sure: as noted above, I can well believe that these kids are catching colds from each other.

But what about that motor-skills index? Here are their results from the appendix:

Screen Shot 2016-06-22 at 1.56.04 PM

I’m not quite sure whether + or – is desirable here, but I do notice that the coefficients for “can count out loud to 10” and “spoken a sentence of 3 words or more” (the two examples cited in the paragraph above) go in opposite directions. That’s fine—the data are the data—but it doesn’t quite fit their story of consistency.

More generally, the data are addressed in an scattershot manner. For example:

We have estimated our models separately for those with and without siblings, finding no consistent evidence of a stronger effect on one group or another. While not ruling out the socialization story, this finding is not consistent with it.

This appears to be the classic error of interpretation of a non-rejection of a null hypothesis.

And here’s their table of key results:

Screen Shot 2016-06-22 at 1.59.53 PM

As quantitative social scientists we need to think harder about how to summarize complicated data with multiple outcomes and many different comparisons.

I see the current standard ways to summarize this sort of data are:

(a) Focus on a particular outcome and a particular comparison (choosing these ideally, though not usually, using preregistration), present that as the main finding and then tag all else as speculation.

Or, (b) Construct a story that seems consistent with the general pattern in the data, and then extract statistically significant or nonsignificant comparisons to support your case.

Plan (b) was what was done again, and I think it has problems: lots of stories can fit the data, and there’s a real push toward sweeping any anomalies aside.

For example, how do you think about that coefficient of 0.308 with standard error 0.080 for anxiety among the 6-11-year-olds? You can say it’s just bad luck with the data, or that the standard error calculation is only approximate and the real standard error should be higher, or that it’s some real effect caused by what was happening in Quebec in these years—but the trouble is that any of these explanations could be used just as well to explain the 0.234 with standard error 0.068 for 2-4-year-olds, which directly maps to one of their main findings.

Once you start explaining away anomalies, there’s just a huge selection effect in which data patterns you choose to take at face value and which you try to dismiss.

So maybe approach (a) is better—just pick one major outcome and go with it? But then you’re throwing away lots of data, that can’t be right.

I am unconvinced by the claims of Baker et al., but it’s not like I’m saying their paper is terrible. They have an identification strategy, and clean data, and some reasonable hypotheses. I just think their statistical analysis approach is not working. One trouble is that statistics textbooks tend to focus on stand-alone analyses—getting the p-value right, or getting the posterior distribution, or whatever, and not on how these conclusions fit into the big picture. And of course there’s lots of talk about exploratory data analysis, and that’s great, but EDA is typically not plugged into issues of modeling, data collection, and inference.

What to do?

OK, then. Let’s forget about the strengths and the weaknesses of the Baker et al. paper and instead ask, how should one evaluate a program like Quebec’s nearly-free preschool? I’m not sure. I’d start from the perspective of trying to learn what we can from what might well be ambiguous evidence, rather than trying to make a case in one direction or another. And lots of graphs, which would allow us to see more in one place, that’s much better than tables and asterisks. But, exactly what to do, I’m not sure. I don’t know whether the policy analysis literature features any good examples of this sort of exploration. I’d like to see something, for this particular example and more generally as a template for program evaluation.

3. Nearly 20 years on

So here’s the story. I heard about this work in 2016, from a press release issued in 2006, the article was published in a top economics journal in 2008, it appeared in preprint form in 2005, and it was based on data collected in the late 1990s. And here we are discussing it again in 2023.

It’s kind of beating a dead horse to discuss a 20-year-old piece of research, but you know what they say about dead horses. Also, according to Google Scholar, the article has 1616 citations, including 120 citations in 2023 alone, so, yeah, still worth discussing.

That said, not all the references refer to the substance of the paper. For example, the very first paper on Google Scholar’s list of citers is a review article, Explaining the Decline in the US Employment-to-Population Ratio, and when I searched to see what they said about this Canada paper (Baker, Gruber, and Milligan 2008), here’s what was there:

Additional evidence on the effects of publicly provided childcare comes from the province of Quebec in Canada, where a comprehensive reform adopted in 1997 called for regulated childcare spaces to be provided to all children from birth to age five at a price of $5 per day. Studies of that reform conclude that it had significant and long-lasting effects on mothers’ labor force participation (Baker, Gruber, and Milligan 2008; Lefebvre and Merrigan 2008; Haeck, Lefebvre, and Merrigan 2015). An important feature of the Quebec reform was its universal nature; once fully implemented, it made very low-cost childcare available for all children in the province. Nollenberger and Rodriguez-Planas (2015) find similarly positive effects on mothers’ employment associated with the introduction of universal preschool for three-year-olds in Spain.

They didn’t mention the bit about “the evidence suggests that children are worse off” at all! Indeed, they’re just kinda lumping this in with positive studies on “the effects of publicly provided childcare.” Yes, it’s true that this new article specifically refers to “similarly positive effects on mothers’ employment,” and that earlier paper, while negative about the effect of universal child care on kids, did say, “Maternal labor supply increases significantly.” Still, when it comes to sentiment analysis, that 2008 paper just got thrown into the positivity blender.

I don’t know how to think about this.

On one hand, I feel bad for Baker et al.: they did this big research project, they achieved the academic dream of publishing it in a top journal, it’s received over 1616 citations and remains relevant today—but, when it got cited, its negative message was completely lost! I guess they should’ve given their paper a more direct title. Instead of “Universal Child Care, Maternal Labor Supply, and Family Well‐Being,” they should’ve called it something like: “Universal Child Care: Good for Mothers’ Employment, Bad for Kids.”

On the other hand, for the reasons discussed above, I don’t actually believe their strong claims about the child care being bad for kids, so I’m kinda relieved that, even though the paper is being cited, some of its message has been lost. You win some, you lose some.

Concerns about misconduct at Harvard’s department of government and elsewhere

The post below addresses a bunch of specifics about Harvard, but for the key point, jump to the last paragraph of the post.

Problems about Harvard

A colleague pointed me to this post by Christopher Brunet, “The Curious Case of Claudine Gay,” and asked what I thought. It was interesting. I’ve met or corresponded with almost all the people involved, at some time or another. Here’s my take:

Interesting. I know almost all the people involved, one way or another (sometimes just by email). Here’s my take:

– There’s a claim that Harvard political science professor Ryan Enos falsified a dataset. I looked at this awhile ago. I thought I’d blogged it but I couldn’t find it in a google search. There’s a pretty good summary here by journalist Jesse Singal here. I corresponded with both Singal and Brunet on this one. As I wrote, “I’d say that work followed standard operating procedure of that era which indeed was to draw overly strong conclusions from quantitative data using forking paths.” I don’t think it’s appropriate to say that someone falsified data, just because they did an analysis that (a) had data issues and (b) came to a conclusion that doesn’t make you happy. Data issues come up all the time.

– There’s a claim that Gay “swept this [Enos investigation] under the rug” (see here). This reminds me of my complaint about the University of California not taking seriously the concerns about the publications of Matthew “Why We Sleep” Walker (see here). A common thread is that universities don’t like to discipline their tenured professors! Also, though, I wasn’t convinced by the claim that Enos committed research misconduct. The Walker case seems more clear to me. But, even with the Walker case, it’s possible to come up with excuses.

– There’s a criticism that Gay’s research record is thin. That could be. I haven’t read her papers carefully. I guess that a lot of academic administrators are known more for their administration than their research records. Brunet writes, “A prerequisite for being a Dean at Harvard is having a track record of research excellence.” I guess that’s the case sometimes, maybe not other times. Lee Bollinger did a lot as president of University of Michigan and then Columbia, but I don’t think he’s known for his research. He published some law review articles once upon a time? Brunet refers to Gay being an “affirmative action case,” but that seems kind of irrelevant given that that lots of white people reach academic heights without doing influential research.

– There’s a criticism of a 2011 paper by Dustin Tingley, which has the line, “Standard errors clustered at individual level and confidence intervals calculated using a parametric bootstrap running for 1000 iterations,” but Brunet says, “when you actually download his R code, there is no bootstrapping.” I guess, maybe? I clicked through and found the R code here, but I don’t know how the “zelig” package works. Brunet writes that Tingley “grossly misrepresented the research processes by claiming his reported SEs are bootstrap estimates clustered at the individual level. As per the Zelig documentation, no such bootstrapping functionality ever existed in his chosen probit regression package.” I googled and it seemed that zelig did have boostrapping, but maybe not with clustering. I have no idea what’s going on here: it could be a misunderstanding of the software on Brunet’s part, a misunderstanding on Tingley’s part, or some statistical subtlety. I’m not really into this whole clustered standard errors thing anyway. My guess is that there was some confusion regarding what is a “bootstrap,” and it makes sense that a journalist coming at this from the outside might miss some things. The statistical analysis in this 2011 paper can be questioned, as is usually the case with anyone’s statistical analysis when they’re working on an applied research frontier. For example, from p.12 of Tingley’s paper: “Looking at the second repetition of the experiment, after which subjects had some experience with the strategic context, there was a significant difference in rejection rates across the treatments in the direction predicted by the model (51% when delta = 0.3 and 63% when delta = 0.7) (N = 396, t = 1.37, p = .08). Pooling offers of all sizes together I find reasonable support for Hypothesis 1 that a higher proportion of offers will be rejected, leading to both players paying a cost, when the shadow of the future was higher.” I’m not a fan of this sort of statistical-significance-based claim, including labeling p = .08 as “reasonable support” for a hypothesis, but this is business as usual in the social sciences.

– There’s a bunch of things about internal Harvard politics. I have zero knowledge one way or another regarding internal Harvard politics. What Brunet is saying there could be true, or maybe not, as he’s relying on various anonymous sources and other people with axes to grind. For example, he writes, “Gay did not recuse herself from investigating Enos. Rather, she used the opportunity to aggressively cover up his research misconduct.” I have no idea what standard policy is here. If she had recused herself, maybe she could be criticized for avoiding the topic. For another example, Brunet writes, “Claudine Gay allowed Michael Smith to get away scot-free in the Harvard-Epstein ties investigation — she came in and nicely whitewashed it all away. Claudine Gay has Epstein coverup stink on her, and Michael Smith has major Epstein stink on him,” and this could be a real issue, or it could just be a bunch of associations, as he doesn’t actually quote from the Harvard-Epstein ties investigation to which he refers. Regarding Jorge Dominguez: as Brunet says, the guy had been around for decades—indeed, I heard about his sexual harassment scandal back when I was a Ph.D. student at Harvard, I think it was in the student newspaper at the time, and I also remember being stunned, not so much that it happened, but that the political science faculty at the time just didn’t seem to care—so it’s kind of weird how first Brunet (rightly) criticizes the Government “department culture” that allowed a harasser to stay around for so long, and then he criticizes Smith for “protecting Dominguez” and criticizes Gay for being “partly responsible for having done nothing to address Dominguez’s abuses”—but then he also characterizes Smith as having “decided to throw [Dominguez] under the bus.” You can’t have it both ways! Responding to a decades-long harassment campaign is not “throwing someone under the bus.” Regarding Roland Fryer, Brunet quotes various politically-motivated people complimenting Fryer, which is fine—they guy did some influential research—but no context is added by referring to Fryer as “a mortal threat to some of the most powerful black people at Harvard” and referring to Gay as “a silky-smooth corporate operator.” Similarly, the Harvey Weinstein thing is something that can go both ways: if Gay criticizes a law professor who chooses to defend Weinstein, then she’s “was driven by pure spite. She is a petty and vicious little woman.” If she had supported the prof, I can see the argument the other way: so much corruption, she turns a blind eye to Epstein and then to Weinstein, why is she attacking Fryer but defending the law professor who is defending the “scumbag,” etc.

It’s everywhere

Here’s my summary. I think if you look carefully at just about any university social-science departments, you’ll be likely to find some questionable work, some faculty who do very little research, and some administrators who specialize in administration rather than research, as well as lots and lots of empirical papers with data challenges and less than state-of-the-art statistical analyses. You also might well find some connections to funders who made their money in criminal enterprises, business and law professors who work for bad guys, and long-tolerated sexual harassers. I also expect you can find all of this in private industry and government; we just might not hear about it. Universities have a standard of openness that allows us to see the problems, in part because universities have lots of graduates who can spill the beans without fear of repercussions. Also, universities produce public documents. For example, the aforementioned Matthew Walker wrote Why We Sleep. The evidence of his research misconduct is right out there. In a government or corporate context, the bad stuff can be inside of internal documents.

Executive but no legislative and no judicial

There’s also the problem that universities, and corporations, have an executive branch but no serious legislative or judicial branches. I’ve seen a lot of cases of malfeasance within universities where nothing is done, or where whatever is done is too little, too late. I attribute much of this problem to the lack of legislative and judicial functions. Stepping back, we could think of this as a problem with pure utilitarianism. In a structural system of government, each institution plays a role. The role of the judicial system is to judge without concern about policy consequences. In the university (or a corporation), there is on the executive, and it’s hard for the executive to make a decision without thinking about consequences. Executives will accept malfeasance of all sorts because they decide that the cost of addressing the malfeasance is greater than the expected benefit. I’m not even just talking here about research misconduct, sexual harassment, or illegal activities by donors; other issues that arise range from misappropriation of grant money, violation of internal procedures, and corruption in the facilities department.

To get back to research for a moment, there’s also the incentive structure that favors publication. Many years ago I had a colleague who showed me a paper he’d written that was accepted for publication in a top journal. I took a look and realized it had a fatal error–not a mathematical error, exactly, more of a conceptual error so that his method wasn’t doing what he was claiming it was doing. I pointed it out to him and said something like, “Hey, you just dodged a bullet–you almost published a paper that was wrong.” I assumed he’d contact the journal and withdraw the article. But, no, he just let the publication process go as scheduled: it gave him another paper on his C.V. And, back then, C.V.’s were a lot shorter; one publication could make a real difference! That’s just one story; the point is that, yes, of course a lot of fatally flawed work is out there.

So, yeah, pull up any institutional rock and you’re likely to find some worms crawling underneath. It’s good for people to pull up rocks! So, fair enough for Brunet to write these posts. And it’s good to have lots of people looking into these things, from all directions. The things that I don’t buy are his claims that there is clear research misconduct by Enos and Tingley, and his attempt to tie all these things together to Gay or to Harvard more generally. There’s a paper from 2014 with some data problems, a paper from 2011 by a different professor from the same (large) department that used some software that does bootstrapping, a professor in a completely different department who got donations from a criminal, a political science professor and an economics professor with sexual harassment allegations, a law professor who was defending a rich, well-connected rapist . . . and Brunet is criticizing Gay for being too lenient in some of these cases and too strict in others. Take anyone who’s an administrator at a large institution and you’ll probably find a lot of judgment calls.

Lots of dots

To put it another way, it’s fine to pick out a paper published in 2014 with data problems and a paper published in 2011 with methods that are not described in full detail. Without much effort it should be possible to find hundreds of examples from Harvard alone that are worse. Much worse. Here are just a few of the more notorious examples:

Stereotype susceptibility: Identity salience and shifts in quantitative performance,” by Shih, Pittinsky, and Ambady (1999)

This Old Stereotype: The Pervasiveness and Persistence of the Elderly Stereotype,” by Cuddy, Norton, and Fiske (2005)

Rule learning by cotton-top tamarins,” by Hauser, Weiss, and Marcus (2006)

Signing at the beginning makes ethics salient and decreases dishonest self-reports in comparison to signing at the end,” by Shu, Mazar, Gino, Ariely, and Bazerman (2012)

“Jesus said to them, ‘My wife…'”: A New Coptic Papyrus Fragment,” by King (2014)

Physical and situational inequality on airplanes predicts air rage,” by DeCelles and Norton (2016).

the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%” (not actually in a published paper, but in a press release featuring two Harvard processors from 2016)

Hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis,” by Mehra, Ruschitzka, and Patel (2020).

It’s not Brunet’s job, or mine, or anyone’s, to look at all these examples, and it’s fine for Brunet to focus on two much more ambiguous cases of problematic research papers. The point of my examples above is to put some of his connect-the-dots exercises into perspective. Harvard, and any other top university, will have hundreds of “dots”—bad papers, scandals, harassment, misconduct, etc.—that can be connected in many different ways.

A problem of quality control

We can see this as a problem of quality control. A large university is going to have some rate of iffy research, sexual harassment, tainted donations (and see here for a pointer to a horrible Harvard defense of that), faculty who work for bad people, etc., and it’s not really set up to handle this. Indeed, a top university such as Harvard or USC could well be more likely to have such problems: Its faculty are more successful, so even their weak work could get publicity, their faculty are superstars so might be more likely to get away with sexual harassment (but it seems that even the non-tenure-track faculty at such places can be protected by the old boys’ network), top universities could be more likely to get big donations from rich criminals, and they could also have well-connected business and law professors who’d like to make money defending bad guys (back at the University of California we had a professor who was working for the O. J. Simpson defense team!). I’ve heard a rumor that top universities can even cheat on their college rankings. And, again, universities have no serious legislative or judicial institutions, so the administrators at any top university will find themselves dealing with an unending stream of complaints regarding research misconduct, sexual harassment, tainted donations, and questionable outside activities by faculty, not to mention everyday graft of various sorts. I’m pretty sure all this is happening in companies too; we just don’t usually hear so much about it. Regarding the case of Harvard’s political science department, I appreciate Brunet’s efforts to bring attention to various issues, even if I am not convinced by several of his detailed claims and am not at all convinced by his attempt to paint this all as a big picture involving Gay.

Why I like hypothesis testing (it’s another way to say “fake-data simulation”):

Following up on the post that appeared this morning . . . “Using simulation from the null hypothesis to study statistical artifacts” is another way of saying “hypothesis testing.” I like hypothesis testing in this sense—indeed, it’s all over the place in chapter 6 of BDA, and I do it in my research all the time. The goal of this sort of hypothesis testing (I usually call it “model checking” to distinguish it from the bad stuff that I don’t like) is to (a) see the ways in which the model does not fit the data, and (b) possibly learn that we have insufficient data to detect certain departures of interest from the model. The goal is not to “reject” the null hypothesis, which I already know is false, and there’s no “type 1 and type 2 error.”

When apparently interesting features in data can be explained as unsurprising chance results conditional on a null hypothesis (here’s another example), that tells us something, not something about general “reality” or, as we say in statistics, the “population,” but something about the limitations of a statistical claim that’s being made.

That’s why I often say that hypothesis tests are often most useful when they don’t reject: it’s not that this is evidence that the null hypothesis is true; rather, it tells something about what we can’t reliably learn from the available data and model.

And remember, I really really really like fake-data simulation, and I can’t stop talking about it (see also here).

On a proposal to scale confidence intervals so that their overlap can be more easily interpreted

Greg Mayer writes:

Have you seen this paper by Frank Corotto, recently posted to a university depository?

It advocates a way of doing box plots using “comparative confidence intervals” based on Tukey’s HSD in lieu of traditional error bars. I would question whether the “Error Bar Overlap Myth” is really a myth (i.e. a widely shared and deeply rooted but imaginary way of understanding the world) or just a more or less occasional misunderstanding, but whatever it’s frequency, I thought you might be interested, given your longstanding aversion to box plots, and your challenge to the world to find a use for them. (I, BTW, am rather fond of dox plots.)

My reply: Clever but I can’t imagine ever using this method or recommending it to others. The abstract connects the idea to Tukey, and indeed the method reminds me of some of Tukey’s bad ideas from the 1950s involving multiple comparisons. I think the problem here is in thinking of “statistical significance” as a goal in the first place!

I’m not saying it was a bad idea for this paper to be written. The concept could be worth thinking about, even if I would not recommend it as a method. Not every idea has to be useful. Interesting is important too.

(back to basics:) How is statistics relevant to scientific discovery?

Following up on today’s post, “Why I continue to support the science reform movement despite its flaws,” it seems worth linking to this post from 2019, about the way in which some mainstream academic social psychologists have moved beyond denial, to a more realistic view that accepts that failure is a routine, indeed inevitable part of science, and that, just because a claim is published, even in a prestigious journal, that doesn’t mean it has to be correct:

Once you accept that the replication rate is not 100%, nor should it be, and once you accept that published work, even papers by ourselves and our friends, can be wrong, this provides an opening for critics, the sort of scientists whom academic insiders used to refer to as “second stringers.”

Once you move to the view that “the bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech,” there is a clear role for accurate critics to move this process along. Just as good science is, ultimately, the discovery of truths that ultimately would be discovered by someone else sometime in the future, thus, the speeding along of a process that we’d hope would happen anyway, so good criticism should speed along the process of scientific correction.

Criticism, correction, and discovery all go together. Obviously discovery is the key part, otherwise there’d be nothing to criticize, indeed nothing to talk about. But, from the other direction, criticism and correction empower discovery. . . .

Just as, in economics, it is said that a social safety net gives people the freedom to start new ventures, in science the existence of a culture of robust criticism should give researchers a sense of freedom in speculation, in confidence that important mistakes will be caught.

Along with this is the attitude, which I strongly support, that there’s no shame in publishing speculative work that turns out to be wrong. We learn from our mistakes. . . .

Speculation is fine, and we don’t want the (healthy) movement toward replication to create any perverse incentives that would discourage people from performing speculative research. What I’d like is for researchers to be more aware of when they’re speculating, both in their published papers and in their press materials. Not claiming a replication rate of 100%, that’s a start. . . .

What, then, is—or should be—the role of statistics, and statistical criticism in the process of scientific research?

Statistics can help researchers in three ways:
– Design and data collection
– Data analysis
– Decision making.

And how does statistical criticism fit into all this? Criticism of individual studies has allowed us to develop our understanding, giving us insight into designing future studies and interpreting past work. . . .

We want to encourage scientists to play with new ideas. To this purpose, I recommend the following steps:

– Reduce the costs of failed experimentation by being more clear when research-based claims are speculative.

– React openly to follow-up studies. Once you recognize that published claims can be wrong (indeed, that’s part of the process), don’t hang on to them too long or you’ll reduce your opportunities to learn.

– Publish all your data and all your comparisons (you can do this using graphs so as to show many comparisons in a compact grid of plots). If you follow current standard practice and focus on statistically significant comparisons, you’re losing lots of opportunities to learn.

– Avoid the two-tier system. Give respect to a student project or Arxiv paper just as you would to a paper published in Science or Nature.

We should all feel free to speculate in our published papers without fear of overly negative consequences in the (likely) event that our speculations are wrong; we should all be less surprised to find that published research claims did not work out (and that’s one positive thing about the replication crisis, that there’s been much more recognition of this point); and we should all be more willing to modify and even let go of ideas that didn’t happen to work out, even if these ideas were published by ourselves and our friends.

There’s more at the link, and also let me again plug my recent article, Before data analysis: Additional recommendations for designing experiments to learn about the world.

Oooh, I’m not gonna touch that tar baby!

Someone pointed me to a controversial article written a couple years ago. The article remains controversial. I replied that it’s a topic that I’ve not followed any detail and I’ll just defer to the experts. My correspondent pointed to some serious flaws in the article and asked that I link to the article here on the blog. He wrote, “I was unable to find any peer responses to it. Perhaps the discussants on your site will have some insights.”

My reply is the title of this post.

P.S. Not enough information is given in this post to figure out what is the controversial article here, so please don’t post guesses in the comments! Thank you for understanding.

Simulations of measurement error and the replication crisis: Update

Last week we ran a post, “Simulations of measurement error and the replication crisis: Maybe Loken and I have a mistake in our paper?”, reporting some questions that neuroscience student Federico D’Atri asked about a paper that Eric Loken and I wrote a few years ago. It’s one of my favorite papers so it was good to get feedback on it. D’Atri had run a simulation and had some questions, and in my post I shared some old code from the paper. Eric and I then looked into it. We discussed with D’Atri and here’s what we found:

1. No, Loken and I did not have a mistake in our paper. (More on that below.)

2. The code I posted on the blog was not the final code for our paper. Eric had made the final versions. From Eric, here’s:
(a) the final code that we used to make the figures in the paper, where we looked at regression slopes, and
(b) cleaned version of the code I’d posted, where we looked at correlations.
The code I posted last week was something in my files, but it was not the final version of the code, hence the confusions about what was being conditioned on in the analysis.

Regarding the code, Eric reports:

All in all we get the same results whether it’s correlations or t-tests of slopes. At small samples, and for small effects, the majority of the stat sig cors/slopes/t-tests are larger in the error than the non error (when you compare them paired). The graph’s curve does pop up through 0.5 and higher. It’s a lot higher if r = 0.08, and it’s not above 50% if the r is 0.4. It does require a relatively small effect, but we also have .8 reliability.

3. Some interesting questions remain. Federico writes:

I don’t think there’s an error in the code used to produce the graphs in the paper; rather I personally find that certain sentences in the paper may lead to some misunderstandings. I also concur with the main point made in their paper, that large estimates obtained from a small sample in high-noise conditions should not be trusted and I believe they do a good job of delivering this message.

What Andrew and Eric show is the proportion of larger correlations achieved when noisy measurements are selected for statistical significance, compared to the estimate one would obtain in the same scenario without measurement error and without selecting for statistical significance. What I had initially thought was that there was an equal level of selection for statistical significance applied to both scenarios. They essentially show that under conditions of insufficient power to detect the true underlying effect, doing enough selection based on statistical significance, can produce an overestimation much higher than the attenuation caused by measurement error.

This seems quite intuitive to me, and I would like to clarify it with an example. Consider a true underlying correlation in ideal conditions of 0.15 and a sample size of N = 25, and the extreme scenario where measurement error is infinite (in the noisy x and y will be uncorrelated). In this case, the measurements of x and y under ideal conditions will be totally uncorrelated with those obtained under noisy conditions, hence the correlation estimates in the two different scenarios as well. If I select for significance the correlations obtained under noisy conditions, I am only looking at correlations greater than 0.38 (for α = 0.05, two-tailed test), which I’ll be comparing to an average correlation of 0.15, since the two estimates are completely unrelated. It is clear then that the first estimate will almost always be greater than the second. The greater the noise, the more uncorrelated the correlation estimates obtained in the two different scenarios become, making it less likely that obtaining a large estimate in one case would also result in a large estimate in the other case.

My criticism is not about the correctness of the code (which is correct as far as I can see), but rather how relevant this scenario is in representing a real situation. Indeed, I believe it is very likely that the same hypothetical researchers who made selections for statistical significance in ‘noisy’ measurement conditions would also select for significance in ideal measurement conditions, and in that case, they would obtain an even higher frequency of effect overestimation when selecting for statistical significance (once selecting for the direction of the true effect) as well as a greater ease in achieving statistically significant results .

However, I think it could be possible that in research environments where measurement error is greater (and isn’t modeled), there might be an incentive, or a greater co-occurrence, of selection for statistical significance and poor research practices. Without evidence of this, though, I find it more interesting to compare the two scenarios assuming similar selection criteria.

Also I’m aware that in situations deviating from the simple assumptions of the case we are considering here (simple correlation between x and y and uncorrelated measurement errors), complexities can arise. For example, as probably know better than me, in multiple regression scenarios where two predictors, x1 and x2, are correlated and their measurement errors are also correlated (which can occur with certain types of measures, such as self-reporting where individuals prone to overestimating x1 may also tend to overestimate x2), and only x1 is correlated with y, there is an inflation of Type I error for x2 and asymptotically β2 is biased away from zero.

Eric adds:

Glad we resolved the initial confusion about our article’s main point and associated code. When you [Federico] first read our article, you were interested in different questions than the one we covered. It’s a rich topic, with lots of work to be done, and you seem to have several ideas. Our article addressed the situation where someone might acknowledge measurement error, but then say “my finding is all the more impressive because if not for the measurement error I would have found an even bigger effect.” We target the intuition that if a dataset could be made error free by waving a wand, that the data would necessarily show a larger correlation. Of course the “iron law” (attenuation) holds in large samples. Unsurprisingly, however, in smaller samples, data with measurement error can have a larger realized correlation. And after conditioning on the statistical significance of the observed correlations, a majority of them could be larger than the corresponding error free correlation. We treated the error free effect (the “ideal study”) as the counterfactual (“if only I had no error in my measurements”), and thus filtered on the statistical significance of the observed error prone correlations. When you tried to reproduce that graph, you applied the filter differently, but you now find that what we did was appropriate for the question we were answering.

By the way, we deliberately kept the error modest. In our scenario, the x and y values have about 0.8 reliability—widely considered excellent measurement. I agree that if the error grows wildly, as with your hypothetical case, then the observed values are essentially uncorrelated with the thing being measured. Our example though was pretty realistic—small true effect, modest measurement error, range of sample sizes. I can see though that there are many factors to explore.

Different questions are of interest in different settings. One complication is that, when researchers say things like, “Despite limited statistical power . . .” they’re typically not recognizing that they have been selecting on statistical significance. In that way, they are comparing to the ideal setting with no selection.

And, for reasons discussed in my original paper with Eric, researchers often don’t seem to think about measurement error at all! because they have the (wrong) impression that having a “statistically significant” result gives retroactive assurance that their signal-to-noise ratio is high.

That’s what got us so frustrated to start with: not just that noisy studies get published all the time, but that many researchers seem to not even realize that noise can be a problem. Lots and lots of correspondence with researchers who seem to feel that if they’ve found a correlation between X and Y, where X is some super-noisy measurement with some connection to theoretical concept A, and Y is some super-noisy measurement with some connection to theoretical concept B, that they’ve proved that A causes B, or that they’ve discovered some general connection between A and B.

So, yeah, we encourage further research in this area.

Hey! Here are some amazing articles by George Box from around 1990. Also there’s some mysterious controversy regarding his center at the University of Wisconsin.

The webpage is maintained by John Hunter, son of Box’s collaborator William Hunter, and I came across it because I was searching for background on the paper-helicopter example that we use in our classes to teach principles of experimental design and data analysis.

There’s a lot to say about the helicopter example and I’ll save that for another post.

Here I just want to talk about how much I enjoyed reading these thirty-year-old Box articles.

A Box Set from 1990

Many of the themes in those articles continue to resonate today. For example:

• The process of learning. Here’s Box from his 1995 article, “Total Quality: Its Origins and its Future”:

Scientific method accelerated that process in at least three ways:

1. By experience in the deduction of the logical consequences of the group of facts each of which was individually known but had not previously been brought together.

2. By the passive observation of systems already in operation and the analysis of data coming from such systems.

3. By experimentation – the deliberate staging of artificial experiences which often might ordinarily never occur.

A misconception is that discovery is a “one shot” affair. This idea dies hard. . . .

• Variation over time. Here’s Box from his 1989 article, “Must We Randomize Our Experiment?”:

We all live in a non-stationary world; a world in which external factors never stay still. Indeed the idea of stationarity – of a stable world in which, without our intervention, things stay put over time – is a purely conceptual one. The concept of stationarity is useful only as a background against which the real non-stationary world can be judge. For example, the manufacture of parts is an operation involving machines and people. But the parts of a machine are not fixed entities. They are wearing out, changing their dimensions, and losing their adjustment. The behavior of the people who run the machines is not fixed either. A single operator forgets things over time and alters what he does. When a number of operators are involved, the opportunities for change because of failures to communicate are further multiplied. Thus, if left to itself any process will drift away from its initial state. . . .

Stationarity, and hence the uniformity of everything depending on it, is an unnatural state that requires a great deal of effort to achieve. That is why good quality control takes so much effort and is so important. All of this is true, not only for manufacturing processes, but for any operation that we would like to be done consistently, such as the taking of blood pressures in a hospital or the performing of chemical analyses in a laboratory. Having found the best way to do it, we would like it to be done that way consistently, but experience shows that very careful planning, checking, recalibration and sometimes appropriate intervention, is needed to ensure that this happens.

Here an example, from Box’s 1992 article, “How to Get Lucky”:

For illustration Figure 1(a) shows a set of data designed so seek out the source of unacceptably large variability which, it was suspected, might be due to small differences in five, supposedly identical, heads on a machine. To test this idea, the engineer arranged that material from each of the five heads was sampled at roughly equal intervals of time in each of six successive eight-hour periods. . . . the same analysis strongly suggested that real differences in means occurred between the six eight-hour periods of time during which the experiment was conducted. . . .

• Workflow. Here’s Box from his 1999 article, “Statistics as a Catalyst to Learning by Scientific Method Part II-Discussion”:

Most of the principles of design originally developed for agricultural experimentation would be of great value in industry, but the most industry experimentation differed from agricultural experimentation in two major respects. These I will call immediacy and sequentially.

What I mean by immediacy is that for most of our investigations the results were available, if not within hours, then certainly within days and in rare cases, even within minutes. This was true whether the investigation was conducted in a laboratory, a pilot plant or on the full scale. Furthermore, because the experimental runs were usually made in sequence, the information obtained from each run, or small group of runs, was known and could be acted upon quickly and used to plan the next set of runs. I concluded that the chief quarrel that our experimenters had with using “statistics” was that they thought it would mean giving up the enormous advantages offered by immediacy and sequentially. Quite rightly, they were not prepared to make these sacrifices. The need was to find ways of using statistics to catalyze a process of investigation that was not static, but dynamic.

There’s lots more. It’s funny to read these things that Box wrote back then, that I and others have been saying over and over again in various informal contexts, decades later. It’s a problem with our statistical education (including my own textbooks) that these important ideas are buried.

More Box

A bunch of articles by Box, with some overlap but not complete overlap with the above collection, is at the site of the University of Wisconsin, where he worked for many years. Enjoy.

Some kinda feud is going on

John Hunter’s page also has this:

The Center for Quality and Productivity Improvement was created by George Box and Bill Hunter at the University of Wisconsin-Madison in 1985.

In the first few years reports were published by leading international experts including: W. Edwards Deming, Kaoru Ishikawa, Peter Scholtes, Brian Joiner, William Hunter and George Box. William Hunter died in 1986. Subsequently excellent reports continued to be published by George Box and others including: Gipsie Ranney, Soren Bisgaard, Ron Snee and Bill Hill.

These reports were all available on the Center’s web site. After George Box’s death the reports were removed. . . .

It is a sad situation that the Center abandonded the ideas of George Box and Bill Hunter. I take what has been done to the Center as a personal insult to their memory. . . .

When diagonoised with cancer my father dedicated his remaining time to creating this center with George to promote the ideas George and he had worked on throughout their lives: because it was that important to him to do what he could. They did great work and their work provided great benefits for long after Dad’s death with the leadership of Bill Hill and Soren Bisgaard but then it deteriorated. And when George died the last restraint was eliminated and the deterioration was complete.

Wow. I wonder what the story was. I asked someone I know who works at the University of Wisconsin and he had no idea. Box died in 2013 so it’s not so long ago; there must be some people who know what happened here.

“You need 16 times the sample size to estimate an interaction than to estimate a main effect,” explained

This has come up before here, and it’s also in Section 16.4 of Regression and Other Stories (chapter 16: “Design and sample size decisions,” Section 16.4: “Interactions are harder to estimate than main effects”). But there was still some confusion about the point so I thought I’d try explaining it in a different way.

The basic reasoning

The “16” comes from the following four statements:

1. When estimating a main effect and an interaction from balanced data using simple averages (which is equivalent to least squares regression), the estimate of the interaction has twice the standard error as the estimate of a main effect.

2. It’s reasonable to suppose that an interaction will have half the magnitude of a main effect.

3. From 1 and 2 above, we can suppose that the true effect size divided by the standard error is 4 times higher for the interaction than for the main effect.

4. To achieve any desired level of statistical power for the interaction, you will need 4^2 = 16 times the sample size that you would need to attain that level of power for the main effect.

Statements 3 and 4 are unobjectionable. They somewhat limit the implications of the “16” statement, which does not in general apply to Bayesian or regularized estimates, not does it consider goals other than statistical power (equivalently, the goal of estimating an effect to a desired relative precision). I don’t consider these limitations a problem; rather, I interpret the “16” statement as relevant to that particular set of questions, in the way that the application of any mathematical statement is conditional on the relevance of the framework under which they can be proved.

Statements 1 and 2 are a bit more subtle. Statement 1 depends on what is considered a “main effect,” and statement 2 is very clearly an assumption regarding the applied context of the problem being studied.

First, statement 1. Here’s the math for why the estimate of the interaction has twice the standard error of the estimate of the main effect. The scenario is an experiment with N people, of which half get treatment 1 and half get treatment 0, so that the estimated main effect is ybar_1 – ybar_0, comparing average under treatment and control. We further suppose the population is equally divided between two sorts of people, a and b, and half the people in each group get each treatment. Then the estimated interaction is (ybar_1a – ybar_0a) – (ybar_1b – ybar_0b).

The estimate of the main effect, ybar_1 – ybar_0, has standard error sqrt(sigma^2/(N/2) + sigma^2/(N/2)) = 2*sigma/sqrt(N); for simplicity I’m assuming a constant variance within groups, which will typically be a good approximation for binary data, for example. The estimate of the interaction, (ybar_1a – ybar_0a) – (ybar_1b – ybar_0b), has standard error sqrt(sigma^2/(N/4) + sigma^2/(N/4) + sigma^2/(N/4) + sigma^2/(N/4)) = 4*sigma/sqrt(N). I’m assuming that the within-cell standard deviation does not change after we’ve divided the population into 4 cells rather than 2; this is not exactly correct—to the extent that the effects are nonzero, we should expect the within-cell standard deviations to get smaller as we subdivide—; again, however, it is common in applications for the within-cell standard deviation to be essentially unchanged after adding the interaction. This is equivalent to saying that you can add a important predictor without the R-squared going up much, and it’s the usual story in research areas such as psychology, public opinion, and medicine where individual outcomes are highly variable and so we look for effects among averaging.

The biggest challenge with the reasoning in the above two paragraphs is not the bit about sigma being smaller when the cells are subdivided—this is typically a minor concern, and it’s easy enough to account for if necessary—, nor is it the definition of interaction. Rather, the challenge comes, perhaps surprisingly, from the definition of main effect.

Above I define the “main effect” as the average treatment effect in the population, which seems reasonable enough. There is an alternative, though. You could also define the main effect as the average treatment effect in the baseline category. In the notation above, the main effect would then be defined ybar_1a – ybar_0a. In that case, the standard error of the estimated main effect is only sqrt(2) times the standard error of the estimate of the interaction.

Typically I’ll frame the main effect as the average effect in the population, but there are some settings where I’d frame it as the average effect in the baseline category. It depends on how you’re planning to extrapolate the inferences from your model. The important thing is to be clear in your definition.

Now on to statement 2. I’m supposing an interaction that is half the magnitude of the main effect. For example, if the main effect is 20 and the interaction is 10, that corresponds to an effect of 25 in group a and 15 in group b. To me, that’s a reasonable baseline: the treatment effect is not constant but it’s pretty stable, which is kinda what I think about when I hear “main effect.”

But there are other possibilities. Suppose that the effect is 30 in group a and 10 in group b, so the effect is consistently positive effect, but now it varies by a factor of 3 rather under the two conditions. In this case, the main effect is 20 and the interaction is 20. The main effect and the interaction are of equal size, and so you only need 4 times the sample size to estimate the main effect as to estimate the interaction.

Or suppose the effect is 40 in group a and 0 in group b. Then the main effect is 20 and the interaction is 40, and in that case you need the same sample size to estimate the main effect as to estimate the interaction. This can happen! In such a scenario, I don’t know that I’d be particularly interested in the “main effect”—I think I’d frame the problem in terms of effect in group a and effect in group b, without any particular desire to average over them. It will depend on context.

Why this is important

Before going on, let me copy something from my our earlier post explaining the importance of this result: From the statement of the problem, we’ve assumed the interaction is half the size of the main effect. If the main effect is 2.8 on some scale with a standard error of 1 (and thus can be estimated with 80% power; see for example page 295 of Regression and Other Stories, where we explain why, for 80% power, the true value of the parameter must be 2.8 standard errors away from
the comparison point), and the interaction is 1.4 with a standard error of 2, then the z-score of the interaction has a mean of 0.7 and a sd of 1, and the probability of seeing a statistically significant effect difference is pnorm(0.7, 1.96, 1) = 0.10. That’s right: if you have 80% power to estimate the main effect, you have 10% power to estimate the interaction.

And 10% power is really bad. It’s worse than it looks. 10% power kinda looks like it might be OK; after all, it still represents a 10% chance of a win. But that’s not right at all: if you do get “statistical significance” in that case, your estimate is a huge overestimate:

> raw <- rnorm(1e6, .7, 1)
> significant <- raw > 1.96
> mean(raw[significant])
[1] 2.4

So, the 10% of results which do appear to be statistically significant give an estimate of 2.4, on average, which is over 3 times higher than the true effect.

So, yeah, you don’t want to be doing studies with 10% power, which implies that when you’re estimating that interaction, you have to forget about statistical significance; you need to just accept the uncertainty.

Explaining using a 2 x 2 table

Now to return to the main-effects-and-interactions thing:

One way to look at all this is by framing the population as a 2 x 2 table, showing the averages among control and treated conditions within groups a and b:

           Control  Treated  
Group a:  
Group b:  

For example, here’s an example where the treatment has a main effect of 20 and an interaction of 10:

           Control  Treated  
Group a:     100      115
Group b:     150      175

In this case, there’s a big “group effect,” not necessarily causal (I had vaguely in mind a setting where “Group” is an observational factor and “Treatment” is an experimental factor), but still a “main effect” in the sense of a linear model. Here, the main effect of group is 55. For the issues we’re discussing here, the group effect doesn’t really matter, but we need to specify something here in order to fill in the table.

If you’d prefer, you can set up a “null” setting where the two groups are identical, on average, under the control condition:

           Control  Treated  
Group a:     100      115
Group b:     100      125

Again, each of the numbers in these tables represents the population average within the four cells, and “effects” and “interactions” correspond to various averages and differences of the four numbers. We’re further assuming a balanced design with equal sample sizes and equal variances within each cell.

What would it look like if the interaction were twice the size of the main effect, for example a main effect of 20 and an interaction of 40? Here’s one possibility of the averages within each cell:

           Control  Treated  
Group a:     100      100
Group b:     100      140

If that’s what the world is like, then indeed you need exactly the same sample size (that is, the total sample size in the four cells) to estimate the interaction as to estimate the main effect.

When using regression with interactions

To reproduce the above results using linear regression, you’ll want to code the Group and Treatment variables on a {-0.5, 0.5} scale. That is, Group = -0.5 for a and +0.5 for b, and Treatment = -0.5 for control and +0.5 for treatment. That way, the main effect of each variable corresponds to the other variable equaling zero (thus, the average of a balanced population), and the interaction corresponds to the difference of treatment effects, comparing the two groups.

Alternatively we could code each variable on a {-1, 1} scale, in which case the main effects are divided by 2 and the interaction is divided by 4, but the standard errors are also divided in the same way, so the z-scores don’t change, and you still need the same X times the sample size to estimate the interaction as to estimate the man effect.

Or we could code each variable as {0, 1}, in which case, as discussed above, the main effect for each predictor is then defined as the effect of that predictor when the other predictor equals 0.

Why do I make the default assumptions that I do in the above analyses?

The scenario I have in mind is studies in psychology or medicine where a and b are two groups of the population, for example women and men, or young and old people, and researchers start with a general idea, a “main effect,” but there is also interest in how this effects vary, that is, “interactions.” In my scenario, neither a or b is a baseline, and so it makes sense to think of the main effect as some sort of average (which, as discussed here, can take many forms).

In the world of junk science, interactions represent a way out, a set of forking paths that allow researchers to declare a win in settings where their main effect does not pan out. Three examples we’ve discussed to death in this space are the claim of an effect of fat arms on men’s political attitudes (after interacting with parental SES), an effect of monthly cycle on women’s political attitudes (after interacting with partnership status), and an effect of monthly status on women’s clothing choices (after interacting with weather). In all these examples, the main effect was the big story and the interaction was the escape valve. The point of “You need 16 times the sample size to estimate an interaction than to estimate a main effect” is not to say that researchers shouldn’t look for interactions or that they should assume interactions are zero; rather, the point is that they should not be looking for statistically-significant interactions, given that their studies are, at best, barely powered to estimate main effects. Thinking about interactions is all about uncertainty.

In more solid science, interactions also come up: there are good reasons to think that certain treatments will be more effective on some people and in some scenarios. Again, though, in a setting where you’re thinking of interactions as variations on a theme of the main effect, your inferences for interactions will be highly uncertain, and the “16” advice should be helpful both in design and analysis.

Summary

In a balanced experiment, when the treatment effect is 15 in Group a and 25 in Group b (that is, the main effect is twice the size of the interaction), the estimate of the interaction will have twice the standard error as the estimate of the main effect, and so you’d need a sample size of 16*N to estimate the interaction at the same relative precision as you can estimate the main effect from the same design but with a sample size of N.

With other scenarios of effect sizes, the result is different. If the treatment effect is 10 in Group a and 30 in Group b, you’d need 4 times the sample size to estimate the interaction as to estimate the main effect. If the treatment effect is 0 in group a and 40 in Group b, you’d need equal sample sizes.

Hydrology Corner: How to compare outputs from two models, one Bayesian and one non-Bayesian?

Zac McEachran writes:

I am a Hydrologist and Flood Forecaster at the National Weather Service in the Midwest. I use some Bayesian statistical methods in my research work on hydrological processes in small catchments.

I recently came across a project that I want to use a Bayesian analysis for, but I am not entirely certain what to look for to get going on this. My issue: NWS uses a protocol for calibrating our river models using a mixed conceptual/physically-based model. We want to assess whether a new calibration is better than an old calibration. This seems like a great application for a Bayesian approach. However, a lot of the literature I am finding (and methods I am more familiar with) are associated with assessing goodness-of-fit and validation for models that were fit within a Bayesian framework, and then validated in a Bayesian framework. I am interested in assessing how a non-Bayesian model output compares with another non-Bayesian model output with respect to observations. Someday I would like to learn to use Bayesian methods to calibrate our models but one step at a time!

My response: I think you need somehow to give a Bayesian interpretation to your non-Bayesian model output. This could be as simple as taking 95% prediction intervals and interpreting them as 95% posterior intervals from a normally-distributed posterior. Or if the non-Bayesian fit only gives point estimates, then do some boostrapping or something to get an effective posterior. Then you can use external validation or cross validation to compare the predictive distributions of your different models, as discussed here; also see Aki’s faq on cross validation.

A Hydrologist and Flood Forecaster . . . how cool is that?? Last time we had this level of cool was back in 2009 when we were contacted by someone who was teaching statistics to firefighters.

We were gonna submit something to Nature Communications, but then we found out they were charging $6290 for publication. For that amount of money, we could afford 37% of an invitation to a conference featuring Grover Norquist, Gray Davis, and a rabbi, or 1/160th of the naming rights for a sleep center at the University of California, or 4735 Jamaican beef patties.

My colleague and I wrote a paper, and someone suggested we submit it to the journal Nature Communications. Sounds fine, right? But then we noticed this:

Hey! We wrote the damn article, right? They should be paying us to publish it, not the other way around. Ok, processing fees yeah yeah, but $6290??? How much labor could it possibly take to publish one article? This makes no damn sense at all. I guess part of that $6290 goes to paying for that stupid website where they try to con you into paying several thousand dollars to put an article on their website that you can put on Arxiv for free.

Ok, then the question arises: What else could we get for that $6290? A trawl through the blog archive gives some possibilities:

– 37% of an invitation to a conference featuring Grover Norquist, Gray Davis, and a rabbi

– 1/160th of the naming rights for a sleep center at the University of California

– 4735 Jamaican beef patties

I guess that, among all these options, the Nature Communications publication would do the least damage to my heart. Still, I couldn’t quite bring myself to commit to forking over $6290. So we’re sending the paper elsewhere.

At this point I’m still torn between the other three options. 4735 Jamaican beef patties sounds good, but 1/160th of a sleep center named just for me, that would be pretty cool. And 37% of a chance to meet Grover Norquist, Gray Davis, and a rabbi . . . that’s gotta be the most fun since Henry Kissinger’s 100th birthday party. (Unfortunately I was out of town for that one, but I made good use of my invite: I forwarded it to Kissinger superfan Cass Sunstein, and it seems he had a good time, so nothing was wasted.) So don’t worry, that $6290 will go to a good cause, one way or another.

Postdoc on Bayesian methodological and applied work! To optimize patient care! Using Stan! In North Carolina!

Sam Berchuck writes:

I wanted to bring your attention to a postdoc opportunity in my group at Duke University in the Department of Biostatistics & Bioinformatics. The full job ad is here: https://forms.stat.ufl.edu/statistics-jobs/entry/10978/.

The postdoc will work on Bayesian methodological and applied work, with a focus on modeling complex longitudinal biomedical data (including electronic health records and mobile health data) to create data-driven approaches to optimize patient care among patients with chronic diseases. The position will be particularly interesting to people interested in applying Bayesian statistics in real-world big data settings. We are looking for people who have experience in Bayesian inference techniques, including Stan!

Interesting. In addition to the Stan thing, I’m interested in data-driven approaches to optimize patient care. This is an area where a Bayesian approach, or something like it, is absolutely necessary, as you typically just won’t have enough data to make firm conclusions about individual effects, so you have to keep track of uncertainty. Sounds like a wonderful opportunity.

Bloomberg News makes an embarrassing calibration error

Palko points to this amusing juxtaposition:

I was curious so I googled to find the original story, “Forecast for US Recession Within Year Hits 100% in Blow to Biden,” by Josh Wingrove, which begins:

A US recession is effectively certain in the next 12 months in new Bloomberg Economics model projections . . . The latest recession probability models by Bloomberg economists Anna Wong and Eliza Winger forecast a higher recession probability across all timeframes, with the 12-month estimate of a downturn by October 2023 hitting 100% . . .

I did some further googling but could not find any details of the model. All I could find was this:

With probabilities that jump around this much, you can expect calibration problems.

This is just a reminder that for something to be a probability, it’s not enough that it be a number between 0 and 1. A real-world probability don’t exist in isolation; they are ensnared in a web of interconnections. Recall our discussion from last year:

Justin asked:

Is p(aliens exist on Neptune that can rap battle) = .137 valid “probability” just because it satisfies mathematical axioms?

And Martha sagely replied:

“p(aliens exist on Neptune that can rap battle) = .137” in itself isn’t something that can satisfy the axioms of probability. The axioms of probability refer to a “system” of probabilities that are “coherent” in the sense of satisfying the axioms. So, for example, the two statements

“p(aliens exist on Neptune that can rap battle) = .137″ and p(aliens exist on Neptune) = .001”

are incompatible according to the axioms of probability, because the event “aliens exist on Neptune that can rap battle” is a sub-event of “aliens exist on Neptune”, so the larger event must (as a consequence of the axioms) have probability at least as large as the probability of the smaller event.

The general point is that a probability can only be understood as part of a larger joint distribution; see the second-to-last paragraph of the boxer/wrestler article. I think that confusion on this point has led to lots of general confusion about probability and its applications.

Beyond that, seeing this completely avoidable slip-up from Bloomberg gives us more respect for the careful analytics teams at other news outlets such as the Economist and Fivethirtyeight, both of which are far from perfect, but at least we’re all aware that it would not make sense to forecast a 100% probability of recession in this sort of uncertain situation.

P.S. See here for another example of a Bloomberg article with a major quantitative screw-up. In this case the perpetrator was not the Bloomberg in-house economics forecasting team, it was a Bloomberg Opinion columnist who is described as “a former editorial director of Harvard Business Review,” which at first kinda sounds like he’s an economist at the Harvard business school, but I guess what it really means is that he’s a journalist without strong quantitative skills.

Academia corner: New candidate for American Statistical Association’s Founders Award, Enduring Contribution Award from the American Political Science Association, and Edge Foundation just dropped

Bethan Staton and Chris Cook write:

A Cambridge university professor who copied parts of an undergraduate’s essays and published them as his own work will remain in his job, despite an investigation upholding a complaint that he had committed plagiarism. 

Dr William O’Reilly, an associate professor in early modern history, submitted a paper that was published in the Journal of Austrian-American History in 2018. However, large sections of the work had been copied from essays by one of his undergraduate students.

The decision to leave O’Reilly in post casts doubt on the internal disciplinary processes of Cambridge, which rely on academics judging their peers.

Dude’s not a statistician, but I think this alone should be enough to make him a strong candidate for the American Statistical Association’s Founders Award.

And, early modern history is not quite the same thing as political science, but the copying thing should definitely make him eligible for the Aaron Wildavsky Enduring Contribution Award from the American Political Science Association. Long after all our research has been forgotten, the robots of the 21st century will be able to sift through the internet archive and find this guy’s story.

Or . . . what about the Edge Foundation? Plagiarism isn’t quite the same thing as misrepresenting your data, but it’s close enough that I think this guy would have a shot at joining that elite club. I’ve heard they no longer give out flights to private Caribbean islands, but I’m sure there are some lesser perks available.

According to the news article:

Documents seen by the Financial Times, including two essays submitted by the third-year student, show nearly half of the pages of O’Reilly’s published article — entitled “Fredrick Jackson Turner’s Frontier Thesis, Orientalism, and the Austrian Militärgrenze” — had been plagiarised.

Jeez, some people are so picky! Only half the pages were plagiarized, right? Or maybe not? Maybe this prof did a “Quentin Rowan” and constructed his entire article based on unacknowledged copying from other sources. As Rowan said:

It felt very much like putting an elaborate puzzle together. Every new passage added has its own peculiar set of edges that had to find a way in.

I guess that’s how it felt when they were making maps of the Habsburg empire.

On the plus side, reading about this story motivated me to take a look at the Journal of Austrian-American History, and there I found this cool article by Thomas Riegler, “The Spy Story Behind The Third Man.” That’s one of my favorite movies! I don’t know how watchable it would be to a modern audience—the story might seem a bit too simplistic—but I loved it.

P.S. I laugh but only because that’s more pleasant than crying. Just to be clear: the upsetting thing is not that some sleazeball managed to climb halfway up the greasy pole of academia by cheating. Lots of students cheat, some of these students become professors, etc. The upsetting thing is that the organization closed ranks to defend him. We’ve seen this sort of thing before, over and over—for example, Columbia never seemed to make any effort whatsoever to track down whoever was faking its U.S. News numbers—, so this behavior by Cambridge University doesn’t surprise me, but it still makes me sad. I’m guessing it’s some combination of (a) the perp is plugged in, the people who make the decisions are his personal friends, (b) a decision that the negative publicity for letting this guy stay on at his job is not as bad as the negative publicity for firing him.

Can you imagine what it would be like to work in the same department as this guy?? Fun conversations at the water cooler, I guess. “Whassup with the Austrian Militärgrenze, dude?”

Meanwhile . . .

There are people who actually do their own research, and they’re probably good teachers too, but they didn’t get that Cambridge job. It’s hard to compete with an academic cheater, if the institution he’s working for seems to act as if cheating is just fine, and if professional societies such as the American Statistical Association and the American Political Science Association don’t seem to care either.

Why aren’t there more fake reviews on yelp etc?

Bert Gunter writes:

This article in today’s NYTimes is a hoot, and might be grist for the lighter side of your blog … or maybe the heavier if you want to get into statistical fake detection, which is a big deal these days I guess.

The news article in question is called, “Five Stars, Zero Clue: Fighting the ‘Scourge’ of Fake Online Reviews,” subtitled, “Third parties pay writers for posts praising or panning hotels, restaurants and other places they never visited. How review sites like Yelp and Tripadvisor are trying to stop the flood.”

I agree with Gunter; it’s a fun article. Here’s my question: why aren’t there more fake reviews? Sure, lots of people are honest and would not cheat, but what I don’t understand is why the cheaters don’t cheat more. For example, suppose some crappy restaurant goes to the trouble of posting two fake five-star reviews. Why not go whole hog and post 100? Or would that be too easy to detect?

Or maybe we’ve reached an equilibrium. Right now if you’re looking for a place to eat, you can look at the reviews on google/yelp/tripadvisor/etc, and . . . they don’t give you zero information, but they provide a very weak signal. Not necessarily from cheating, just that tastes differ. But cheating muddies the waters enough, it just adds one more reason you can’t use these for much. So maybe the answer to the question, Why don’t they cheat more?, is that not much is to be gained by it.