Postdoc at Washington State University on law-enforcement statistics

This looks potentially important:

The Center for Interdisciplinary Statistical Education and Research (CISER) at Washington State University (WSU) is excited to announce that it has an opening for a Post-Doctoral Research Associate (statistical scientist) supporting a new state-wide public data project focused on law enforcement. The successful candidate will be part of a team of researchers whose mission is to modernize public safety data collection through standardization, automation, and evaluation. The project will actively involve law enforcement agencies, state and local policymakers, researchers, and the public in data exploration and discovery. This effort will be accomplished in part by offering education and training opportunities fostering community-focused policing and collaborative learning sessions. The statistical scientist in this role will develop comprehensive educational materials, workshops, online courses, and training manuals designed to equip and empower law enforcement agencies, state and local policymakers, researchers, and the public with data and statistical literacy skills that will enable them to maximize the utility of the data project.

Data, education, and policy. Interesting.

Storytelling and Scientific Understanding (my talks with Thomas Basbøll at Johns Hopkins on 26 Apr)

Storytelling and Scientific Understanding

Andrew Gelman and Thomas Basbøll

Storytelling is central to science, not just as a tool for broadcasting scientific findings to the outside world, but also as a way that we as scientists understand and evaluate theories. We argue that, for this purpose, a story should be anomalous and immutable; that is, it should be surprising, representing some aspect of reality that is not well explained by existing models of the world, and have details that stand up to scrutiny.

We consider how this idea illuminates some famous stories in social science involving soldiers in the Alps, Chinese boatmen, and trench warfare, and we show how it helps answer literary puzzles such as why Dickens had all those coincidences, why authors are often so surprised by what their characters come up with, and why the best alternative history stories have the feature that, in these stories, our “real world” ends up as the deeper truth. We also discuss connections to chatbots and human reasoning, stylized facts and puzzles in science, and the millionth digit of pi.

At the center our framework is a paradox: learning from anomalies seems to contradict usual principles of science and statistics where we seek representative or unbiased samples. We resolve this paradox by placing learning-within-stories into a hypothetico-deductive (Popperian) framework, in which storytelling is a form of exploration of the implications of a hypothesis. This has direct implications for our work as a statistician and a writing coach.

Bad stuff going down at the American Sociological Association

I knew the Association for Psychological Science, the American Psychological Association, the American Political Science Association, the American Statistical Association, and the National Academy of Sciences had problems. It turns out the American Sociological Association does some bad things too.

Philip Cohen has the story. It starts back in 2019, when the American Sociological Association, along with “many other paywall-dependent academic societies” (in Cohen’s words) sent an open letter to the president to oppose open science. Here’s Cohen:

At the time, there was a rumor that OSTP [the U.S. Office of Science and Technology Policy] would require agencies to make public the results of research funded by the federal government without a 12-month delay — the cherished “embargo” that allowed these associations to profit from delaying access to public knowledge . . .

They wrote: “We are writing to express our concerns about a possible change in federal policies that could significantly threaten a vibrant American scientific enterprise.” That is, by requiring free access to research, OSTP would threaten the “financial stability that enables us to support peer review that ensures the quality and integrity of the research enterprise.” If ASA lost their journal subscription profits, in other words, American science would die. “To take action to shorten the 12-month embargo… risks the continued international leadership for the U.S. scientific enterprise.”

Uh huh. I agree with Cohen that this is some combination of ridiculous and offensive. He continues:

Despite a petition signed by many ASA members, and a resolution from its own Committee on Publications “to express opposition to the decision by the ASA to sign the December 18, 2019 letter” — which the ASA leadership never even publicly acknowledged — ASA has not uttered a word to alter its anachronistic and unpopular position.

It’s starting to make me wonder if academic cartels sometimes act like . . . cartels?

Just to be clear, this does not seem to be a problem with academic sociology as a profession. As Cohen notes, the ASA’s own Committee on Publications opposed the ASA’s horrible recommendation to keep science closed.

Putting it all into perspective

We live in a world where political leaders start wars, companies and governments dump toxic waste, church leaders cover up child abuse, etc. In comparison, universities and academic societies faking statistics, rewarding plagiarism and other scientific misconduct, restricting data, and otherwise mucking up the process of scholarly inquiry . . . that barely registers on the scale of institutionalized evil.

So what is it that’s so irritating about academic institutions behaving badly? I can think of a few things:

1. I work in academia so I’m made aware of these issues and feel some bit of collective responsibility for them.

2. Academia is more open than much of business, government, and organized religion, so it’s easier for us to see the problems.

3. So much of the enabling of cheating in academia just seems so pointless. It’s not cool when companies pollute, but, hey, you can see the reason$ they’ll want to do so. But what does the American Sociology Association get out of fighting against open science, what does the University of California get out of tolerating research misconduct, what do the American Statistical Association and American Political Science Association get out of giving rewards for plagiarists? Nothing. That’s what so damn pitiful.

When Lysenko did his part to destroy Soviet agriculture, at least he personally got something out of it. These American Sociology Association etc dudes, they get nothing.

It’s really pitiful, when you think about it. These people aren’t evil, they’re pathetic.

The immediate victims of the con would rather act as if the con never happened. Instead, they’re mad at the outsiders who showed them that they were being fooled.

Dorothy Bishop has the story about “a chemistry lab in CNRS-Université Sorbonne Paris Nord”:

More than 20 scientific articles from the lab of one principal investigator have been shown to contain recycled and doctored graphs and electron microscopy images. That is, results from different experiments that should have distinctive results are illustrated by identical figures, with changes made to the axis legends by copying and pasting numbers on top of previous numbers. . . . the problematic data are well-documented in a number of PubPeer comments on the articles (see links in Appendix 1 of this document).

The response by CNRS [Centre National de la Recherche Scientifique] to this case . . . was to request correction rather than retraction of what were described as “shortcomings and errors”, to accept the scientist’s account that there was no intentionality, despite clear evidence of a remarkable amount of manipulation and reuse of figures; a disciplinary sanction of exclusion from duties was imposed for just one month.

I’m not surprised. The sorts of people who will cheat on their research are likely to be the same sorts of people who will instigate lawsuits, start media campaigns, and attack in other ways. These are researchers who’ve already shown a lack of scruple and a willingness to risk their careers; in short, they’re loose cannons, scary people, so it can seem like the safest strategy to not try to upset them too much, not trap them into a corner where they’ll fight like trapped rats. I’m not speaking specifically of this CNRS researcher—I know nothing of the facts of this case beyond what’s reported in Bishop’s post—I’m just speaking to the mindset of the academic administrators who would just like the problem to go away so they can get on with their regular jobs.

But Bishop and her colleagues were annoyed. If even blatant examples of scientific misconduct cannot be handled straightforwardly, what does this say about the academic and scientific process more generally? Is science just a form of social media, where people can make any sort of claim and evidence doesn’t matter?

They write:

So what should happen when fraud is suspected? We propose that there should be a prompt investigation, with all results transparently reported. Where there are serious errors in the scientific record, then the research articles should immediately be retracted, any research funding used for fraudulent research should be returned to the funder, and the person responsible for the fraud should not be allowed to run a research lab or supervise students. The whistleblower should be protected from repercussions.

In practice, this seldom happens. Instead, we typically see, as in this case, prolonged and secret investigations by institutions, journals and/or funders. There is a strong bias to minimize the severity of malpractice, and to recommend that published work be “corrected” rather than retracted.

Bishop and her colleagues continue:

One can see why this happens. First, all of those concerned are reluctant to believe that researchers are dishonest, and are more willing to assume that the concerns have been exaggerated. It is easy to dismiss whistleblowers as deluded, overzealous or jealous of another’s success. Second, there are concerns about reputational risk to an institution if accounts of fraudulent research are publicised. And third, there is a genuine risk of litigation from those who are accused of data manipulation. So in practice, research misconduct tends to be played down.

But:

This failure to act effectively has serious consequences:

1. It gives credibility to fictitious results, slowing down the progress of science by encouraging others to pursue false leads. . . . [and] erroneous data pollutes the databases on which we depend.

2. Where the research has potential for clinical or commercial application, there can be direct damage to patients or businesses.

3. It allows those who are prepared to cheat to compete with other scientists to gain positions of influence, and so perpetuate further misconduct, while damaging the prospects of honest scientists who obtain less striking results.

4. It is particularly destructive when data manipulation involves the Principal Investigator of a lab. . . . CNRS has a mission to support research training: it is hard to see how this can be achieved if trainees are placed in a lab where misconduct occurs.

5. It wastes public money from research grants.

6. It damages public trust in science and trust between scientists.

7. It damages the reputation of the institutions, funders, journals and publishers associated with the fraudulent work.

8. Whistleblowers, who should be praised by their institution for doing the right thing, are often made to feel that they are somehow letting the side down by drawing attention to something unpleasant. . . .

What happened next?

It’s the usual bad stuff. They receive a series of stuffy bureaucratic responses, none of which address any of items 1 through 8 above, let alone the problem of the data which apparently have obviously been faked. Just disgusting.

But I’m not surprised. We’ve seen it many times before:

– The University of California’s unresponsive response when informed of research misconduct by their star sleep expert.

– The American Political Science Association refusing to retract an award given to an author for a book with plagiarized material, or even to retroactively have the award shared with the people whose material was copied without acknowledgment.

– The London Times never acknowledging the blatant and repeated plagiarism by its celebrity chess columnist.

– The American Statistical Association refusing to retract an award given to a professor who plagiarized multiple times, including from wikipedia (in an amusing case where he created negative value by introducing an error into the material he’d copied, so damn lazy that he couldn’t even be bothered to proofread his pasted material).

– Cornell University . . . ok they finally canned the pizzagate dude, but only after emitting some platitudes. Kind of amazing that they actually moved on that one.

– The Association for Psychological Science: this one’s personal for me, as they ran an article that flat-out lied about me and then refused to correct it just because, hey, they didn’t want to.

– Lots and lots of examples of people finding errors or fraud in published papers and journals refusing to run retractions or corrections or even to publish letters pointing out what went wrong.

Anyway, this is one more story.

What gets my goat

What really annoys me in these situations is how the institutions show loyalty to the people who did research misconduct. When researcher X works at or publishes with institution Y, and it turns out that X did something wrong, why does Y so often try to bury the problem and attack the messenger? Y should be mad at X; after all, it’s X who has leveraged the reputation of Y for his personal gain. I’d think that the leaders of Y would be really angry at X, even angrier than people from the outside. But it doesn’t happen that way. The immediate victims of the con would rather act as if the con never happened. Instead, they’re mad at the outsiders who showed them that they were being fooled. I’m sure that Dan Davies would have something to say about all this.

What to trust in the newspaper? Example of “The Simple Nudge That Raised Median Donations by 80%”

Greg Mayer points to this news article, “The Simple Nudge That Raised Median Donations by 80%,” which states:

A start-up used the Hebrew word “chai” and its numerical match, 18, to bump up giving amounts. . . . It’s a common donation amount among Jews — $18, $180, $1,800 or even $36 and other multiples.

So Daffy lowered its minimum gift to $18 and then went further, prompting any donor giving to any Jewish charity to bump gifts up by some related amount. Within a year, median gifts had risen to $180 from $100. . . .

I see several warning signs here:

1. “Within a year, median gifts had risen to $180 from $100.” This is a before/after change, not a direct comparison of outcomes.

2. No report, just a quoted number which could easily have been made up. Yes, the numbers in a report can be fabricated too, but that takes more work and is more risk. Making up numbers when talking with a reporter, that’s easy.

3. The people who report the number are motivated to claim success; the reporter is motivated to report a success. The article is filled with promotion for this company. It’s a short article that mentions “Daffy” 6 times in the short article, for example this bit which reads like a straight-up ad:

If you have children, grandchildren, nieces or nephews, there’s another possibility. Daffy has a family plan that allows children to prompt their adult relatives to support a cause the children choose. Why not put the app on their iPhones or iPads so they can make suggestions and let, for example, a 12-year-old make $12 donations to 12 nonprofits each year?

Why not, indeed? Even better, why not have them make their donations directly to Daffy and cut out the middleman?? Look, I’m not saying that the people behind Daffy are doing anything wrong; it’s just that this is public relations, not journalism.

4. Use of the word “nudge” in the headline is consistent with business-press hype. Recall that “nudge” is a subfield whose proponents are well connected in the media and routinely make exaggerated claims.

So, yeah, an observational comparison with no documentation, in an article that’s more like an advertisement, that’s kinda sus. Not that the claim is definitely wrong, there’s just no good reason for us to take it seriously.

In some cases academic misconduct doesn’t deserve a public apology

This is Jessica. As many of you probably saw, Claudine Gay resigned as president of Harvard this week. Her tenure as president is apparently the shortest on record, and accusations of plagiarism involving some of her published papers and her dissertation seem to be a major contributor that pushed this decision, after the initial backlash against Gay’s response alongside MIT and Penn presidents Kornbluth and Magill to questions from Republican congresswoman Stefanik about blatantly anti-semitic remarks on their campuses in the wake of Oct. 7.

The plagiarism counts are embarrassing for Gay and for Harvard, for sure, as were the very legalistic reactions of all three presidents when asked about anti-semitism on their campuses. In terms of plagiarism as a phenomena that crops up in academia, I agree with Andrew that it tells us something about the author’s lack of ability or effort to take the time to understand the material. I suspect it happens a lot under the radar, and I see it as a professor (more often with ChatGPT in the mix and no, it does not always lead to explicit punishment, to comment on what some are saying online about double standards for faculty and students). What I don’t understand is how in Gay’s case this is misconduct at the level that warrants a number of headline stories in major mainstream news media and the resignation of an administrator who has put aside her research career anyway. 

On the one hand, I can see how it is temptingly easy to rationalize why the president of what is probably the most revered university on earth cannot be associated with any academic misconduct without somehow bringing shame on the institution. She’s the president of Harvard, how can it not be shocking?! is one narrative I suppose. But, this kind of response to this situation is exactly what bothers me in the wake of her resignation. I will try to explain.

Regarding the specifics, I saw a few of the plagarized passages early on, and I didn’t see much reason to invest my time in digging further, if this was the best that could be produced by those who were obviously upset about it (I agree with Phil here that they seem like a  “weak” form of plagiarism). What makes me uncomfortable about this situation was how so many people, under the guise of being “objective,” did feel the need to invest their time in the name of establishing some kind of truth in the situation. This is the moment of decision that I wish to call attention to. It’s as though in the name of being “neutral” and “evidence based” we are absolved from having to consider why we feel so compelled in certain cases to get to the bottom of it, but not so much in other cases.  

It’s the same thing that makes so much research bad: the inability to break frame, to turn on the premise rather than the minor details. To ask, how did we get here? Why are we all taking for granted that this is the thing to be concerned with? 

Situations like what happened to Gay bring a strong sense of deja vu for me. I’m not sure how much my personal reaction is related to being female in a still largely male-dominated field myself, but I suspect it contributes. There’s a scenario that plays out from time to time where someone who is not in the majority in some academic enterprise is found to have messed up. At first glance, it seems fairly minor, somewhat relatable at least, no worse than what many others have done. But, somehow, it can’t be forgotten in some cases. Everyone suddenly exerts effort they would normally have trouble producing for a situation that doesn’t concern them that much personally to pore over the details with a fine-tooth comb to establish that there really was some fatal flaw here. The discussion goes on and becomes hard to shut out, because here is always someone else who is somehow personally offended by it. And the more it gets discussed, the more it seems like overwhelmingly a real thing to be dealt with, to be decided. It becomes an example for the sake of being principled. Once this palpable sense that ‘this is important’, ‘this is a message about our principles,’ sets in, then the details cannot be overlooked. How else can we be sure we are being rational and objective? We have to treat it like evidence and bring to bear everything we know about scrutinizing evidence. 

What is hard for me to get over is that these stories that stick around and capture so much attention are far more often stories about some member of the racial or gender non-majority who ended up in a high place. It’s like the resentment that a person from the outside has gotten in sets in without the resenter even becoming aware of it, and suddenly a situation that seems like it should have been cooperative gets much more complicated. This is not to say that people who are in the majority in a field don’t get called out or targetted sometimes, they do. Just that there’s a certain dynamic that seems to set in more readily when someone perceived as not belonging to begin with messes up. As Jamelle Watson-Daniels writes on X/Twitter of the Gay situation: “the legacy and tradition of orchestrated attacks against the credibility of Black scholars all in the name of haunting down and exposing them as… the ultimate imposters.” This is the undertone I’m talking about here.

I’ve been a professor for about 10 years, and I’ve seen this sort of hyper-attention turned on women and/or others in the non-majority who violated some minor code repeatedly in that time. In many instances, it creates a situation that divides those who are confused by the apparent level of detail orientedness given the crime and those who can’t see how there is any other way than to make the incident into an example. Gay is just the most recent reminder. 

What makes this challenging for me to write about personally is that I am a big believer in public critique, and admitting one’s mistakes. I have advocated for both on this blog. To take an example that comes up from time to time, I don’t think that because of uneven power dynamics, public critique of papers with lead student authors should be shut down, or that we owe authors extensive private communications before we openly criticize. That goes against the sort of open discussion of research flaws that we are already often incentivized to avoid. For the same reason, I don’t think that critiques made by people with ulterior motives should be dismissed. I think there were undoubtedly ulterior motives here, and I am not arguing that the information about accounts of plagiarism here should not have been shared at all. 

I also think making decisions driven by social values (which often comes up under the guise of DEI) is very complex. At least in academic computer science, we seem to be experiencing a moment of heightened sensitivity to what is perceived “moral” and “ethical”, and that often these things are defined very simplistically and tolerance for disagreement low. 

And I also think that there are situations where a transgression may seem minor but it is valuable to mind all the details and use it as an example! I was surprised for example at how little interest there seemed to be in the recent Nature Human Behavior paper which claimed to present all confirmatory analyses but couldn’t produce the evidence that the premise of the paper suggests should be readily available. This seemed to me like an important teachable moment given what the paper was advocating to begin with.  

So anyway, lots of reasons why this is hard to write about, and lots of fodder for calling me a hypocrite if you want. But I’m writing this post because the plagiarism is clearly not be the complete story here. I don’t know the full details of the Gay investigation (and admit I haven’t spent too much time researching this: I’ve seen a bunch of the plagiarism examples, but I don’t have a lot of context on her entire career). So it’s possible I’m wrong and she did some things that were truly more awful than the average Harvard president. But I haven’t heard about them yet. And either way my point still stands: there are situations with similar dynamics to this where my dedication to scientific integrity and public critique and getting to the bottom of technical details do not disappear, but are put on the backburner to question a bigger power dynamic that seems off. 

And so, while I normally I think everyone caught doing academic misconduct should acknowledge it, for the reasons above, at least at the moment, it doesn’t bother me that Gay’s resignation letter doesn’t mention the plagiarism. I think not acknowledging it was the right thing to do. 

The continuing challenge of poststratification when we don’t have full joint data on the population.

Torleif Halkjelsvik at the Norwegian Institute of Public Health writes:

Norway has very good register data (education/income/health/drugs/welfare/etc.) but it is difficult to obtain complete tables at the population level. It is however easy to get independent tables from different registries (e.g., age by gender by education as one data source and gender by age by welfare benefits as another). What if I first run a multilevel model to regularize predictions for a vast set of variables, but in the second step, instead of a full table, use a raking approach based on several independent post-stratification tables? Would that be a valid approach? And have you seen examples of this?

My reply: I think the right way to frame this is as a poststratification problem where you don’t have the full poststratification table, you only have some margins. The raking idea you propose could work, but to me it seems awkward in that it’s mixing different parts of the problem together. Instead I’d recommend first imputing a full poststrat table and then using this to do your poststratification. But then the question is how to do this. One approach is iterative proportional fitting (Deming and Stephan, 1940). I don’t know any clean examples of this sort of thing in the recent literature, but there might be something out there.

Halkjelsvik responded:

It is an interesting idea to impute a full poststrat table, but I wonder whether it is actually better than directly calculating weights using the proportions in the data itself. Cells that should be empty in the population (e.g., women, 80-90 years old, high education, sativa spray prescription) may not be empty in the imputed table when using iterative proportional fitting (IPF), and these “extreme” cells may have quite high or low predicted values. By using the data itself, such cells will be empty, and they will not “steal” any of the marginal proportions when using IPF. This is of course a problem in itself if the data is limited (if there are empty cells in the data that are not empty in the population).

Me: If you have information that certain cells are empty or nearly so, that’s information that you should include in the poststrat table. I think the IPF approach will be similar to the weighting; it is just more model-based. So if you think the IPF will give some wrong answers, that suggests you have additional information. I recommend you try to write down all the additional information you have and use all of it in constructing the poststratification table. This should allow you to do better than with any procedure that does not use this info.

Halkjelsvik:

After playing with a few scenarios (on a piece of paper, no simulation) I see that my suggested raking/weighting approach (which also would involve iterative proportional fitting) directly on the sample data is not a good idea in contexts where MRP is most relevant. That is, if the sample cell sizes are small and regularization matters, then the subgroups of interest (e.g. geographical regions) will likely have too little data on rare demographic combinations. The approach you suggested (full population table imputation based on margins) appears more reasonable, and the addition of “extra information” is obviously a good idea. But how about a hybrid: Instead of manually accounting for “extra information” (e.g., non-existing demographic combinations) this extra information can be derived directly from the proportions of the sample itself (across subgroups of interest) and can be used as “seed” values (i.e., before accounting for margins at the local level). Using information from the sample to create the initial (seed) values for the IPF may be a good way to avoid imputing positive values in cells that are structural zeros, given that the sample is sufficiently large to avoid too many “sample zeros” that are not true “structural zeros”.

So the following could be an approach for my problem?

1. Obtain regularized predictions from sample.

2. Produce full postrat seed table directly from “global” cell values in the sample (or from other available “global” data, e.g. if available only at national level). That is, regions start with identical seed structures.

3. Adjust the poststrat table by iterative proportional fitting based on local margins (but I have read that there may be convergence problems when there are many zeros in seed cells).

Me: I’m not sure! I really want to have a fully worked-out example, a case study of MRP where the population joint distribution (the poststratification table) is not known and it needs to be estimated from data. We’re always so sloppy in those settings. I’d like to do it with a full Bayesian model in Stan and then compare various approximations.

In judo, before you learn the cool moves, you first have to learn how to fall. Maybe we should be training researchers the same way: first learn how things can go wrong, and only when you get that lesson down do you learn the fancy stuff.

I want to follow up on a suggestion from a few years ago:

In judo, before you learn the cool moves, you first have to learn how to fall. Maybe we should be training researchers, journalists, and public relations professionals the same way. First learn about Judith Miller and Thomas Friedman, and only when you get that lesson down do you get to learn about Woodward and Bernstein.

Martha in comments modified my idea:

Yes! But I’m not convinced that “First learn about Judith Miller and Thomas Friedman, and only when you get that lesson down do you get to learn about Woodward and Bernstein” or otherwise learning about people is the way to go. What is needed is teaching that involves lots of critiquing (especially by other students), with the teacher providing guidance (e.g., criticize the work or the action, not the person; no name calling; etc.) so students learn to give and accept criticism as a normal part of learning and working.

I responded:

Yes, learning in school involves lots of failure, getting stuck on homeworks, getting the wrong answer on tests, or (in grad school) having your advisor gently tone down some of your wild research ideas. Or, in journalism school, I assume that students get lots of practice in calling people and getting hung up on.

So, yes, students get the experience of failure over and over. But the message we send, I think, is that once you’re a professional it’s just a series of successes.

Another commenter pointed to this inspiring story from psychology researchers Brian Nosek, Jeffrey Spies, and Matt Motyl, who ran an experiment, thought they had an exciting result, but, just to be sure, they tried a replication and found no effect. This is a great example of how to work and explore as a scientist.

Background

Scientific research is all about discovery of the unexpected: to do research, you need to be open to new possibilities, to design experiments to force anomalies, and to learn from them. The sweet spot for any researcher is at Cantor’s corner.

Buuuut . . . researchers are also notorious for being stubborn. In particular, here’s a pattern we see a lot:
– Research team publishes surprising result A based on some “p less than .05” empirical results.
– This publication gets positive attention and the researchers and others in their subfield follow up with open-ended “conceptual replications”: related studies that also attain the “p less than .05” threshold.
– Given the surprising nature of result A, it’s unsurprising that other researchers are skeptical of A. The more theoretically-minded skeptics, or agnostics, demonstrate statistical reasons why these seemingly statistically-significant results can’t be trusted. The more empirically-minded skeptics, or agnostics, run preregistered replications studies, which fail to replicate the original claim.
– At this point, the original researchers do not apply the time-reversal heuristic and conclude that their original study was flawed (forking paths and all that). Instead they double down, insist their original findings are correct, and they come up with lots of little explanations for why the replications aren’t relevant to evaluating their original claims. And they typically just ignore or brush aside the statistical reasons why their original study was too noisy to ever show what they thought they were finding.

I’ve conjectured that one reason scientists often handle criticism in such scientifically-unproductive ways is . . . the peer-review process, which goes like this:

As scientists, we put a lot of effort into writing articles, typically with collaborators: we work hard on each article, try to get everything right, then we submit to a journal.

What happens next? Sometimes the article is rejected outright, but, if not, we’ll get back some review reports which can have some sharp criticisms: What about X? Have you considered Y? Could Z be biasing your results? Did you consider papers U, V, and W?

The next step is to respond to the review reports, and typically this takes the form of, We considered X, and the result remained significant. Or, We added Y to the model, and the result was in the same direction, marginally significant, so the claim still holds. Or, We adjusted for Z and everything changed . . . hmmmm . . . we then also though about factors P, Q, and R. After including these, as well as Z, our finding still holds. And so on.

The point is: each of the remarks from the reviewers is potentially a sign that our paper is completely wrong, that everything we thought we found is just an artifact of the analysis, that maybe the effect even goes in the opposite direction! But that’s typically not how we take these remarks. Instead, almost invariably, we think of the reviewers’ comments as a set of hoops to jump through: We need to address all the criticisms in order to get the paper published. We think of the reviewers as our opponents, not our allies (except in the case of those reports that only make mild suggestions that don’t threaten our hypotheses).

When I think of the hundreds of papers I’ve published and the, I dunno, thousand or so review reports I’ve had to address in writing revisions, how often have I read a report and said, Hey, I was all wrong? Not very often. Never, maybe?

Where we’re at now

As scientists, we see serious criticism on a regular basis, and we’re trained to deal with it in a certain way: to respond while making minimal, ideally zero, changes to our scientific claims.

That’s what we do for a living; that’s what we’re trained to do. We think of every critical review report as a pain in the ass that we have to deal with, not as a potential sign that we screwed up.

So, given that training, it’s perhaps little surprise that when our work is scrutinized in post-publication review, we have the same attitude: the expectation that the critic is nitpicking, that we don’t have to change our fundamental claims at all, that if necessary we can do a few supplemental analyses and demonstrate the robustness of our findings to those carping critics.

How to get to a better place?

How can this situation be improved? I’m not sure. In some ways, things are getting better: the replication crisis has happened, and students and practitioners are generally aware that high-profile, well-accepted findings often do not replicate. In other ways, though, I fear we’re headed in the wrong direction: students are now expected to publish peer-reviewed papers throughout grad school, so right away they’re getting on the minimal-responses-to-criticism treadmill.

It’s not clear to me how to best teach people how to fall before they learn fancy judo moves in science.

Explaining that line, “Bayesians moving from defense to offense”

Earlier today we posted something on our recent paper with Erik van Zwet et al., “A New Look at P Values for Randomized Clinical Trials.” The post had the provocative title, “Bayesians moving from defense to offense,” and indeed that title provoked some people!

The discussion thread here at the blog was reasonable enough, but someone pointed me to a thread at Hacker News where there was more confusion, so I thought I’d clarify one or two points.

First, yes, as one commenter puts it, “I don’t know when Bayesians have ever been on defense. They’ve always been on offense.” Indeed, we published the first edition of Bayesian Data Analysis back in 1995, and there was nothing defensive about our tone! We were demonstrating Bayesian solutions to a large set of statistics problems, with no apology.

As a Bayesian, I’m kinda moderate—indeed, Yuling and I published an entire, non-ironic paper on holes in Bayesian statistics, and there’s also a post a few years ago called What’s wrong with Bayes, where I wrote, “Bayesian inference can lead us astray, and we’re better statisticians if we realize that,” and “the problem with Bayes is the Bayesians. It’s the whole religion thing, the people who say that Bayesian reasoning is just rational thinking, or that rational thinking is necessarily Bayesian, the people who refuse to check their models because subjectivity, the people who try to talk you into using a ‘reference prior’ because objectivity. Bayesian inference is a tool. It solves some problems but not all, and I’m exhausted by the ideology of the Bayes-evangelists.”

So, yeah, “Bayesians on the offensive” is not new, and I don’t even always like it. Non-Bayesians have been pretty aggressive too over the years, and not always in a reasonable way; see my discussion with Christian Robert from a few years ago and our followup. As we wrote, “The missionary zeal of many Bayesians of old has been matched, in the other direction, by an attitude among some theoreticians that Bayesian methods were absurd—not merely misguided but obviously wrong in principle.”

Overall, I think there’s much more acceptance of Bayesian methods within statistics than in past decades, in part from the many practical successes of Bayesian inference and in part because recent successes of machine learning have given users and developers of methods more understanding and acceptance of regularization (also known as partial pooling or shrinkage, and central to Bayesian methods) and, conversely, have given Bayesians more understanding and acceptance of regularization methods that are not fully Bayesian.

OK, so what was I talking about?

So . . . Given all the above, what did I mean with my crack about Bayesians moving from defense to offense”? I wasn’t talking about Bayesians being positive about Bayesian statistics in general; rather, I was talking about the specific issue of informative priors.

Here’s how we used to talk, circa 1995: “Bayesian inference is a useful practical tool. Sure, you need to assign a prior distribution, but don’t worry about it: the prior can be noninformative, or in a hierarchical model the hyperparameters can be estimated from data. The most important ‘prior information’ to use is structural, not numeric.”

Here’s how we talk now: “Bayesian inference is a useful practical tool, in part because it allows us to incorporate real prior information. There’s prior information all around that we can use in order to make better inferences.”

My “moving from defense to offense” line was all about the changes in how we think about prior information. Instead of being concerned about prior sensitivity, we respect prior sensitivity and, when the prior makes a difference, we want to use good prior information. This is exactly the same as in any statistical procedure: when there’s sensitivity to data (or, in general, to any input to the model), that’s where data quality is particularly relevant.

This does not stop you from using classical p-values

Regarding the specific paper we were discussing in yesterday’s post, let me emphasize that this work is very friendly to traditional/conventional/classical approaches.

As we say right there in the abstract, “we reinterpret the P value in terms of a reference population of studies that are, or could have been, in the Cochrane Database.”

So, in that paper, we’re not saying to get rid of the p-value. It’s a data summary, people are going to compute it, and people are going to report it. That’s fine! It’s also well known that p-values are commonly misinterpreted (as detailed for example by McShane and Gal, 2017).

Given that people will be reporting p-values, and given how often they are misinterpreted, even by professionals, we believe that it’s a useful contribution to research and practice to consider how to interpret them directly, under assumptions that are more realistic than the null hypothesis.

So if you’re coming into all of this as a skeptic of Bayesian methods, my message to you is not, “Chill out, the prior doesn’t really matter anyway,” but, rather, “Consider this other interpretation of a p-value, averaging over the estimated distribution of effect sizes in the Cochrane database instead of conditioning on the null hypothesis.” So you now have two interpretations of the p-value, conditional on different assumptions. You can use them both.

Suppose you realize that a paper on an important topic, with thousands of citations, is fatally flawed? Where should the correction be published?

OK, here’s a hypothetical scenario. You’re a researcher. You look carefully at one of the most cited-papers in an important subfield—perhaps the most influential paper published there in the past decade. It turns out that the paper is fatally flawed. Unfortunately, it seems very unlikely that the authors of the original paper will do an “Our bad, we retract.” You can write a short article detailing the flaws in that paper. But what do you do with your short article?

Here are some options:

– Publish it in a journal such as Econ Journal Watch or Sociological Science that specializes in criticism of published work. The trouble here is (a) there aren’t many such outlets, and (b) a publication there might be barely noticed.

– Publish it the same journal that published the original article. The trouble here is that the journal that published the original article might defer to the authors of the original article, for example publishing your criticism as a letter with a reply by the original authors, which in theory could be fine but in practice could just be a way to muddy the waters. Or the journal might just flat-out refuse to publish your article at all, taking the position that they just want to publish original work (even if wrong) and not commentary.

– Publish it in a top journal in the field. That should be possible, given that you’re criticizing a very influential paper. But, again, top journals often don’t want to publish criticism, also there can be a circling-the-wagons thing where they really don’t want to see criticism of influential work.

– Publish it on a preprint server and blog it. This can create some short-term stir, but I think that if the criticism’s not in the scholarly literature, it will fade, except in an extreme scenario such as Wansink, Ariely, or Tol where revelations of sloppy practice are so clear that a researcher’s entire corpus is called into question.

– Don’t publish it right away. Instead, do further research and publish a new paper on the topic with a new conclusion, incidentally shooting down that old paper that’s all wrong. Publish that new paper in a top journal. This could be the best option, but (a) it takes a lot of work, (b) it delays the revelation of the problems with the earlier paper, and (c) it shouldn’t be necessary: this implies higher standards for corrections than for original work.

P.S. Joshua in comments writes:

What about another option – contacting the authors of the paper with an elaboration of your critique, and offering to collaborate on a follow up. CC the journal that published the original paper.

My reply: Sure, I guess that’s worth a try. It’s generally not worked for me. In my experience, the original researchers and the journals that published the original papers are too committed to the result. One problem here is that I’ll typically hear about a paper in the first place because it got inappropriately uncritical publicity, and, when that happens, most authors just do not want to hear that they did something wrong, and they go to great efforts to avoid confronting the issue.

For example, I tried several times to contact the authors of the notorious early-childhood-intervention-in-Jamaica paper, using various channels including two different intermediaries whom I knew personally, and I never received any response. The authors of that paper did not seem to be interested in exploring what went on or doing better.

Indeed, I’ve had struggles with this sort of thing for decades. As a student, I collaborated with someone who had a paper in preparation for journal publication that had a fatal error. I told him about it and explained the error, and he refused to do anything about it: the paper was already accepted and he just wanted to take the W on his C.V. and move on. Also as a student, I found a problem in a published paper—not an error, exactly, but a bad analysis for which there was a clear direction for improvement. I contacted the authors by letter and phone, and they refused to share their data or to even consider that they might not have done the best analysis. The story for that one is here.

Oh, and here’s another story where I contacted a colleague at another institution who’d promoted work of which I was skeptical, but he didn’t want to engage in any serious way. And here’s another such story, where it was only possible to collaborate asynchronously, as it were. That last case was the best, because the data were available.

Other times I’ve contacted authors had had fruitful exchanges and they want to figure out how to do better; an example is here. So it can happen.

Celebrity scientists and the McKinsey bagmen

Josh Marshall writes:

Trump doesn’t think of truth or lies the way you or I do. Most imperfect people, which is to say all of us, exist in a tension between what we believe is true and what is good for or pleasing to us. If we have strong character we hew closely to the former, both in what we say to others and what we say to ourselves. The key to understanding Trump is that it’s not that he hews toward the latter. It’s that the tension doesn’t exist. What he says is simply what works for him. Whether it’s true is irrelevant and I suspect isn’t even part of Trump’s internal dialog. It’s like asking an actor whether she really loved her husband like she claimed in her blockbuster movie or whether she was lying. It’s a nonsensical question. She was acting.

The analogy to the actor is a good one.

Regarding the general sort of attitude and behavior discussed here, though, I don’t think Trump stands out as much as Marshall implies. Even setting aside other politicians, who in the matter of lying often seem to differ from the former president more in degree than kind, I feel like I’ve seen the same sort of thing with researchers, which is one reason I think Clarke’s law (“Any sufficiently crappy research is indistinguishable from fraud”) is so often relevant.

When talking about researchers who don’t seem to care about saying the truth, I’m not just talking about various notorious flat-out data fakers. I’m also talking about researchers who just do unreplicable crap or who make claims in the titles and abstracts of their papers that aren’t supported by their data. We get lots of statements that are meaningless or flat-out false.

Does the truth matter to these people? I don’t know. I think they believe in some things they view as deeper truths: (a) their vague models of how the world works are correct, and (b) they are righteous eople. Once you start there, all the false statements don’t matter, as they are all being done in the service of a larger truth.

I don’t think everyone acts this way—I have the impression that most people, as Marshall puts it, “exist in a tension between what we believe is true and what is good for or pleasing to us.” There’s just a big chunk of people—including many academic researchers, journalists, politicians, etc.—who don’t seem to feel that tension. As I’ve sometimes put it, they choose what to say or what to write based on the music, not the words. And they see the rest of us as “schoolmarms” or “Stasi“—pedants who get in the way of the Great Men of science. Not the same as Donald Trump by a longshot, but I see some similarities in that it’s kinda hard to pin them down when it comes to factual beliefs. It’s much more about who’s-side-are-you-on.

Also incentives: it’s not so much that people lie because of incentives, as that incentives affect the tough calls they make, and incentives affect who succeeds on climbing the greasy pole of success.

“Has the replication crisis been a long series of conversations that haven’t influenced publishing and research practices much if at all?”

Kelsey Piper writes:

I’m writing about the replication crisis for Vox and I was wondering if you saw this blog post from one of the DARPA replication project participants, particularly the section that argues:

I frequently encounter the notion that after the replication crisis hit there was some sort of great improvement in the social sciences, that people wouldn’t even dream of publishing studies based on 23 undergraduates any more (I actually saw plenty of those), etc. Stuart Ritchie’s new book praises psychologists for developing “systematic ways to address” the flaws in their discipline. In reality there has been no discernible improvement.

Your blog post yesterday about scientists who don’t care about doing science struck a similar tone, and I was curious: do you think we’re in a better place w/r/t the replication crisis than we were ten years ago? Or has the replication crisis been a long series of conversations that haven’t influenced publishing and research practices much if at all?

My discussion of that above-quoted blog post appeared a couple years ago. I agreed with some of that post and disagreed with other parts.

Regarding Piper’s question, “has the replication crisis been a long series of conversations that haven’t influenced publishing and research practices much if at all?,” I don’t think the influence has been zero! For one thing, this crisis has influenced my own research practices, and I assume it’s influenced many others as well. And it’s my general impression that journals such as Psychological Science and PNAS don’t publish as much junk as they used to. I haven’t done any formal study of this, though.

P.S. For some other relevant recent discussions, see More on possibly rigor-enhancing practices in quantitative psychology research and (back to basics:) How is statistics relevant to scientific discovery?.

“U.S. Watchdog Halts Studies at N.Y. Psychiatric Center After a Subject’s Suicide”

I don’t know anyone involved in this story and don’t really have anything to add. I just wanted to post on it because it sits at the intersection of science, statistics, and academia. The New York State Psychiatric Institute is involved in a lot of the funded biostatistics research at Columbia University. Ultimately we want to save lives and improve people’s health, but in the meantime we do the work and take the funding without always thinking too much about the people involved. I don’t have any specific study in mind here; I’m just thinking in general terms.

This empirical paper has been cited 1616 times but I don’t find it convincing. There’s no single fatal flaw, but the evidence does not seem so clear. How to think about this sort of thing? What to do? First, accept that evidence might not all go in one direction. Second, make lots of graphs. Also, an amusing story about how this paper is getting cited nowadays.

1. When can we trust? How can we navigate social science with skepticism?

2. Why I’m not convinced by that Quebec child-care study

3. 20 years on

1. When can we trust? How can we navigate social science with skepticism?

The other day I happened to run across a post from 2016 that I think is still worth sharing.

Here’s the background. Someone pointed me to a paper making the claim that “Canada’s universal childcare hurt children and families. . . . the evidence suggests that children are worse off by measures ranging from aggression to motor and social skills to illness. We also uncover evidence that the new child care program led to more hostile, less consistent parenting, worse parental health, and lower‐quality parental relationships.”

I looked at the paper carefully and wasn’t convinced. In short, the evidence went in all sorts of different directions, and I felt that the authors had been trying too hard to fit it all into a consistent story. It’s not that the paper had fatal flaws—it was not at all in the category of horror classics such as the beauty-and-sex-ratio paper, the ESP paper, the himmicanes paper, the air-rage paper, the pizzagate papers, the ovulation-and-voting paper, the air-pollution-in-China paper, etc etc etc.—it just didn’t really add up to me.

The question then is, if a paper can appear in a top journal, have no single killer flaw but still not be convincing, can we trust anything at all in the social sciences? At what point does skepticism become nihilism? Must I invoke the Chestertonian principle on myself?

I don’t know.

What I do think is that the first step is to carefully assess the connection between published claims, the analysis that led to these claims, and the data used in the analysis. The above-discussed paper has a problem that I’ve seen a lot, which is an implicit assumption that all the evidence should go in the same direction, a compression of complexity which I think is related to the cognitive illusion that Tversky and Kahneman called “the law of small numbers.” The first step in climbing out of this sort of hole is to look at lots of things at once, rather than treating empirical results as a sort of big bowl of fruit where the researcher can just pick out the the juiciest items and leave the rest behind.

2. Why I’m not convinced by that Quebec child-care study

Here’s what I wrote on that paper back in 2006:

Yesterday we discussed the difficulties of learning from a small, noisy experiment, in the context of a longitudinal study conducted in Jamaica where researchers reported that an early-childhood intervention program caused a 42%, or 25%, gain in later earnings. I expressed skepticism.

Today I want to talk about a paper making an opposite claim: “Canada’s universal childcare hurt children and families.”

I’m skeptical of this one too.

Here’s the background. I happened to mention the problems with the Jamaica study in a talk I gave recently at Google, and afterward Hal Varian pointed me to this summary by Les Picker of a recent research article:

In Universal Childcare, Maternal Labor Supply, and Family Well-Being (NBER Working Paper No. 11832), authors Michael Baker, Jonathan Gruber, and Kevin Milligan measure the implications of universal childcare by studying the effects of the Quebec Family Policy. Beginning in 1997, the Canadian province of Quebec extended full-time kindergarten to all 5-year olds and included the provision of childcare at an out-of-pocket price of $5 per day to all 4-year olds. This $5 per day policy was extended to all 3-year olds in 1998, all 2-year olds in 1999, and finally to all children younger than 2 years old in 2000.

(Nearly) free child care: that’s a big deal. And the gradual rollout gives researchers a chance to estimate the effects of the program by comparing children at each age, those who were and were not eligible for this program.

The summary continues:

The authors first find that there was an enormous rise in childcare use in response to these subsidies: childcare use rose by one-third over just a few years. About a third of this shift appears to arise from women who previously had informal arrangements moving into the formal (subsidized) sector, and there were also equally large shifts from family and friend-based child care to paid care. Correspondingly, there was a large rise in the labor supply of married women when this program was introduced.

That makes sense. As usual, we expect elasticities to be between 0 and 1.

But what about the kids?

Disturbingly, the authors report that children’s outcomes have worsened since the program was introduced along a variety of behavioral and health dimensions. The NLSCY contains a host of measures of child well being developed by social scientists, ranging from aggression and hyperactivity, to motor-social skills, to illness. Along virtually every one of these dimensions, children in Quebec see their outcomes deteriorate relative to children in the rest of the nation over this time period.

More specifically:

Their results imply that this policy resulted in a rise of anxiety of children exposed to this new program of between 60 percent and 150 percent, and a decline in motor/social skills of between 8 percent and 20 percent. These findings represent a sharp break from previous trends in Quebec and the rest of the nation, and there are no such effects found for older children who were not subject to this policy change.

Also:

The authors also find that families became more strained with the introduction of the program, as manifested in more hostile, less consistent parenting, worse adult mental health, and lower relationship satisfaction for mothers.

I just find all this hard to believe. A doubling of anxiety? A decline in motor/social skills? Are these day care centers really that horrible? I guess it’s possible that the kids are ruining their health by giving each other colds (“There is a significant negative effect on the odds of being in excellent health of 5.3 percentage points.”)—but of course I’ve also heard the opposite, that it’s better to give your immune system a workout than to be preserved in a bubble. They also report “a policy effect on the treated of 155.8% to 394.6%” in the rate of nose/throat infection.

OK, hre’s the research article.

The authors seem to be considering three situations: “childcare,” “informal childcare,” and “no childcare.” But I don’t understand how these are defined. Every child is cared for in some way, right? It’s not like the kid’s just sitting out on the street. So I’d assume that “no childcare” is actually informal childcare: mostly care by mom, dad, sibs, grandparents, etc. But then what do they mean by the category “informal childcare”? If parents are trading off taking care of the kid, does this count as informal childcare or no childcare? I find it hard to follow exactly what is going on in the paper, starting with the descriptive statistics, because I’m not quite sure what they’re talking about.

I think what’s needed here is some more comprehensive organization of the results. For example, consider this paragraph:

The results for 6-11 year olds, who were less affected by this policy change (but not unaffected due to the subsidization of after-school care) are in the third column of Table 4. They are largely consistent with a causal interpretation of the estimates. For three of the six measures for which data on 6-11 year olds is available (hyperactivity, aggressiveness and injury) the estimates are wrong-signed, and the estimate for injuries is statistically significant. For excellent health, there is also a negative effect on 6-11 year olds, but it is much smaller than the effect on 0-4 year olds. For anxiety, however, there is a significant and large effect on 6-11 year olds which is of similar magnitude as the result for 0-4 year olds.

The first sentence of the above excerpt has a cover-all-bases kind of feeling: if results are similar for 6-11 year olds as for 2-4 year olds, you can go with “but not unaffected”; if they differ, you can go with “less effective.” Various things are pulled out based on whether they are statistically significant, and they never return to the result for anxiety, which would seem to contradict their story. Instead they write, “the lack of consistent findings for 6-11 year olds confirm that this is a causal impact of the policy change.” “Confirm” seems a bit strong to me.

The authors also suggest:

For example, higher exposure to childcare could lead to increased reports of bad outcomes with no real underlying deterioration in child behaviour, if childcare providers identify negative behaviours not noticed (or previously acknowledged) by parents.

This seems like a reasonable guess to me! But the authors immediately dismiss this idea:

While we can’t rule out these alternatives, they seem unlikely given the consistency of our findings both across a broad spectrum of indices, and across the categories that make up each index (as shown in Appendix C). In particular, these alternatives would not suggest such strong findings for health-based measures, or for the more objective evaluations that underlie the motor-social skills index (such as counting to ten, or speaking a sentence of three words or more).

Health, sure: as noted above, I can well believe that these kids are catching colds from each other.

But what about that motor-skills index? Here are their results from the appendix:

Screen Shot 2016-06-22 at 1.56.04 PM

I’m not quite sure whether + or – is desirable here, but I do notice that the coefficients for “can count out loud to 10” and “spoken a sentence of 3 words or more” (the two examples cited in the paragraph above) go in opposite directions. That’s fine—the data are the data—but it doesn’t quite fit their story of consistency.

More generally, the data are addressed in an scattershot manner. For example:

We have estimated our models separately for those with and without siblings, finding no consistent evidence of a stronger effect on one group or another. While not ruling out the socialization story, this finding is not consistent with it.

This appears to be the classic error of interpretation of a non-rejection of a null hypothesis.

And here’s their table of key results:

Screen Shot 2016-06-22 at 1.59.53 PM

As quantitative social scientists we need to think harder about how to summarize complicated data with multiple outcomes and many different comparisons.

I see the current standard ways to summarize this sort of data are:

(a) Focus on a particular outcome and a particular comparison (choosing these ideally, though not usually, using preregistration), present that as the main finding and then tag all else as speculation.

Or, (b) Construct a story that seems consistent with the general pattern in the data, and then extract statistically significant or nonsignificant comparisons to support your case.

Plan (b) was what was done again, and I think it has problems: lots of stories can fit the data, and there’s a real push toward sweeping any anomalies aside.

For example, how do you think about that coefficient of 0.308 with standard error 0.080 for anxiety among the 6-11-year-olds? You can say it’s just bad luck with the data, or that the standard error calculation is only approximate and the real standard error should be higher, or that it’s some real effect caused by what was happening in Quebec in these years—but the trouble is that any of these explanations could be used just as well to explain the 0.234 with standard error 0.068 for 2-4-year-olds, which directly maps to one of their main findings.

Once you start explaining away anomalies, there’s just a huge selection effect in which data patterns you choose to take at face value and which you try to dismiss.

So maybe approach (a) is better—just pick one major outcome and go with it? But then you’re throwing away lots of data, that can’t be right.

I am unconvinced by the claims of Baker et al., but it’s not like I’m saying their paper is terrible. They have an identification strategy, and clean data, and some reasonable hypotheses. I just think their statistical analysis approach is not working. One trouble is that statistics textbooks tend to focus on stand-alone analyses—getting the p-value right, or getting the posterior distribution, or whatever, and not on how these conclusions fit into the big picture. And of course there’s lots of talk about exploratory data analysis, and that’s great, but EDA is typically not plugged into issues of modeling, data collection, and inference.

What to do?

OK, then. Let’s forget about the strengths and the weaknesses of the Baker et al. paper and instead ask, how should one evaluate a program like Quebec’s nearly-free preschool? I’m not sure. I’d start from the perspective of trying to learn what we can from what might well be ambiguous evidence, rather than trying to make a case in one direction or another. And lots of graphs, which would allow us to see more in one place, that’s much better than tables and asterisks. But, exactly what to do, I’m not sure. I don’t know whether the policy analysis literature features any good examples of this sort of exploration. I’d like to see something, for this particular example and more generally as a template for program evaluation.

3. Nearly 20 years on

So here’s the story. I heard about this work in 2016, from a press release issued in 2006, the article was published in a top economics journal in 2008, it appeared in preprint form in 2005, and it was based on data collected in the late 1990s. And here we are discussing it again in 2023.

It’s kind of beating a dead horse to discuss a 20-year-old piece of research, but you know what they say about dead horses. Also, according to Google Scholar, the article has 1616 citations, including 120 citations in 2023 alone, so, yeah, still worth discussing.

That said, not all the references refer to the substance of the paper. For example, the very first paper on Google Scholar’s list of citers is a review article, Explaining the Decline in the US Employment-to-Population Ratio, and when I searched to see what they said about this Canada paper (Baker, Gruber, and Milligan 2008), here’s what was there:

Additional evidence on the effects of publicly provided childcare comes from the province of Quebec in Canada, where a comprehensive reform adopted in 1997 called for regulated childcare spaces to be provided to all children from birth to age five at a price of $5 per day. Studies of that reform conclude that it had significant and long-lasting effects on mothers’ labor force participation (Baker, Gruber, and Milligan 2008; Lefebvre and Merrigan 2008; Haeck, Lefebvre, and Merrigan 2015). An important feature of the Quebec reform was its universal nature; once fully implemented, it made very low-cost childcare available for all children in the province. Nollenberger and Rodriguez-Planas (2015) find similarly positive effects on mothers’ employment associated with the introduction of universal preschool for three-year-olds in Spain.

They didn’t mention the bit about “the evidence suggests that children are worse off” at all! Indeed, they’re just kinda lumping this in with positive studies on “the effects of publicly provided childcare.” Yes, it’s true that this new article specifically refers to “similarly positive effects on mothers’ employment,” and that earlier paper, while negative about the effect of universal child care on kids, did say, “Maternal labor supply increases significantly.” Still, when it comes to sentiment analysis, that 2008 paper just got thrown into the positivity blender.

I don’t know how to think about this.

On one hand, I feel bad for Baker et al.: they did this big research project, they achieved the academic dream of publishing it in a top journal, it’s received over 1616 citations and remains relevant today—but, when it got cited, its negative message was completely lost! I guess they should’ve given their paper a more direct title. Instead of “Universal Child Care, Maternal Labor Supply, and Family Well‐Being,” they should’ve called it something like: “Universal Child Care: Good for Mothers’ Employment, Bad for Kids.”

On the other hand, for the reasons discussed above, I don’t actually believe their strong claims about the child care being bad for kids, so I’m kinda relieved that, even though the paper is being cited, some of its message has been lost. You win some, you lose some.

Cohort effects in literature (David Foster Wallace and other local heroes)

I read this review by Patricia Lockwood of a book by David Foster Wallace. I’d never read the book being reviewed, but that was no problem because the review itself was readable and full of interesting things. What struck me was how important Wallace seemed to be to her. I’ve heard of Wallace and read one or two things by him, but from my perspective he’s just one of many, many writers, with no special position in the world. I think it’s a generational thing. Wallace hit the spot for people of Lockwood’s age, a couple decades younger than me. To get a sense of how Lockwood feels about Wallace’s writing, I’d have to consider someone like George Orwell or Philip K. Dick, who to me had special things to say.

My point about Orwell and Dick (or, for Lockwood, Wallace) is not that they stand out from all other writers. Yes, Orwell and Dick are great writers with wonderful styles and a lot of interesting things to say—but that description characterizes many many others, from Charles Dickens and Mark Twain through James Jones, Veronica Geng, Richard Ford, Colson Whitehead, etc etc. Orwell and Dick just seem particularly important to me; it’s hard to say exactly why. So there was something fascinating about seeing someone else write about a nothing-special (from my perspective) writer but with that attitude that, good or bad, he’s important.

It kinda reminds me of how people used to speculate on what sort of music would’ve been made by the Beatles had they not broken up. In retrospect, the question just seems silly: they were a group of musicians who wrote some great songs, lots of great songs have been written by others since then, there’s no reason to think that future Beatles compositions would’ve been maybe more amazing than the fine-but-not-earthshaking songs they wrote on their own or that others were writing during that period. What’s interesting to me here is not to think about the Beatles but to put myself into that frame of mind in which those Beatles were so important that the question, What would they have done next?, is considered so important.

That’s why I call Wallace, and some of the other writers discussed above, “local heroes,” with their strongest appeal localized in cohort and time rather than in space. “Voice of a generation” would be another way to put it, but I like the framing of locality because it opens the door to considering dimensions other than cohort and time.

Modest pre-registration

This is Jessica. In light of the hassles that can arise when authors make clear that they value pre-registration by writing papers about its effectiveness but then they can’t find their pre-registration, I have been re-considering how I feel about the value of the public aspects of pre-registration. 

I personally find pre-registration useful, especially when working with graduate students (as I am almost always doing). It gets us to agree on what we are actually hoping to see and how we are going to define the key quantities we compare. I trust my Ph.D. students, but when we pre-register we are more likely to find the gaps between our goals and the analyses that we can actually do because we have it all in a single document that we know cannot be further revised after we start collecting data.

Shravan Vasishth put it well in a comment on a previous post:

My lab has been doing pre-registrations for several years now, and most of the time what I learned from the pre-registration was that we didn’t really adequately think about what we would do once we have the data. My lab and I are getting better at this now, but it took many attempts to do a pre-registration that actually made sense once the data were in. That said, it’s still better to do a pre-registration than not, if only for the experimenter’s own sake (as a sanity pre-check). 

The part I find icky is that as soon as pre-registration gets discussed outside the lab, it often gets applied and interpreted as a symbol that the research is rigorous. Like the authors who pre-register must be doing “real science.” But there’s nothing about pre-registration to stop sloppy thinking, whether that means inappropriate causal inference, underspecification of the target population, overfitting to the specific experimental conditions, etc.

The Protzko et al. example could be taken as unusual, in that we might not expect the average reviewer to feel the need to double check the pre-registration when they see that author list includes Nosek and Nelson. On the other hand, we could see it as particularly damning evidence of how pre-registration can fail in practice, when some of the researchers that we associate with the highest standards of methodological rigor are themselves not appearing to take claims made about what practices were followed so seriously as to make sure they can back them up when asked. 

My skepticism about how seriously we should take public declarations of pre-registration is influenced by my experience as author and reviewer, where, at least in the venues I’ve published in, when you describe your work as pre-registered it wins points with reviewers, increasing the chances that someone will comment about the methodological rigor, that your paper will win an award, etc. However, I highly doubt the modal reviewer or reader is checking the preregistration. At least, no reviewer has ever asked a single question about the pre-registration in any of the studies I’ve ever submitted, and I’ve been using pre-registration for at least 5 or 6 years. I guess it’s possible they are checking it and it’s just all so perfectly laid out in our documents and followed to a T that there’s nothing to question. But I doubt that… surely at some point we’ve forgotten to fully report a pre-specified exploratory analysis, or the pre-registration wasn’t clear, or something else like that. Not a single question ever seems fishy.

Something I dislike about authors’ incentives when reporting on their methods in general is that reviewers (and readers) can often be unimaginative. So what the authors say about their work can set the tone for how the paper is received. I hate when authors describe their own work in a paper as “rigorous” or “highly ecologically valid” or “first to show” rather than just allowing the details to speak for themselves. It feels like cheap marketing. But I can understand why some do it, because one really can impress some readers for saying such things. Hence, points won for mentioning pre-registration, but no real checks and balances, can be a real issue.  

How should we use pre-registration in light of all this? If nobody cares to do the checking, but extra credit is being handed out when authors slap the “pre-registered” label on their work, maybe we want to pre-register more quietly.

At the extreme, we could pre-register amongst ourselves, in our labs or whatever, without telling everyone about it. Notify our collaborators by email or slack or whatever else when we’ve pinned down the analysis plan and are ready to collect the data but not expect anyone else to care, except maybe when they notice that our research is well-engineered in general, because we are the kind of authors who do our best to keep ourselves honest and use transparent methods and subject our data to sensitivity analyses etc. anyways.

I’ve implied before on the blog that pre-registration is something I find personally useful but see externally as a gesture toward transparency more than anything else. If we can’t trust authors when they claim to pre-register, but we don’t expect the reviewing or reading standards in our communities to evolve to the point where checking to see what it actually says becomes mainstream, then we could just omit the signaling aspect altogether and continue to trust that people are doing their best. I’m not convinced we would lose much in such a world as pre-registration is currently practiced in the areas I work in. Maybe the only real way to fix science is to expect people to find reasons to be self-motivated to do good work. And if they don’t, well, it’s probably going to be obvious in other ways than just a lack of pre-registration. Bad reasoning should be obvious and if it’s not, maybe we should spend more time training students on how to recognize it.

But of course this seems unrealistic, since you can’t stop people from saying things in papers that they think reviewers will find relevant. And many reviewers have already shown they find it relevant to hear about a pre-registration. Plus of course the only real benefit we can say with certainty that pre-registration provides is that if one pre-registers, others can verify to what extent the the analysis was planned beforehand and therefore less subject to authors exploiting degrees of freedom, so we’d lose this.  

An alternative strategy is to be more specific about pre-registration while crowing about it less. Include the pre-registration link in your manuscript but stop with all the label-dropping that often occurs, in the abstract, the introduction, sometimes in the title itself describing how this study is pre-registered. (I have to admit, I have been guilty of this, but from now on I intend to remove such statements from papers I’m on).

Pre-registration statements should be more specific, in light of the fact that we can’t expect reviewers to catch deviations themselves. E.g., if you follow your pre-registration to a T, say something like “For each of our experiments, we report all sample sizes, conditions, data exclusions, and measures for the main analyses that were described in our pre-registration documents. We do not report any analyses that were not included in our pre-registration.” That makes it clear what you are knowingly claiming regarding the pre-registration status of your work. 

Of course, some people may say reasonably specific things even when they can’t back them up with a pre-registration document. But being specific at least acknowledges that a pre-registration is actually a bundle of details that we must mind if we’re going to claim to have done it, because they should impact how it’s assessed. Plus maybe the act of typing out specific propositions would remind some authors to check what their pre-registration actually says. 

If you don’t follow your pre-registration to a T, which I’m guessing is more common in practice, then there are a few strategies I could see using:

Put in a dedicated paragraph before you describe results detailing all deviations from what you pre-registered. If it’s a whole lot of stuff, perhaps the act of writing this paragraph will convince you to just skip reporting on the pre-registration altogether because it clearly didn’t work out. 

Label each individual comparison/test as pre-registered versus not as you walk through results. Personally I think this makes things harder to keep track of than a single dedicated paragraph, but maybe there are occasionally situations where its better.

(back to basics:) How is statistics relevant to scientific discovery?

Following up on today’s post, “Why I continue to support the science reform movement despite its flaws,” it seems worth linking to this post from 2019, about the way in which some mainstream academic social psychologists have moved beyond denial, to a more realistic view that accepts that failure is a routine, indeed inevitable part of science, and that, just because a claim is published, even in a prestigious journal, that doesn’t mean it has to be correct:

Once you accept that the replication rate is not 100%, nor should it be, and once you accept that published work, even papers by ourselves and our friends, can be wrong, this provides an opening for critics, the sort of scientists whom academic insiders used to refer to as “second stringers.”

Once you move to the view that “the bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech,” there is a clear role for accurate critics to move this process along. Just as good science is, ultimately, the discovery of truths that ultimately would be discovered by someone else sometime in the future, thus, the speeding along of a process that we’d hope would happen anyway, so good criticism should speed along the process of scientific correction.

Criticism, correction, and discovery all go together. Obviously discovery is the key part, otherwise there’d be nothing to criticize, indeed nothing to talk about. But, from the other direction, criticism and correction empower discovery. . . .

Just as, in economics, it is said that a social safety net gives people the freedom to start new ventures, in science the existence of a culture of robust criticism should give researchers a sense of freedom in speculation, in confidence that important mistakes will be caught.

Along with this is the attitude, which I strongly support, that there’s no shame in publishing speculative work that turns out to be wrong. We learn from our mistakes. . . .

Speculation is fine, and we don’t want the (healthy) movement toward replication to create any perverse incentives that would discourage people from performing speculative research. What I’d like is for researchers to be more aware of when they’re speculating, both in their published papers and in their press materials. Not claiming a replication rate of 100%, that’s a start. . . .

What, then, is—or should be—the role of statistics, and statistical criticism in the process of scientific research?

Statistics can help researchers in three ways:
– Design and data collection
– Data analysis
– Decision making.

And how does statistical criticism fit into all this? Criticism of individual studies has allowed us to develop our understanding, giving us insight into designing future studies and interpreting past work. . . .

We want to encourage scientists to play with new ideas. To this purpose, I recommend the following steps:

– Reduce the costs of failed experimentation by being more clear when research-based claims are speculative.

– React openly to follow-up studies. Once you recognize that published claims can be wrong (indeed, that’s part of the process), don’t hang on to them too long or you’ll reduce your opportunities to learn.

– Publish all your data and all your comparisons (you can do this using graphs so as to show many comparisons in a compact grid of plots). If you follow current standard practice and focus on statistically significant comparisons, you’re losing lots of opportunities to learn.

– Avoid the two-tier system. Give respect to a student project or Arxiv paper just as you would to a paper published in Science or Nature.

We should all feel free to speculate in our published papers without fear of overly negative consequences in the (likely) event that our speculations are wrong; we should all be less surprised to find that published research claims did not work out (and that’s one positive thing about the replication crisis, that there’s been much more recognition of this point); and we should all be more willing to modify and even let go of ideas that didn’t happen to work out, even if these ideas were published by ourselves and our friends.

There’s more at the link, and also let me again plug my recent article, Before data analysis: Additional recommendations for designing experiments to learn about the world.

Why I continue to support the science reform movement despite its flaws

I was having a discussion with someone about problems with the science reform movement (as discussed here by Jessica), and he shared his opinion that “Scientific reform in some corners has elements of millenarian cults. In their view, science is not making progress because of individual failings (bias, fraud, qrps) and that if we follow a set of rituals (power analysis, preregistration) devised by the leaders than we can usher in a new era where the truth is revealed (high replicability).”

My quick reaction was that this reminded me of an annoying thing where people use “religion” as a term of insult. When this came up before, I wrote that maybe it’s time to retire use of the term “religion” to mean “uncritical belief in something I disagree with.”

But then I was thinking about this all from another direction, and I think there’s something there there. Not the “millenarian cults” thing, which I think was an overreaction on my correspondent’s part.

Rather, I see a paradox. From his perspective, my correspondent sees the science reform movement as having a narrow perspective, an enforced conformity that leads it into unforced errors such as publishing a high-profile paper promoting preregistration without actually itself following preregistered analysis plans. OK, he doesn’t see all of the science reform movement as being so narrow—for one thing, I’m part of the science reform movement and I wasn’t part of that project!—but he seems some core of the movement being stuck in narrow rituals and leader-worship.

But I think it’s kind of the opposite. From my perspective, the core of the science reform movement (the Open Science Framework, etc.) has had to make all sorts of compromises with conservative forces in the science establishment, especially within academic psychology, in order to keep them on board. To get funding, institutional support, buy-in from key players, . . . that takes a lot of political maneuvering.

I don’t say this lightly, and I’m not using “political” as a put-down. I’m a political scientist, but personally I’m not very good at politics. Politics takes hard work, requiring lots of patience and negotiation. I’m impatient and I hate negotiation; I’d much rather just put all my cards face-up on the table. For some activities, such as blogging and collaborative science, these traits are helpful. I can’t collaborate with everybody, but when the connection’s there, it can really work.

But there’s more to the world than this sort of small-group work. Building and maintaining larger institutions, that’s important too.

So here’s my point: Some core problems with the open-science movement are not a product of cult-like groupthink. Rather, it’s the opposite: this core has been structured out of a compromise with some groups within psychology who are tied to old-fashioned thinking, and this politically-necessary (perhaps) compromise has led to some incoherence, in particular the attitude or hope that, by just including some preregistration here and getting rid of some questionable research practices there, everyone could pretty much continue with business as usual.

Summary

The open-science movement has always had a tension between burn-it-all-down and here’s-one-quick-trick. Put them together and it kinda sounds like a cult that can’t see outward, but I see it as more the opposite, as an awkward coalition representing fundamentally incoherent views. But both sides of the coalition need each other: the reformers need the old institutional powers to make a real difference in practice, and the oldsters need the reformers because outsiders are losing confidence in the system.

The good news

The good news for me is that both groups within this coalition should be able to appreciate frank criticism from the outside (they can listen to me scream and get something out of it, even if they don’t agree with all my claims) and should also be able to appreciate research methods: once you accept the basic tenets of the science reform movement, there are clear benefits to better measurement, better design, and better analysis. In the old world of p-hacking, there was no real reason to do your studies well, as you could get statistical significance and publication with any old random numbers, along with a few framing tricks. In the new world of science reform—even imperfect science reform, this sort of noise mining isn’t so effective, and traditional statistical ideas of measurement, design, and analysis become relevant again.

So that’s one reason I’m cool with the science reform movement. I think it’s in the right direction: its dot product with the ideal direction is positive. But I’m not so good at politics so I can’t resist criticizing it too. It’s all good.

Reactions

I sent the above to my correspondent, who wrote:

I don’t think it is a literal cult in the sense that carries the normative judgments and pejorative connotations we usually ascribe to cults and religions. The analogy was more of a shorthand to highlight a common dynamic that emerges when you have a shared sense of crisis, ritualistic/procedural solutions, and a hope that merely performing these activities will get past the crisis and bring about a brighter future. This is a spot where group-think can, and at times possibly should, kick in. People don’t have time to each individually and critically evaluate the solutions, and often the claim is that they need to be implemented broadly to work. Sometimes these dynamics reflect a real problem with real solutions, sometimes they’re totally off the rails. All this is not to say I’m opposed to scientific reform; I’m very much for it in the general sense. There’s no shortage of room for improvement in how we turn observations into understanding, from improving statistical literacy and theory development to transparency and fostering healthier incentives. I am, however, wary of the uncritical belief that the crisis is simply one of failed replications and that the performance of “open science rituals” is sufficient for reform, across the breadth of things we consider science. As a minor point, I don’t think many of the vast majority of prominent figures in open science intend for these dynamics to occur, but I do think they all should be wary of them.

There does seem to be a problem that many researchers are too committed to the “estimate the effect” paradigm and don’t fully grapple with the consequences of high variability. This is particularly disturbing in psychology, given that just about all psychology experiments study interactions, not main effects. Thus, a claim that effect sizes don’t vary much is a claim that effect sizes vary a lot in the dimension being studied, but have very little variation in other dimensions. Which doesn’t make a lot of sense to me.

Getting back to the open-science movement, I want to emphasize the level of effort it takes to conduct and coordinate these big group efforts, along with the effort required to keep together that the coalition of skeptics (who see preregistration as a tool for shooting down false claims) and true believers (who see preregistration as a way to defuse skepticism about their claims) and get these papers published in top journals. I’d also say it takes a lot of effort for them to get funding, but that would be kind of a cheap shot, given that I too put in a lot of effort to get funding!

Anyway, to continue, I think that some of the problems with the science reform movement are that it effectively promises different things to different people. And another problem is with these massive projects that inevitably include things that not all the authors will agree with.

So, yeah, I have a problem with simplistic science reform prescriptions, for example recommendations to increase sample size without any nod toward effect size and measurement. But much much worse, in my opinion, are the claims of success we’ve seen from researchers and advocates who are outside the science-reform movement. I’m thinking here about ridiculous statements such as the unfounded claim of 17 replications of power pose, or the endless stream of hype from the nudgelords, or the “sleep is your superpower” guy, or my personal favorite, the unfounded claim from Harvard that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

It’s almost enough to stop here with the remark that the scientific reform movement has been lucky in its enemies.

But I also want to say that I appreciate that the “left wing” of the science reform movement—the researchers who envision replication and preregistration and the threat of replication and preregistration as a tool to shoot down bad studies—have indeed faced real resistance within academia and the news media to their efforts, as lots of people will hate the bearers of bad news. And I also appreciate that the “right wing” of the science reform movement—the researchers who envision replication and preregistration as a way to validate their studies and refute the critics—in that they’re willing to put their ideas to the test. Not always perfectly, but you have to start somewhere.

While I remain annoyed at certain aspects of the mainstream science reform movement, especially when it manifests itself in mass-authored articles such as the notorious recent non-preregistered paper on the effects of preregistration, or that “Redefine statistical significance” article, or various p-value hardliners we’ve encountered over the decades, I also respect the political challenges of coalition-building that are evident in that movement.

So my plan remains to appreciate the movement while continuing to criticize its statements that seem wrong or do not make sense.

I sent the above to Jessica Hullman, who wrote:

I can relate to being surprised by the reactions of open science enthusiasts to certain lines of questioning. In my view, how to fix science is as about a complicated question as we will encounter. The certainty/level of comfortableness with making bold claims that many advocates of open science seem to have is hard for me to understand. Maybe that is just the way the world works, or at least the way it works if you want to get your ideas published in venues like PNAS or Nature. But the sensitivity to what gets said in public venues against certain open science practices or people reminds me very much of established academics trying to hush talk about problems in psychology, as though questioning certain things is off limits. I’ve been surprised on the blog for example when I think aloud about something like preregistration being imperfect and some commenters seem to have a visceral negative reaction to see something like that written. To me that’s the opposite of how we should be thinking.

As an aside, someone I’m collaborating with recently described to me his understanding of the strategy for getting published in PNAS. It was 1. Say something timely/interesting, 2. Don’t be wrong. He explained that ‘Don’t be wrong’ could be accomplished by preregistering and large sample size. Naturally I was surprised to hear #2 described as if it’s really that easy. Silly me for spending all this time thinking so hard about other aspects of methods!

The idea of necessary politics is interesting; not what I would have thought of but probably some truth to it. For me many of the challenges of trying to reform science boil down to people being heuristic-needing agents. We accept that many problems arise from ritualistic behavior, but we have trouble overcoming that, perhaps because no matter how thoughtful/nuanced some may prefer to be, there’s always a larger group who want simple fixes / aren’t incentivized to go there. It’s hard to have broad appeal without being reductionist I guess.

Hey! Here’s how to rewrite and improve the title and abstract of a scientific paper:

Last week in class we read and then rewrote the title and abstract of a paper. We did it again yesterday, this time with one of my recent unpublished papers.

Here’s what I had originally:

title: Unifying design-based and model-based sampling inference by estimating a joint population distribution for weights and outcomes

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses. We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

Not terrible, but we can do better. Here’s the new version:

title: MRP using sampling weights

abstract: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights come from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

How did we get there?

The title. The original title was fine—it starts with some advertising (“Unifying design-based and model-based sampling inference”) and follows up with a description of how the method works (“estimating a joint population distribution for weights and outcomes”).

But the main point of the title is to get the notice of potential readers, people who might find the paper useful or interesting (or both!).

This pushes the question back one step: Who would find this paper useful or interesting? Anyone who works with sampling weights. Anyone who uses public survey data or, more generally, surveys collected by others, which typically contain sampling weights. And anyone who’d like to follow my path in survey analysis, which would be all the people out there who use MRP (multilevel regression and poststratification). Hence the new title, which is crisp, clear, and focused.

My only problem with the new title, “MRP using sampling weights,” is that it doesn’t clearly convey that the paper involves new research. It makes it look like a review article. But that’s not so horrible; people often like to learn from review articles.

The abstract. If you look carefully, you’ll see that the new abstract is the same as the original abstract, except that we replaced the middle part:

But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses.

with this:

But what if you don’t know where the weights come from?

Here’s what happened. We started by rereading the original abstract carefully. That abstract has some long sentences that are hard to follow. The first sentence is already kinda complicated, but I decided to keep it, because it clearly lays out the problem, and also I think the reader of an abstract will be willing to work a bit when reading the first sentence. Getting to the abstract at all is a kind of commitment.

The second sentence, though, that’s another tangle, and at this point the reader is tempted to give up and just skate along to the end—which I don’t want! The third sentence isn’t horrible, but it’s still a little bit long (starting with the nearly-contentless “It is also not clear how one is supposed to account for” and the ending with the unnecessary “in such analyses”). Also, we don’t even really talk much about clustering in the paper! So it was a no-brainer to collapse these into a sentence that was much more snappy and direct.

Finally, yeah, the final sentence of the abstract is kinda technical, but (a) the paper’s technical, and we want to convey some of its content in the abstract!, and (b) after that new, crisp, replacement second sentence, I think the reader is ready to take a breath and hear what the paper is all about.

General principles

Here’s a general template for a research paper:
1. What is the goal or general problem?
2. Why is it important?
3. What is the challenge?
4. What is the solution? What must be done to implement this solution?
5. If the idea in this paper is so great, why wasn’t the problem already solved by someone else?
6. What are the limitations of the proposed solution? What is its domain of applicability?

We used these principles in our rewriting of my title and abstract. The first step was for me to answer the above 6 questions:
1. Goal is to do survey inference with sampling weights.
2. It’s important for zillions of researchers who use existing surveys which come with weights.
3. The challenge is that if you don’t know where the weights come from, you can’t just follow the recommended approach to condition in the regression model on the information that is predictive of inclusion into the sample.
4. The solution is to condition on the weights themselves, which involves the additional step of estimating a joint population distribution for the weights and other predictors in the model.
5. The problem involves a new concept (imagining a population distribution for weights, which is not a coherent assumption, because, in the real world, weights are constructed based on the data) and some new mathematical steps (not inherently sophisticated as mathematics, but new work from a statistical perspective). Also, the idea of modeling the weights is not completely new; there is some related literature, and one of our contributions is to take weights (which are typically constructed from a non-Bayesian design-based perspective) and use them in a Bayesian analysis.
6. Survey weights do not include all design information, so the solution offered in the paper can only be approximate. In addition the method requires distributional assumptions on the weights; also it’s a new method so who knows how useful it will be in practice.

We can’t put all of that in the abstract, but we were able to include some versions of the answers to questions 1, 3, and 4. Questions 5 and 6 are important, but it’s ok to leave them to the paper, as this is where readers will typically search for limitations and connections to the literature.

Maybe we should include the answer to question 2 in the abstract, though. Perhaps we could replace “But what if you don’t know where the weights come from?” with “But what if you don’t know where the weights come from? This is often a problem when analyzing surveys collected by others.”

Summary

By thinking carefully about goals and audience, we improved the title and abstract of a scientific paper. You should be able to do this in your own work!

The rise and fall of Seth Roberts and the Shangri-La diet

Here’s a post that’s suitable for the Thanksgiving season.

I no longer believe in the Shangri-La diet. Here’s the story.

Background

I met Seth Roberts back in the early 1990s when we were both professors at the University of California. He sometimes came to the statistics department seminar and we got to talking about various things; in particular we shared an interest in statistical graphics. Much of my work in this direction eventually went toward the use of graphical displays to understand fitted models. Seth went in another direction and got interested in the role of exploratory data analysis in science, the idea that we could use graphs not just to test or even understand a model but also as the source of new hypotheses. We continued to discuss these issues over the years.

At some point when we were at Berkeley the administration was encouraging the faculty to teach freshman seminars, and I had the idea of teaching a course on left-handedness. I’d just read the book by Stanley Coren and thought it would be fun to go through it with a class, chapter by chapter. But my knowledge of psychology was minimal so I contacted the one person I knew in the psychology department and asked him if he had any suggestions of someone who’d like to teach the course with me. Seth responded that he’d be interested in doing it himself, and we did it.

Seth was an unusual guy—not always in a good way, but some of his positive traits were friendliness, inquisitiveness, and an openness to consider new ideas. He also struggled with mood swings, social awkwardness, and difficulties with sleep, and he attempted to address these problems with self-experimentation.

After we taught the class together we got together regularly for lunch and Seth told me about his efforts in self-experimentation involving sleeping hours and mood. Most interesting to me was his discovery that seeing life-sized faces in the morning helped with his mood. I can’t remember how he came up with this idea, but perhaps he started by following the recommendation that is often given to people with insomnia to turn off TV and other sources of artificial light in the evening. Seth got in the habit of taping late-night talk-show monologues and then watching them in the morning while he ate breakfast. He found himself happier, did some experimentation, and concluded that we had evolved to talk with people in the morning, and that life-sized faces were necessary. Seth lived alone, so the more natural approach of talking over breakfast with a partner was not available.

Seth’s self-experimentation went slowly, with lots of dead-ends and restarts, which makes sense given the difficulty of his projects. I was always impressed by Seth’s dedication in this, putting in the effort day after day for years. Or maybe it did not represent a huge amount of labor for him, perhaps it was something like a diary or blog which is pleasurable to create, even if it seems from the outside to be a lot of work. In any case, from my perspective, the sustained focus was impressive. He had worked for years to solve his sleep problems and only then turned to the experiments on mood.

Seth’s academic career was unusual. He shot through college and graduate school to a tenure-track job at a top university, then continued to do publication-quality research for several years until receiving tenure. At that point he was not a superstar but I think he was still considered a respected member of the mainstream academic community. But during the years that followed, Seth lost interest in that thread of research. He told me once that his shift was motivated by teaching introductory undergraduate psychology: the students, he said, were interested in things that would affect their lives, and, compared to that, the kind of research that leads to a productive academic career did not seem so appealing.

I suppose that Seth could’ve tried to do research in clinical psychology (Berkeley’s department actually has a strong clinical program) but instead he moved in a different direction and tried different things to improve his sleep and then, later, his skin, his mood, and his diet. In this work, Seth applied what he later called his “insider/outsider perspective”: he was an insider in that he applied what he’d learned from years of research on animal behavior, an outsider in that he was not working within the existing paradigm of research in physiology and nutrition.

At the same time he was working on a book project, which I believe started as a new introductory psychology course focused on science and self-improvement but ultimately morphed into a trade book on ways in which our adaptations to Stone Age life were not serving us well in the modern era. I liked the book but I don’t think he found a publisher. In the years since, this general concept has been widely advanced and many books have been published on the topic.

When Seth came up with the connection between morning faces and depression, this seemed potentially hugely important. Were the faces were really doing anything? I have no idea. On one hand, Seth was measuring his own happiness and doing his own treatments on his own hypothesis so the potential for expectation effects are huge. On the other hand, he said the effect he discovered was a surprise to him and he also reported that the treatment worked with others. Neither he nor, as far as I know, anyone else, has attempted a controlled trial of this idea.

In his self-experimentation, Seth lived the contradiction between the two tenets of evidence-based medicine: (1) Try everything, measure everything, record everything; and (2) Make general recommendations based on statistical evidence rather than anecdotes.

Seth’s ideas were extremely evidence-based in that they were based on data that he gathered himself or that people personally sent in to him, and he did use the statistical evidence of his self-measurements, but he did not put in much effort to reduce, control, or adjust for biases in his measurements, nor did he systematically gather data on multiple people.

The Shangri-La diet

Seth’s next success after curing his depression was losing 40 pounds on an unusual diet that he came up with, in which you can eat whatever you want as long as each day you drink a cup of unflavored sugar water, at least an hour before or after a meal. The way he theorized that his diet worked was that the carefully-timed sugar water had the effect of reducing the association between calories and flavor, thus lowering your weight set-point and making you uninterested in eating lots of food.

I asked Seth once if he thought I’d lose weight if I were to try his diet in a passive way, drinking the sugar water at the recommended time but not actively trying to reduce my caloric intake. He said he supposed not, that the diet would make it easier to lose weight but I’d probably still have to consciously eat less.

I described Seth’s diet to one of my psychologist colleagues at Columbia and asked what he thought of it. My colleague said he thought it was ridiculous. And, as with the depression treatment, Seth never had an interest in running a controlled trial, even for the purpose of convincing the skeptics.

I had a conversation with Seth about this. He said he’d tried lots of diets and none had worked for him. I suggested that maybe he was just ready at last to eat least and lose weight, and he said he’d been ready for awhile but this was the first diet that allowed him to eat less without difficulty. I suggested that maybe the theory underlying Seth’s diet was compelling enough to act as a sort of placebo, motivating him to follow the protocol. Seth responded that other people had tried his diet and lost weight with it. He also reminded me that it’s generally accepted that “diets don’t work” and that people who lose weight while dieting will usually gain it all back. He felt that his diet was different in that it didn’t you what foods to eat or how much; rather, it changed your set point so that you didn’t want to eat so much. I found Seth’s arguments persuasive. I didn’t feel that his diet had been proved effective, but I thought it might really work, I told people about it, and I was happy about its success. Unlike my Columbia colleague, I didn’t think the idea was ridiculous.

Media exposure and success

Seth’s breakout success happened gradually, starting with a 2005 article on self-experimentation in Behavioral and Brain Sciences, a journal that publishes long articles followed by short discussions from many experts. Some of his findings from the ten of his experiments discussed in the article:

Seeing faces in the morning on television decreased mood in the evening and improved mood the next day . . . Standing 8 hours per day reduced early awakening and made sleep more restorative . . . Drinking unflavored fructose water caused a large weight loss that has lasted more than 1 year . . .

As Seth described it, self-experimentation generates new hypotheses and is also an inexpensive way to test and modify them. The article does not seem to have had a huge effect within research psychology (Google Scholar gives it 93 cites) but two of its contributions—the idea of systematic self-experimentation and the weight-loss method—have spread throughout the popular culture in various ways. Seth’s work was featured in a series of increasingly prominent blogs, which led to a newspaper article by the authors of Freakonomics and ultimately a successful diet book (not enough to make Seth rich, I think, but Seth had simple tastes and no desire to be rich, as far as I know). Meanwhile, Seth started a blog of his own which led to a message board for his diet that he told me had thousands of participants.

Seth achieved some measure of internet fame, with fans including Nassim Taleb, Steven Levitt, Dennis Prager, Tucker Max, Tyler Cowen, . . . and me! In retrospect, I don’t think having all this appreciation was good for him. On his blog and elsewhere Seth reported success with various self-experiments, the last of which was a claim of improved brain function after eating half a stick of butter a day. Even while maintaining interest in Seth’s ideas on mood and diet, I was entirely skeptical of his new claims, partly because of his increasing rate of claimed successes. It took Seth close to 10 years of sustained experimentation to fix his sleep problems, but in later years it seemed that all sorts of different things he tried were effective. His apparent success rate was implausibly high. What was going on? One problem is that sleep hours and weight can be measured fairly objectively, whereas if you measure brain function by giving yourself little quizzes, it doesn’t seem hard at all for a bit of unconscious bias to drive all your results. I also wonder if Seth’s blog audience was a problem: if you have people cheering on your every move, it can be that much easier to fool yourself.

Seth also started to go down some internet rabbit holes. On one hand, he was a left-wing Berkeley professor who supported universal health care, Amnesty International, and other liberal causes. On the other hand, his paleo-diet enthusiasm brought him close to various internet right-wingers, and he was into global warming denial and kinda sympathetic to Holocaust denial, not because he was a Nazi or anything but just because he had distrust of authority thing going on. I guess that if he’d been an adult back in the 1950s and 1960s he would’ve been on the extreme left, but more recently it’s been the far right where the rebels are hanging out. Seth also had sympathy for some absolutely ridiculous and innumerate research on sex ratios and absolutely loved the since-discredited work of food behavior researcher Brian Wansink; see here and here. The point here is not that Seth believed things that turned out to be false—that happens to all of us—but rather that he had a soft spot for extreme claims that were wrapped in the language of science.

Back to Shangri-La

A few years ago, Seth passed away, and I didn’t think of him too often, but then a couple years ago my doctor told me that my cholesterol level too high. He prescribed a pill, which I’m still taking every day, and he told me to switch to a mostly-plant diet and lose a bunch of weight.

My first thought was to try the Shangri-La diet. That cup of unflavored sugar water at least an hour between meals. Or maybe I did the spoonful of unflavored olive oil, I can’t remember which. Anyway, I tried it for a few days, also following the advice to eat less. And then after a few days, I thought: if the point is to eat less, why not just do that? So that’s what I did. No sugar water or olive oil needed.

What’s the point of this story? Not that losing the weight was easy for me. For a few years before that fateful conversation, my doctor had been bugging me to lose weight, and I’d vaguely wanted that to happen, but it hadn’t. What worked was me having this clear goal and motivation. And it’s not like I’m starving all the time. I’m fine; I just changed my eating patterns, and I take in a lot less energy every day.

But here’s a funny thing. Suppose I’d stuck with the sugar water and everything else had been the same. Then I’d have lost all this weight, exactly when I’d switched to the new diet. I’d be another enthusiastic Shangri-La believer, and I’d be telling you, truthfully, that only since switching to that diet had I been able to comfortably eat less. But I didn’t stick with Shangri-La and I lost the weight anyway, so I won’t make that attribution.

OK, so after that experience I had a lot less belief in Seth’s diet. The flip side of being convinced by his earlier self-experiment was becoming unconvinced after my own self-experiment.

And that’s where I stood until I saw this post at the blog Slime Mold Time Mold about informal experimentation:

For the potato diet, we started with case studies like Andrew Taylor and Penn Jilette; we recruited some friends to try nothing but potatoes for several days; and one of the SMTM authors tried the all-potato diet for a couple weeks.

For the potassium trial, two SMTM hive mind members tried the low-dose potassium protocol for a couple of weeks and lost weight without any negative side effects. Then we got a couple of friends to try it for just a couple of days to make sure that there weren’t any side effects for them either.

For the half-tato diet, we didn’t explicitly organize things this way, but we looked at three very similar case studies that, taken together, are essentially an N = 3 pilot of the half-tato diet protocol. No idea if the half-tato effect will generalize beyond Nicky Case and M, but the fact that it generalizes between them is pretty interesting. We also happened to know about a couple of other friends who had also tried versions of the half-tato diet with good results.

My point here is not to delve into the details of these new diets, but rather to point out that they are like the Shangri-La diet in being different from other diets, associated with some theory, evaluated through before-after studies on some people who wanted to lose weight, and yielded success.

At this point, though, my conclusion is not that unflavored sugar water is effective in making it easy to lose weight, or that unflavored oil works, or that potatoes work, or that potassium works. Rather, the hypothesis that’s most plausible to me is that, if you’re at the right stage of motivation, anything can work.

Or, to put it another way, I now believe that the observed effect of the Shangri-La diet, the potato diet, etc., comes from a mixture of placebo and selection. The placebo is that just about any gimmick can help you lose weight, and keep the weight off, if it somehow motivates you to eat less. The selection is that, once you’re ready to try something like this diet, you might be ready to eat less.

But what about “diets don’t work”? I guess that diets don’t work for most people at most times. But the people trying these diets are not “most people at most times.” They’re people with a high motivation to eat less and lose weight.

I’m not saying I have an ironclad case here. I’m pretty much now in the position of my Columbia colleague who felt that there’s no good reason to believe that Seth’s diet is more effective than any other arbitrary series of rules that somewhere includes the suggestion to eat less. And, yes, I have the same impression of the potato diet and the other ideas mentioned above. It’s just funny that it took so long for me to reach this position.

Back to Seth

I wouldn’t say the internet killed Seth Roberts, but ultimately I don’t think it did him any favors for him to become an internet hero, in the same way that it’s not always good for an ungrounded person to become an academic hero, or an athletic hero, or a musical hero, or a literary hero, or a military hero, or any other kind of hero. The stuff that got you to heroism can be a great service to the world, but what comes next can be a challenge.

Seth ended up believing in his own hype. In this case, the hype was not that he was an amazing genius; rather, the hype was about his method, the idea that he had discovered modern self-experimentation (to the extent that this rediscovery can be attributed to anybody, it should be to Seth’s undergraduate adviser, Allen Neuringer, in this article from 1981). Maybe even without his internet fame Seth would’ve gone off the deep end and started to believe he was regularly making major discoveries; I don’t know.

From a scientific standpoint, Seth’s writings are an example of the principle that honesty and transparency are not enough. He clearly described what he did, but his experiments got to be so flawed as to be essentially useless.

After I posted my obituary of Seth (from which I took much of the beginning of this post), there were many moving tributes in the comments, and I concluded by writing, “It is good that he found an online community of people who valued him.” That’s how I felt right now, but in retrospect, maybe not. If I could’ve done it all over again, I never would’ve promoted his diet, a promotion that led to all the rest.

I’d guess that the wide dissemination of Seth’s ideas was a net benefit to the world. Even if his diet idea is bogus, it seems to have made a difference to a lot of people. And even if the discoveries he reported from his self-experimentation (eating a stick of butter a day improving brain functioning and all the rest) were nothing but artifacts of his hopeful measurement protocols, the idea of self-experimentation was empowering to people—and I’m assuming that even his true believers (other than himself) weren’t actually doing the butter thing.

Setting aside the effects on others, though, I don’t think that this online community was good for Seth in his own work or for his personal life. In some ways he was ahead of his time, as nowadays we’re hearing a lot about people getting sucked into cult-like vortexes of misinformation.

P.S. Lots of discussion in comments, including this from the Slime Mold Time Mold bloggers.