Skip to content

We Count! The 2020 Census

Like many readers of this blog, I’m a statistician who works with Census Bureau data to answer policy questions. So I’ve been following the controversy surrounding the added citizenship question.

Andy thought I should write an article for a wider audience, so I published a short piece in The Indypendent. But much more discussion could happen here:

  1. How are citizenship data used in Voting Rights Act cases? For redistricting cases, American Community Survey citizenship data are either imprecise (1 year only) or out-of-date (aggregated over time).
  2. Should we use administrative data? How? 2020 will be the first census to use administrative data for nonresponse follow-up.
  3. Should we adjust for undercount? How? In 1999, the Supreme Court ruled against using adjusted counts for apportionment. See some of Andy’s thoughts here.

Whole books have been written on census politics. Given the stakes, it deserves our attention.

Thoughts? Continue reading ‘We Count! The 2020 Census’ »

How to read (in quantitative social science). And by implication, how to write.

I happened to come across this classic from 2014. For convenience I’ll just repeat it all here:


It all started when I was reading Chris Blattman’s blog and noticed this:

One of the most provocative and interesting field experiments I [Blattman] have seen in this year:

Poor people often do not make investments, even when returns are high. One possible explanation is that they have low aspirations and form mental models of their future opportunities which ignore some options for investment.

This paper reports on a field experiment to test this hypothesis in rural Ethiopia. Individuals were randomly invited to watch documentaries about people from similar communities who had succeeded in agriculture or business, without help from government or NGOs. A placebo group watched an Ethiopian entertainment programme and a control group were simply surveyed.

. . . Six months after screening, aspirations had improved among treated individuals and did not change in the placebo or control groups. Treatment effects were larger for those with higher pre-treatment aspirations. We also find treatment effects on savings, use of credit, children’s school enrolment and spending on children’s schooling, suggesting that changes in aspirations can translate into changes in a range of forward-looking behaviours.

What was my reaction? When I saw Chris describe this as “provocative and interesting,” my first thought was—hey, this could be important! I have a lot of respect for Chris Blattman, both regarding his general judgment and his expertise more particularly in research on international development.

My immediate next reaction was a generalized skepticism, the sort of thing I feel when encountering any sort of claim in social science. I read the above paragraphs with a somewhat critical eye and noticed some issues: potential multiple comparisons (“forking paths”) and comparisons between significant and non-significant, also possible issues with “story time.” So now I wanted to see more.

Blattman’s post links to an article, “The Future in Mind: Aspirations and Forward-Looking Behaviour in Rural Ethiopia,” by Bernard Tanguy, Stefan Dercon, Kate Orkin, and Alemayehu Seyoum Taffesse. Here’s the final sentence of the abstract:

The result that a one-hour documentary shown six months earlier induces actual behavioural change suggests a challenging, promising avenue for further research and poverty-related interventions.

OK, maybe. But now I’m really getting skeptical. How much effect can we really expect to get from a one-hour movie? And now I’m looking more carefully at what Chris wrote: “provocative and interesting.” Hmmm . . . Chris doesn’t actually say he believes it!

Now it’s time to read the Tanguy et al. article. Unfortunately the link only gives the abstract, with no pointer to the actual paper that I can see. So I google the title, *The Future in Mind: Aspirations and Forward-Looking Behaviour in Rural Ethiopia*, and it works! the first link is this pdf, it’s a version of the paper from April 2014 but that should be good enough.

How to read a research paper

But now the real work begins. I go into the paper and look for their comparisons: treatment group minus control group, controlling for pre-treatment information. Where to look? I cruise over to the Results section, that would be section 4.1, “Empirical strategy: direct effects,” which begins, “We first examine direct effects on individuals from the experiment.” It looks like I’m interested in model (4.3), and it appears that the results appear in table 6 through 12. And here’s the real punchline:

Overall, despite a relatively soft intervention – a one-hour documentary screening – we find clear evidence of behavioural changes six months after treatment. These results are also in line with our analysis of which components of the aspirations index are affected by treatment.

OK, so let’s take a look at tables 6-12. We’ll start with table 6:

Screen Shot 2014-11-30 at 12.08.41 PM

I’ll focus on the third and sixth columns of numbers, as this is where they are controlling for pre-treatment predictors. And for now I’ll look separately at outcomes straight after screening and after six months. And it looks like I’m suppose to take the difference between treatment and placebo groups. But then there’s a problem: of the four results presented (aspirations and expectations, immediate and after 6 months), only one is statistically significant, and that only at p=.05. So now I’m wondering whassup.

Table 7 considers the participants’ assessment of the films. I don’t care so much about this but I’ll take a quick look:

Screen Shot 2014-11-30 at 12.20.24 PM

Huh? Given the sizes of the standard errors, I don’t understand how these comparisons can be statistically significant. Maybe there was some transcription error? 0.201 should’ve been 0.0201, etc?

Tables 8 and 10, nothing’s statistically significant. This of course does not mean that nothing’s there, it just tells us that the noise is large compared to any signal. No surprise, perhaps, as there’s lots of variation in these survey responses.

Table 9, I’ll ignore, as it’s oriented 90 degrees off and it’s hard to read, also it’s a bunch of estimates of interactions. And given that I don’t really see much going on in the main effects, it’s hard for me to believe there will be much evidence for interactions.

Table 11 is also rotated 90 degrees, also it’s about a “hypothetical demand for credit.” Could be important but I’m not gonna knock myself out trying to read a bunch of tiny numbers (868.15, 1245.80, etc.) Quick scan: three comparisons, one is statistically significant.

And Table 12, nothing statistically significant here either.

At this point I’m desperate for a graph but there’s not much here to quench my thirst in that regard. Just a few cumulative distributions of some survey responses at baseline. Nothing wrong with that but it doesn’t really address the main questions.

So where are we? I just don’t see the evidence for the big claims, actually I don’t even see the evidence for the little claims in the paper. Again, I’m not saying the claims are wrong or even that they have not been demonstrated, I just couldn’t find the relevant information in a quick read.

How to write a research paper

Now let’s flip it around. Given my thought process as described above, how would you write an article so I could more directly get to the point?

You’d want to focus on the path leading from your data and assumptions to your key empirical claims. What would really help would be a graph—“Figure 1” of the paper, or possibly “Figure 2” showing the data and the fitted model, maybe it would be a scatterplot where each dot represents a person, with two different colors representing treated and control groups, plotting outcome vs. a pre-treatment summary, with fitted regression lines overlain.

It shouldn’t take forensics to find the basis for the article’s key claim. And the claims themselves should be presented crisply.

Consider two approaches to writing an article. Both are legitimate:
1. There is a single key finding, a headline result, with everything else being a modification or elaboration of it.
2. There are many little findings, we’re seeing a broad spectrum of results.

Either of these can work, indeed my collaborators and I have published papers of both types.

But I think it’s a good idea to make it clear, right away, where your paper is heading. If it’s the first sort of paper, please state clearly what is the key finding and what is the evidence for it. If it’s the second sort of paper, I’d suggest laying out all the results (positive and negative) in some sort of grid so they can all be visible at once. Otherwise, as a reader, I struggle through the exposition, trying to figure out which results are the most important and what to focus on.

That sort of organization can help the reader and is also relevant when considering questions of multiple comparisons.

Beyond this, it would be helpful to make it clear what you don’t yet know. Not just: The comparison is statistically significant in setting A but not in setting B (or “aspirations had improved among treated individuals and did not change in the placebo or control groups”), but a more direct statement about where are the key remaining uncertainties.

In using the Tanguy et al. paper as an opening to talk about how to read and write research articles, I’m not at all trying to say that it’s a particularly bad example; it’s just an example that was at hand. And, in any case, the authors’ primary goal is not to communicate to me. If their style satisfies their aim of communicating to economists and development specialists, that’s what’s most important. They, and other readers, will I hope take my advice here in more general terms, as serving the goals of statistical communication.

My role in all this

A couple months ago I got into a dispute with political scientist Larry Bartels, who expressed annoyance that I expressed skepticism about a claim he’d made (“Fleeting exposure to ‘irrelevant stimuli’ powerfully shapes our assessments of policy arguments”), without having fully read the research reports upon which his claim was based. In my response, I argued that it was fully appropriate for me to express skepticism based on partial information; or, to put it another way, that my skepticism based on partial information was as valid as his dramatic positive statements (“Here’s how a cartoon smiley face punched a big hole in democratic theory”) which themselves were only based on partial information.

That said, Bartels had a point, which is that a casual reader of a blog post might just take away the skepticism without the nuance. So let me repeat that I have not investigated this Tanguy et al. article in detail, indeed the comments above represent my entire experience of it.

To put it another way, the purpose of this post is not to present a careful investigation into claims about the effect of watching a movie about rural economic development; rather, this is all about the experience of reading a research article and, by implications, suggestions of how to write such an article to make it more accessible to critical readers.

In the meantime, if any reader wants to supply further information to clarify this particular example, feel free. If there’s something important that I’ve missed, I’d like to know; also if anything it would make my argument even stronger, buy demonstrating the difficulties I’ve had in reading a research paper.

P.S. From a few years back, here’s some other advice on writing research articles.

Another U.S. government advisor from Columbia University!

Cool! We’ve had Alexander Hamilton, John Jay, Dwight Eisenhower, Richard Clarida, Jeff Sachs, those guys from the movie Inside Job, and now . . . Dr. Oz. Government service at its finest.

The pizzagate guy was from Cornell, though.

Zero-excluding priors are probably a bad idea for hierarchical variance parameters

(This is Dan, but in quick mode)

I was on the subway when I saw Andrew’s last post and it doesn’t strike me as a particularly great idea.

So let’s take a look at the suggestion for 8 schools using a centred parameterization.  This is not as comprehensive as doing a proper simulation study, but the results are sufficiently alarming to put the brakes on Andrew’s proposal. Also, I can do this quickly (with a little help from Jonah Gabry).

We know that this will lead to divergences under any priors that include a decent amount of mass at zero, so Andrew’s suggestion was to use a boundary avoiding prior.  In particular a Gamma(2,\beta) prior, which has mean 2\beta^{-1} and variance 2\beta^{-2}.  So let’s make a table!

Here I’ve run RStan under the default settings, so it’s possible that some of the divergences are false alarms, however you can see that you need to pull the prior quite far to the right in order to reduce them to double digits. A better workflow would then look at pairs plots to see where the divergences are (see Betancourt’s fabulous workflow tutorial), but I’m lazy. I’ve also included a column for warning if there’s a chain that isn’t sampling the energy distribution appropriately.

beta Prior mean 0.01 quantile Divergences (/4000) Low BFMI Warning? 2.5% for tau 50% for tau 97.5% for tau
10.00 0.20 0.01 751 FALSE 0.04 0.19 0.57
5.00 0.40 0.03 342 FALSE 0.11 0.34 1.05
2.00 1.00 0.07 168 FALSE 0.30 0.99 2.81
1.00 2.00 0.15 195 FALSE 0.68 1.76 5.23
0.50 4.00 0.30 412 FALSE 0.64 2.78 8.60
0.20 10.00 0.74 34 FALSE 1.51 5.32 14.71
0.10 20.00 1.49 40 FALSE 1.74 6.92 18.71
0.05 40.00 2.97 17 FALSE 2.21 8.45 23.41
0.01 200.00 14.86 13 FALSE 2.88 9.37 27.86

You can see from this that you can’t reliably remove divergences and there is strong prior dependence based on how strongly the prior avoids the boundary.

But it’s actually worse than that. These priors heavily bias the group standard deviation! To see this, compare to the inference from two of the priors we normally use (which need to be fitted using a non-centred parameterization). If the prior for \tau is a half-Cauchy with dispersion parameter 5, the (2.5%, 50%, 97.5%) posterior values for \tau are (0.12 2.68 11.90).  If instead it’s a half-normal with standard deviation 5, the right tail shrinks a little to (0.13, 2.69, 9.23). None of the boundary avoiding priors give estimates anything like this.

So please don’t use boundary avoiding priors on the hierarchical variance parameters. It probably doesn’t work.

Git repo for the code is here.

Edit: At Andrew’s request, a stronger boundary avoiding prior. This one is an inverse gamma, which has a very very light left tail (the density goes to zero like e^{1/\tau} for small \tau) and a very heavy right tail (the density goes like \tau^{-3} for large \tau).  This is the same table as below. For want of a better option, I kept the shape parameter at 2, although I welcome other options.

beta Prior mean 0.01 quantile Divergences (/4000) Low BFMI Warning? 2.5% for tau 50% for tau 97.5% for tau
0.50 0.50 0.07 95 TRUE 0.09 0.32 1.91
1.00 1.00 0.15 913 TRUE 0.11 0.44 4.60
2.00 2.00 0.30 221 FALSE 0.29 1.10 5.17
5.00 5.00 0.74 64 FALSE 0.99 2.60 8.61
10.00 10.00 1.48 57 FALSE 1.41 4.17 12.38
20.00 20.00 2.96 4 FALSE 2.94 6.92 16.44
30.00 30.00 4.44 0 FALSE 4.08 8.86 19.99
40.00 40.00 5.92 0 FALSE 5.03 10.31 22.13
50.00 50.00 7.40 0 FALSE 5.88 11.56 24.98
100.00 100.00 14.79 0 FALSE 9.93 17.77 34.84

You can see that you need to use strong boundary avoidance to kill off the divergences, and the estimates for \tau still seem a bit off (if you compare the prior and posterior quantiles, you can see that the data really wants to be smaller).

How about zero-excluding priors for hierarchical variance parameters to improve computation for full Bayesian inference?

So. For awhile now we’ve moved away from the uniform (or, worse, inverse-gamma!) prior distributions for hierarchical variance parameters. We’ve done half-Cauchy, folded t, and other options; now we’re favoring unit half-normal.

We also have boundary-avoiding priors for point estimates, so that in 8-schools-type problems, the posterior mode won’t be zero. Something like the gamma(2) family keeps the mode away from zero while allowing it to be arbitrarily close to zero, depending on the likelihood. See here for details; I love this stuff.

Until just now, I thought of these as two separate problems: you can use the zero-avoiding prior if your goal is point estimation, but if you want to use full Bayes there’s no need for zero-avoidance because you’re averaging over the posterior distribution anyway.

But then we were talking about the well-known problem of parameterization in hierarchical models (see for example section 5.2 of this paper, or, for a more modern, Stan-centered take, this discussion from Mike Betancourt): simulation can go slow if you use the standard (“centered”) parameterization and there’s a big posterior mass near zero, and you’ll want to switch to the noncentered parameterization (as here, or you can just check the Stan manual).

The difficulty arises in that, depending on the data, sometimes the centered parameterization can be slow, other times the noncentered parameterization can be slow.

So, what about using a zero-excluding prior, even for full Bayes, just to get rid of the need for the non-centered parameterization? You can only really get this to work if you exclude zero for real (a weak gamma(2) prior won’t be enough). But in lots of problems, I’d guess most problems, we’d know enough about scale to do this. We’re in folk theorem territory.

It’s almost like we’re moving toward some discreteness in our modeling and computing, where each variance parameter is either being zeroed out or is being modeled as distinct from zero. So, instead of trying to get Stan (or whatever you’re using) to cover the entire space, you split up the problem externally into “variance = 0” and “variance excluded from 0” components.

Gotta think more about this one. Interesting how this one example keeps revealing further insights. God is in every leaf of every tree.

P.S. Dan Simpson responds.

Doomsday! Problems with interpreting a confidence interval when there is no evidence for the assumed sampling model

Mark Brown pointed me to a credulous news article in the Washington Post, “We have a pretty good idea of when humans will go extinct,” which goes:

A Princeton University astrophysicist named J. Richard Gott has a surprisingly precise answer to that question . . . to understand how he arrived at it and what it means for our survival, we first need to take a brief but fascinating detour through the science of probability and astronomy, one that begins 500 years ago with the Polish mathematician Nicholas Copernicus. . . .

Assuming that you and I are not so special as to be born at either the dawn of a very long-lasting human civilization or the twilight years of a short-lived one, we can apply Gott’s 95 percent confidence formula to arrive at an estimate of when the human race will go extinct: between 5,100 and 7.8 million years from now. . . . But for either of those scenarios to be true we must be observing humanity’s existence from a highly privileged point in time: either at the dawn of a technologically advanced, galaxy-hopping supercivilization, or at the end of days for an Earthbound civilization on the brink of extinguishing itself. According to the Copernican Principle, neither one of those scenarios is likely.

That’s old-school, uncritical, gee-whiz science writing for ya. The particular claim is not new; indeed we discussed it right here 13 years ago (see also here and here) I guess we can use the same reasoning to suggest that we’ll be busy debunking this story for many decades to come!

I replied to Mark by sending him a link to my paper with Christian Robert along with this quote:

Mark replied:

The method has a built in bias for residual age to be stochastically increasing in current age. Apply the method to a 90 year old man and the 50% prediction confidence interval for his residual life will be (30,270). Apply it to his 9 year old great grandson and you get (3,27). They would argue that we know something about human lifetimes, and this method is to be used when you essentially know nothing.

As I read it there is a random lifetime of interest, L. We observe A= LU where U is independent of L and is uniform (0,1). The residual life R=L(1-U).=A (1-U)/U . From the percentiles of U/(1-U) we get a 100(1-2b) percent prediction confidence interval for R, A( b/(1-b), (1-b)/b ). If the assumption A=LU is correct then the coverage probability is correct without assumptions on L. That being said if the method is used on a large sample of individuals as in the example above ((A/3, 3A) being a 50% prediction interval), its error rate will probably be a lot larger than 50%.

The method is hard to update. In the Berlin Wall example if you return 5 years later and the wall is still standing what’s the prediction interval for residual life, given this information? The prediction intervals are log symmetric (the prediction interval for log R is symmetric about log(A)). That doesn’t seem reasonable to me.

Hey, living to the age of 270—that’d be cool! I just have to make it to 90, then I’m most of the way there.

In all seriousness, this is a nonparametric frequentist statistical procedure, which is fine—but then this puts a big burden on the sampling model. The key quote from the above news article is, “Assuming that you and I are not so special as to be born at either the dawn of a very long-lasting human civilization or the twilight years of a short-lived one.” That’s an assumption that pretty much assumes the answer already.

Walter Benjamin on storytelling

After we discussed my paper with Thomas Basbøll, “When do stories work? Evidence and illustration in the social sciences,” Jager Hartman wrote to me:

Here is a link to the work by Walter Benjamin I think of when I think of storytelling. He uses storytelling throughout his works and critiques done on his works are interesting with regards to story telling to convey a message. However, I find this work really highlights differences between information, storytelling, and messages to be conveyed.

Benjamin’s article is called “The Storyteller: Reflections on the Works of Nikolai Leskov,” and it begins:

Familiar though his name may be to us, the storyteller in his living immediacy is by no means a present force. He has already become something remote from us and something that is getting even more distant. To present someone like Leskov as a storyteller does not mean bringing him closer to us but, rather, increasing our distance from him. Viewed from a certain distance, the great, simple outlines which define the storyteller stand out in him, or rather, they become visible in him, just as in a rock a human head or an animal’s body may appear to an observer at the proper distance and angle of vision. This distance and this angle of vision are prescribed for us by an experience which we may have almost every day. It teaches us that the art of storytelling is coming to an end. Less and less frequently do we encounter people with the ability to tell a tale properly. More and more often there is embarrassment all around when the wish to hear a story is expressed. It is as if something that seemed inalienable to us, the securest among our possessions, were taken from us: the ability to exchange experiences.

I’d heard the name Walter Benjamin but had never read anything by him, so I ran this by Basbøll who replied:

My own view is that stories can be usefully compared to models, i.e. we can think of storytelling as analogous to statistical modeling.

The storyteller is a point of agency, not in the story itself (the storyteller need not be character in the story) but in the communication of the story, the telling of it. The storyteller has authority to decide “what happened” roughly as the modeler has the authority to decide what comparison to run on the data.

We can think of the narrator’s “poetic license” here like the statistician’s “degrees of freedom”. While we allow the narrator to “construct” the narrative, a story is not compelling if you “catch” the storyteller just making things up, without any consideration for how this affects the overall plausibility of the story. Do note that this happens even in fiction. It’s not really about true and false, but about a good or bad story. If it’s just one thing happening after another without rhyme or reason we lose interest.

Likewise, the statistician can’t just run any number of comparisons on the data to find something “significant”. Here, again, it’s not that the model has to be “true”; but it must be good in the sense of providing a useful representation of the probability space. Perhaps in storytelling we could talk of a “plausibility space”—which is actually more usefully thought of as a time dimension. (Anything is possible—but not in any order!) Perhaps that’s why Bakhtin coined the word “chronotope”, a time-space.

Like models, stories can be subjected to criticism. That is, we can question the decisions that were made by the modeler or teller. Often, a story can be entirely misleading even though it recounts only things that actually happened. The deception lies in what is left out.

A story can also be inadequately contextualized, which leads us to make unwarranted moral judgments about the people involved. Sometimes merely adding context, about what came either before or after the main events in the account, completely inverts the distribution of heroes and villains in the narrative. I think the corresponding error of judgment can be found in the way models sometimes lead us to make judgments about causality. A story often assigns praise and blame. A model usually suggests cause and effect.

I wonder: what corresponds to “replication” in storytelling? Model studies can be replicated by gathering fresh data and seeing if it holds on them too. Often the “effect” disappears. Perhaps in storytelling there is a similar quality to be found in retelling it to a new audience. Not contextualization in the sense I just meant, but re-contextualizing the story against a completely different set of background experiences.

This is something Irving Goffman pointed out in his preface to Asylums. As we read his description of life in a closed psychiatric ward, he reminds us that he is seeing things from a middle-class, male perspective. “Perhaps I suffered vicariously,” he says, “about conditions that lower-class patients handled with little pain.” A story makes sense or nonsense (some stories are supposed to shock us with the senselessness of the events; that is their meaning) relative to a particular set of life experiences.

Models, too, derive their meaning from the background experiences of those who apply them to understand what is going on. Kenneth Burke called literature “equipment for living”. We use stories in our lives all the time, understanding our experiences by “fitting” our stories to them. Models too are part of our equipment for getting around. After all, one of the most familiar models is a map. Another is our sense of the changing seasons.

I replied that I want to write (that is, think systematically about) all this sometime. Right now (Oct 2017) I feel too busy to focus on this so I put this post at the end of the queue so as to be reminded next year (that is, now) to think again about statistical modeling, scientific learning, and stories.

“We continuously increased the number of animals until statistical significance was reached to support our conclusions” . . . I think this is not so bad, actually!

Jordan Anaya pointed me to this post, in which Casper Albers shared this snippet from a recently-published paper from an article in Nature Communications:

The subsequent twitter discussion is all about “false discovery rate” and statistical significance, which I think completely misses the point.

The problems

Before I get to why I think the quoted statement is not so bad, let me review various things that these researchers seem to be doing wrong:

1. “Until statistical significance was reached”: This is a mistake. Statistical significance does not make sense as an inferential or decision rule.

2. “To support our conclusions”: This is a mistake. The point of an experiment should be to learn, not to support a conclusion. Or, to put it another way, if they want support for their conclusion, that’s fine, but that has nothing to do with statistical significance.

3. “Based on [a preliminary data set] we predicted that about 20 unites are sufficient to statistically support our conclusions”: This is a mistake. The purpose of a pilot study is to demonstrate the feasibility of an experiment, not to estimate the treatment effect.

OK, so, yes, based on the evidence of the above snippet, I think this paper has serious problems.

Sequential data collection is ok

That all said, I don’t have a problem, in principle, with the general strategy of continuing data collection until the data look good.

I’ve thought a lot about this one. Let me try to explain here.

First, the Bayesian argument, discussed for example in chapter 8 of BDA3 (chapter 7 in earlier editions). As long as your model includes the factors that predict data inclusion are also included in the model, you should be ok. In this case, the relevant variable is time: If there’s any possibility of time trends in your underlying process, you want to allow for that in your model. A sequential design can yield a dataset that is less robust to model assumptions, and a sequential design changes how you’ll do model checking (see chapter 6 of BDA), but from a Bayesian standpoint, you can handle these issues. Gathering data until they look good is not, from a Bayesian perspective, a “questionable research practice.”

Next, the frequentist argument, which can be summarized as, “What sorts of things might happen (more formally, what is the probability distribution of your results) if you as a researcher follow a sequential data collection rule?

Here’s what will happen. If you collect data until you attain statistical significance, then you will attain statistical significance, unless you have to give up first because you run out of time or resources. But . . . so what? Statistical significance by itself doesn’t tell you anything at all. For one thing, your result might be statistically significant in the unexpected direction, so it won’t actually confirm your scientific hypothesis. For another thing, we already know the null hypothesis of zero effect and zero systematic error is false, so we know that with enough data you’ll find significance.

Now, suppose you run your experiment a really long time and you end up with an estimated effect size of 0.002 with a standard error of 0.001 (on some scale in which an effect of 0.1 is reasonably large). Then (a) you’d have to say whatever you’ve discovered is trivial, (b) it could easily be explained by some sort of measurement bias that’s crept into the experiment, and (c) in any case, if it’s 0.002 on this group of people, it could well be -0.001 or -0.003 on another group. So in that case you’ve learned nothing useful, except that the effect almost certainly isn’t large—and that thing you’ve learned has nothing to do with the statistical significance you’ve obtained.

Or, suppose you run an experiment a short time (which seems to be what happened here) and get an estimate of 0.4 with a standard error of 0.2. Big news, right! No. Enter the statistical significance filter and type M errors (see for example section 2.1 here). That’s a concern. But, again, it has nothing to do with sequential data collection. The problem would still be there with a fixed sample size (as we’ve seen in zillions of published papers).


Based on the snippet we’ve seen, there are lots of reasons to be skeptical of the paper under discussion. But I think the criticism based on sequential data collection misses the point. Yes, sequential data collection gives the researchers one more forking path. But I think the proposal to correct for this with some sort of type 1 or false discovery adjustment rule is essentially impossible and would be pointless even if it could be done, as such corrections are all about the uninteresting null hypothesis of zero effect and zero systematic error. Better to just report and analyze the data and go from there—and recognize that, in a world of noise, you need some combination of good theory and good measurement. Statistical significance isn’t gonna save your ass, no matter how it’s computed.

P.S. Clicking through, I found this amusing article by Casper Albers, “Valid Reasons not to participate in open science practices.” As they say on the internet: Read the whole thing.

P.P.S. Next open slot is 6 Nov but I thought I’d post this right away since the discussion is happening online right now.

Anthony West’s literary essays

Awhile ago I picked up a collection of essays by Anthony West, a book called Principles and Persuasions that came out in 1957, was briefly reprinted in 1970, and I expect has been out of print ever since. It’s a wonderful book, one of my favorite collections of literary essays, period. West was a book reviewer for the New Yorker for a long time so there must’ve been material for many more volumes but given the unenthusiastic response to this one collection, I guess it makes sense that no others were printed.

West is thoughtful and reasonable and a fluid writer, with lots of insights. The book includes interesting and original takes on well-trodden authors such as George Orwell, Charles Dickens, T. E. Lawrence, and Graham Greene, along with demolitions of Edwin O’Connor (author of The Last Hurrah) and the now-forgotten Reinhold Niebuhr, and lots more. West employs historical exposition, wit, and political passion where appropriate. I really enjoyed this book and am sad that there’s no more of this stuff by West that’s easily accessible. Reading it also gave me nostalgia for an era in which writers took their time to craft beautiful book reviews—not like now, here I am writing 400 posts per year along with articles, books, teaching, fundraising, etc., we’re just so busy and there’s this sense that few people will read anything we write from beginning to end again, so why bother? Here I am typing this on the computer but for the purpose of literature I wish we could blow up all the computers and return to a time when we had more free hours to read. There’s something particularly appealing about West’s book in that he’s not a famous author or even a famous critic; he’s completely forgotten and I guess wasn’t considered so important even back then.

And, yes, I know this post would be more meaningful if I could pull out some quotes to show you what West had to say. But when I was reading it I didn’t happen to have any sticky notes and it’s hard to flip through and find striking bits. And, don’t get me wrong, West was great but there were some things he couldn’t do. For example I doubt he ever wrote anything comparable to those unforgettable last three paragraphs of Homage to Catalonia. But that’s fine, not everyone can do that. I loved West’s book and it made me want to live in 1957.

P.S. Anthony West is the son of H. G. Wells and Rebecca West. Those two famous parents were never married to each other so that explains why Anthony’s last name isn’t Wells, but it seems odd that he didn’t just go with Fairfield. I guess I’ll have to read Anthony’s autobiographical novel to get more insight into the question of his name.

A model for scientific research programmes that include both “exploratory phenomenon-driven research” and “theory-testing science”

John Christie points us to an article by Klaus Fiedler, What Constitutes Strong Psychological Science? The (Neglected) Role of Diagnosticity and A Priori Theorizing, which begins:

A Bayesian perspective on Ioannidis’s (2005) memorable statement that “Most Published Research Findings Are False” suggests a seemingly inescapable trade-off: It appears as if research hypotheses are based either on safe ground (high prior odds), yielding valid but unsurprising results, or on unexpected and novel ideas (low prior odds), inspiring risky and surprising findings that are inevitably often wrong. Indeed, research of two prominent types, sexy hypothesis testing and model testing, is often characterized by low priors (due to astounding hypotheses and conjunctive models) as well as low-likelihood ratios (due to nondiagnostic predictions of the yin-or-yang type). However, the trade-off is not inescapable: An alternative research approach, theory-driven cumulative science, aims at maximizing both prior odds and diagnostic hypothesis testing. The final discussion emphasizes the value of pluralistic science, within which exploratory phenomenon-driven research can play a similarly strong part as strict theory-testing science.

I like a lot of this paper. I think Fiedler’s making a mistake working in the false positive, false negative framework—I know that’s how lots of people have been trained to think about science, but I think it’s an awkward framework that can lead to serious mistake. That said, I like the what Fielder’s saying. I think it would be a great idea for someone to translate it into my language, in which effects are nonzero and variable.

And the ideas apply far beyond psychology, I think to social and biological sciences more generally.

A coding problem in the classic study, Nature and Origins of Mass Opinion

Gaurav Sood writes:

In your 2015 piece, you mention: “In my research I’ve been strongly committed, in many different ways, to the model in which voter preferences and attitudes should be taken seriously.”

One of the reasons people in political sciene think voters are confused is because of data presented in a book by Zaller—Nature and Origins of Mass Opinion.

Recently Paul Sniderman re-analyzed the data, taking issue with how “conflicts” are coded. The point is narrow but vital. To make it easy, I have taken screenshots of the two relevant pages and included their links here and here [I’ve updated the links — ed.].

The chapter also touches upon another point that is in your wheelhouse—how key claims go unscrutinized. When writing a super brief review of the book, here are a few lines I came up with on that point: “What is more startling and sobering is that something regularly taught in graduate courses and so well cited is so under-scrutinized and so underthought. The citation/scrutiny ratio is pretty high. And tells a bunch about biases of academics and chances of scientific progress. It is a strange fate to be cited but not be scrutinized.”

I’ll have to take a look at Sniderman’s book now and then talk these ideas over with my colleagues in the political science department. I’m writing this post in early Oct and it’s scheduled for the end of Apr so this should allow enough time for me to get some sense of what’s going on.

In any case, Sood’s remark about “the citation/scrutiny ratio” is interesting in its own right. It often seems that people love to be cited but hate to be scrutinized, most famously when researchers in psychology have complained about “bullying” when outsiders do close readings of their articles and point out things that don’t make sense.

On the other hand, some people love scrutiny: they feel their work is strong and they welcome when outsiders make criticisms and reveal flaws. That’s how I feel: citation and scrutiny should go together.

I don’t really know Zaller so I can’t say how he’ll react to Sniderman’s comments. A quick web search led to this article by Larry Bartels who writes that an “apparent evolution of Zaller’s views is a testament to his open-mindedness and intellectual seriousness.” So that’s encouraging. I also came across an article by Sniderman and John Bullock, “A Consistency Theory of Public Opinion and Political Choice: The Hypothesis of Menu Dependence,” that seems relevant to this discussion.

Early p-hacking investments substantially boost adult publication record

In a post with the title “Overstated findings, published in Science, on long-term health effects of a well-known early childhood program,” Perry Wilson writes:

In this paper [“Early Childhood Investments Substantially Boost Adult Health,” by Frances Campbell, Gabriella Conti, James Heckman, Seong Hyeok Moon, Rodrigo Pinto, Elizabeth Pungello, and Yi Pan], published in Science in 2014, researchers had a great question: Would an intensive, early-childhood intervention focusing on providing education, medical care, and nutrition lead to better health outcomes later in life?

The data they used to answer this question might appear promising at first, but looking under the surface, one can see that the dataset can’t handle what is being asked of it. This is not a recipe for a successful study, and the researchers’ best course of action might have been to move on to a new dataset or a new question.

Yup, that happens. What, according to Wilson, happened in this case?

What the authors of this Science paper did instead was to torture the poor data until it gave them an answer.

Damn. Wilson continues with a detailed evisceration. You can read the whole thing; here I’ll just excerpt some juicy bits:

Red Flag 1: The study does not report the sample size.

I couldn’t believe this when I read the paper the first time. In the introduction, I read that 57 children were assigned to the intervention and 54 to control. But then I read that there was substantial attrition between enrollment and age 35 (as you might expect). But all the statistical tests were done at age 35. I had to go deep into the supplemental files to find out that, for example, they had lab data on 12 of the 23 males in the control group and 20 of the 29 males in the treatment group. That’s a very large loss-to-follow-up. It’s also a differential loss-to-follow-up, meaning more people were lost in one group (the controls in this case) than in the other (treatment). If this loss is due to different reasons in the two groups (it likely is), you lose the benefit of randomizing in the first place.

The authors state that they accounted for this using inverse probability weighting. . . . This might sound good in theory, but it is entirely dependent on how good your model predicting who will follow-up is. And, as you might expect, predicting who will show up for a visit 30 years after the fact is a tall order. . . . In the end, the people who showed up to this visit self-selected. The results may have been entirely different if the 40 percent or so of individuals who were lost to follow-up had been included.

Red Flag 2: Multiple comparisons accounted for! (Not Really)

Referring to challenges with this type of analysis, the authors write in their introduction:

“Numerous treatment effects are analyzed. This creates an opportunity for ‘cherry picking’—finding spurious treatment effects merely by chance if conventional one-hypothesis-at-a-time approaches to testing are used. We account for the multiplicity of the hypotheses being tested using recently developed stepdown procedures.”

. . . The stepdown procedure they refer to does indeed account for multiple comparisons. But only if you use it on, well, all of your comparisons. The authors did not do this . . .

One problem here is that, as the economists like to say, incentives matter. Cambpell et al. put in some work into this study, and it was only going to get published in a positive form if they found statistically significant results. So they found statistically significant results.

Two of the authors of the paper (Heckman and Pinto) replied:

Dr. Perry Wilson’s “Straight Talk” dismisses our study—the first to study the benefits of an early childhood program on adult health—as a statistical artifact, where we “torture the poor data” to get findings we liked. His accusation that we tortured data is false. Our paper, especially our detailed 100-page appendix, documents our extensive sensitivity and robustness analyses and contradicts his claims.

I’ve done robustness studies too, I admit, and one problem is that these are investigations designed not to find anything surprising. A typical robustness study is like a police investigation where the cops think they already know who did it, so they look in a careful way so as not to uncover any inconvenient evidence. I’m not saying that robustness studies are necessarily useless, just that the incentives there are pretty clear, and the actual details of such studies (what analyses you decide to do, and how you report them) are super-flexible, even more so than original studies which have forking path issues of their own.

Heckman and Pinto continue with some details, to which Wilson responds. I have not read the original paper in detail, and I’ll just conclude with my general statement that uncorrected multiple comparisons are the norm in this sort of study which involves multiple outcomes, multiple predictors, and many different ways of adjusting for missing data. Everybody was doing it back in 2014 when that paper was published, and in particular I’ve seen similar issues in other papers on early childhood intervention by some of the same authors. So, sure, of course there are uncorrected multiple comparisons issues.

I better unpack this one a bit. If “everybody was doing it back in 2014,” then I was doing it back in 2014 too. And I was! Does that mean I think that all the messy, non-preregistered studies of the past are to be discounted? No, I don’t. After all, I’m still analyzing non-probability samples—it’s called “polling,” or “doing surveys,” despite what Team Buggy-Whip might happen to be claiming in whatever evidence-less press release they happen to be spewing out this month—and I think we can learn from surveys. I do think, though, that you have to be really careful when trying to interpret p-values and estimates in the presence of uncontrolled forking paths.

For example, check out the type M errors and selection bias here, from the Campbell et al. paper:

The evidence is especially strong for males. The mean systolic blood pressure among the control males is 143 millimeters of mercury (mm Hg), whereas it is only 126 mm Hg among the treated. One in four males in the control group is affected by metabolic syndrome, whereas none in the treatment group are affected.

Winner’s curse, anyone?

The right thing to do, I think, is not to pick a single comparison and use it to get a p-value for the publication and an estimate for the headlines. Rather, our recommendation is to look at, and report, and graph, all relevant comparisons, and form estimates using hierarchical modeling.

Reanalyzing data can be hard, and I suspect that Wilson’s right that the data at hand are too noisy and messy to shed much light on the researchers’ questions about long-term effects of early-childhood intervention.

And, just to be clear: if the data are weak, you can’t necessarily do much. It’s not like, if Campbell et al. had done a better analysis, then they’d have this great story. Rather, if they’d done a better analysis, it’s likely they would’ve had uncertain conclusions: they’d just have to report that they can’t really say much about the causal effect here. And, unfortunately, it would’ve been a lot harder to get that published in the tabloids.

On to policy

Early childhood intervention sounds like a great idea. Maybe we should do it. That’s fine with me. There can be lots of reasons to fund early childhood intervention. Just don’t claim the data say more than they really do.

The syllogism that ate social science

I’ve been thinking about this one for awhile and expressed it most recently in this blog comment:

There’s the following reasoning which I’ve not seen explicitly stated but is I think how many people think. It goes like this:
– Researcher does a study which he or she thinks is well designed.
– Researcher obtains statistical significance. (Forking paths are involved, but the researcher is not aware of this.)
– Therefore, the researcher thinks that the sample size and measurement quality was sufficient. After all, the purpose of a high sample size and good measurements is to get your standard error down. If you achieved statistical significance, the standard error was by definition low enough. Thus in retrospect the study was just fine.

So part of this is self-interest: It takes less work to do a sloppy study and it can still get published. But part of it is, I think, genuine misunderstanding, an attitude that statistical significance retroactively solves all potential problems of design and data collection.

Type M and S errors are a way of getting at this, the idea that just cos an estimate is statistically significant, it doesn’t mean it’s any good. But I think we need to somehow address the above flawed reasoning head-on.

Economic growth -> healthy kids?

Joe Cummins writes:

Anaka Aiyar and I have a new working paper on economic growth and child health. Any comments from you or your readers would be much appreciated.

In terms of subject matter, it fits in pretty nicely with the Demography discussions on the blog (Deaton/Case, age adjustment, interpreting population level changes in meaningful ways). And methodologically we were concerned about a lot of the problems that have been discussed on the blog: the abuse of p-values; trying to take measurement seriously; the value of replication and reanalysis of various forms; and attempting to visually display complex data in useful ways. There is even a bit of the Secret Weapon in Figure 2. In general, we hope that we built a convincing statistical argument that our estimates are more informative, interpretable and useful than previous estimates.

Would love to hear what your readers do and don’t find interesting or useful (and of course, if we messed something up, we want to know that too!).

Replication files are here.

Here’s their abstract:

For the last several years, there has been a debate in the academic literature regarding the association between economic growth and child health in under-developed countries, with many arguing the association is strong and robust and several new papers arguing the association is weak or nonexistent. Focusing on child growth faltering as a process that unfolds over the first several years of life, we provide new evidence tracing out the relationship between macroeconomic trends and the trajectory of child growth through age 5. Using two novel regression models that each harness different kinds of within- and between-country variation, and data on over 600,000 children from 38 countries over more than 20 years, our estimates of the association are relatively small but precise, and are consistent across both estimators. We estimate that a 10% increase in GDP around the time of a child’s birth is associated with a decrease in the rate of loss of HAZ of about 0.002 SD per month over the first two years of life, which generates a cumulative effect of around 0.04 SD by age 3 that then persists through age 5. Our estimates are small compared to most previously published statistically significant estimates, more precisely estimated than previous insignificant estimates, and relate to a broader population of children than previous estimates focused on dichotomous outcomes.

It’s a frustrating thing that this sort of careful, policy-relevant work (I have’t read the paper carefully so I can’t comment on the quality of the analysis, one way or another, but it certainly seems careful and policy-relevant) doesn’t get so much attention compared to headline-bait like pizzagate or himmicanes or gay genes or whatever. And I’m part of this! A careful quantitative analysis . . . what can I say about that? Not much, without doing a bunch of work.

But at least I’m posting on this, so I hope some of you who work in this area will take a look and offer your thoughts.

Don’t do the Wilcoxon (reprise)

František Bartoš writes:

I’ve read your and various others statistical books and from most of them, I gained a perception, that nonparametric tests aren’t very useful and are mostly a relic from pre-computer ages.

However, this week I witnessed a discussion about this (in Psych. methods discussion group on FB) and most of the responses were very supportive of the nonparametric test.

I was trying to find more support on your blog, but I wasn’t really successful. Could you consider writing a post with a comparison of parametric and nonparametric tests?

My reply:

1. In general I don’t think statistical hypothesis tests—parametric or otherwise—are helpful because they are typically used to reject a null hypothesis that nobody has any reason to believe, of exactly zero effect and exactly zero systematic error.

2. I also think that nonparametric tests are overrated. I wrote about this a few years ago, in a post entitled Don’t do the Wilcoxon, which is a restatement of a brief passage from our book, Bayesian Data Analysis. The point (click through for the full story) is that Wilcoxon is essentially equivalent to first ranking the data, then passing the ranks through a z-score transformation, and then running a classical z-test. As such, this procedure could be valuable in some settings (those settings where you feel that the ranks contain most of the information in the data, and where otherwise you’re ok with a z-test). But, if it’s working for you, what makes it work is that you’re discarding information using the rank transformation. As I wrote in the above-linked post, just do the transformation if you want and then use your usual normal-theory methods; don’t get trapped into thinking there’s something specially rigorous about the method being nonparametric.

A quick rule of thumb is that when someone seems to be acting like a jerk, an economist will defend the behavior as being the essence of morality, but when someone seems to be doing something nice, an economist will raise the bar and argue that he’s not being nice at all.

Like Pee Wee Herman, act like a jerk
And get on the dance floor let your body work

I wanted to follow up on a remark from a few years ago about the two modes of pop-economics reasoning:

You take some fact (or stylized fact) about the world, and then you either (1) use people-are-rational-and-who-are-we-to-judge-others reasoning to explain why some weird-looking behavior is in fact rational, or (2) use technocratic reasoning to argue that some seemingly reasonable behavior is, in fact, inefficient.

The context, as reported by Felix Salmon, was a Chicago restaurant whose owner, Grant Achatz, was selling tickets “at a fixed price and are then free to be resold at an enormous markup on the secondary market.” Economists Justin Wolfers and Betsey Stevenson objected. They wanted Achatz to increase his prices. By keeping prices low, he was, apparently, violating the principles of democracy: “‘It’s democratic in theory, but not in practice,’ said Wolfers . . . Bloomberg’s Mark Whitehouse concludes that Next should ‘consider selling tickets to the highest bidder and giving the extra money to charity.'”

I summarized as follows:

In this case, Wolfers and Whitehouse are going through some contortions to argue (2). In a different mood, however, they might go for (1). I don’t fully understand the rules for when people go with argument 1 and when they go with 2, but a quick rule of thumb is that when someone seems to be acting like a jerk, an economist will defend the behavior as being the essence of morality, but when someone seems to be doing something nice, an economist will raise the bar and argue that he’s not being nice at all.

I’m guessing that if Grant Achatz were to implement the very same pricing policy but talk about how he’s doing it solely out of greed, that a bunch of economists would show up and explain how this was actually the most moral and democratic option.

In comments, Alex wrote:

(1) and (2) are typically distinguished in economics textbooks as examples of positive and normative reasoning, respectively. The former aims at describing the observed behavior in terms of a specific model (e.g. rationality), seemingly without any attempt at subjective judgement. The latter takes the former as given and applies a subjective social welfare function to the outcomes in order to judge, whether the result could be improved upon with, say, different institutional arrangement or a policy intervention.

To which I replied:

Yup, and the usual rule seems to be to use positive reasoning when someone seems to be acting like a jerk, and normative reasoning when someone seems to be doing something nice. This seems odd to me. Why assume that, just because someone is acting like a jerk, that he is acting so efficiently that his decisions can’t be improved, only understood? And why assume that, just because someone seems to be doing something nice, that “unintended consequences” etc. ensure he’s not doing a good job of it. To me, this is contrarianism run wild. I’m not saying that Wolfers is a knee-jerk contrarian; rather I’m guessing that he’s following default behaviors without thinking much about it.

This is an awkward topic to write about. I’m not saying I think economists are mean people; they just seem to have a default mode of thought which is a little perverse.

In the traditional view of Freudian psychiatrists, which no behavior can be taken at face value, and it takes a Freudian analyst to decode the true meaning. Similarly, in the world of pop economics, or neoclassical economics, any behavior that might seem good, or generous (for example, not maxing out your prices at a popular restaurant) is seen to be damaging of the public good—“unintended consequences” and all that—, while any behavior that might seem mean, or selfish, is actually for the greater good.

Let’s unpack this in five directions, from the perspective of the philosophy of science, the sociology of scientific professions, politics, the logic of rhetoric, and the logic of statistics.

From the standpoint of the philosophy of science, pop economics or neoclassical economics is, like Freudian theory, unfalsifiable. Any behavior can be explained as rational (motivating economists’ mode 1 above) or as being open to improvement (motivating economists’ mode 2 of reasoning). Economists can play two roles: (1) to reassure people that the current practices are just fine and to use economic theory to explain the hidden benefits arising from seemingly irrational or unkind decisions; or (2) to improve people’s lives through rational and cold but effective reasoning (the famous “thinking like an economist”). For flexible Freudians, just about any behavior can be explained by just about any childhood trauma; and for modern economists, just about any behavior can be interpreted as a rational adaptation—or not. In either case, specific applications of the method can be falsified—after all, Freudians and neoclassical economists alike are free to make empirically testable predictions—but the larger edifice is unfalsifiable, as any erroneous prediction can simply be explained as an inappropriate application of the theory.

From a sociological perspective, the flexibility of pop-economics reasoning, like the flexibility of Freudian theory, can be seen as a plus, in that it implies a need for trained specialists, priests who can know which childhood trauma to use as an explanation, or who can decide whether to use economics’s explanation 1 or 2. Again, recall economists’ claims that they think in a different, more piercing, way than other scholars, an attitude that is reminiscent of old-school Freudians’ claim to look squarely at the cold truths of human nature that others can’t handle.

The political angle is more challenging. Neoclassical economics is sometimes labeled as conservative, in that explanation 1 (the everything-is-really-ok story) can be used to justify existing social and economic structures; on the other hand, such arguments can also be used to justify existing structures with support on the left. And, for that matter, economist Justin Wolfers, quoted above, is I believe a political liberal in the U.S. context. So it’s hard for me to put this discussion on the left or the right; maybe best just to say that pop-econ reasoning is flexible enough to go in either political direction, or even both at once.

When it comes to analyzing the logic of economic reasoning, I keep thinking about Albert Hirschman’s book, The Rhetoric of Reaction. I feel that the ability to bounce back and forth between arguments 1 and 2 is part of what gives pop economics, or microeconomics more generally, some of its liveliness and power. If you only apply argument 1—explaining away all of human behavior, however ridiculous, as rational and desirable, then you’re kinda talking yourself out of a job: as an economist, you become a mere explainer, not a problem solver. On the other hand, if you only apply argument 2—studying how to approach optimal behavior in situation after situation—then you become a mere technician. By having the flexibility of which argument to use in any given setting, you can be unpredictable. Unpredictability is a source of power and can also make you more interesting.

Finally, I can give a statistical rationale for the rule of thumb given in the title of this post. It’s Bayesian reasoning; that is, partial pooling. If you look at the population distribution of all the things that people do, some of these actions have positive effects, some have negative effects, and most effects are small. So if you receive a noisy signal that someone did something positive, the appropriate response is to partially pool toward zero and to think of reasons why this apparently good deed was, on net, not so wonderful at all. Conversely, when you hear about something that sounds bad, you can partially pool toward zero from the other direction.

Just look at the crowd. Say, “I meant to do that.”

Proposed new EPA rules requiring open data and reproducibility

Tom Daula points to this news article by Heidi Vogt, “EPA Wants New Rules to Rely Solely on Public Data,” with subtitle, “Agency says proposal means transparency; scientists see public-health risk.” Vogt writes:

The Environmental Protection Agency plans to restrict research used in developing regulations, the agency said Tuesday . . . The new proposal would exclude the many research studies that don’t make their raw data public and limit the use of findings that can’t be reproduced by others. The EPA said this would boost transparency. . . .

The move prompted an uproar from scientists who say it would exclude so much research that the resulting rules would endanger Americans’ health. Ahead of the announcement, a coalition of 985 scientists issued a statement decrying the plan.

“This proposal would greatly weaken EPA’s ability to comprehensively consider the scientific evidence,” they said in a letter issued Monday. The group said the EPA has long been very transparent in explaining the scientific basis for decisions and that requiring public data would exclude essential studies that involve proprietary information or confidential personal data. . . .

The administrator made his announcement flanked by two lawmakers who introduced that legislation: Sen. Mike Rounds (R., S.D.) and Rep. Lamar Smith (R., Texas).

Mr. Smith has argued that confidential data such as patient records could be redacted or given only to those who agree to keep it confidential.

Scientists have said this sort of process would still exclude many studies and make others costly to use in regulation. Gretchen Goldman, research director for the Center for Science and Democracy, has said studies are already rigorously reviewed by scientific journals and that those peer reviews rarely require raw data to assess the science.

Richard Denison, lead scientist at the Environmental Defense Fund, said the rule could exclude studies that track real-life situations that it would be unethical to reproduce. He gave as an example the monitoring of the Deepwater Horizon oil spill in the Gulf of Mexico in 2010.

“The only way to reproduce that work would be to stage another such oil spill, clearly nonsensical,” he said in a statement.

As for providing all the raw data, Mr. Denison said that would prevent the use of medical records that must be kept confidential by law.

The American Association for the Advancement of Science—-the world’s largest general scientific society and the publisher of the journal Science—said the rule would also exclude many studies that rely on outside funders, because they sometimes limit access to the underlying data.

Daula expressed this view:

If journals required data and code to replicate then it wouldn’t matter. Having a big player demand such transparency may spur journals to adopt such a policy. Thoughts? Controversial politically, but seems in line with ideas advanced on your blog.

I have mixed feelings about this proposal. Overall it seems like a good idea, as long as exceptions for special cases are carved out.

1. Going forward, I strongly support the idea that decisions should be made based on open data and reproducible studies.

2. That said, there are lots of decisions that need to be made based on existing, imperfect studies. So in practice some compromises need to be made.

3. Regarding the example given by the guy from the Environmental Defense Fund, I don’t know how the monitoring was done of the Deepwater Horizon oil spill. But why can’t these data be open, and why can’t the analysis be reproducible?

4. There seems to be some confusion over the nature of “reproducibility,” which has different meanings in different contexts. A simple psychology experiment can actually be reproduced (although there’s never such a thing as an exact replication, given that any attempted replication will include new people and a new context). In some examples of environmental science, you can re-run a lab or field experiment; in other cases (as when studying global warming or massive oil spills), there’s no way to replicate. But the data processing and analysis should still be replicable. I haven’t seen the proposed EPA rules, so I’m not sure what’s meant by “limit the use of findings that can’t be reproduced by others.”

I’d hope that for a study such as the Deepwater Horizon monitoring, there’s be no requirement that a new oil spill be reproduced—but it does seem reasonable for the data to be fully available and the data processing and analysis be replicable.

5. I’m disappointed to see the research director for the Center for Science and Democracy saying that studies are already rigorously reviewed by scientific journals and that those peer reviews rarely require raw data to assess the science.

No kidding, peer reviews rarely require raw data to assess the science! And that’s a big problem. So, no, I don’t think the existence of purportedly rigorous peer review (if you want an environmental science example, see here) is any reason to dismiss a call for open data and reproducibility.

Also, I’d think that any organization called the “Center for Science and Democracy” would favor openness.

6. I can understand the reasoning by which these science organizations are opposing this EPA plan: The current EPA administrator is notorious for secrecy, and from newspaper reports it seems pretty clear that the EPA is making a lot of decisions based on closed-doors meetings with industry. But, if the problem is a closed, secretive government, I don’t think the solution is to defend closed, secretive science.

7. Specific objections raised by the scientists were: (a) “requiring public data would exclude essential studies that involve proprietary information or confidential personal data,” and (b) “rule would also exclude many studies that rely on outside funders, because they sometimes limit access to the underlying data.” I suppose exceptions would have to be made in these cases, but I do think that lots of scrutiny should be applied to claims based on unshared data and unreplicable experiments.

The current state of the Stan ecosystem in R

(This post is by Jonah)

Last week I posted here about the release of version 2.0.0 of the loo R package, but there have been a few other recent releases and updates worth mentioning. At the end of the post I also include some general thoughts on R package development with Stan and the growing number of Stan users who are releasing their own packages interfacing with rstan or one of our other packages.


rstanarm and brms: Version 2.17.4 of rstanarm and version 2.2.0 of brms were both released to provide compatibility with the new features in loo v2.0.0. Two of the new vignettes for the loo package show how to use it with rstanarm models, and we have also just released a draft of a vignette on how to use loo with brms and rstan for many “non-factorizable” models (i.e., observations not conditionally independent). brms is also now officially supported by the Stan Development Team (welcome Paul!) and there is a new category for it on the Stan Forums.

rstan: The next release of the rstan package (v2.18), is not out yet (we need to get Stan 2.18 out first), but it will include a loo() method for stanfit objects in order to save users a bit of work. Unfortunately, we can’t save you the trouble of having to compute the point-wise log-likelihood in your Stan program though! There will also be some new functions that make it a bit easier to extract HMC/NUTS diagnostics (thanks to a contribution from Martin Modrák).


bayesplot: A few weeks ago we released version 1.5.0 of the bayesplot package (, which also integrates nicely with loo 2.0.0. In particular, the diagnostic plots using the leave-one-out cross-validated probability integral transform (LOO-PIT) from our paper Visualization in Bayesian Workflow (preprint on arXiv, code on GitHub) are easier to make with the latest bayesplot release. Also, TJ Mahr continues to improve the bayesplot experience for ggplot2 users by adding (among other things) more functions that return the data used for plotting in a tidy data frame.

shinystan: Unfortunately, there hasn’t been a shinystan ( release in a while because I’ve been busy with all of these other packages, papers, and various other Stan-related things. We’ll try to get out a release with a few bug fixes soon. (If you’re annoyed by the lack of new features in shinystan recently let me know and I will try to convince you to help me solve that problem!)

(Update: I forgot to mention that despite the lack of shinystan releases, we’ve been working on better introductory materials. To that end, Chelsea Muth, Zita Oravecz, and I recently published an article User-friendly Bayesian regression modeling: A tutorial with rstanarm and shinystan (view).)

Other tools

loo: We released version 2.0.0, a major update to the loo package ( See my previous blog post.

projpred: Version 0.8.0 of the projpred package ( for projection predictive variable selection for GLMs was also released shortly after the loo update in order to take advantage of the improvements to the Pareto smoothed importance sampling algorithm. projpred can already be used quite easily with rstanarm models and we are working on improving its compatibility with other packages for fitting Stan models.

rstantools: Unrelated to the loo update, we also released version 1.5.0 of the rstantools package (, which provides functions for setting up R packages interfacing with Stan. The major changes in this release are that usethis::create_package() is now called to set up the package (instead of utils::package.skeleton), fewer manual changes to files are required by users after calling rstan_package_skeleton(), and we have a new vignette walking through the process of setting up a package (thanks Stefan Siegert!). Work is being done to keep improving this process, so be on the lookout for more updates soonish.

Stan related R packages from other developers

There are now well over fifty packages on CRAN that depend in some way on one of our R packages mentioned above!  You can find most of them by looking at the “Reverse dependencies” section on the CRAN page for rstan, but that doesn’t count the ones that depend on bayesplot, shinystanloo, etc., but not rstan.

Unfortunately, given the growing number of these packages, we haven’t been able to look at each one of them in detail. For obvious reasons we prioritize giving feedback to developers who reach out to us directly to ask for comments and to those developers who make an effort to our recommendations for developers of R packages interfacing with Stan (included with the rstantools package since its initial release in 2016). If you are developing one of these packages and would like feedback please let us know on the Stan Forums. Our time is limited but we really do make a serious effort to answer every single question asked on the forums (thank you to the many Stan users who also volunteer their time helping on the forums!).

My primary feelings about this trend of developing Stan-based R packages are ones of excitement and gratification. It’s really such an honor to have so many people developing these packages based on all the work we’ve done! There are also a few things I’ve noticed that I hope will change going forward. I’ll wrap up this post by highlighting two of these issues that I hope developers will take seriously:

(1) Unit testing

(2) Naming user-facing functions

The number of these packages that have no unit tests (or very scant testing) is a bit scary. Unit tests won’t catch every possible bug (we have lots of tests for our packages and people still find bugs all the time), but there is really no excuse for not unit testing a package that you want other people to use. If you care enough to do everything required to create your package and get it on CRAN, and if you care about your users, then I think it’s fair to say that you should care enough to write tests for your package. And there’s really no excuse these days with the availability of packages like testthat to make this process easier than it used to be! Can anyone think of a reasonable excuse for not unit testing a package before releasing it to CRAN and expecting people to use it? (Not a rhetorical question. I really am curious given that it seems to be relatively common or at least not uncommon.) I don’t mean to be too negative here. There are also many packages that seem to have strong testing in place! My motivation for bringing up this issue is that it is in the best interest of our users.

Regarding function naming: this isn’t nearly as big of a deal as unit testing, it’s just something I think developers (including myself) of packages in the Stan R ecosystem can do to make the experience better for our users. rstanarm and brms both import the generic functions included with rstantools in order to be able to define methods with consistent names. For example, whether you fit a model with rstanarm or with brms, you can call log_lik() on the fitted model object to get the pointwise log-likelihood (it’s true that we still have a bit left to do to get the names across rstanarm and brms more standardized, but we’re actively working on it). If you are developing a package that fits models using Stan, we hope you will join us in trying to make it as easy as possible for users to navigate the Stan ecosystem in R.

A few words on a few words on Twitter’s 280 experiment.

Gur Huberman points us to this post by Joshua Gans, “A few words on Twitter’s 280 experiment.” I hate twitter but I took a look anyway, and I’m glad I did, as Gans makes some good points and some bad points, and it’s all interesting.

Gans starts with some intriguing background:

Twitter have decided to run an experiment. They are giving random users twice the character limit — 280 rather than 140 characters. Their motivation was their observation that in Japanese, Korean and Chinese 140 characters conveys alot more information and so people tend to tweet more often. Here is their full statement.

The instructive graph is this:


The conclusion drawn is that Japanese tweeters do not hit their character limit as much as English tweeters. They also claim they see more people tweeting in the less constrained languages. Their conclusion is that not having as tight a character limit makes expression easier and so you get more of it.

Interesting.  Gans continues:

What Twitter have just told us is that the world gave them a natural experiment and they liked what they saw. . . . What was Twitter’s reaction to this? To do an experiment. In other words, they are worried that the natural experiment isn’t telling them enough. Since it is about as clean a natural experiment as you are likely to get in society, we can only speculate what they are missing. Are they concerned that this is something cultural? (They had three cultures do this so that is strange). Moreover, many of those users must also speak English so one has to imagine something could be learned from that.

I’m not quite sure what he means by a “culture,” but this generally seems like a useful direction to explore.  One thing, though:  Gans seems to think it’s a big mystery why Twitter would want to do an experiment rather than just draw inferences from observational data.  But an experiment here is much different from the relevant observational data.  In the observational data, the U.S. condition is unchanged; in the experiment, the U.S. condition is changed.  That’s a big deal!  We’re talking about two different comparisons:

observational:  U.S. with a 140 character limit vs. Japan with a 140 character limit.

experimential:  U.S. with a 140 character limit vs. U.S. with a 280 character limit.

These comparisons are a lot different!  It doesn’t matter how “clean” is the observational comparison (which I think Gans somewhat misleadingly calls a “natural experiment”); these are two different comparisons.

Gans continues:

My point is: the new experiment must be testing a hypothesis. But what is that hypothesis?

Huh?  There’s no requirement at all that an experiment “must be testing a hypothesis.”  An experiment is a way to gather data.  You can use experimental data to test hypotheses, or to estimate parameters, or to make predictions, or to make decisions.  All these can be useful.  But none of them is necessary.  In particular, I’m guessing that Twitter wants to make decisions (also to get some publicity, goodwill, etc.).  No need for there to be any testing of a hypothesis.

Gans does have some interesting thoughts on the specifics:

The obvious way [to do an experiment] would be to announce, say, a three month trial across the whole of English speaking twitter and observe changes. That would replicate the natural experiment to a degree. Or, alternatively, you might pick a language with a small number of users and conduct the experiment there. . . .

That is not what Twitter did. They decided to randomise across a subset of English users — giving them 280 characters — and leaving the rest out. That strikes me as a bad idea because those random people are not contained. They mix with the 140 people. . . .

Why is this a terrible idea? Because it is not an experiment that tests what Twitter was likely missing from the information they gained already. Instead, it is an experiment that tests the hypothesis — what if we gave some people twice the limit and threw all of them together with those without? The likelihood that Twitter learns anything with confidence to move to a 280 limit from everyone is very low from this.

All this seems odd to me.  Gans’s concern is spillover, and that’s a real concern, but any design has issues.  His proposed three-month trial has no spillover but is confounded with time trends.  If it’s not one thing it’s another.  My point is that I don’t think it’s right to say that a design is “terrible” just because there’s spillover, any more than you should say that the design is terrible if it is confounded with time, any more than you should describe an observational comparison which is confounded with country as if it is “as clean as you are likely to get.”

Yes, identify the problems in data and consider what assumptions are necessary to learn from these problems. No, don’t be so sure that what people are doing is a bad idea. Remember that Twitter has goals beyond testing hypotheses—indeed I’d guess that Twitter isn’t interested in hypothesis testing at all!  It’s a business decision and Twitter has lots of business goals. Just to start, see this comment from Abhishek on the post in question.

Finally, Gans writes:

What we should be complaining about is why they are running such an awful experiment and how they came to such a ludicrous decision on that.

Huh?  We should be complaining because a company is suboptimally allocating resources?  I don’t get it.  We can laugh at them, but why complain?

P.S.  Yes, I recognize the meta-argument, that if I think Gans has no reason to complain that Twitter did an experiment that’s different from the experiment he would’ve preferred, then, similarly, I have no reason to complain that Gans wrote a blog post different from the post that I would’ve preferred.  Fair enough.

What I’m really saying here is that I disagree with much of what Gans writes.  Or, to be more precise, I like Gans’s big picture—he’s looking at a data analysis (the above graph) and thinking of it as an observational study, and he’s looking at a policy change (the 280-character rule) and thinking of it as an experiment—but I think he’s getting stuck in the weeds, not fully recognizing the complexity of the situation and thinking that there’s some near-ideal experiment and hypothesis out there.

I appreciate that Gans is stepping back, taking a real-world business decision that’s in the news and trying to evaluate from first principles. We certainly shouldn’t assume that any decision made by Twitter, say, is automatically a wise choice, nor should we assume that change is bad.  It’s a good idea to look at a policy change and consider what can be learned from it.  (For more on this point, see Section 4 of this review.)  I’d just like to step back a few paces further and place this data gathering in the context of various goals of Twitter and its users.

So I thank Gans for getting this discussion started, and I thank Huberman for passing it over to us.

P.P.S.  I wrote this post in Sep 2017 and it’s scheduled to appear in Apr 2018, at which time, who knows, tweets might be 1000 characters long.  I still prefer blogs.

The cargo cult continues

Juan Carlos Lopez writes:

Here’s a news article: . . .

Here’s the paper: . . .

[Details removed to avoid embarrassing the authors of the article in question.]

I [Lopez] am especially bothered by the abstract of this paper, which makes bold claims in the context of a small and noisy study which measurements are not closely tied to the underlying constructs of interest—at best, they are quantifying a very context-dependent, special case.

Anyhow, I think you can get the gist of the article (and its problems) by only reading the abstract, Table 1, and Figure 1.

My reply:

Yes, there’s no need to take the paper seriously: it’s an exercise in noise mining, and if anyone would ever go to the trouble of replicating it—which I doubt will ever happen—I expect they’d see some other set of interactions pop up as statistically significant. In the news article, one of the authors describes the results in the paper as “surprising”—without realizing that it’s no surprise at all that if you shuffle around a bunch of random numbers, out will pop some random statistically significant comparisons.

The whole thing is a disaster, from data collection to analysis to writeup to publication to publicity—for the general reasons discussed here, and I think I’d be doing the authors a favor, at some level, to tell them that—but for the usual reasons of avoiding conflict I won’t bother doing this. It really makes me sad, not angry. This particular paper that you sent me is not on a particularly important or exciting topic (it’s just quirky enough to get into the news), it’s just routine cargo-cult science that we see every day. For lots of people, it’s their career and they just don’t know better.

Lopez followed up with another question:

In the setting of, say, a research seminar presentation, how do you answer the question “Why are you not including p-values in your Results section”?

Some context for my question: I’m a Ph.D. candidate at a university where most people are still using p-values in the usual ways which you criticize in McShane et al. (2017). I have trouble answering the question above in a way that doesn’t derail the entire discussion. Recently, I’ve discovered that the most effective way to avoid a long—and sometimes counterproductive—discussion on the topic is to appeal to authority by saying I’m following the ASA guidelines. This has become my go-to, 30-second answer.

My response: I don’t object to people including p-values—they do tell you something! My objection is when p-values are used to select a subset of results. I say: give all the results, not just a subset.