Skip to content

No to inferential thresholds

Harry Crane points us to this new paper, “Why ‘Redefining Statistical Significance’ Will Not Improve Reproducibility and Could Make the Replication Crisis Worse,” and writes:

Quick summary: Benjamin et al. claim that FPR would improve by factors greater than 2 and replication rates would double under their plan. That analysis ignores the existence and impact of “P-hacking” on reproducibility. My analysis accounts for P-hacking and shows that FPR and reproducibility would improve by much smaller margins and quite possibly could decline depending on some other factors.

I am not putting forward a specific counterproposal here. I am instead examining the argument in favor of redefining statistical significance in the original Benjamin et al. paper.

From the concluding section of Crane’s paper:

The proposal to redefine statistical significance is severely flawed, presented under false pretenses, supported by a misleading analysis, and should not be adopted.

Defenders of the proposal will inevitably criticize this conclusion as “perpetuating the status quo,” as one of them already has [12]. Such a rebuttal is in keeping with the spiritof the original RSS [redefining statistical significance] proposal, which has attained legitimacy not by coherent reasoning or compelling evidence but rather by appealing to the authority and number of its 72 authors. The RSS proposal is just the latest in a long line of recommendations aimed at resolving the crisis while perpetuating the cult of statistical significance [22] and propping up the flailing and failing scientific establishment under which the crisis has thrived.

I like Crane’s style. I can’t say that I tried to follow the details, because his paper is all about false positive rates, and I think that whole false positive thing is a inappropriate in most science and engineering contexts that I’ve seen, as I’ve written many times (see, for example, here and here).

I think the original sin of all these methods is the attempt to get certainty or near-certainty from noisy data. These thresholds are bad news—and, as Hal Stern and I wrote awhile ago, it’s not just because of the 0.049 or 0.051 thing. Remember this: a z-score of 3 gives you a (two-sided) p-value of 0.003, and a z-score of 1 gives you a p-value of 0.32. One of these is super significant—“p less than 0.005”! Wow!—and the other is the ultimate statistical nothingburger. But if you have two different studies, and one gives p=0.003 and the other gives p=0.32, the difference between them is not at all remarkable. You could easily get both these results from the exact same underlying result, based on nothing but sampling variation, or measurement error, or whatever.

So, scientists and statisticians: All that thresholding you’re doing? It’s not doing what you think it’s doing. It’s just a magnification of noise.

So I’m not really inclined to follow the details of Crane’s argument regarding false positive rates etc., but I’m supportive of his general attitude that thresholds are a joke.

Post-publication review, not “ever expanding regulation”

Crane’s article also includes this bit:

While I am sympathetic to the sentiment prompting the various responses to RSS [1, 11, 15, 20], I am not optimistic that the problem can be addressed by ever expanding scientific regulation in the form of proposals and counterproposals advocating for pre-registered studies, banned methods, better study design, or generic ‘calls to action’. Those calling for bigger and better scientific regulations ought not forget that another regulation—the 5% significance level—lies at the heart of the crisis.

As a coauthor of one of the cited papers ([15], to be precise), let me clarify that we are not “calling for bigger and better scientific regulations, nor are we advocating for pre-registered studies (although we do believe such studies have their place), nor are we proposing to “ban” anything!, nor are we offering any “generic calls to action.” Of all the things on that list, the only thing we’re suggesting is “better study design”—and our suggestions for better study design are in no way a call for “ever expanding scientific regulation.”

Spatial models for demographic trends?

Jon Minton writes:

You may be interested in a commentary piece I wrote early this year, which was published recently in the International Journal of Epidemiology, where I discuss your work on identifying an aggregation bias in one of the key figures in Case & Deaton’s (in)famous 2015 paper on rising morbidity and mortality in middle-aged White non-Hispanics in the US.

Colour versions of the figures are available in the ‘supplementary data’ link in the above. (The long delay between writing, submitting, and the publication of the piece in IJE in some ways supports the arguments I make in the commentary, that timeliness is key, and blogs – and arxiv – allow for a much faster pace of research and analysis.)

An example of the more general approach I try to promote to looking at outcomes which vary by age and year is provided below, where I used data from the Human Mortality Database to produce a 3D printed ‘data cube’ of log mortality by age and year, whose features I then discuss. [See here and here.]

Seeing the data arranged in this way also makes it possible to see when the data quality improves, for example, as you can see the texture of the surface change from smooth (imputed within 5/10 year intervals) to rough.

I agree with your willingness to explore data visually to establish ground truths which your statistical models then express and explore more formally. (For example, in your identification of cohort effects in US voting preferences.) To this end I continue to find heat maps and contour plots of outcomes arranged by year and age a simple but powerful approach to pattern-finding, which I am now using as a starting point for statistical model specification.

The arrangement of data by year and age conceptually involves thinking about a continuous ‘data surface’ much like a spatial surface.

Given this, what are your thoughts on using spatial models which account for spatial autocorrelation, such as in R’s CARBayes package, to model demographic data as well?

My reply:

I agree that visualization is important.

Regarding your question about a continuous surface: yes, this makes sense. But my instinct is that we’d want something tailored to the problem; I doubt that a CAR model makes sense in your example. Those models are rotationally symmetric, which doesn’t seem like a property you’d want here.

If you do want to fit Bayesian CAR models, I suggest you do it in Stan.

Minton responded:

I agree that additional structure and different assumptions to those made by CAR would be needed. I’m thinking more about the general principle of modeling continuous age-year-rate surfaces. In the case of fertility modeling, for example, I was able to follow enough of this paper (my background is as an engineer rather than statistician) to get a sense that it formalises the way I intuit the data.

In the case of fertility, I also agree with using cohort and age as the surface’s axes rather than year and age. I produced the figure in this poster, where I munged Human Fertility Database and (less quality assured but more comprehensive) Human Fertility Collection data together and re-arranged year-age fertility rates by cohort to produce slightly crude estimates of cumulative cohort fertility levels. The thick solid line shows at which age different cohort ‘achieve’ replacement fertility levels (2.05), which for most countries veers off into infinity if not achieved by around the age of 43. The USA is unusual in regaining replacement fertility levels after losing them, which I assume is a secondary effect of high migration, and migrant cohorts bringing with them a different fertility schedule with them than non-migrants. The tiles are arranged from most to least fertile in the last recorded year, but the trends show these ranks will change over time, and the USA may move to top place.

Graphics software is not a tool that makes your graphs for you. Graphics software is a tool that allows you to make your graphs.

I had an email exchange with someone the other day. He had a paper with some graphs that I found hard to read, and he replied by telling me about the software he used to make the graphs. It was fine software, but the graphs were, nonetheless, unreadable.

Which made me realize that people are thinking about graphics software the wrong way. People are thinking that the software makes the graph for you. But that’s not quite right. The software allows you to make a graph for yourself.

Think of graphics software like a hammer. A hammer won’t drive in a nail for you. But if you have a nail and you know where to put it, you can use the hammer to drive in the nail yourself.

This is what I told my correspondent:

Writing takes thought. You can’t just plug your results into a computer program and hope to have readable, useful paragraphs.
Similarly, graphics takes thought. You can’t just plug your results into a graphics program and hope to have readable, useful graphs.

Tips when conveying your research to policymakers and the news media

Following up on a conversation regarding publicizing scientific research, Jim Savage wrote:

Here’s a report that we produced a few years ago on prioritising potential policy levers to address the structural budget deficit in Australia. In the report we hid all the statistical analysis, aiming at an audience that would feel comfortable reading a broadsheet newspaper.

In terms of impact, the report really hit the mark—front page of every national newspaper, and was the centre of political discourse for weeks. Longer-term, our big proposals were more or less adopted by both sides of politics.

Some strategies that we used that I think paid off (I can’t claim credit for these—my old boss John was a master at the dark arts):

A surprise to no insiders. We spent about a year on the report, talking to policymakers and those who’d be hostile to our ideas (lobby groups, mainly) throughout. By the time it was released, the insiders knew what to say about it, and we had good arguments against the detractors.

Prioritising in terms of political cost (as well as potential budget gains, economic costs) was well received.

– The “supporting analysis” deck was a hit with political staffers and journalists. We provided Excel files containing all the plots to any media outfit that asked. Anything that makes journalists’ jobs easier, sadly, will get more media time.

Briefing, briefing, briefing. In the two weeks before release, we took a 2-page summary (only charts) around to any journalist/politician who’d listen. That gave them time to write their pieces well in advance.

Apparently this is PR 101, but it was completely new to me. And I think the approach gave the paper a great run among those we wanted to influence.

These are interesting ideas that we can all think about when we have some policy-relevant results to convey from our research.

This seemed worth blogging, on the theory that our blog readers are, on average, doing good things and so we should spread these useful public relations tips. Positive-sum advice, I hope.

Computing marginal likelihoods in Stan, from Quentin Gronau and E. J. Wagenmakers

Gronau and Wagemakers write:

The bridgesampling package facilitates the computation of the marginal likelihood for a wide range of different statistical models. For models implemented in Stan (such that the constants are retained), executing the code bridge_sampler(stanfit) automatically produces an estimate of the marginal likelihood.

Full story is at the link.

My talk tomorrow (Fri) 10am at Columbia

I’m speaking for the statistics undergraduates tomorrow (Fri 17 Nov) 10am in room 312 Mathematics Bldg. I’m not quite sure what I’ll talk about: maybe I’ll do again my talk on statistics and sports, maybe I’ll speak on the statistical crisis in science. Anyone can come; especially we’d like to attract undergraduates—not just statistics majors—to learn more about our field.

No no no no no on “The oldest human lived to 122. Why no person will likely break her record.”

I came across this news article by Brian Resnick entitled:

The oldest human lived to 122. Why no person will likely break her record. Even with better medicine, living past 120 years will be extremely unlikely.

I was skeptical, and I really didn’t buy it after reading the research article, “Evidence for a limit to human lifespan,” by Xiao Dong, Brandon Milholland and Jan Vijg, that appeared in Nature.

As I wrote in an email to Resnick: “No no no no no on ‘The oldest human lived to 122. Why no person will likely break her record.'”

So much of it seems ridiculous to me.

The news article says, “In all, they determined the probability that someone will reach age 125 in any given year ‘is less than 1 in 10,000.’ Or put another way: A 125-year-old human is a once-in-10,000-year occurrence.”

But the headline refers to someone living to 122 or 123, not to 125. And that’s already happened once, right?

tl;dr: If someone has a mathematical model claiming that something that actually did happen, is extremely unlikely to happen, this to me is evidence that the model is flawed. I can see how Nature—which is a bit of a “tabloid”—would publish such a thing, but I was unhappy to see a neutral journalist falling for this. I recommend a bit of skepticism.

The news article concludes: “Calment, meanwhile, should rest easy in her grave that her record will be around for a long, long time.”

I wouldn’t be so sure.

I clicked through, and the paper has various weird things. For example, they report that maximum reported age of death has been decreasing in recent years, but if you look carefully these estimates have huge uncertainties (that’s what it means when they say P=0.27 and P=0.70). Their curves look pretty but are basically overfitting; that is, they’re correct when they write that one “could explain these results simply as fluctuations.” They write, “we modelled the MRAD as a Poisson distribution; we found that the probability of an MRAD exceeding 125 in any given year is less than 1 in 10,000.” But there’s no reason at all that this model should make sense at all.

To summarize: There’s nothing wrong with them rooting around in the data and looking for patterns; we can learn a lot that way. But it’s a mistake to present such speculations as anything more than speculation. I don’t think statements such as “In fact, the human race is not very likely to break that record, ever,” are doing anyone any favors.

To put it another way: If you saw such extreme claims from a political advocacy group, you’d be skeptical, right? I recommend the same skepticism when you see something in a scientific publication. Please please please don’t think that, just cos something’s published in Nature, that this is a guarantee that it’s sound science. You really have to look carefully at the paper. And this one isn’t so hard to look at; they’re not doing anything really technical here.

I also sent this to science journalist Ed Yong, who was quoted in the news article. Yong replied:

So Vijg was clear to me in our interview that the change after the mid-90s shouldn’t be seen as a decrease since it’s non-significant. He billed it as a plateau; it’s more that the significant increase before that point no longer continues.

I did ask him about things like outliers and the choice of 1995 as a breakpoint. He said that the results are the same even if you take out Calment as the most obvious outlier, and whichever year you pick as the breakpoint. From him:

There simply is no significant increase from the early 1990s onwards. I am sure that some people will argue that the upward trend may continue soon enough. While we agree that the data are noisy, which is to be expected, the statistics are clear. Fortunately, all databases are public so everyone who wishes can do the math and disagree with us.

To which I responded:

Let me put it this way, then: My problem is in going from “A linear regression with a small number of data points has a trend coefficient which, when fit to the past twenty years, is not statistically significantly different from zero” to “they determined the probability that someone will reach age 125 in any given year ‘is less than 1 in 10,000′” and “In fact, the human race is not very likely to break that record, ever.”

Also, I think it’s a bit strange for them to say both “the data are noisy” and “the statistics are clear.” Their Poisson distribution seems to come out of nowhere.

There was also this quote from Vijg: “When Calment died at 122, everyone said it’ll only be a matter of time before we have someone who’s 125 or 130.”

That also seems a bit misleading in that there’s a big difference between 122 and 125, and a really big difference between 125 and 130! Each year becomes harder to achieve (at least, until there’s some medical breakthrough).

From a news perspective, this is not serious science, it’s just a fun feature story. I think Vijg is misunderstanding the difference between interpolation and extrapolation, but, hey, that’s how he got published in Nature!

3 more articles (by others) on statistical aspects of the replication crisis

A bunch of items came in today, all related to the replication crisis:

– Valentin Amrhein points us to this fifty-authored paper, “Manipulating the alpha level cannot cure significance testing – comments on Redefine statistical significance,” by Trafimow, Amrhein, et al., who make some points similar to those made by Blake McShane et al. here.

– Torbjørn Skardhamar points us to this paper, “The power of bias in economics research,” by Ioannidis, Stanley, and Doucouliagos, which is all about type M errors, but for a different audience (economics instead of psychology and statistics), so that’s a good thing.

– Jonathan Falk points us to this paper, “Consistency without inference: Instrumental variables in practical Application,” by Alwyn Young, which argues, convincingly, that instrumental variables estimates are typically too noisy to be useful. Here’s the link to the replication crisis: If IV estimates are so noisy, how is it that people thought they were ok for so long? Because researchers had so many unrecognized degrees of freedom that they were able to routinely obtain statistical significance from IV estimates—and, traditionally, once you have statistical significance, you just assume, retrospectively, that your design had sufficient precision.

It’s good to see such a flood of articles of this sort. When it’s one or two at a time, the defenders of the status quo can try to ignore, dodge, or parry the criticism. But when it’s coming in from all directions, this perhaps will lead us to a new, healthy consensus.

“What is a sandpit?”

From Private Eye 1399, in Pseuds Corner:

What is a sandpit?

Sandpits are residential interactive workshops over five days involving 20-30 participants; the director, a team of expert mentors, and a number of independent stakeholders. Sandpits have a highly multidisciplinary mix of participants, some active researchers and others potential users of research outcomes, to drive lateral thinking and radical approaches to address research challenges. [continues for three pages]

Here’s the webpage, from the Engineering and Physical Sciences Research Council (U.K.).

That’s right, social scientists aren’t the only ones who have to put up with this sort of b.s.

And get this:

Due to group dynamics and continual evaluation it is not possible to ‘dip in and out’ of the process. Participants must stay for the whole duration of the event.

I just hope they let the participants go into town for the occasional meal, and they don’t stick them with cafeteria food for five straight days. Lateral thinking, indeed.

High five: “Now if it is from 2010, I think we can make all sorts of assumptions about the statistical methods without even looking.”

Eric Tassone writes:

Have you seen this? “Suns Tracking High Fives to Measure Team Camaraderie.” Key passage:
Continue reading ‘High five: “Now if it is from 2010, I think we can make all sorts of assumptions about the statistical methods without even looking.”’ »

I hate that “Iron Law” thing

Dahyeon Jeong wrote:

While I was reading your today’s post “Some people are so easy to contact and some people aren’t”, I’ve come across your older posts including “Edlin’s rule for routinely scaling down published estimates.”

In this post you write:

Also, yeah, that Iron Law thing sounds horribly misleading. I’d not heard that particular term before, but I was aware of the misconception. I’ll wait on posting more about this now, as a colleague and I are already in the middle of a writing a paper on the topic.

I was especially curious about this, so I’ve searched your blog and CV, but I didn’t find a relevant follow-up post/article on this topic. If there’s indeed no post on this, I would really look forward to reading it at some point in future.

Jeong’s email was in 2016, and my quote above is from 2014. In the meantime, Eric Loken and I finally wrote that paper: it came out early this year. Here’s our article, and here and here are some relevant blog posts.

So we do make progress. Slowly.

Fitting multilevel models when predictors and group effects correlate

Ryan Bain writes:

I came across your ‘Fitting Multilevel Models When Predictors and Group Effects Correlate‘ paper that you co-authored with Dr. Bafumi and read it with great interest. I am a current postgraduate student at the University of Glasgow writing a dissertation examining explanations of Euroscepticism at the individual and country level since the onset of the economic crisis. I employ multilevel modeling with two levels: individuals within states. As I am examining predictors of Euroscepticism at the country level, I employ random effects as individuals are clustered within countries. My supervisor pointed me in the direction of your paper as a means for controlling for omitted variable bias by ensuring that my country-level predictors are not correlated with my random effect parameter.

I recently discovered an article by Jonathan Kelley, M. D. R. Evans, Jennifer Lowman and Valerie Lykes: ‘Group-mean-centering independent variables in multi-level models is dangerous’. After working through a series of examples, the paper suggests that the practice be abandoned. The authors demonstrate, after group mean centering individual-level independent variables, that group mean centering country-level variables in regression models results in incorrect estimations of the coefficients for country-level (and individual-level) predictors being produced. The authors summarise their doubts about the method on pg.15 in the ‘5 Summary’ section. However, all of their criticisms about the use of the method and the adverse consequences that group mean centering has on estimates of country-level predictors are based on models that also have the individual-level predictors group mean centered.

The authors of the article only briefly reference the purpose of group mean centering as a means of controlling for omitted variable bias at the contextual level, on pg.3 stating: “Raudenbush and Bryk (2002) also posit that group-mean centering can reduce bias in random component variance estimates”. That passing reference is all that the authors make in regards to the use of group mean centering for this purpose.

They also cite other authors who criticise the method but, again, all of their issues with the method relate to models in which individual level predictors are centered. In ‘Centering Predictor Variables in Cross-Sectional Multilevel Models: A New Look at an Old Issue’ by Craig K. Enders and Davood Tofighi (2007), for example, the authors state on pg.121 that: “the centering of Level 2 (e.g., organizational level) variables is far less complex than the centering decisions required at Level 1, as it is only necessary to choose between the raw metric and CGM[centered at the grand mean]; CWC[centering within cluster(which the authors refer to group mean centering as)] is not an option because each member of a given cluster shares the same value on the Level 2 predictor. Centering decisions at Level 2 generally mimic prescribed practice from the OLS regression literature (Aiken & West, 1991), so the focus of this article is on centering at Level 1. Throughout the remainder of the article, we assume that all Level 2 predictors are centered at their grand mean.”

Could please provide any guidance on this matter? The Kelley et al. (2016) article has made me doubt the use of group mean centering for controlling for omitted variable bias yet I am not sure if that was it’s intention for models in which only the country-level predictors were group mean centered.

My reply:

Yes, rather than thinking about centering the group means, I prefer to think about it as adding new predictors at the group level. In sociology they sometimes talk about individual and contextual effects, but more generally we can just speak predictively and say that the individual predictor and its group-level average can both be predictive of the outcome.

Bain adds:

What I believe has happened with this paper is that the authors assert that the group mean centered individual level coefficients are inappropriate because the within effect introduces additional level 2 error. But the authors do not mean the within effect (they stay clear of this terminology but it is what their argument is referring to). They are actually discussing the difference between the within and between effect. Throughout their article the authors examined the mean of the correlated random effects (cre) model in their analysis which represents the between-within difference.

Essentially, because the authors have examined the effects of the mean of the cre model they’ve compared and contrasted the coefficient of the mean of the individual-level variable of interest in the cre model with the original coefficient in a random effects model. With their focus on the mean – the difference between the within and between effect – they believed that this was the coefficient which represented the within effect hence why they’ve (incorrectly) argued that the within effect is confounded with the level 2 error (because the mean is what they focused on which obviously is confounded with the level 2 error in the cre model).

What should this student do? His bosses want him to p-hack and they don’t even know it!

Someone writes:

I’m currently a PhD student in the social sciences department of a university. I recently got involved with a group of professors working on a project which involved some costly data-collection. None of them have any real statistical prowess, so they came to me to perform their analyses, which I was happy to do. The problem? They want me to p-hack it, and they don’t even know it.

The project reads like one of your blog posts. The professors want to send this to a high-impact journal (they said Science, Nature, and The Lancet were their first three). There is no research question, and very little underlying theory. They essentially dumped the data on me and told me to email them when “you find something significant.” The worst part is, there is no malicious intent here and I don’t think they even know they they’re just fishing for p <.05. These are genuinely good, smart people who just want to do a cool study and get some recognition. I don't know if you have any advice to handling this sort of situation.

My recommendation is to do the best analysis you can, given your time constraints. If there are many potential things to look at, you might want to fit a multilevel model.

In any case, write up what you did, make graphs of data and fitted model, give the manuscript to the professors and let them decide where to submit it.

You’ll have a lot more control over the project if you write up your findings as a real paper, with a title, abstract, paragraphs, data and methods section, results, conclusions, and graphs. Don’t just send them a bunch of printouts as if you’re some kind of cog in the machine. Write something up.

My guess is that your colleagues/supervisors will appreciate this: Writing up results is a lot of work, and a student who can write is valuable. Here are some tips on writing research articles.

It’s fine if these profs want to change your paper, or rewrite it, or incorporate it into what you wrote (as long as they give you appropriate coauthorship). If in all this manipulation they want to submit something you don’t like, for example if they start pulling out p-values and telling bogus stories, then tell them you’re not happy with this! Explain your problems forthrightly. Ultimately it might come to a breakup, but give these colleagues of yours a chance to do things right, and give yourself a chance to make a contribution. And if it doesn’t work out, walk away: at least you got some practice with data analysis and writing.

Stan Roundup, 10 November 2017

We’re in the heart of the academic season and there’s a lot going on.

  • James Ramsey reported a critical performance regression bug in Stan 2.17 (this affects the latest CmdStan and PyStan, not the latest RStan). Sean Talts and Daniel Lee diagnosed the underlying problem as being with the change from char* to std::string arguments—you can’t pass char* and rely on the implicit std::string constructor without the penalty of memory allocation and copying. The reversion goes back to how things were before with const char* arguments. Ben Goodrich is working with Sean Talts to cherry-pick the performance regression fix to Stan that led to a very slow 2.17 release for the other interfaces. RStan 2.17 should be out soon, and it will be the last pre-C++11 release. We’ve already opened the C++11 floodgates on our development branches (yoo-hoo!).

  • Quentin F. Gronau, Henrik Singmann, E. J. Wagenmakers released the bridgesampling package in R. Check out the arXiv paper. It runs with output from Stan and JAGS.

  • Andrew Gelman and Bob Carpenter‘s proposal was approved by Coursera for a four-course introductory concentration on Bayesian statistics with Stan: 1. Bayesian Data Analysis (Andrew), 2. Markov Chain Monte Carlo (Bob), 3. Stan (Bob), 4. Multilevel Regression (Andrew). The plan is to finish the first two by late spring and the second two by the end of the summer in time for Fall 2018. Advait Rajagopal, an economics Ph.D. student at the New School is going to be leading the exercise writing, managing the Coursera platform, and will also TA the first few iterations. We’ve left open the option for us or others to add a prequel and sequel, 0. Probability Theory, and 5. Advanced Modeling in Stan.

  • Dan Simpson is in town and dropped a casual hint that order statistics would clean up the discretization and binning issues that Sean Talts and crew were having with the simulation-based algorithm testing framework (aka the Cook-Gelman-Rubin diagnostics). Lo-and-behold, it works. Michael Betancourt worked through all the math on our (chalk!) board and I think they are now ready to proceed with the paper and recommendations for coding in Stan. As I’ve commented before, one of my favorite parts of working on Stan is watching the progress on this kind of thing from the next desk.

  • Michael Betancourt tweeted about using Andrei Kascha‘s javascript-based vector field visualization tool for visualizing Hamiltonian trajectories and with multiple trajectories, the Hamiltonian flow. Richard McElreath provides a link to visualizations of the fields for light, normal, and heavy-tailed distributions. The Cauchy’s particularly hypnotic, especially with many fewer particles and velocity highlighting.

  • Krzysztof Sakrejda finished the fixes for standalone function generation in C++. This lets you generate a double- and int-only version of a Stan function for inclusion in R (or elsewhere). This will go into RStan 2.18.

  • Sebastian Weber reports that the Annals of Applied Statistics paper, Bayesian aggregation of average data: An application in drug development, was finally formally accepted after two years in process. I think Michael Betancourt, Aki Vehtari, Daniel Lee, and Andrew Gelman are co-authors.

  • Aki Vehtari posted a case study for review on extreme-value analysis and user-defined functions in Stan [forum link — please comment there].

  • Aki Vehtari, Andrew Gelman and Jonah Gabry have made a major revision of Pareto smoothed importance sampling paper with improved algorithm, new Monte Carlo error and convergence rate results, new experiments with varying sample size and different functions. The next loo package release will use the new version.

  • Bob Carpenter (it’s weird writing about myself in the third person) posted a case study for review on Lotka-Volterra predator-prey population dynamics [forum link — please comment there].

  • Sebastian and Sean Talts led us through the MPI design decisions about whether to go with our own MPI map-reduce abstraction or just build the parallel map function we’re going to implement in the Stan language. Pending further review from someone with more MPI experience, the plan’s to implememt the function directly, then worry about generalizing when we have more than one function to implement.

  • Matt Hoffman (inventor of the original NUTS algorithm and co-founder of Stan) dropped in on the Stan meeting this week and let us know he’s got an upcoming paper generalizing Hamiltonian Monte Carlo sampling and that his team at Google’s working on probabilistic modeling for Tensorflow.

  • Mitzi Morris, Ben Goodrich, Sean Talts and I sat down and hammered out the services spec for running the generated quantities block of a Stan program over the draws from a previous sample. This will decouple the model fitting process and the posterior predictive inference process (because the generated quantities block generates a ỹ according to p(ỹ | θ) where ỹ is a vector of predictive quantities and θ is the vector of model parameters. Mitzi then finished the coding and testing and it should be merged soon. She and Ben Bales are working on getting it into CmdStan and Ben Goodrich doesn’t think it’ll be hard to add to RStan.

  • Mitzi Morris extended the spatial case study with leave-one-out cross-validation and WAIC comparisons of the simple Poisson model, a heterogeneous random effects model, a spatial random effects model, and a combined heterogeneous and spatial model with two different prior configurations. I’m not sure if she posted the updated version yet (no, because Aki is also in town and suggested checking Pareto khats, which said no).

  • Sean Talts split out some of the longer tests for less frequent application to get distribution testing time down to 1.5 hours to improve flow of pull requests.

  • Sean Talts is taking another one for the team by leading the charge to auto-format the C++ code base and then proceed with pre-commit autoformat hooks. I think we’re almost there after a spirited discussion of readability and our ability to assess it.

  • Sean Talts also added precompiled headers to our unit and integration tests. This is a worthwhile speedup when running lots of tests and part of the order of magnitude speedup Sean’s eked out.

ps. some edits made by Aki

Noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger.

Kyle MacDonald writes:

I wondered if you’d heard of Purvesh Khatri’s work in computational immunology, profiled in this Q&A with Esther Landhuis at Quanta yesterday.

Elevator pitch is that he believes noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger. The thing that gave me the woollies was this line:

“We start with dirty data,” he says. “If a signal sticks around despite the heterogeneity of the samples, you can bet you’ve actually found something.”

On the one hand, that seems like an almost verbatim restatement of your “what doesn’t kill my statistical significance makes it stronger” fallacy. On the other hand, he seems to use his methods purely to look for things to test empirically, rather than to draw conclusions based on the analysis, which is good, and might mean that the fallacy doesn’t apply. I also like his desire to look for connections that isolated groups might miss:

I realized that heart transplant surgeons, kidney transplant surgeons and lung transplant surgeons don’t really talk to each other!

I’d be interested in hearing your thoughts: worth the noise if he’s finding connections that no one would have thought to test?

My response:

I haven’t read Khatri’s research articles and I know next to nothing about this field of research so I can’t really say. Based on the above-quoted news article, the work looks great.

Regarding your question: On one hand, yes, it seems mistaken to have more confidence in one’s findings because the data were noisier. On the other hand, it’s not clear that by “dirty data,” he means “noisy data.” It seems that he just means “diverse data” from different settings. And there I agree that it should be better to include and model the variation (multilevel modeling!) than to study some narrow scenario. It also looks like good news that he uses training and holdout sets. That’s something we can’t always do in social science but should be possible in genetics where data are so plentiful.

“A mixed economy is not an economic abomination or even a regrettably unavoidable political necessity but a natural absorbing state,” and other notes on “Whither Science?” by Danko Antolovic

So. I got this email one day, promoting a book that came with the following blurb:

Whither Science?, by Danko Antolovic, is a series of essays that explore some of the questions facing modern science.

A short read at only 41 pages, Whither Science? looks into the fundamental questions about the purposes, practices and future of science. As a global endeavor, which influences all of contemporary life, science is still a human creation with historical origins and intellectual foundations. And like all things human, it has its faults, which must be accounted for.

It sounded like this guy might be a crank, but they sent me a free copy so I took a look. I read the book, and I liked it. It’s written in an unusual style, kinda like what you might expect from someone with a physics/chemistry background writing about social science and philosophy. But that’s ok.

Antolovic deserves to be recognized as the next Nassim Taleb—by which, I mean a plain-speaking yet deep revealer of true structures, a philosophical autodidact with a unique combination of views.

The book is worth reading.

p.6, “Today, the practitioner of science is almost without exception an employee of a larger corporate entity (a university or a company) or of a national government. He is hemmed in by the tangible constraints of his terms of employment and funding, and by the less tangible ones of departmental, institutional and funding politics. He labors in a crowded field, in which there are increasingly fewer stones left unturned, and he climbs the ladder of corporate seniority until he retires.”

p.7, “After the Second World War, science went from being the province of the few to becoming the career path of many.”

“Since scientific development is fundamentally important to the well-being of modern societies, it is easy to see the benefits of exalting this decidedly un-adventurous walk of life with the help of a heroic foundation story. In the eyes of the supporting public, and in those of prospective practitioners, present-day science is the heir and descendant of the heroic achievements that dispelled the darkness of superstition, changed our image of the universe, and wonder-worked what we today know as the industrial world. And so it is, but we should examine the heir on his own merits.”

Well put. I have nothing to add.

“Market economy is usually held up as the paragon of a robust and efficient mechanism by which to produce and distribute wealth. For it to function, it must have a sufficiently large number of economic “players” (individuals and companies), and a pool of as of yet unowned resources – energy or raw materials – that are available for the taking. Players invest their labor, and their already owned wealth, to appropriate the resources; they work the raw resources into things that they and others consider valuable, and they trade with each other in the quest for greater wealth.”

“We must point out that the pool of unowned resources is an essential factor for the competitive market to exist: that is what the market players compete for, either directly, by extracting the resources themselves, or indirectly, by trading with others in the wealth derived from these resources.”

Compare to the hypothetical desert island whose inhabitants survive economically by taking in each others’ laundry. Or various poor countries, or poor regions of countries, that just don’t have enough unowned resources to go around. What economist Tyler Cowen calls Zero Marginal Product zones.

Just as fishing technology has allowed humans to grab all the fish, and oil drilling and coal mining technology threaten to remove that pool of unowned fossil fuel resources, so does economic development threaten to kill the golden goose etc.

I’ll have to think about this one. If it’s really true that economic exchange relies on that pool of unowned resources, then the market economy is self-defeating. Cultural contradictions of capitalism but in a different way. This is interesting because economists often recommend solving problems of unowned resources by giving them owners. Rhinos, fish, the disaster that was post-communist Russia. But to put it in Antolovic’s terms, “If the resources are intentionally distributed among the players, again by political means, we have a form of planned economy.”

So in that way a mixed economy is not an economic abomination or even a regrettably unavoidable political necessity but a natural absorbing state.

I wonder what Jeff Sachs would think of this.

p.8, “The wealth of the participants is not tokenized by money, but by a less rigidly defined currency, which we will refer to as prestige. . . . Participants use their existing prestige to appropriate the funding resources, which they convert into further prestige via the process of performing scientific research. Direct trading in research results is proscribed as unethical, since the results must nominally be original and attributable to a researcher. However, scientific results are merely ancillary to the accumulation of prestige, and prestige is freely traded for labor and further prestige: this is the politics of who collaborates with whom, who is hired in which department or research laboratory etc. Typically, those with less prestige offer their labor to those with more, with the objective of increasing their own prestige and share of resources by association.”

Yup. That has the “anthropologist on Mars” ring of truth. It describes what goes on, what I and others do.

It’s important to be clear-eyed without being cynical. Prestige is the currency of science, but that does not mean that prestige is the reason we do things, nor does it mean that science is all about prestige. We have many goals in doing science, including discovery, serving societal goals, teamwork, and the joys of the scientific endeavor itself. Given that people will do crossword puzzles for diversion, it’s not such a stretch to think that science can be fun too.

One can draw an analogy to acting, where one could say the currency is fame or reputation; or professional team sports, where players are motivated both to win and to improve their personal statistics. To recognize certain goals should not be taken as to deny the existence of others.

To get back to science and its coins of prestige, I take Antolovic’s point to be, not that scientists are hypocrites to claim to seek discovery when they are nothing but careerists, or that scientists think themselves rational but are actually ruled by the same instincts, urges, and motivations that drive a society of bonobos, but rather that the accumulation and trading in of prestige is at this point a necessity for most scientists; it is baked into the scientific economy.

Consider my own case. I know myself well enough to recognize that I have an innate desire for prestige and acclaim. As a child I enjoyed being praised, and for decades now I’ve been thrilled when people come up to me and say they loved my talk, or that they’ve learned so much from my books. OK, fine. But that’s not why I do what I do. It’s more of an pleasant byproduct. I don’t choose to work on based on what will give me more praise or happy feedback, except to the extent that I want my work to be useful to others—I am a statistician, after all!—in which case the beneficiaries of my labors might well choose to thank me, which is fine.

But—and here’s where Antolovic’s argument comes in—I do seek prestige, not so much for its own sake but because of what it can buy. Again, the prestige-as-money argument. I know some people for whom accumulation of money is a major goal in itself, but most of us want money for what it can buy, and for the security it can provide. Similarly, I seek the prestige and publications which will allow me to attract top collaborators and do the best work I can, and to get the funding to hire the programmers that can allow Stan to realize its destiny, thus advancing science and technology in ways that I would like.

Prestige is the coin. It is true that my collaborators and I accumulate prestige, which we convert into grant funding and then into research results. We play the game because we want to do science. Prestige is not, by and large, the goal in itself. Antolovic writes, “Infantile gratification of personal vanity cannot remain the primary motivation for doing science.” But I think he’s missing the point here. Prestige buys us money, and money amplifies our research efforts, so we go for prestige for sensible instrumental reasons. Maybe also infantile gratification, but that’s not the primary motivation. Any more than the primary motivation of businessmen is the infantile desire to hold shiny coins and green pieces of paper.

One striking feature of the current crisis in science is the panic of people such as that embodied-cognition guy who’d built up great stores of the stuff—thousands and thousands of citations!—only to see science moving away from the germanium standard, as it were. (I don’t enjoy the dilution of my own prestige, of course—my list of journals I’ve published in, is looking more and more like a collection of vinyl records—but there’s nothing much I can do about it.)

The economic analogy works well. The realization that one can easily print more money leads to inflation, then a need for more money, then hyperinflation. Just look at the C.V.’s of recent computer science Ph.D.’s: there’s a pressure to publish dozens of conference papers a year. The field of statistics is more bimetallic, or multimetallic, with publications in various different sorts of journals. And, perhaps unsurprisingly, economics itself has, relatively speaking, remained a bastion of hard money, with the top five or so journals keeping much of their gold-standard status. (Which leads to troubles of its own, as in the career of Bruno Frey, and the recent brouhaha involving alleged insider favors in the American Economic Review.)

But I digress. What I want to say here is that I appreciate Antolovic’s insightful application of economic ideas to scientific research, and I hope that readers can get the point without getting lost in cynicism. Moving Stan forward costs a lot of money. Programmers need to be paid, and that means that I end up spending a lot of my time asking people for money.

To draw yet another analogy, the currency of baking is not flour or yeast but, by and large, money. A successful baker can raise the funds to buy higher-quality ingredients, to expand the bakery, to try out new recipes, and so forth, allowing more money to be raised, etc. Or he or she can run a small shop with no grander goals but will still need to make enough money to live on. But the goal of just about everyone involved (setting aside the pure hacks) is to make bread. The system must ultimately be evaluated based on the quality and quantity of bread produced (along with related concerns such as variety and sustainability).

p.12, “It is our thesis that the past half century or so has proven the bazaar-like approach to science a failure. This period has filled libraries with scientific publications to the point of bursting, while offering disappointingly little toward what has always been the underlying premise of the techno-scientific endeavor: betterment of the human condition. The great killing diseases of our time, cardiovascular disease and cancer, have remained with us through this period, and no fundamental approach to curing them is in sight. As the average population age creeps up, degenerative diseases of body and mind are becoming an ever greater economic drain, yet progress in that area moves at glacial pace. Even new infectious pathogens, such as HIV and the Ebola virus, seem to be more than what contemporary science can readily counter, despite very considerable advances in molecular and cell biology.”

Rather than argue the details of this, I want to remark on how refreshing this perspective is, to criticize the “bazaar-like approach” to anything. In a famous internet document from 1995, The Cathedral and the Bazaar, Eric Raymond contrasted the top-down and bottom-up or self-organizing approaches to construction and argued strongly and persuasively in favor of the latter. The cathedral is central planning, bureaucracy, and projects that take centuries to complete, at which point the original goals have become irrelevant. The bazaar is evolution, it’s competition, it’s small groups working together when they need to, and going their own way when appropriate.

In the context of scientific research, the cathedral is big research labs and PNAS; the bazaar is Arxiv and internet comment sections. Or is it the other way around?

Big research looks like a cathedral only from a distance; close-up it’s thousands of competing research groups. Meanwhile, Arxiv is run by a small group, and much of the discussion on the internet has been absorbed within the walls of Facebook.

Anyway, I don’t plan any cathedral/bazaar manifesto myself, I just wanted to register my interest in Anotolovic’s refusal to hold a reflexive pro-bazaar position. Instead, he recommends scientific management focused on a national level aimed at particular goals, rather than the current loose system where goals are stated but then money is given to research teams with little outside direction or management. I don’t know how well this will work, but the possibility seems worth looking into.

p.18, “Perhaps it is understandable that the supernatural has greater emotional traction in the human mind than the natural. The supernatural is the product of the mind itself, a story told to both stir and assuage the anxieties of a social animal: supernatural causes are always personal, they are somebody, good or evil. Empirical explanation, on the other hand, endeavors to discover causes that are unfamiliar, emotionally indifferent and invariably impersonal; there is, at the core of it, certain disappointing banality to every factual explanation.”

Well put.

p.19, “Religion, specifically Roman Christianity, is of course the arch-villain of the foundational narrative of science, but from the perspective of the empiricist, the conflict, or at least the intellectual part of the conflict, is entirely avoidable: insofar that religion asserts that certain doctrines are factually true without presenting factual evidence, that assertion is intellectually worthless. Any theological speculations that do not make factual claims are open to consideration, discussion or disregard, as one may wish, but science has no inherent conflict with them.”

“Intellectually worthless” is a bit too strong: if I come up with an assertion without presenting factual evidence, I still may be making a contribution if my assertion is taken as a hypothesis or if it inspires others to useful thought. Just as one can argue, for example, that Jules Verne could’ve made a useful intellectual contribution to undersea exploration, even had he decided to insist on the factual existence of Captain Nemo.

p.21, “Objections raised by romantic movements are substantive and conscientious, and they speak from the authority of their historical present. They do not represent a reflexive “opposition to progress,” but rather they are a legitimate effort of the human mind to come to terms with the full implications of the changing image of the world, emotional as much as rational; we regard the romantic periods as an integral part of the story of empiricism.”

p.22, “A new scientific theory must account for those facts that were understood under the old one before it ventures to offer new explanations.”

Not quite! Sometimes science can make progress, working around well-known anomalies that resist clean explanation in any existing framework. Indeed, it could well be that certain aspects of the real world will never be explainable by human theories. 1/137, anyone?

p.25, “Putting it in a straightforward way, secular ethics asks: How should I treat others? Should I “do unto others” as I would wish to be treated (or at least give them decent consideration), or should I do unto others whatever it takes to attain my own goals, goals which, in the absence of a credible supernatural authority, I am free to set however I please?”

Well put. Here’s my definition, from my first ethics column in Chance: “An ethics problem arises when you are considering an action that (a) benefits you or some cause you support, (b) hurts or reduces benefits to others, and (c) violates some rule.”

Antolovic continues, “Empirical observation convinces me that societies in which the golden rule is generally followed are happier, free of strife, and productive; reason tells me that I can live a good life in such a society, and it can guide me in contributing to its welfare, if I so choose. But reason also tells me that, under right circumstances and with right effort, I can acquire much more for myself by manipulating, destroying, robbing and enslaving others; the same reason will help me accomplish that objective also.”

p.29, “Since its 16th century beginnings, science has reached far and wide into the world of phenomena, and for perhaps a century now, it has continuously exploited its proven methods of investigation, making available an ever greater power over that phenomenal world. However, only a small fraction of its effort has been expended toward understanding the one thing which is both the source of scientific inquiry and the recipient of its fruits: the human mind.”

Not anymore, right? Neuroscience is a big deal these days. And psychology’s been a big deal for awhile. Even more “external” social sciences have been turning inward; consider, for example, the claim by economists that theirs is the science of human behavior.

And then there’s computer science, machine learning, artificial intelligence.

But this: “We accept, and always have accepted, that procreative aggression of young males – the bellicosity of the rut – will be harnessed for state’s purposes, making them into cannon fodder for whatever cause is being fought about at the moment. We accept that civilized peoples can and will be coaxed back into the depths of pre-civilized horde loyalty and set against some conveniently chosen outside group as the ‘enemy.’ We observe public words and actions of decision makers of nuclear-armed nations, and we recognize in them thinly disguised impulses of the dominant animal in a primate horde – and we accept that as natural. We allow the fruits of technological progress to be used for vertiginous enrichment of individuals who are devoid of all but a boundless drive for acquisition, and we do not see this drive as a pathology, a personality disorder: rather, we see it as a trait to be envied and lived out vicariously through admiration.”

Ouch. As a human, I feel the shock of recognition.

p.32, “First truly scientific insights into the mind came with the work of Sigmund Freud. Freud’s method of investigation was not empirical observation, but rather introspection, but he used introspection as if it were empirical observation of external phenomena. He regarded the patients’ introspective monologues as authentic and reliable observables of the mind, although he did not treat them as literal reports, but as material to be analyzed. . . . Freud’s work has in it much that is speculative, and it does not (yet) exhibit the rigor of a developed scientific discipline . . .”

He’s no Freud-worshipper: “Proponents of psychoanalysis in their turn believe that dark subconscious impulses and conflicts can be resolved by reason, once they have been brought into the light of consciousness by analysis. In reality, psychoanalytical approach has been shown to have limited success even in its original role as a clinical therapy for neuroses, and it is entirely impractical to think that the ‘talking cure’ could be employed to lead the broader mankind out of instinctual darkness.”

But: “The contribution of [Freud’s] work lies in having proposed both a methodology and a set of working hypotheses in an area of science which is still deficient in that respect today.”

p.33, “Human governance throughout history has amounted mostly to murderous rule by individuals whose only claim to power was that they wanted it badly enough to fight for it; this accompanied by equally murderous sycophancy of the ruled, usually directed against the heretic, the infidel, the traitor to the cause, the ‘other.’ In modern times, unfettered overconsumption is practiced by most of the western populations, accompanied by equally grotesque over-accumulation of wealth and economic power by a few individuals. All of these behaviors can be readily recognized as driven by primitive instincts that were unilaterally freed from their natural constraints, their effects amplified by human power over nature.”

The rest of Antolovic’s book is interesting too.

Using D&D to reduce ethnic prejudice

OK, not quite D&D—I just wrote that to get Bob’s attention. It is a role-playing game, though!

Here’s the paper, “Seeing the World Through the Other’s Eye: An Online Intervention Reducing Ethnic Prejudice,” by Gabor Simonovits, Gabor Kezdi, and Peter Kardos:

We report the results of an intervention that targeted anti-Roma sentiment in Hungary using an online perspective-taking game. We evaluated the impact of this intervention using a randomized experiment in which a sample of young adults played this perspective-taking game, or an unrelated online game. Participation in the perspective-taking game markedly reduced prejudice, with an effect-size equivalent to half the difference between voters of the far-right and the center-right party. The effects persisted for at least a month, and, as a byproduct, the intervention also reduced antipathy toward refugees, another stigmatized group in Hungary, and decreased vote intentions for Hungary’s overtly racist, far-right party by 10%. Our study offers a proof-of-concept for a general class of interventions that could be adapted to different settings and implemented at low costs.

Simonovits wrote:

The paper is similar to some existing social psychology studies on perspective taking but we made an effort to improve on the credibility of the analysis by (1) using a relatively large sample (2) registering and following a pre-analysis plan (3) using pre-treatment measures to explore differential attrition and (4) estimating long term effects of the treatment. It got desk-rejected from PNAS and Psych Science but was just accepted for publication in APSR.

I agree that: (1) a large sample can’t hurt, (2) preregistration makes this sort of result much more believable, (3) using pre-treatment variables can be crucial in getting enough precision to estimate what you care abut, and (4) richer outcome measures can help a lot.

When people proudly take ridiculous positions

Tom Wolfe on evolution:

I think it’s misleading to say that human beings evolved from animals. I mean, actually, nobody knows whether they did or not.

This is just sad. Does Wolfe really think this? My guess is he’s trying to do a solid for his political allies.

Jerry Coyne writes:

Somewhere on his mission to tear down the famous, elevate the neglected outsider and hit the exclamation-point key as often as possible, Wolfe has forgotten how to think.

Well put. But I think Wolfe does know how to think.

You know what they say, right? “Any prosecutor can convict a guilty man. It takes a great prosecutor to convict an innocent man.” Similarly, I think Wolfe takes it as a point of pride that, as a great writer, he can make the case for something as ridiculous as anti-Darwinism.

And, after all, who goes to Tom Wolfe to learn about science? The man’s an entertainer.

This is not to defend Wolfe’s statement, which is flat-out ridiculous, comparable to that of Kenneth Ludmerer, a professor of history and medicine at Washington University in St. Louis who testified that he had “no opinion” on whether cigarette smoking contributes to the development of lung cancer in human beings—and he said that in 2002, that’s right, 38 years after the Surgeon General’s report. I just think we take it in context: Wolfe doesn’t give a damn about science but he cares a lot about politics, so he probably thinks it’s charming to say something ridiculous with a straight face, his way to give a poke in the eye to those pesky experts who know more than he does about something.

That’s right. Tom Wolfe is a low-rent G. K. Chesterton (or, to put it in modern terms, a witty, intelligent, socially conscious version of Michael Kinsley).

Using Stan to improve rice yields

Matt Espe writes:

Here is a new paper citing Stan and the rstanarm package.

Yield gap analysis of US rice production systems shows opportunities for improvement. Matthew B. Espe, Kenneth G. Cassman, Haishun Yang, Nicolas Guilpart, Patricio Grassini, Justin Van Wart, Merle Anders, Donn Beighley, Dustin Harrell, Steve Linscombe, Kent McKenzie, Randall Mutters, Lloyd T. Wilson, Bruce A. Linquist. Field Crops Research. Volume 196, September 2016, Pages 276–283.

Many thanks to everyone on the development team for some excellent tools!

I’ve not read the paper, but, hey, if Stan can improve U.S. rice yields by a factor of 1.5, that’s cool. Then all our research will have been worth it.

The Statistical Crisis in Science—and How to Move Forward (my talk next Monday 6pm at Columbia)

I’m speaking Mon 13 Nov, 6pm, at Low Library Rotunda at Columbia:

The Statistical Crisis in Science—and How to Move Forward

Using examples ranging from elections to birthdays to policy analysis, Professor Andrew Gelman will discuss ways in which statistical methods have failed, leading to a replication crisis in much of science, as well as directions for improvements through statistical methods that make use of more information.

Online reservation is required; follow the link currently full and closed. This will be a talk for a general audience.