Skip to content

Fitting the Besag, York, and Mollie spatial autoregression model with discrete data

Rudy Banerjee writes:

I am trying to use the Besag, York & Mollie 1991 (BYM) model to study the sociology of crime and space/time plays a vital role. Since many of the variables and parameters are discrete in nature is it possible to develop a BYM that uses an Integer Auto-regressive (INAR) process instead of just an AR process?

I’ve seen INAR(1) modeling, even a spatial INAR or SINAR paper but they seem to be different that the way BYM is specified in the Bayes framework.

Does it even make sense to have a BYM that is INAR? I can think of discrete jumps in independent variables that affect the dependent variable in discrete jumps. (Also, do these models violate convexity requirements often required for statistical computing?)

My reply:

1. To see how to fit this sort of model in a flexible way, see this Stan case study, Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data, from Mitzi Morris.

2. Rather than trying to get cute with your discrete modeling, I’d suggest a simple two-level approach, where you use an underlying continuous model (use whatever space-time process you want, BYM or whatever) and then you can have a discrete data model (for example, negative binomial, that is, overdispersed Poisson) on top of that.

Stan development in RStudio

Check this out! RStudio now has special features for Stan:

– Improved, context-aware autocompletion for Stan files and chunks

– A document outline, which allows for easy navigation between Stan code blocks

– Inline diagnostics, which help to find issues while you develop your Stan model

– The ability to interrupt Stan parallel workers launched within the IDE

This is awesome—especially that last feature. Rstudio is my hero.

And don’t forget this: If you don’t currently have Stan on your computer, you can play with this demo version on the web, thanks to RStudio Cloud.

David Brooks discovers Red State Blue State Rich State Poor State!

The New York Times columnist writes:

Our political conflict is primarily a rich, white civil war. It’s between privileged progressives and privileged conservatives. You could say that tribalism is the fruit of privilege. People with more stresses in their lives necessarily pay less attention to politics. . . .

I’ve had some differences with Brooks in the past, but when he agrees with me, I’m not gonna complain.

As David Park, Boris Shor, Joe Bafumi, Jeronimo Cortina, and I wrote ten years ago:

The cultural divide of the two Americas looms larger at high incomes . . . A theme throughout this book is that the cultural differences between states—the things that make red and blue America feel like different places—boil down to differences among richer people in these states.


Consistent with Ronald Inglehart’s hypothesis of postmaterialism, survey data show social issues to be more important to higher-income voters. This can be viewed as a matter of psychology and economics, with social policy as a luxury that can be afforded once you have achieved some material comfort. Our results stand in contradiction to the commonly held idea that social issues distract lower-income voters from their natural economic concerns.

It took a few years, but it seems that our ideas have finally become conventional wisdom.

The AAA tranche of subprime science, revisited

Tom Daula points us to this article, “Mortgage-Backed Securities and the Financial Crisis of 2008: A Post Mortem,” by Juan Ospina and Harald Uhlig. Not our usual topic at this blog, but then there’s this bit on page 11:

We break down the analysis by market segment defined by loan type (Prime, Alt-A, and Subprime). Table 5 shows the results and documents the third fact: the subprime AAA-rated RMBS did particularly well. AAA-rated Subprime Mortgage Backed Securities were the safest securities among the non-agency RMBS market. As of December 2013 the principal-weighted loss rates AAA-rated subprime securities were on average 0.42% [2.2% for Prime AAA same page]. We do not deny that even the seemingly small loss of 0.42% should be considered large for any given AAA security.

Nonetheless, we consider this to be a surprising fact given the conventional narrative for the causes of the financial crisis and its assignment of the considerable blame to the subprime market and its mortgage-backed securities. An example of this narrative is provided by Gelman and Loken (2014):

We have in mind an analogy with the notorious AAA-class bonds created during the mid-2000s that led to the subprime mortgage crisis. Lower-quality mortgages [that is, mortgages with high probability of default and, thus, high uncertainty] were packaged and transformed into financial instruments that were (in retrospect, falsely) characterized as low risk.

OK, our paper wasn’t actually about mortgages; it was about statistics. We were just using mortgages as an example. But if Ospina and Uhlig are correct, we were mistaken in using AAA-rated subprime mortgages as an example of a bad bet. Analogies are tricky things!

P.S. Daula adds:

Overall, I think it fits your data collection/measurement theme, and how doing that well can provide novel insights. In that vein, they provide a lot of detail to replicate the results, in case folks disagree. There’s the technical appendix which (p.39) “serves as guide for replication and for understanding the contents of our database” as well as (p.7) a replication kit available from the authors. As to the latter, (p.15) footnote 15 guides the reader to where exactly where to look for the one bit of modeling in the paper (“For a detailed list of the covariates employed, refer to MBS Project/Replication/DefaultsAnalysis/Step7”).

Toward better measurement in K-12 education research

Billy Buchanan, Director of Data, Research, and Accountability, Fayette County Public Schools, Lexington, Kentucky, expresses frustration with the disconnect between the large and important goals of education research, on one hand, and the gaps in measurement and statistical training, on the other.

Buchanan writes:

I don’t think that every classroom educator, instructional coach, principal, or central office administrator needs to be an expert on measurement. I do, however, think that if we are training individuals to be researchers (e.g., PhDs) we have a duty to make sure they are able to conduct the best possible research, understand the various caveats and limitations to their studies, and especially understand how measurement – as the foundational component of all research across all disciplines (yes, even qualitative research) – affects the inferences derived from their data. . . .

In essence, if a researcher wants to use an existing vetted measurement tool for research purposes, they should already have access to the technical documentation and can provide the information up front so as our staff reviews the request we can also evaluate whether or not the measurement tool is appropriate for the study. If a researcher wants to use their own measure, we want them to be prepared to provide sufficient information about the validity of their measurement tool so we can ensure that they publish valid results from their study; this also has an added benefit for the researcher by essentially motivating them to generate another paper – or two or three – from their single study.

He provides some links to resources, and then he continues:

I would like to encourage other K-12 school districts to join with us in requiring researchers to do the highest quality research possible. I, for one, at least feel that the students, families, and communities that we serve deserve nothing less than the best and believe if you feel the same that your organization would adopt similar practices. You can find our policies here: Fayette County Public Schools Research Requests. Or if you would like to use our documentation/information and/or contribute to what we currently have, you can submit an issue to our documentation’s GitHub Repository (

He had a sudden cardiac arrest. How does this change the probability that he has a particular genetic condition?

Megan McArdle writes:

I have a friend with a probability problem I don’t know how to solve. He’s 37 and just keeled over with sudden cardiac arrest, and is trying to figure out how to assess the probability that he has a given condition as his doctors work through his case. He knows I’ve been sharply critical of doctors’ failures to properly assess the Type I/Type II tradeoff, so he reached out to me, but we quickly got into math questions above my pay grade, so I volunteered to ask if you would sketch out the correct statistical approach.

To be clear, he’s an engineer, so he’s not asking you to do the work for him! Just to sketch out in a few words how he might approach information gathering and setting up a problem like “given that you’ve had sudden cardiac arrest, what’s the likelihood that a result on a particular genetic test is a false positive?”

My reply:

I agree that the conditional probability should change, given the knowledge that he had the cardiac arrest. Unfortunately it’s hard for me to be helpful here because there are too many moving parts: of course the probability of the heart attack conditional on having the condition or not, but also the relevance of the genetic test to his health condition. This is the kind of problem that is addressed in the medical decision making literature, but I don’t think I have anything useful to add here, beyond emphasizing that the calculation of any such probability is an intermediate step in this person’s goal of figuring out what he should do next regarding his heart condition.

I’m posting the question here, in case any of you can point to useful materials on this. In addition to the patient’s immediate steps in staying alive and healthy, this is a general statistical issue that has to be coming up in medical testing all the time, in that tests are often done in the context of something that happened to you, so maybe there is some general resource on this topic?

Understanding Chicago’s homicide spike; comparisons to other cities

Michael Masinter writes:

As a longtime blog reader sufficiently wise not to post beyond my academic discipline, I hope you might take a look at what seems to me to be a highly controversial attempt to use regression analysis to blame the ACLU for the recent rise in homicides in Chicago. A summary appears here with followup here.

I [Masinter] am skeptical, but loathe to make the common mistake of assuming that my law degree makes me an expert in all disciplines. I’d be curious to know what you think of the claim.

The research article is called “What Caused the 2016 Chicago Homicide Spike? An Empirical Examination of the ‘ACLU Effect’ and the Role of Stop and Frisks in Preventing Gun Violence,” and is by Paul Cassell and Richard Fowles.

I asked Masinter what were the reasons for his skepticism, and he replied:

Two reasons, one observational, one methodological, and a general sense of skepticism.

Other cities have restricted the use of stop and frisk practices without a corresponding increase in homicide rates.

Quite a few variables appear to be collinear. The authors claim that Bayesian Model Averaging controls for that, but I lack the expertise to assess their claim.

More generally, their claim of causation appears a bit like post hoc propter hoc reasoning dressed up in quantitative analysis.

Perhaps I am wrong, but I have come to be skeptical of such large claims.

I asked Masinter if it would be ok to quote him, and he responded:

Yes, but please note my acknowledged lack of expertise. Too many of my law prof colleagues think our JD qualifies us as experts in all disciplines, a presumption magnified by our habit of advocacy.

I replied, “Hey, don’t they know that the only universal experts are M.D.’s and, sometimes, economists?”, and Masinter then pointed me to this wonderful article by Arthur Allen Leff from 1974:

With the publication of Richard A. Posner’s Economic Analysis of Law, that field of learning known as “Law and Economics” has reached a stage of extended explicitness that requires and permits extended and explicit comment. . . . I was more than half way through the book before it came to me: as a matter of literary genre (though most likely not as a matter of literary influence) the closest analogue to Economic Analysis of Law is the picaresque novel.

Think of the great ones, Tom Jones, for instance, or Huckleberry Finn, or Don Quixote. In each case the eponymous hero sets out into a world of complexity and brings to bear on successive segments of it the power of his own particular personal vision. The world presents itself as a series of problems; to each problem that vision acts as a form of solution; and the problem having been dispatched, our hero passes on to the next adventure. The particular interactions are essentially invariant because the central vision is single. No matter what comes up or comes by, Tom’s sensual vigor, Huck’s cynical innocence, or the Don’s aggressive romanticism is brought into play . . .

Richard Posner’s hero is also eponymous. He is Economic Analysis. In the book we watch him ride out into the world of law, encountering one after another almost all of the ambiguous villains of legal thought, from the fire-spewing choo-choo dragon to the multi-headed ogre who imprisons fair Efficiency in his castle keep for stupid and selfish reasons. . . .

One should not knock the genre. To hold the mind-set constant while the world is played in manageable chunks before its searching single light is a powerful analytic idea, the literary equivalent of dropping a hundred metals successively into the same acid to see what happens. The analytic move, just as a strategy, has its uses, no matter which mind-set is chosen, be it ethics or psychology or economics or even law. . . .

Leff quotes Posner:

Efficiency is a technical term: it means exploiting economic resources in such a way that human satisfaction as measured by aggregate consumer willingness to pay for goods and services is maximized. Value too is defined by willingness to pay.

And then Leff continues:

Given this initial position about the autonomy of people’s aims and their qualified rationality in reaching them, one is struck by the picture of American society presented by Posner. For it seems to be one which regulates its affairs in rather a bizarre fashion: it has created one grand system—the market, and those market-supportive aspects of law (notably “common,” judge-made law)—which is almost flawless in achieving human happiness; it has created another—the political process, and the rest of “the law” (roughly legislation and administration)—which is apparently almost wholly pernicious of those aims.

An anthropologist coming upon such a society would be fascinated. It would seem to him like one of those cultures which, existing in a country of plenty, having developed mechanisms to adjust all intracultural disputes in peace and harmony, lacking any important enemies, nevertheless comes up with some set of practices, a religion say, simultaneously so barbaric and all-pervasive as to poison almost every moment of what would otherwise appear to hold potential for the purest existential joy. If he were a bad anthropologist, he would cluck and depart. If he were a real anthropologist, I suspect he would instead stay and wonder what it was about the culture that he was missing. That is, he would ponder why they wanted that religion, what was in it for them, what it looked like and felt like to be inside the culture. And he would be especially careful not to stand outside and scoff if, like Posner, he too leaned to the proposition that “most people in most affairs of life are guided by what they conceive to be their self-interest and . . . choose means reasonably (not perfectly) designed to promote it.”

I’m reminded of our discussion from a few years ago regarding the two ways of thinking in economics.

Economists are often making one of two arguments:

1. People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist.

2. People are irrational and they need economists, with their open minds, to show them how to be rational and efficient.

Argument 1 is associated with “why do they do that?” sorts of puzzles. Why do they charge so much for candy at the movie theater, why are airline ticket prices such a mess, why are people drug addicts, etc. The usual answer is that there’s some rational reason for what seems like silly or self-destructive behavior.

Argument 2 is associated with “we can do better” claims such as why we should fire 80% of public-schools teachers or Moneyball-style stories about how some clever entrepreneur has made a zillion dollars by exploiting some inefficiency in the market.

Despite going in the opposite directions, Arguments 1 and 2 have a similarity, in that both used to make the point that economists stand from a superior position by which they can evaluate the actions of others.

But Leff said it all back in 1974.

P.S. We seem to have mislaid our original topic, which is the study of the change in homicide rates in Chicago. I read the article by Casell and Fowles and their analysis of Chicago seemed reasonable to me. At the same time, I agree with Masinter that the comparisons to other cities didn’t seem so compelling.

Limitations of “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection”

“If you will believe in your heart and confess with your lips, surely you will be saved one day”The Mountain Goats paraphrasing Romans 10:9

One of the weird things about working with people a lot is that it doesn’t always translate into multiple opportunities to see them talk.  I’m pretty sure the only time I’ve seen Andrew talk was at a fancy lecture he gave in Columbia. He talked about many things that day, but the one that stuck with me (because I’d not heard it phrased that well before, but as a side-note this is a memory of the gist of what he was saying. Do not hold him to this opinion!) was that the problem with p-values and null-hypothesis wasn’t so much that the procedure was bad. The problem is that people are taught to believe that there exists a procedure that can, given any set of data, produce a “yes/no” answer to a fairly difficult question. So the problem isn’t the specific decision rule that NHST produces, so much as the idea that a universally applicable decision rule exists at all. (And yes, I know the maths. But the problem with p-values was never the maths.)

This popped into my head again this week as Aki, Andrew, Yuling, and I were working on a discussion to Gronau and Wagenmakers’ (GW) paper “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection”.

Our discussion is titled “Limitations of ‘Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection'” and it extends various points that Aki and I have made at various points on this blog.

To summarize our key points:

  1. It is a bad thing for GW to introduce LOO model selection in a way that doesn’t account for its randomness. In their very specialized examples this turns out not to matter because they choose such odd data that the LOO estimates have zero variance. But it is nevertheless bad practice.
  2. Stacking is a way to get model weights that is more in line with the LOO-predictive concept than GW’s ad hoc pseudo-BMA weights. Although stacking is also not consistent for nested models, in the cases considered in GW’s paper it consistently picks the correct model. In fact, the model weight for the true model in each of their cases is w_0=1 independent of the number of data points.
  3. By not recognizing this, GW missed an opportunity to discuss the limitations of the assumptions underlying LOO (namely that the observed data is representative of the future data, and each individual data point is conditionally exchangeable).  We spent some time laying these out and proposed some modifications to their experiments that would make these limitations clearer.
  4. Because LOO is formulated under much weaker assumptions than is used in this paper, namely LOO does not assume that the data is generated by one of the models under consideration (the so-called “M-Closed assumption”), it is a little odd that GW only assess its performance under this assumption. This assumption almost never holds. If you’ve ever used the famous George Box quote, you’ve explicitly stated that the M-Closed assumption does not hold!
  5. GW’s assertion that when two models can support identical models (such as in the case of nested models), the simplest model should be preferred is not a universal truth, but rather a specific choice that is being made. This can be enforced for LOO methods, but like all choices in statistical modelling, it shouldn’t be made automatically or by authority, but should instead be critically assessed in the context of the task being performed.

All of this has made me think about the idea of doing model selection. Or, more specifically, it’s made me question whether or not we should try to find universal tools for solving this problem. Is model selection even possible? (Danielle Navarro from UNSW has a particularly excellent blog post outlining her experiences with various existing model selection methods that you all should read.)

So I guess my very nebulous view is that we can’t do model selection, but we can’t not do model selection, but we also can’t not not do model selection.

In the end we need to work out how to do model selection for specific circumstances and to think critically about our assumptions. LOO helps us do some of that work.

To close off, I’m going to reproduce the final section of our paper because what’s the point of having a blog post (or writing a discussion) if you can’t have a bit of fun.

Can you do open science with M-Closed tools?

One of the great joys of writing a discussion is that we can pose a very difficult question that we have no real intention of answering. The question that is well worth pondering is the extent to which our chosen statistical tools influence how scientific decisions are made. And it’s relevant in this context because of a key difference between model selection tools based on LOO and tools based on marginal likelihoods is what happens when none of the models could reasonably generate the data.

In this context, marginal likelihood-based model selection tools will, as the amount of data increases, choose the model that best represents the data, even if it doesn’t represent the data particularly well. LOO-based methods, on the other hand, are quite comfortable expressing that they can not determine a single model that should be selected. To put it more bluntly, marginal likelihood will always confidently select the wrong model, while LOO is able to express that no one model is correct.

We leave it for each individual statistician to work out how the shortcomings of marginal likelihood-based model selection balance with the shortcomings of cross-validation methods. There is no simple answer.

Stan on the web! (thanks to RStudio)

This is big news. Thanks to RStudio, you can now run Stan effortlessly on the web.

So you can get started on Stan without any investment in set-up time, no need to install C++ on your computer, etc.

As Ben Goodrich writes, “RStudio Cloud is particularly useful for Stan tutorials where a lot of time can otherwise be wasted getting C++ toolchains installed and configured on everyone’s laptops.”

To get started with a simple example, just click here and log in.

We’ve pre-loaded this particular RStudio session with a regression model and an R script to simulate fake data and run the model. In your online RStudio Cloud session (which will appear within your browser when you click the above link), just go the lower-right window with Files, and click on simple_regression.stan and simple_regression.R. This will open up those files. Run simple_regression.R and it will simulate the data, run the Stan program, and produce a graph.

Now you can play around.

Create your own Stan program: just work in the upper-left window, click on File, New File, Stan File, then click on File, Save As, and give it a .stan extension. The RStudio editor already has highlighting and autocomplete for Stan files.

Or edit the existing Stan program (simple_regression.stan, sitting there in the lower right of your RStudio Cloud window), change it how you’d like, then edit the R script or create a new one. You can upload data to your session too.

When you run a new or newly edited Stan program, it will take some time to compile. But then next time you run it, R will access the compiled version and it will run faster.

You can also save your session and get back to it later.

Some jargon for ya—but I mean every word of it!

This is a real game-changer as it significantly lowers the barriers to entry for people to start using Stan.


Ultimately I recommend you set up Stan on your own computer so you have full control over your modeling, but RStudio’s Cloud is a wonderful way to get started.

Here’s what Rstudio says:

Each project is allocated 1GB of RAM. Each account is allocated one private space, with up to 3 members and 5 projects. You can submit a request to the RStudio Cloud team for more capacity if you hit one of these space limits, and we will do our best accomodate you. If you are using a Professional 2 account, you will not encounter these space limits.

In addition to the private space (where you can collaborate with a selected group of other users), every user also gets a personal workspace (titled “Your Workspace”), where there is virtually no limit to the number of projects you can create. Only you can work on projects in your personal workspace, but you can grant view & copy access to them to any other RStudio Cloud user.

This is just amazing. I’m not the most computer-savvy user, but I was able to get this working right away.

Ben adds:

It also comes with
* The other 500-ish examples in the examples/ subdirectory
* Most of the R packages that use Stan, including McElreath’s rethinking package from GitHub and all the ones under stan-dev, namely
– rstanarm (comes with compiled Stan programs for regression)
– brms (generates Stan programs for regression)
– projpred (for selecting a submodel of a GLM)
– bayesplot and shinystan (for visualizing posterior output)
– loo (for model comparison using expected log predictive density)
– rstantools (for creating your own R packages like rstanarm)
* Saving new / modified compiled Stan programs to the disk to use across sessions first requires the user to do rstan_options(auto_write = TRUE)

I’m so excited. You can now play with Stan yourself with no start-up effort. Or, if you’re already a Stan user, you can demonstrate it to your friends. Also, you can put your own favorite models in an RStudio Cloud environment (as Ben did for my simple regression model) and then share the link with other people, who can use your model on your data, upload their own data, alter your model, etc.

P.S. It seems that for now this is really only for playing around with very simple models, to give you a taste of Stan, and then you’ll have to install it on your computer to do more. See this post from Aki. That’s fine. as this demo is intended to be a show horse, not a work horse. I think there is also a way to pay RStudio for cycles on the cloud and then you can run bigger Stan models through the web interface. So that could be an option too, for example if you want to use Stan as a back-end for some computing that you’d like others to access remotely.

Why are functional programming languages so popular in the programming languages community?

Matthijs Vákár writes:

Re the popularity of functional programming and Church-style languages in the programming languages community: there is a strong sentiment in that community that functional programming provides important high-level primitives that make it easy to write correct programs. This is because functional code tends to be very short and easy to reason about because of referential transparency, making it quick to review and maintain. This becomes even more important in domains where correctness is non-trivial, like in the presence of probabilistic choice and conditioning (or concurrency).

The sense – I think – is that for the vast majority of your code, correctness and programmer-time are more important considerations than run-time efficiency. A typical workflow should consist of quickly writing an obviously correct but probably inefficient program, profiling it to locate the bottlenecks and finally – but only then – thinking hard about how to use computational effects like mutable state and concurrency to optimise those bottlenecks without sacrificing correctness.

There is also a sense that compilers will get smarter with time, which should result in pretty fast functional code, allowing the programmer to think more about what she wants to do rather than about the how. I think you could also see the lack of concern for performance of inference algorithms in this light. This community is primarily concerned with correctness over efficiency. I’m sure they will get to more efficient inference algorithms eventually. (Don’t forget that they are not statisticians by training so it may take some time for knowledge about inference algorithms to percolate into the PL community.)

Justified or not, there is a real conviction in the programming languages community that functional ideas will become more and more important in mainstream programming. People will point to the increase of functional features in Scala, Java, C# and even C++. Also, people note that OCaml and Haskell these days are getting to the point where they are really quite fast. Jane Street would not be using OCaml if it weren’t competitive. In fact, one of the reasons they use it – as I understand it – is that their code involves a lot of maths which functional languages lend themselves well to writing. Presumably, this is also part of the motivation of using functional programming for prob prog? In a way, functional programming feels closer to maths.

Another reason I think people care about language designs in the style of Church is that they make it easy to write certain programs that are hard in Stan. For instance, perhaps you care about some likelihood-free model, like a model where some probabilistic choices are made, then a complicated deterministic program is run and only then the results are observed (conditioned on). An example that we have here in Oxford is a simulation of the standard model of particle physics. This effectively involves pushing forward your priors through some complicated simulator and then observing. It would be difficult to write down a likelihood for this model. Relatedly, when I was talking to Erik Meijer, I got the sense that Facebook wants to integrate prob prog into existing large software systems. The resulting models would not necessarily have a smooth likelihood wrt the Lebesgue measure. I am told that in some cases inference in these models is quite possible with rejection sampling or importance sampling or some methods that you might consider savage. The models might not necessarily be very hard statistically, but they are hard computationally in the sense that they do involve running a non-trivial program. (Note though that this has less to do with functional programming than it does with combining probabilistic constructs with more general programming constructs!)

A different example that people often bring up is given by models where observations might be of variable length, like in language models. Of course, you could do it in Stan with padding, but some people don’t seem to like this.

Of course, there is the question whether all of this effort is wasted if eventually inference is completely intractable in basically all but the simplest models. What I would hope to eventually see – and many people in the PL community with me – is a high-level language in which you can do serious programming (like in OCaml) and have access to serious inference capabilities (like in Stan). Ideally, the compiler should decide/give hints as to which inference algorithms to try — use NUTS when you can, but otherwise back-off and try something else. And there should be plenty of diagnostics to figure out when to distrust your results.

Finally, though, something I have to note is that programming language people are the folks writing compilers. And compilers are basically the one thing that functional languages do best because of their good support for user-defined data structures like trees and recursing over such data structures using pattern matching. Obviously, therefore, programming language folks are going to favour functional languages. Similarly, because of their rich type systems and high-level abstractions like higher-order functions, polymorphism and abstract data types, functional languages serve as great hosts for DSLs like PPLs. They make it super easy for the implementer to write a PPL even if they are a single academic and do not have the team required to write a C++ project.

The question now is whether they also genuinely make things easier for the end user. I believe they ultimately have the potential to do so, provided that you have a good optimising compiler, especially if you are trying to write a complicated mathematical program.

I replied:

On one hand, people are setting up models in Church etc. that they may never be able to fit—it’s not just that Church etc are too slow to fit these models, it’s that they’re unidentified or just don’t make sense or have never really been thought out. (My problem here is not with the programming language called Church; I’m just here concerned about the difficulty with fitting any model correctly the first time, hence the need for flexible tools for model fitting and exploration.)

But, on the other hand, people are doing inference and making decisions all the time, using crude regressions or t-tests or whatever.

To me, it’s close to meaningless for someone to say they can write “an obviously correct but probably inefficient program” if the program is fitting a model that you can’t really fit!

Matthijs responded:

When I say “write an obviously correct but probably inefficient program”, I am talking about usual functional programming in a deterministic setting. I think many programming language (PL) folks do not fully appreciate what correctness for a probabilistic program means. I think they are happy when they can prove that their program asymptotically does the right thing. I do not think they quite realise yet that that gives no guarantees about what to think about your results when you run your program for a finite amount of time. It sounds silly, but I think at least part of the community is still to realise that. Personally, I expect that people will wake up eventually and will realise that correctness in a probabilistic setting is a much more hairy beast. In particular, I expect that the PL community will realise that run-time diagnostics and model checking are the things that they should have been doing all along. I am hopeful though that at some point enough knowledge of statistics will trickle through to them such that genuinely useful collaboration is possible. I strongly feel that there is mostly just a lot of confusion rather than wilful disregard.

The Golden Rule of Nudge

Nudge unto others as you would have them nudge unto you.

Do not recommend to apply incentives to others that you would not want for yourself.


I was reading this article by William Davies about Britain’s Kafkaesque immigration policies.

The background, roughly, is this: Various English politicians promised that the net flow of immigrants to the U.K. would decrease. But the British government had little control over the flow of immigrants to the U.K., because most of them were coming from other E.U. countries. So they’d pretty much made a promise they couldn’t keep. The only way the government could even try to reach their targets was to convince people currently living in the U.K. to leave. And one way to do this was to make their lives more difficult. Apparently the government focused this effort on people living in Britain who’d originally come from former British colonies in the Caribbean, throwing paperwork at them and threatening their immigration status.

Davies explains:

The Windrush generation’s immigration status should never have been in question, and the cause of their predicament is recent: the 2014 Immigration Act, which contained the flagship policies of the then home secretary, Theresa May. Foremost among them was the plan to create a ‘hostile environment’, with the aim of making it harder for illegal immigrants to work and live in the UK . . .

It’s almost as if, on discovering that law alone was too blunt an instrument for deterring and excluding immigrants, May decided to weaponise paperwork instead. The ‘hostile environment’ strategy was never presented just as an effective way of identifying and deporting illegal immigrants: more important, it was intended as a way of destroying their ability to build normal lives. The hope, it seemed, was that people might decide that living in Britain wasn’t worth the hassle. . . .

The thread linking benefit sanctions and the ‘hostile environment’ is that both are policies designed to alter behaviour dressed up as audits. Neither is really concerned with accumulating information per se: the idea is to use a process of constant auditing to punish and deter . . .

And then he continues:

The coalition government was fond of the idea of ‘nudges’, interventions that seek to change behaviour by subtle manipulation of the way things look and feel, rather than through regulation. Nudgers celebrate the sunnier success stories, such as getting more people to recycle or to quit smoking, but it’s easy to see how the same mentality might be applied in a more menacing way.

Nudge for thee but not for me?

This all reminds me of a general phenomenon, that “incentives matter” always seems like a good motto for other people, but we rarely seem to want it for ourselves. For example, there’s lots of talk about how being worried about losing your job is a good motivator to work hard (or, conversely, that a secure job is a recipe for laziness), but we don’t want job insecurity for ourselves. (Sure, lots of people who aren’t tenured professors or secure government employees envy those of us with secure jobs, and maybe they think we don’t deserve them, but I don’t think these people are generally asking for less security in their own jobs.)

More generally, it’s my impression that people often think that “nudges” are a good way to get other people to behave in desirable ways, without these people wanting to be “nudged” themselves. For example, did Jeremy Bentham put himself in a Panopticon to ensure his own good behavior?

There are exceptions, though. I like to stick other people in my office so it’s harder for me to drift off and be less productive. And lots of smokers are supportive of polities that make it less convenient to smoke, as this can make it easier for them to quit and harder for them to relapse.


With all that in mind, I’d like to propose a Golden Rule of Nudge: Nudge unto others as you would have them nudge unto you.

Do not recommend to apply incentives to others that you would not want for yourself.

Perhaps you could try a big scatterplot with one dot per dataset?

Joe Nadeau writes:

We are studying variation in both means and variances in metabolic conditions. We have access to nearly 200 datasets that involve a range of metabolic traits and vary in sample size, mean effects, and variance. Some traits differ in mean but not variance, others in variance but not mean, still others in both, and of course some in neither. These studies are based on animal models where genetics and environmental conditions are well-controlled and where pretty rigorous study designs are possible. We plan to report the full range of results, but would like to rank results according to confidence (power?) for each dataset. Some are obviously more robust than others. Confidence limits. which will be reported, don’t seem to quite right for ranking. We feel an obligation to share some sense of confidence, which should be based on sample size, variance in the contrast groups – between genotypes, between diets, between treatments, …… . We are of course aware of the trickiness in studies like these. Our question is based on getting the rest right, and wanting to share with readers and reviewers a sense of the ‘strengths, limitations’ across the datasets. Suggestions?

My reply: Perhaps you could try a big scatterplot with one dot per dataset? Also I would doubt that there are really traits that do not differ in mean or variance. Maybe it would help to look at the big picture rather than to be categorizing individual cases, which can be noisy. Similarly, rather than ranking the results, I think it would be better to just consider ways of displaying all of them.

Podcast interview on polling (mostly), also some Bayesian stuff

Hugo Bowne-Anderson interviewed me for a DataCamp podcast. Transcript is here.

Rising test scores . . . reported as stagnant test scores

Joseph Delaney points to a post by Kevin Drum pointing to a post by Bob Somerby pointing to a magazine article by Natalie Wexler that reported on the latest NAEP (National Assessment of Educational Progress) test results.

In an article entitled, “Why American Students Haven’t Gotten Better at Reading in 20 Years,” Wexler asks, “what’s the reason for the utter lack of progress in reading scores?”

The odd thing is, though, is that reading scores have clearly gone up in the past twenty years, as Somerby points out in text and Drum shows in this graph:

Drum summarizes:

Asian: +15 points
White: +5 points
Hispanic: +10 points
Black: +5 points

. . . Using the usual rule of thumb that 10 points equals one grade level, black and white kids have improved half a grade level; Hispanic kids have improved a full grade level; and Asian kids have improved 1½ grade levels.

Why does this sort of thing get consistently misreported? Delaney writes:

My [Delaney’s] opinion: because there is a lot of money in education and it won’t be possible to “disrupt” education and redirect this money if the current system is doing well. . . .

It also moves the goalposts. If everything is falling apart then it isn’t such a crisis if the disrupted industry has teething issues once they strip cash out of it to pay for the heroes who are reinventing the system.

But if current educational systems are doing well, and slowly improving through incremental change, then it is a lot harder to argue that there is a crisis in education, isn’t it?

Could be. The other thing is that it can be hard to get unstuck from a conventional story. We discussed a similar example a few years ago: that time it was math test scores, which economist Roland Fryer stated had been “largely constant over the past thirty years,” even while he’d included a graph showing solid improvements.

Bayesian inference and religious belief

We’re speaking here not of Bayesianism as a religion but of the use of Bayesian inference to assess or validate the evidence regarding religious belief, in short, the probability that God !=0 or the probability that the Pope is Catholic or, as Tyler Cowen put it, the probability that Lutheranism is true.

As a statistician and social scientist I have a problem with these formulations in large part because the states in question do not seem at all clearly defined—for instance, Lutheranism is a set of practices as much as it is a set of doctrines, and what does it mean for a practice to be “true”; also the doctrines themselves are vague enough that it seems pretty much meaningless to declare them true or false.

Nonetheless, people have wrestled with these issues, and it may be that something can be learned from such explorations.

So the following email from political philosopher Kevin Vallier might be of interest, following up on my above-linked reply to Cowen. Here’s Vallier:

Philosophers of religion have been using Bayesian reasoning to determine the rationality of theistic belief in particular for many years. Richard Swinburne introduced Bayesian analysis in his many books defending the rationality of theistic and Christian belief. But many others, theist and atheist, use it as well. So there actually are folks out there who think about their religious commitments in Bayesian terms, just not very many of us. But I bet there are at least 1000 philosophers, theologians, and lay apologists who find thinking in Bayesian terms useful on religious matters and debates.

A nice, accessible use of Bayesian reasoning with regard to religious belief is Richard Swinburne’s The Existence of God, 2nd ed.

Wow—I had no idea there were 1000 philosophers in the world, period. Learn something new every day.

In all seriousness . . . As indicated in my post replying to Cowen, I’m interested in the topic not so much out of an interest in religion but rather because of the analogy to political and social affiliation. It seems to me that many affiliations are nominally about adherence to doctrine but actually are more about belonging. This idea is commonplace (for example, when people speak of political “tribes”), but complications start to arise when the doctrines have real-world implications, as in our current environment of political polarization.

Present each others’ posters

It seems that I’ll be judging a poster session next week. So this seems like a good time to repost this from 2009:

I was at a conference that had an excellent poster session. I realized the session would have been even better if the students with posters had been randomly assigned to stand next to and explain other students’ posters. Some of the benefits:

1. The process of reading a poster and learning about its material would be more fun if it was a collaborative effort with the presenter.

2. If you know that someone else will be presenting your poster, you’ll be motivated to make the poster more clear.

3. When presenting somebody else’s poster, you’ll learn the material. As the saying goes, the best way to learn a subject is to teach it.

4. The random assignment will lead to more inderdisciplinary understanding and, ultimately, collaboration.

I think just about all poster sessions should be done this way.

My post elicited some comments to which I replied:

– David wrote that my idea “misses the potential benefit to the owner of the poster of geting critical responses to their work.” The solution: instead of complete randoimization, randoimize the poster presenteres into pairs, then put pairs next to each other. Student A can explain poster B, student B can explain poster A, and spectators can give their suggestions to the poster preparers.

– Mike wrote that “one strong motivation for presenters is the opportunity to stand in front of you (and other members of the evaluation committee) and explain *their* work to you. Personally.” Sure, but I don’t think it’s bad if instead they’re explaining somebody else’s work. If I were a student, I think I’d enjoy explaining my tellow-students’ work to an outsider. The ensuing conversation might even result in some useful new ideas.

– Lawrence suggested that “the logic of your post apply to conference papers, too.” Maybe so.

“Fudged statistics on the Iraq War death toll are still circulating today”

Mike Spagat shares this story entitled, “Fudged statistics on the Iraq War death toll are still circulating today,” which discusses problems with a paper published in a scientific journal in 2006, and errors that a reporter inadvertently included in a recent news article. Spagat writes:

The Lancet could argue that if [Washington Post reporter Philip] Bump had only read the follow-up letters it published, he never would have reprinted the discredited graph. But this argument is akin to saying that there is no need for warning labels on cigarettes because people can just read the scientific literature on smoking and consider themselves warned. But in practice, many people will just assume the graph is kosher because it sits on the Lancet website with no warning attached. . . .

That said, this particular chapter at least has a happy ending. I wrote to Bump and the Washington Post and they fixed the story, in the process demonstrating an admirable respect for evidence and a commitment to the truth. The Lancet would do well to follow their example.

The Lancet declined to comment on this piece.

And here’s Spagat’s summary of the whole process:

1. The Lancet publishes a false graph.
2. The problems of the graph are exposed, several even in letters to the Lancet.
3. The Lancet just leaves the graph up.
4. A Washington Post reporter stumbles onto the false graph, thinks it’s cool and reprints it.
5. I tell the reporter that he just published a false graph.
6. The reporter does a mea culpa and pulls the graph down.
7. I write up this sequence of events for The Conversation [see above link].
8. The Conversation sends it to the Lancet.
9. The Lancet declines to comment and leaves the false graph up.
10. The Conversation publishes the piece.
11. Someone else sees and believes in the graph?

Over the years I’ve had bad experiences trying to get both academic journals and newspapers/magazines to make corrections but the journals have been the more reluctant of the two. A low water mark was when an editor at Public Opinion Quarterly told me, after a long exchange, that POQ policy is that they don’t correct errors.

The Lancet, of course, publishes lots of good stuff too (including, for example, this article). So it’s too bad to see them duck this one. Or maybe there’s more to the story. Anyone can feel free to add information in the comments.

“Ivy League Football Saw Large Reduction in Concussions After New Kickoff Rules”

I noticed this article in the newspaper today:

A simple rule change in Ivy League football games has led to a significant drop in concussions, a study released this week found.

After the Ivy League changed its kickoff rules in 2016, adjusting the kickoff and touchback lines by just five yards, the rate of concussions per 1,000 kickoff plays fell to two from 11, according to the study, which was published Monday in the Journal of the American Medical Association. . . .

Under the new system, teams kicked off from the 40-yard line, instead of the 35, and touchbacks started from the 20-yard line, rather than the 25.

The result? A spike in the number of touchbacks — and “a dramatic reduction in the rate of concussions” . . .

The study looked at the rate of concussions over three seasons before the rule change (2013 to 2015) and two seasons after it (2016 to 2017). Researchers saw a larger reduction in concussions during kickoffs after the rule change than they did with other types of plays, like scrimmages and punts, which saw only a slight decline. . . .

I was curious so I followed the link to the research article, “Association Between the Experimental Kickoff Rule and Concussion Rates in Ivy League Football,” by Douglas Wiebe, Bernadette D’Alonzo, Robin Harris, et al.

From the Results section:

Kickoffs resulting in touchbacks increased from a mean of 17.9% annually before the rule change to 48.0% after. The mean annual concussion rate per 1000 plays during kickoff plays was 10.93 before the rule change and 2.04 after (difference, −8.88; 95% CI, −13.68 to −4.09). For other play types, the concussion rate was 2.56 before the rule change and 1.18 after (difference, −1.38; 95% CI, −3.68 to 0.92). The difference-in-differences analysis showed that 7.51 (95% CI, −12.88 to −2.14) fewer concussions occurred for every 1000 kickoff plays after vs before the rule change.

I took a look at the table and noticed some things.

First, the number of concussions was pretty high and the drop was obviously not statistical noise: 126 (that is, 42 per year) for the first three years, down to 33 (15.5 per year) for the last two years. With the exception of punts and FG/PATs, the number of cases was large enough that the drop was clear.

Second, I took a look at the confidence intervals. The confidence interval for “other play types combined” includes zero: see the bottom line of the table. Whassup with that?

I shared this example with my classes this morning and we also had some questions about the data:

– How is “concussion” defined? Could the classification be changing over time? I’d expect that, what with increased concussion awareness, that concussion would be diagnosed more than before, which would make the declining trend in the data even more impressive. But I don’t really know.

– Why data only since 2013? Maybe that’s only how long they’ve been recording concussions.

– We’d like to see the data for each year. To calibrate the effect of a change over time, you want to see year-to-year variation, in this case a time series of the 5 years of data. Obviously the years of the concussions are available, and they might have even been used in the analysis. In the published article, it says, “Annual concussion counts were modeled by year and play type, with play counts as exposures, using maximum likelihood Poisson regression . . .” I’m not clear on what exactly was done here.

– We’d also like to see similar data from other conferences, not just the Ivy League, to see changes in games that did not follow these rules.

– Even simpler than all that, we’d like to see the raw data on which the above analysis was based. Releasing the raw data—that would be trivial. Indeed, the dataset may already be accessible—I just don’t know where to look for it. Ideally we’d move to a norm in which it was just expected that every publication came with its data and code attached (except when not possible for reasons of privacy, trade secrets, etc.). It just wouldn’t be a question.

The above requests are not meant to represent devastating criticisms of the research under discussion. It’s just hard to make informed decisions without the data.

Checking the calculations

Anyway, I was concerned about the last row of the above table so I thought I’d do my best to replicate the analysis in R.

First, I put the data from the table into a file, football_concussions.txt:

y1 n1 y2 n2
kickoff 26 2379 3 1467
scrimmage 92 34521 28 22467
punt 6 2496 2 1791
field_goal_pat 2 2090 0 1268

Then I read the data into R, added a new row for “Other plays” summing all the non-kickoff data, and computed the classical summaries. For each row, I computed the raw proportions and standard errors, the difference between the proportion, and the standard errors of that difference. I also computed the difference in differences, comparing the change in the concussion rate for kickoffs to the change in concussion rate for non-kickoff plays, as this comparison was mentioned in the article’s results section. I multiplied all the estimated differences and standard errors by 1000 to get the data in rates per thousand.

Here’s the (ugly) R code:

concussions <- read.table("football_concussions.txt", header=TRUE)
concussions <- rbind(concussions, apply(concussions[2:4,], 2, sum))
rownames(concussions)[5] <- "other_plays"
compare <- function(data) {
  p1 <- data$y1/data$n1
  p2 <- data$y2/data$n2
  diff <- p2 - p1
  se_diff <- sqrt(p1*(1-p1)/data$n1 + p2*(1-p2)/data$n2)
  diff_in_diff <- diff[1] - diff[5]
  se_diff_in_diff <- sqrt(se_diff[1]^2 + se_diff[5]^2)
  return(list(diffs=data.frame(data, diff=diff*1000, 
     diff_ci_lower=(diff - 2*se_diff)*1000, diff_ci_upper=(diff + 2*se_diff)*1000),
     diff_in_diff=diff_in_diff*1000, se_diff_in_diff=se_diff_in_diff*1000))
print(lapply(compare(concussions), round, 2))

And here's the result:

                y1    n1 y2    n2  diff diff_ci_lower diff_ci_upper
kickoff         26  2379  3  1467 -8.88        -13.76         -4.01
scrimmage       92 34521 28 22467 -1.42         -2.15         -0.69
punt             6  2496  2  1791 -1.29         -3.80          1.23
field_goal_pat   2  2090  0  1268 -0.96         -2.31          0.40
other_plays    100 39107 30 25526 -1.38         -2.05         -0.71

[1] -7.5

[1] 2.46

The differences and 95% intervals computed this way are similar, although not identical to, the results in the published table--but there are some differences that baffle me. Let's go through the estimates, line by line:

- Kickoffs. Estimated difference is identical 95% interval is correct to two significant digits. This unimportant discrepancy could be coming because I used a binomial model and the published analysis used a Poisson regression.

- Plays from scrimmage. Estimated difference is the same for both analyses; lower confidence bound is almost the same. But the upper bounds are different: I have -0.69; the published analysis is -0.07. I'm guessing they got -0.70 and they just made a typo when entering the result into the paper. Not as bad as the famous Excel error, but these slip-ups indicate a problem with workflow. A problem I often have in my workflow too, as in the short term it is often easier to copy numbers from one place to another than to write a full script.

- Punts. Estimated difference and confidence interval work out to within 2 significant digits.

- Field goals and extra points. Same point estimate, but confidence bounds are much different, with the published intervals much wider than mine. This makes sense: There's a zero count in these data, and I was using the simple sqrt(p_hat*(1-p_hat)) standard deviation formula which gives too low a value when p_hat=0. Their Poisson regression would be better.

- All non-kickoffs. Same point estimate but much different confidence intervals. I think they made a mistake here: for one thing, their confidence interval doesn't exclude zero, even though the raw numbers show very strong evidence of a difference (30 concussions after the rule change and 100 before; even if you proportionally scale down the 100 to 60 or so, that's still sooo much more than 30; it could not be the result of noise). I have no idea what happened here; perhaps this was some artifact of their regression model?

This last error is somewhat consequential, in that it could lead readers to believe that there's no strong evidence that the underlying rate of concussions was declining for non-kickoff plays.

- Finally, the diff-in-diff is three standard errors away from zero, implying that the relatively larger decline in concussion rate in kickoffs, compared to non-kickoffs, could not be simply explained by chance variation.

Difference in difference or ratio of ratios?

But is the simple difference the right thing to look at?

Consider what was going on. Before the rule change, concussion rates were much higher for kickoffs than for other plays. After the rule change, concussion rates declined for all plays.

As a student in my class pointed out, it really makes more sense to compare the rates of decline than the absolute declines.

Put it this way. If the probabilities of concussion are:
- p_K1: Probability of concussion for a kickoff play in 2013-2015
- p_K2: Probability of concussion for a kickoff play in 2016-2017
- p_N1: Probability of concussion for a non-kickoff play in 2013-2015
- p_N2: Probability of concussion for a non-kickoff play in 2016-2017.

Following the published paper, we estimated the difference in differences, (p_K2 - p_K1) - (p_N2 - p_N1).

But I think it makes more sense to think multiplicatively, to work with the ratio of ratios, (p_K2/p_K1) / (p_N2/p_N1).

Or, on the log scale, (log p_K2 - log p_K1) - (log p_N2 - log p_N1).

What's the estimate and standard error of this comparison?

The key step is that we can use relative errors. From the Poisson distribution, the relative sd of an estimated rate is 1/sqrt(y), so this is the approx sd on the log scale. So, the estimated difference in difference of log probabilities is (log(3/1467) - log(26/2379)) - (log(30/25526) - log(100/39107)) = -0.90. That's a big number: exp(0.90) = 2.46, which means that the concussion rate fell over twice as fast in kickoff than in non-kickoff plays.

But the standard error of this difference in difference of logs is sqrt(1/3 + 1/26 + 1/30 + 1/100) = 0.64. The observed d-in-d-of-logs is -0.90, which is less than two standard errors from zero, thus not conventionally statistically significant.

So, from these data alone, we cannot confidently conclude that the relative decline in concussion rates is more for kickoffs than for other plays. That estimate of 3/1467 is just too noisy.

We could also see this using a Poisson regression with four data points. We create the data frame using this (ugly) R code:

data <- data.frame(y = c(concussions$y1[c(1,5)], concussions$y2[c(1,5)]),
                   n = c(concussions$n1[c(1,5)], concussions$n2[c(1,5)]),
                   time = c(0, 0, 1, 1),
                   kickoff = c(1, 0, 1, 0))

Which produces this:

    y     n time kickoff
1  26  2379    0       1
2 100 39107    0       0
3   3  1467    1       1
4  30 25526    1       0

Then we load Rstanarm:

options(mc.cores = parallel::detectCores())

And run the regression:

fit <- stan_glm(y ~ time*kickoff, offset=log(n), family=poisson, data=data)

Which yields:

             Median MAD_SD
(Intercept)  -6.0    0.1  
time         -0.8    0.2  
kickoff       1.4    0.2  
time:kickoff -0.9    0.6

The coefficient for time is negative: concussion rates have gone down a lot. The coefficient for kickoff is positive: kickoffs have higher concussion rates. The coefficient for the interaction of time and kickoff is negative: concussion rates have been going down faster for kickoffs. But that last coefficient is less than two standard errors from zero. This is just confirming with our regression analysis what we already saw with our simple standard error calculations.

OK, but don't I think that the concussion rate really was going down faster, even proportionally, in kickoffs than in other plays? Yes, I do, for three reasons. In no particular order:
- The data do show a faster relative decline, albeit with uncertainty;
- The story makes sense given that the Ivy League directly changed the rules to make kickoffs safer;
- Also, there were many more touchbacks, and this directly reduces the risks.

That's fine.


- There seem to be two mistakes in the published paper: (1) some error in the analysis of the plays from scrimmage which led to too wide a standard error, and (2) a comparison of absolute rather than relative probabilities.

- It would be good to see the raw data for each year.

- The paper's main finding---a sharp drop in concussions, especially in kickoff plays---is supported both by the data and our substantive understanding.

- Concussion rates fell faster for kickoff plays, but they also dropped a lot in non-kickoff plays too. I think the published article should've emphasized this finding more than they did, but they perhaps here were hindered by the error they made in their analysis leading to an inappropriately wide confidence interval for non-kickoff plays. The decline in concussions for non-kickoff plays is consistent with a general rise in concussion awareness.

P.S. I corrected a mistake in the above Poisson regression code, pointed out by commenter TJ (see here).

Strategic choice of where to vote in November

Darcy Kelley sends along a link to this site, Make My Vote Matter, in which you can enter two different addresses where you might vote, and it will tell you in which (if any) of these addresses has elections that are predicted to be close. The site is aimed at students; according to the site, they can register to vote at either their home or school address.

“Six Signs of Scientism”: where I disagree with Haack

I came across this article, “Six Signs of Scientism,” by philosopher Susan Haack from 2009. I think I’m in general agreement with Haack’s views—science has made amazing progress over the centuries but “like all human enterprises, science is ineradicably is fallible and imperfect. At best its progress is ragged, uneven, and unpredictable; moreover, much scientific work is unimaginative or banal, some is weak or careless, and some is outright corrupt . . .”—and I’ll go with her definition of “scientism” as “a kind of over-enthusiastic and uncritically deferential attitude towards science, an inability to see or an unwillingness to acknowledge its fallibility, its limitations, and its potential dangers.”

But I felt something was wrong with Haack’s list of the six signs of scientism, which she summarizes as:

1. Using the words “science,” “scientific,” “scientifically,” “scientist,” etc., honorifically, as generic terms of epistemic praise.

2. Adopting the manners, the trappings, the technical terminology, etc., of the sciences, irrespective of their real usefulness.

3. A preoccupation with demarcation, i.e., with drawing a sharp line between genuine science, the real thing, and “pseudo-scientific” imposters.

4. A corresponding preoccupation with identifying the “scientific method,” presumed to explain how the sciences have been so successful.

5. Looking to the sciences for answers to questions beyond their scope.

6. Denying or denigrating the legitimacy or the worth of other kinds of inquiry besides the scientific, or the value of human activities other than inquiry, such as poetry or art.

Yes, these six signs can be a problem. But, to me, these six signs are associated with a particular sort of scientism, which one might call active scientism, a Richard Dawkins-like all-out offensive against views that are considered anti-scientific.

I spend a lot of time thinking about something different: a passive scientism which is not concerned about turf (thus, not showing signs 1 and 3 above); not concerned about the scientific method, indeed often displaying little interest in the foundations of scientific logic and reasoning (thus, not showing sign 4 above); and not showing any imperialistic inclinations to bring the humanities into the scientific orbit (thus, not showing signs 5 or 6 above). Passive scientism does involve adopting the trappings and terminology of science in a thoughtless way, so there is a bit of sign 2 above, but that’s it.

A familiar examples of passive scientism is “pizzagate”: the work, publication, and promotion, of the studies conducted by the Food and Brand Lab at Cornell University. Other examples include papers published in the Proceedings of the National Academy of Sciences on himmicanes, air rage, and ages ending in 9.

In all these cases, topics were being studied that clearly can be studied by science. So the “scientism” here is not coming into the decision to pursue this research. Rather, the scientism arises from blind trust in various processes associated with science, such as randomized treatment assignment, statistical significance, and peer review.

In this particular manifestation of scientism—claims that bounce around between scientific journals, textbooks, and general media outlets such as NPR and Ted talks—there is no preoccupation with identifying the scientific method or preoccupation with demarcation, but rather the near-opposite, an all-too-calm acceptance of wacky claims that happen to be in the proximity to various tokens of science.

So I think that when talking about scientism, we need to consider passive as well as active scientism. For every Dawkins, there is a Gladwell.