Skip to content

Multilevel models with group-level predictors

Kari Lock Morgan writes:

I’m writing now though with a multilevel modeling question that has been nagging me for quite some time now. In your book with Jennifer Hill, you include a group-level predictor (for example, 12.15 on page 266), but then end up fitting this as an individual-level predictor with lmer. How can this be okay? It seems as if lmer can’t really still be fitting the model specified in 12.15? In particular, I’m worried about analyzing a cluster randomized experiment where the treatment is applied at the cluster level and outcomes are at the individual level – intuitively, of course it should matter that the treatment was applied at the cluster level, not the individual level, and therefore somehow this should enter into how the model is fit? However, I can’t seem to grasp how lmer would know this, unless it is implicitly looking at the covariates to see if they vary within groups or not, which I’m guessing it’s not? In your book you act as if fitting the model with county-level uranium as an individual predictor is the same as fitting it as a group-level predictor, which makes me think perhaps I am missing something obvious?

My reply: It indeed is annoying that lmer (and, for that matter, stan_lmer in its current incarnation) only allows individual-level predictors, so that any group-level predictors need to be expanded to the individual level (for example, u_full <- u[group]). But from the standpoint of fitting the statistical model, it doesn't matter. Regarding the question, how does the model "know" that, in this case, u_full is actually an expanded group-level predictor: The answer is that it "figures it out" based on the dependence between u_full and the error terms. It all works out.

He’s a history teacher and he has a statistics question

Someone named Ian writes:

I am a History teacher who has become interested in statistics! The main reason for this is that I’m reading research papers about teaching practices to find out what actually “works.”

I’ve taught myself the basics of null hypothesis significance testing, though I confess I am no expert (Maths was never my strong point at school!). But I also came across your blog after I heard about this “replication crisis” thing.

I wanted to ask you a question, if I may.

Suppose a randomised controlled experiment is conducted with two groups and the mean difference turns out to be statistically significant at the .05 level. I’ve learnt from my self-study that this means:

“If there were genuinely no difference in the population, the probability of getting a result this big or bigger is less than 5%.”

So far, so good (or so I thought).

But from my recent reading, I’ve gathered that many people criticise studies for using “small samples.” What was interesting to me is that they criticise this even after a significant result has been found.

So they’re not saying “Your sample size was small so that may be why you didn’t find a significant result.” They’re saying: “Even though you did find a significant result, your sample size was small so your result can’t be trusted.”

I was just wondering whether you could explain why one should distrust significant results with small samples? Some people seem to be saying it’s because it may have been a chance finding. But isn’t that what the p-value is supposed to tell you? If p is less then 0.05, doesn’t that mean I can assume it (probably) wasn’t a “chance finding”?

My reply: See my paper, “The failure of null hypothesis significance testing when studying incremental changes, and what to do about it,” recently published in the Personality and Social Psychology Bulletin. The short answer is that (a) it’s not hard to get p less than 0.05 just from chance, via forking paths, and (b) when effect sizes are small and a study is noisy, any estimate that reaches “statistical significance” is likely to be an overestimate, perhaps a huge overestimate.

An actual quote from a paper published in a medical journal: “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure.”

Someone writes:

So the NYT yesterday has a story about this study I am directed to it and am immediately concerned about all the things that make this study somewhat dubious. Forking paths in the definition of the independent variable, sample selection in who wore the accelerometers, ignorance of the undoubtedly huge importance of interactions in the controls, etc, etc. blah blah blah. But I am astonished at the bald statement at the start of the study: “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure.” Why shouldn’t everyone, including the NYT, stop reading right there? How does a journal accept the article? The dataset itself is public and they didn’t create it! They’re just saying Fuck You.

I was, like, Really? So I followed the link. And, indeed, here it is:

The Journal of the American Heart Association published this? And the New York Times promoted it?

As a heart patient myself, I’m annoyed. I’d give it a subliminal frowny face, but I don’t want to go affecting your views on immigration.

P.S. My correspondent adds:

By the way, I started Who is Rich? this week and it’s great.

Continue reading ‘An actual quote from a paper published in a medical journal: “The data, analytic methods, and study materials will not be made available to other researchers for purposes of reproducing the results or replicating the procedure.”’ »

Young Investigator Special Competition for Time-Sharing Experiment for the Social Sciences

Sociologists Jamie Druckman and Jeremy Freese write:

Time-Sharing Experiments for the Social Sciences is Having A Special Competition for Young Investigators

Time-sharing Experiments for the Social Sciences (TESS) is an NSF-funded initiative. Investigators propose survey experiments to be fielded using a nationally representative Internet platform via NORC’s AmeriSpeak Panel (see http:/ for more information). While anyone can submit a proposal to TESS at any time through our regular mechanism, we are having a Special Competition for Young Investigators. Graduate students and individuals who received their PhD in 2016 or after are eligible.

To give some examples of experiments we’ve done: one TESS experiment showed that individuals are more likely to support a business refusing service to a gay couple versus an interracial couple, but were no more supportive of religious reasons for doing so versus nonreligious reasons. Another experiment found that participants were more likely to attribute illnesses of obese patients as due to poor lifestyle choices and of non-obese patients to biological factors, which, in turn, resulted in participants being less sympathetic to overweight patients—especially when patients are female. TESS has also fielded an experiment about whether the opinions of economists influence public opinion on different issues, and the study found that they do on relatively technical issues but not so much otherwise.

The proposals that win our Special Competition will be able to be fielded at up to twice the size of a regular TESS study. We will begin accepting proposals for the Special Competition on January 1, 2019, and the deadline is March 1, 2019. Full details about the competition are available at

Ethics in statistical practice and communication: Five recommendations.

I recently published an article summarizing some of my ideas on ethics in statistics, going over these recommendations:

1. Open data and open methods,

2. Be clear about the information that goes into statistical procedures,

3. Create a culture of respect for data,

4. Publication of criticisms,

5. Respect the limitations of statistics.

The full article is here.

Predicting spread of flu

Aleks points us to this page on flu prediction. I haven’t looked into it but it seems like an important project.

Fitting the Besag, York, and Mollie spatial autoregression model with discrete data

Rudy Banerjee writes:

I am trying to use the Besag, York & Mollie 1991 (BYM) model to study the sociology of crime and space/time plays a vital role. Since many of the variables and parameters are discrete in nature is it possible to develop a BYM that uses an Integer Auto-regressive (INAR) process instead of just an AR process?

I’ve seen INAR(1) modeling, even a spatial INAR or SINAR paper but they seem to be different that the way BYM is specified in the Bayes framework.

Does it even make sense to have a BYM that is INAR? I can think of discrete jumps in independent variables that affect the dependent variable in discrete jumps. (Also, do these models violate convexity requirements often required for statistical computing?)

My reply:

1. To see how to fit this sort of model in a flexible way, see this Stan case study, Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data, from Mitzi Morris.

2. Rather than trying to get cute with your discrete modeling, I’d suggest a simple two-level approach, where you use an underlying continuous model (use whatever space-time process you want, BYM or whatever) and then you can have a discrete data model (for example, negative binomial, that is, overdispersed Poisson) on top of that.

Stan development in RStudio

Check this out! RStudio now has special features for Stan:

– Improved, context-aware autocompletion for Stan files and chunks

– A document outline, which allows for easy navigation between Stan code blocks

– Inline diagnostics, which help to find issues while you develop your Stan model

– The ability to interrupt Stan parallel workers launched within the IDE

This is awesome—especially that last feature. Rstudio is my hero.

And don’t forget this: If you don’t currently have Stan on your computer, you can play with this demo version on the web, thanks to RStudio Cloud.

David Brooks discovers Red State Blue State Rich State Poor State!

The New York Times columnist writes:

Our political conflict is primarily a rich, white civil war. It’s between privileged progressives and privileged conservatives. You could say that tribalism is the fruit of privilege. People with more stresses in their lives necessarily pay less attention to politics. . . .

I’ve had some differences with Brooks in the past, but when he agrees with me, I’m not gonna complain.

As David Park, Boris Shor, Joe Bafumi, Jeronimo Cortina, and I wrote ten years ago:

The cultural divide of the two Americas looms larger at high incomes . . . A theme throughout this book is that the cultural differences between states—the things that make red and blue America feel like different places—boil down to differences among richer people in these states.


Consistent with Ronald Inglehart’s hypothesis of postmaterialism, survey data show social issues to be more important to higher-income voters. This can be viewed as a matter of psychology and economics, with social policy as a luxury that can be afforded once you have achieved some material comfort. Our results stand in contradiction to the commonly held idea that social issues distract lower-income voters from their natural economic concerns.

It took a few years, but it seems that our ideas have finally become conventional wisdom.

The AAA tranche of subprime science, revisited

Tom Daula points us to this article, “Mortgage-Backed Securities and the Financial Crisis of 2008: A Post Mortem,” by Juan Ospina and Harald Uhlig. Not our usual topic at this blog, but then there’s this bit on page 11:

We break down the analysis by market segment defined by loan type (Prime, Alt-A, and Subprime). Table 5 shows the results and documents the third fact: the subprime AAA-rated RMBS did particularly well. AAA-rated Subprime Mortgage Backed Securities were the safest securities among the non-agency RMBS market. As of December 2013 the principal-weighted loss rates AAA-rated subprime securities were on average 0.42% [2.2% for Prime AAA same page]. We do not deny that even the seemingly small loss of 0.42% should be considered large for any given AAA security.

Nonetheless, we consider this to be a surprising fact given the conventional narrative for the causes of the financial crisis and its assignment of the considerable blame to the subprime market and its mortgage-backed securities. An example of this narrative is provided by Gelman and Loken (2014):

We have in mind an analogy with the notorious AAA-class bonds created during the mid-2000s that led to the subprime mortgage crisis. Lower-quality mortgages [that is, mortgages with high probability of default and, thus, high uncertainty] were packaged and transformed into financial instruments that were (in retrospect, falsely) characterized as low risk.

OK, our paper wasn’t actually about mortgages; it was about statistics. We were just using mortgages as an example. But if Ospina and Uhlig are correct, we were mistaken in using AAA-rated subprime mortgages as an example of a bad bet. Analogies are tricky things!

P.S. Daula adds:

Overall, I think it fits your data collection/measurement theme, and how doing that well can provide novel insights. In that vein, they provide a lot of detail to replicate the results, in case folks disagree. There’s the technical appendix which (p.39) “serves as guide for replication and for understanding the contents of our database” as well as (p.7) a replication kit available from the authors. As to the latter, (p.15) footnote 15 guides the reader to where exactly where to look for the one bit of modeling in the paper (“For a detailed list of the covariates employed, refer to MBS Project/Replication/DefaultsAnalysis/Step7”).

Toward better measurement in K-12 education research

Billy Buchanan, Director of Data, Research, and Accountability, Fayette County Public Schools, Lexington, Kentucky, expresses frustration with the disconnect between the large and important goals of education research, on one hand, and the gaps in measurement and statistical training, on the other.

Buchanan writes:

I don’t think that every classroom educator, instructional coach, principal, or central office administrator needs to be an expert on measurement. I do, however, think that if we are training individuals to be researchers (e.g., PhDs) we have a duty to make sure they are able to conduct the best possible research, understand the various caveats and limitations to their studies, and especially understand how measurement – as the foundational component of all research across all disciplines (yes, even qualitative research) – affects the inferences derived from their data. . . .

In essence, if a researcher wants to use an existing vetted measurement tool for research purposes, they should already have access to the technical documentation and can provide the information up front so as our staff reviews the request we can also evaluate whether or not the measurement tool is appropriate for the study. If a researcher wants to use their own measure, we want them to be prepared to provide sufficient information about the validity of their measurement tool so we can ensure that they publish valid results from their study; this also has an added benefit for the researcher by essentially motivating them to generate another paper – or two or three – from their single study.

He provides some links to resources, and then he continues:

I would like to encourage other K-12 school districts to join with us in requiring researchers to do the highest quality research possible. I, for one, at least feel that the students, families, and communities that we serve deserve nothing less than the best and believe if you feel the same that your organization would adopt similar practices. You can find our policies here: Fayette County Public Schools Research Requests. Or if you would like to use our documentation/information and/or contribute to what we currently have, you can submit an issue to our documentation’s GitHub Repository (

He had a sudden cardiac arrest. How does this change the probability that he has a particular genetic condition?

Megan McArdle writes:

I have a friend with a probability problem I don’t know how to solve. He’s 37 and just keeled over with sudden cardiac arrest, and is trying to figure out how to assess the probability that he has a given condition as his doctors work through his case. He knows I’ve been sharply critical of doctors’ failures to properly assess the Type I/Type II tradeoff, so he reached out to me, but we quickly got into math questions above my pay grade, so I volunteered to ask if you would sketch out the correct statistical approach.

To be clear, he’s an engineer, so he’s not asking you to do the work for him! Just to sketch out in a few words how he might approach information gathering and setting up a problem like “given that you’ve had sudden cardiac arrest, what’s the likelihood that a result on a particular genetic test is a false positive?”

My reply:

I agree that the conditional probability should change, given the knowledge that he had the cardiac arrest. Unfortunately it’s hard for me to be helpful here because there are too many moving parts: of course the probability of the heart attack conditional on having the condition or not, but also the relevance of the genetic test to his health condition. This is the kind of problem that is addressed in the medical decision making literature, but I don’t think I have anything useful to add here, beyond emphasizing that the calculation of any such probability is an intermediate step in this person’s goal of figuring out what he should do next regarding his heart condition.

I’m posting the question here, in case any of you can point to useful materials on this. In addition to the patient’s immediate steps in staying alive and healthy, this is a general statistical issue that has to be coming up in medical testing all the time, in that tests are often done in the context of something that happened to you, so maybe there is some general resource on this topic?

Understanding Chicago’s homicide spike; comparisons to other cities

Michael Masinter writes:

As a longtime blog reader sufficiently wise not to post beyond my academic discipline, I hope you might take a look at what seems to me to be a highly controversial attempt to use regression analysis to blame the ACLU for the recent rise in homicides in Chicago. A summary appears here with followup here.

I [Masinter] am skeptical, but loathe to make the common mistake of assuming that my law degree makes me an expert in all disciplines. I’d be curious to know what you think of the claim.

The research article is called “What Caused the 2016 Chicago Homicide Spike? An Empirical Examination of the ‘ACLU Effect’ and the Role of Stop and Frisks in Preventing Gun Violence,” and is by Paul Cassell and Richard Fowles.

I asked Masinter what were the reasons for his skepticism, and he replied:

Two reasons, one observational, one methodological, and a general sense of skepticism.

Other cities have restricted the use of stop and frisk practices without a corresponding increase in homicide rates.

Quite a few variables appear to be collinear. The authors claim that Bayesian Model Averaging controls for that, but I lack the expertise to assess their claim.

More generally, their claim of causation appears a bit like post hoc propter hoc reasoning dressed up in quantitative analysis.

Perhaps I am wrong, but I have come to be skeptical of such large claims.

I asked Masinter if it would be ok to quote him, and he responded:

Yes, but please note my acknowledged lack of expertise. Too many of my law prof colleagues think our JD qualifies us as experts in all disciplines, a presumption magnified by our habit of advocacy.

I replied, “Hey, don’t they know that the only universal experts are M.D.’s and, sometimes, economists?”, and Masinter then pointed me to this wonderful article by Arthur Allen Leff from 1974:

With the publication of Richard A. Posner’s Economic Analysis of Law, that field of learning known as “Law and Economics” has reached a stage of extended explicitness that requires and permits extended and explicit comment. . . . I was more than half way through the book before it came to me: as a matter of literary genre (though most likely not as a matter of literary influence) the closest analogue to Economic Analysis of Law is the picaresque novel.

Think of the great ones, Tom Jones, for instance, or Huckleberry Finn, or Don Quixote. In each case the eponymous hero sets out into a world of complexity and brings to bear on successive segments of it the power of his own particular personal vision. The world presents itself as a series of problems; to each problem that vision acts as a form of solution; and the problem having been dispatched, our hero passes on to the next adventure. The particular interactions are essentially invariant because the central vision is single. No matter what comes up or comes by, Tom’s sensual vigor, Huck’s cynical innocence, or the Don’s aggressive romanticism is brought into play . . .

Richard Posner’s hero is also eponymous. He is Economic Analysis. In the book we watch him ride out into the world of law, encountering one after another almost all of the ambiguous villains of legal thought, from the fire-spewing choo-choo dragon to the multi-headed ogre who imprisons fair Efficiency in his castle keep for stupid and selfish reasons. . . .

One should not knock the genre. To hold the mind-set constant while the world is played in manageable chunks before its searching single light is a powerful analytic idea, the literary equivalent of dropping a hundred metals successively into the same acid to see what happens. The analytic move, just as a strategy, has its uses, no matter which mind-set is chosen, be it ethics or psychology or economics or even law. . . .

Leff quotes Posner:

Efficiency is a technical term: it means exploiting economic resources in such a way that human satisfaction as measured by aggregate consumer willingness to pay for goods and services is maximized. Value too is defined by willingness to pay.

And then Leff continues:

Given this initial position about the autonomy of people’s aims and their qualified rationality in reaching them, one is struck by the picture of American society presented by Posner. For it seems to be one which regulates its affairs in rather a bizarre fashion: it has created one grand system—the market, and those market-supportive aspects of law (notably “common,” judge-made law)—which is almost flawless in achieving human happiness; it has created another—the political process, and the rest of “the law” (roughly legislation and administration)—which is apparently almost wholly pernicious of those aims.

An anthropologist coming upon such a society would be fascinated. It would seem to him like one of those cultures which, existing in a country of plenty, having developed mechanisms to adjust all intracultural disputes in peace and harmony, lacking any important enemies, nevertheless comes up with some set of practices, a religion say, simultaneously so barbaric and all-pervasive as to poison almost every moment of what would otherwise appear to hold potential for the purest existential joy. If he were a bad anthropologist, he would cluck and depart. If he were a real anthropologist, I suspect he would instead stay and wonder what it was about the culture that he was missing. That is, he would ponder why they wanted that religion, what was in it for them, what it looked like and felt like to be inside the culture. And he would be especially careful not to stand outside and scoff if, like Posner, he too leaned to the proposition that “most people in most affairs of life are guided by what they conceive to be their self-interest and . . . choose means reasonably (not perfectly) designed to promote it.”

I’m reminded of our discussion from a few years ago regarding the two ways of thinking in economics.

Economists are often making one of two arguments:

1. People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist.

2. People are irrational and they need economists, with their open minds, to show them how to be rational and efficient.

Argument 1 is associated with “why do they do that?” sorts of puzzles. Why do they charge so much for candy at the movie theater, why are airline ticket prices such a mess, why are people drug addicts, etc. The usual answer is that there’s some rational reason for what seems like silly or self-destructive behavior.

Argument 2 is associated with “we can do better” claims such as why we should fire 80% of public-schools teachers or Moneyball-style stories about how some clever entrepreneur has made a zillion dollars by exploiting some inefficiency in the market.

Despite going in the opposite directions, Arguments 1 and 2 have a similarity, in that both used to make the point that economists stand from a superior position by which they can evaluate the actions of others.

But Leff said it all back in 1974.

P.S. We seem to have mislaid our original topic, which is the study of the change in homicide rates in Chicago. I read the article by Casell and Fowles and their analysis of Chicago seemed reasonable to me. At the same time, I agree with Masinter that the comparisons to other cities didn’t seem so compelling.

Limitations of “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection”

“If you will believe in your heart and confess with your lips, surely you will be saved one day”The Mountain Goats paraphrasing Romans 10:9

One of the weird things about working with people a lot is that it doesn’t always translate into multiple opportunities to see them talk.  I’m pretty sure the only time I’ve seen Andrew talk was at a fancy lecture he gave in Columbia. He talked about many things that day, but the one that stuck with me (because I’d not heard it phrased that well before, but as a side-note this is a memory of the gist of what he was saying. Do not hold him to this opinion!) was that the problem with p-values and null-hypothesis wasn’t so much that the procedure was bad. The problem is that people are taught to believe that there exists a procedure that can, given any set of data, produce a “yes/no” answer to a fairly difficult question. So the problem isn’t the specific decision rule that NHST produces, so much as the idea that a universally applicable decision rule exists at all. (And yes, I know the maths. But the problem with p-values was never the maths.)

This popped into my head again this week as Aki, Andrew, Yuling, and I were working on a discussion to Gronau and Wagenmakers’ (GW) paper “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection”.

Our discussion is titled “Limitations of ‘Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection'” and it extends various points that Aki and I have made at various points on this blog.

To summarize our key points:

  1. It is a bad thing for GW to introduce LOO model selection in a way that doesn’t account for its randomness. In their very specialized examples this turns out not to matter because they choose such odd data that the LOO estimates have zero variance. But it is nevertheless bad practice.
  2. Stacking is a way to get model weights that is more in line with the LOO-predictive concept than GW’s ad hoc pseudo-BMA weights. Although stacking is also not consistent for nested models, in the cases considered in GW’s paper it consistently picks the correct model. In fact, the model weight for the true model in each of their cases is w_0=1 independent of the number of data points.
  3. By not recognizing this, GW missed an opportunity to discuss the limitations of the assumptions underlying LOO (namely that the observed data is representative of the future data, and each individual data point is conditionally exchangeable).  We spent some time laying these out and proposed some modifications to their experiments that would make these limitations clearer.
  4. Because LOO is formulated under much weaker assumptions than is used in this paper, namely LOO does not assume that the data is generated by one of the models under consideration (the so-called “M-Closed assumption”), it is a little odd that GW only assess its performance under this assumption. This assumption almost never holds. If you’ve ever used the famous George Box quote, you’ve explicitly stated that the M-Closed assumption does not hold!
  5. GW’s assertion that when two models can support identical models (such as in the case of nested models), the simplest model should be preferred is not a universal truth, but rather a specific choice that is being made. This can be enforced for LOO methods, but like all choices in statistical modelling, it shouldn’t be made automatically or by authority, but should instead be critically assessed in the context of the task being performed.

All of this has made me think about the idea of doing model selection. Or, more specifically, it’s made me question whether or not we should try to find universal tools for solving this problem. Is model selection even possible? (Danielle Navarro from UNSW has a particularly excellent blog post outlining her experiences with various existing model selection methods that you all should read.)

So I guess my very nebulous view is that we can’t do model selection, but we can’t not do model selection, but we also can’t not not do model selection.

In the end we need to work out how to do model selection for specific circumstances and to think critically about our assumptions. LOO helps us do some of that work.

To close off, I’m going to reproduce the final section of our paper because what’s the point of having a blog post (or writing a discussion) if you can’t have a bit of fun.

Can you do open science with M-Closed tools?

One of the great joys of writing a discussion is that we can pose a very difficult question that we have no real intention of answering. The question that is well worth pondering is the extent to which our chosen statistical tools influence how scientific decisions are made. And it’s relevant in this context because of a key difference between model selection tools based on LOO and tools based on marginal likelihoods is what happens when none of the models could reasonably generate the data.

In this context, marginal likelihood-based model selection tools will, as the amount of data increases, choose the model that best represents the data, even if it doesn’t represent the data particularly well. LOO-based methods, on the other hand, are quite comfortable expressing that they can not determine a single model that should be selected. To put it more bluntly, marginal likelihood will always confidently select the wrong model, while LOO is able to express that no one model is correct.

We leave it for each individual statistician to work out how the shortcomings of marginal likelihood-based model selection balance with the shortcomings of cross-validation methods. There is no simple answer.

Stan on the web! (thanks to RStudio)

This is big news. Thanks to RStudio, you can now run Stan effortlessly on the web.

So you can get started on Stan without any investment in set-up time, no need to install C++ on your computer, etc.

As Ben Goodrich writes, “RStudio Cloud is particularly useful for Stan tutorials where a lot of time can otherwise be wasted getting C++ toolchains installed and configured on everyone’s laptops.”

To get started with a simple example, just click here and log in.

We’ve pre-loaded this particular RStudio session with a regression model and an R script to simulate fake data and run the model. In your online RStudio Cloud session (which will appear within your browser when you click the above link), just go the lower-right window with Files, and click on simple_regression.stan and simple_regression.R. This will open up those files. Run simple_regression.R and it will simulate the data, run the Stan program, and produce a graph.

Now you can play around.

Create your own Stan program: just work in the upper-left window, click on File, New File, Stan File, then click on File, Save As, and give it a .stan extension. The RStudio editor already has highlighting and autocomplete for Stan files.

Or edit the existing Stan program (simple_regression.stan, sitting there in the lower right of your RStudio Cloud window), change it how you’d like, then edit the R script or create a new one. You can upload data to your session too.

When you run a new or newly edited Stan program, it will take some time to compile. But then next time you run it, R will access the compiled version and it will run faster.

You can also save your session and get back to it later.

Some jargon for ya—but I mean every word of it!

This is a real game-changer as it significantly lowers the barriers to entry for people to start using Stan.


Ultimately I recommend you set up Stan on your own computer so you have full control over your modeling, but RStudio’s Cloud is a wonderful way to get started.

Here’s what Rstudio says:

Each project is allocated 1GB of RAM. Each account is allocated one private space, with up to 3 members and 5 projects. You can submit a request to the RStudio Cloud team for more capacity if you hit one of these space limits, and we will do our best accomodate you. If you are using a Professional 2 account, you will not encounter these space limits.

In addition to the private space (where you can collaborate with a selected group of other users), every user also gets a personal workspace (titled “Your Workspace”), where there is virtually no limit to the number of projects you can create. Only you can work on projects in your personal workspace, but you can grant view & copy access to them to any other RStudio Cloud user.

This is just amazing. I’m not the most computer-savvy user, but I was able to get this working right away.

Ben adds:

It also comes with
* The other 500-ish examples in the examples/ subdirectory
* Most of the R packages that use Stan, including McElreath’s rethinking package from GitHub and all the ones under stan-dev, namely
– rstanarm (comes with compiled Stan programs for regression)
– brms (generates Stan programs for regression)
– projpred (for selecting a submodel of a GLM)
– bayesplot and shinystan (for visualizing posterior output)
– loo (for model comparison using expected log predictive density)
– rstantools (for creating your own R packages like rstanarm)
* Saving new / modified compiled Stan programs to the disk to use across sessions first requires the user to do rstan_options(auto_write = TRUE)

I’m so excited. You can now play with Stan yourself with no start-up effort. Or, if you’re already a Stan user, you can demonstrate it to your friends. Also, you can put your own favorite models in an RStudio Cloud environment (as Ben did for my simple regression model) and then share the link with other people, who can use your model on your data, upload their own data, alter your model, etc.

P.S. It seems that for now this is really only for playing around with very simple models, to give you a taste of Stan, and then you’ll have to install it on your computer to do more. See this post from Aki. That’s fine. as this demo is intended to be a show horse, not a work horse. I think there is also a way to pay RStudio for cycles on the cloud and then you can run bigger Stan models through the web interface. So that could be an option too, for example if you want to use Stan as a back-end for some computing that you’d like others to access remotely.

Why are functional programming languages so popular in the programming languages community?

Matthijs Vákár writes:

Re the popularity of functional programming and Church-style languages in the programming languages community: there is a strong sentiment in that community that functional programming provides important high-level primitives that make it easy to write correct programs. This is because functional code tends to be very short and easy to reason about because of referential transparency, making it quick to review and maintain. This becomes even more important in domains where correctness is non-trivial, like in the presence of probabilistic choice and conditioning (or concurrency).

The sense – I think – is that for the vast majority of your code, correctness and programmer-time are more important considerations than run-time efficiency. A typical workflow should consist of quickly writing an obviously correct but probably inefficient program, profiling it to locate the bottlenecks and finally – but only then – thinking hard about how to use computational effects like mutable state and concurrency to optimise those bottlenecks without sacrificing correctness.

There is also a sense that compilers will get smarter with time, which should result in pretty fast functional code, allowing the programmer to think more about what she wants to do rather than about the how. I think you could also see the lack of concern for performance of inference algorithms in this light. This community is primarily concerned with correctness over efficiency. I’m sure they will get to more efficient inference algorithms eventually. (Don’t forget that they are not statisticians by training so it may take some time for knowledge about inference algorithms to percolate into the PL community.)

Justified or not, there is a real conviction in the programming languages community that functional ideas will become more and more important in mainstream programming. People will point to the increase of functional features in Scala, Java, C# and even C++. Also, people note that OCaml and Haskell these days are getting to the point where they are really quite fast. Jane Street would not be using OCaml if it weren’t competitive. In fact, one of the reasons they use it – as I understand it – is that their code involves a lot of maths which functional languages lend themselves well to writing. Presumably, this is also part of the motivation of using functional programming for prob prog? In a way, functional programming feels closer to maths.

Another reason I think people care about language designs in the style of Church is that they make it easy to write certain programs that are hard in Stan. For instance, perhaps you care about some likelihood-free model, like a model where some probabilistic choices are made, then a complicated deterministic program is run and only then the results are observed (conditioned on). An example that we have here in Oxford is a simulation of the standard model of particle physics. This effectively involves pushing forward your priors through some complicated simulator and then observing. It would be difficult to write down a likelihood for this model. Relatedly, when I was talking to Erik Meijer, I got the sense that Facebook wants to integrate prob prog into existing large software systems. The resulting models would not necessarily have a smooth likelihood wrt the Lebesgue measure. I am told that in some cases inference in these models is quite possible with rejection sampling or importance sampling or some methods that you might consider savage. The models might not necessarily be very hard statistically, but they are hard computationally in the sense that they do involve running a non-trivial program. (Note though that this has less to do with functional programming than it does with combining probabilistic constructs with more general programming constructs!)

A different example that people often bring up is given by models where observations might be of variable length, like in language models. Of course, you could do it in Stan with padding, but some people don’t seem to like this.

Of course, there is the question whether all of this effort is wasted if eventually inference is completely intractable in basically all but the simplest models. What I would hope to eventually see – and many people in the PL community with me – is a high-level language in which you can do serious programming (like in OCaml) and have access to serious inference capabilities (like in Stan). Ideally, the compiler should decide/give hints as to which inference algorithms to try — use NUTS when you can, but otherwise back-off and try something else. And there should be plenty of diagnostics to figure out when to distrust your results.

Finally, though, something I have to note is that programming language people are the folks writing compilers. And compilers are basically the one thing that functional languages do best because of their good support for user-defined data structures like trees and recursing over such data structures using pattern matching. Obviously, therefore, programming language folks are going to favour functional languages. Similarly, because of their rich type systems and high-level abstractions like higher-order functions, polymorphism and abstract data types, functional languages serve as great hosts for DSLs like PPLs. They make it super easy for the implementer to write a PPL even if they are a single academic and do not have the team required to write a C++ project.

The question now is whether they also genuinely make things easier for the end user. I believe they ultimately have the potential to do so, provided that you have a good optimising compiler, especially if you are trying to write a complicated mathematical program.

I replied:

On one hand, people are setting up models in Church etc. that they may never be able to fit—it’s not just that Church etc are too slow to fit these models, it’s that they’re unidentified or just don’t make sense or have never really been thought out. (My problem here is not with the programming language called Church; I’m just here concerned about the difficulty with fitting any model correctly the first time, hence the need for flexible tools for model fitting and exploration.)

But, on the other hand, people are doing inference and making decisions all the time, using crude regressions or t-tests or whatever.

To me, it’s close to meaningless for someone to say they can write “an obviously correct but probably inefficient program” if the program is fitting a model that you can’t really fit!

Matthijs responded:

When I say “write an obviously correct but probably inefficient program”, I am talking about usual functional programming in a deterministic setting. I think many programming language (PL) folks do not fully appreciate what correctness for a probabilistic program means. I think they are happy when they can prove that their program asymptotically does the right thing. I do not think they quite realise yet that that gives no guarantees about what to think about your results when you run your program for a finite amount of time. It sounds silly, but I think at least part of the community is still to realise that. Personally, I expect that people will wake up eventually and will realise that correctness in a probabilistic setting is a much more hairy beast. In particular, I expect that the PL community will realise that run-time diagnostics and model checking are the things that they should have been doing all along. I am hopeful though that at some point enough knowledge of statistics will trickle through to them such that genuinely useful collaboration is possible. I strongly feel that there is mostly just a lot of confusion rather than wilful disregard.

The Golden Rule of Nudge

Nudge unto others as you would have them nudge unto you.

Do not recommend to apply incentives to others that you would not want for yourself.


I was reading this article by William Davies about Britain’s Kafkaesque immigration policies.

The background, roughly, is this: Various English politicians promised that the net flow of immigrants to the U.K. would decrease. But the British government had little control over the flow of immigrants to the U.K., because most of them were coming from other E.U. countries. So they’d pretty much made a promise they couldn’t keep. The only way the government could even try to reach their targets was to convince people currently living in the U.K. to leave. And one way to do this was to make their lives more difficult. Apparently the government focused this effort on people living in Britain who’d originally come from former British colonies in the Caribbean, throwing paperwork at them and threatening their immigration status.

Davies explains:

The Windrush generation’s immigration status should never have been in question, and the cause of their predicament is recent: the 2014 Immigration Act, which contained the flagship policies of the then home secretary, Theresa May. Foremost among them was the plan to create a ‘hostile environment’, with the aim of making it harder for illegal immigrants to work and live in the UK . . .

It’s almost as if, on discovering that law alone was too blunt an instrument for deterring and excluding immigrants, May decided to weaponise paperwork instead. The ‘hostile environment’ strategy was never presented just as an effective way of identifying and deporting illegal immigrants: more important, it was intended as a way of destroying their ability to build normal lives. The hope, it seemed, was that people might decide that living in Britain wasn’t worth the hassle. . . .

The thread linking benefit sanctions and the ‘hostile environment’ is that both are policies designed to alter behaviour dressed up as audits. Neither is really concerned with accumulating information per se: the idea is to use a process of constant auditing to punish and deter . . .

And then he continues:

The coalition government was fond of the idea of ‘nudges’, interventions that seek to change behaviour by subtle manipulation of the way things look and feel, rather than through regulation. Nudgers celebrate the sunnier success stories, such as getting more people to recycle or to quit smoking, but it’s easy to see how the same mentality might be applied in a more menacing way.

Nudge for thee but not for me?

This all reminds me of a general phenomenon, that “incentives matter” always seems like a good motto for other people, but we rarely seem to want it for ourselves. For example, there’s lots of talk about how being worried about losing your job is a good motivator to work hard (or, conversely, that a secure job is a recipe for laziness), but we don’t want job insecurity for ourselves. (Sure, lots of people who aren’t tenured professors or secure government employees envy those of us with secure jobs, and maybe they think we don’t deserve them, but I don’t think these people are generally asking for less security in their own jobs.)

More generally, it’s my impression that people often think that “nudges” are a good way to get other people to behave in desirable ways, without these people wanting to be “nudged” themselves. For example, did Jeremy Bentham put himself in a Panopticon to ensure his own good behavior?

There are exceptions, though. I like to stick other people in my office so it’s harder for me to drift off and be less productive. And lots of smokers are supportive of polities that make it less convenient to smoke, as this can make it easier for them to quit and harder for them to relapse.


With all that in mind, I’d like to propose a Golden Rule of Nudge: Nudge unto others as you would have them nudge unto you.

Do not recommend to apply incentives to others that you would not want for yourself.

Perhaps you could try a big scatterplot with one dot per dataset?

Joe Nadeau writes:

We are studying variation in both means and variances in metabolic conditions. We have access to nearly 200 datasets that involve a range of metabolic traits and vary in sample size, mean effects, and variance. Some traits differ in mean but not variance, others in variance but not mean, still others in both, and of course some in neither. These studies are based on animal models where genetics and environmental conditions are well-controlled and where pretty rigorous study designs are possible. We plan to report the full range of results, but would like to rank results according to confidence (power?) for each dataset. Some are obviously more robust than others. Confidence limits. which will be reported, don’t seem to quite right for ranking. We feel an obligation to share some sense of confidence, which should be based on sample size, variance in the contrast groups – between genotypes, between diets, between treatments, …… . We are of course aware of the trickiness in studies like these. Our question is based on getting the rest right, and wanting to share with readers and reviewers a sense of the ‘strengths, limitations’ across the datasets. Suggestions?

My reply: Perhaps you could try a big scatterplot with one dot per dataset? Also I would doubt that there are really traits that do not differ in mean or variance. Maybe it would help to look at the big picture rather than to be categorizing individual cases, which can be noisy. Similarly, rather than ranking the results, I think it would be better to just consider ways of displaying all of them.

Podcast interview on polling (mostly), also some Bayesian stuff

Hugo Bowne-Anderson interviewed me for a DataCamp podcast. Transcript is here.

Rising test scores . . . reported as stagnant test scores

Joseph Delaney points to a post by Kevin Drum pointing to a post by Bob Somerby pointing to a magazine article by Natalie Wexler that reported on the latest NAEP (National Assessment of Educational Progress) test results.

In an article entitled, “Why American Students Haven’t Gotten Better at Reading in 20 Years,” Wexler asks, “what’s the reason for the utter lack of progress in reading scores?”

The odd thing is, though, is that reading scores have clearly gone up in the past twenty years, as Somerby points out in text and Drum shows in this graph:

Drum summarizes:

Asian: +15 points
White: +5 points
Hispanic: +10 points
Black: +5 points

. . . Using the usual rule of thumb that 10 points equals one grade level, black and white kids have improved half a grade level; Hispanic kids have improved a full grade level; and Asian kids have improved 1½ grade levels.

Why does this sort of thing get consistently misreported? Delaney writes:

My [Delaney’s] opinion: because there is a lot of money in education and it won’t be possible to “disrupt” education and redirect this money if the current system is doing well. . . .

It also moves the goalposts. If everything is falling apart then it isn’t such a crisis if the disrupted industry has teething issues once they strip cash out of it to pay for the heroes who are reinventing the system.

But if current educational systems are doing well, and slowly improving through incremental change, then it is a lot harder to argue that there is a crisis in education, isn’t it?

Could be. The other thing is that it can be hard to get unstuck from a conventional story. We discussed a similar example a few years ago: that time it was math test scores, which economist Roland Fryer stated had been “largely constant over the past thirty years,” even while he’d included a graph showing solid improvements.