Skip to content

SciLua 2 includes NUTS

The most recent release of SciLua includes an implementation of Matt’s sampler, NUTS (link is to the final JMLR paper, which is a revision of the earlier arXiv version).

According to the author of SciLua, Stefano Peluchetti:

Should be quite similar to your [Stan's] implementation with some differences in the adaptation strategy.

If you have time to have a look and give it a try I would appreciate your feedback. My website includes the LuaJIT binaries for 3 arches [Windows, Mac OS X, Linux], and the Xsys and Sci libraries needed to run NUTS.

I’ve also been working on a higher-level interface with some test models and datasets (the user can specify a simple script to describe the model, but it’s always Lua syntax, not a custom DSL).

Please notice that multi-pass forward automatic differentiation is used (I experimented with a single-pass, tape based, version but the speed-up was not big and for high-dimensional problems one wants to use reverse diff in any case, on my TODO).

For complex models you might want to increase loopunroll a little, and maybe callunroll as well. Feel free to ask any question.

I haven’t had time to try it myself. Looks like it’s implemented using a just-in-time compiler (JIT), like Julia, with models specified in its own language, Lua.

Yummy Mr. P!


Chris Skovron writes:

A colleague sent the attached image from Indonesia.

For whatever reason, it seems appropriate that Mr. P is a delicious salty snack with the tagline “good times.”

Indeed. MRP has made the New York Times and Indonesian snack food. What more can we ask for?

A linguist has a question about sampling when the goal is causal inference from observational data

Nate Delaney-Busch writes:

I’m a PhD student of cognitive neuroscience at Tufts, and a question came recently with my colleagues about the difficulty of random sampling in cases of highly controlled stimulus sets, and I thought I would drop a line to see if you had any reading suggestions for us.

Let’s say I wanted to disentangle the effects of word length from word frequency on the speed at which people can discriminate words from pseudowords, controlling for nuisance factors (say, part of speech, age of acquisition, and orthographic neighborhood size – the number of other words in the language that differ from the given word by only one letter). My sample can be a couple hundred words from the English language.

What’s the best way to handle the nuisance factors without compromising random sampling? There are programs that can automatically find the most closely-matched subsets of larger databases (if I bin frequency and word length into categories for a factorial experimental design), but what are the consequences of having experimental item essentially be a fixed factor? Would it be preferable to just take a random sample of the English language, then use a heirarchical regression to deal with the nuisance factors first? Are there measures I can use to determine the quantified extent to which chosen sampling rules (e.g. “nuisance factors must not significantly differ between conditions”) constrain random sampling? How would I know when my constraints start to really become a problem for later generalization?

Another way to ask the same question would be how to handle correlated variables of interest like word length and frequency during sampling. Would it be appropriate to find a sample in which word length and frequency are orthogonal (e.g. if I wrote a script to take a large number of random samples of words and use the one where the two variables of interest are the least correlated)? Or would it be preferable to just take a random sample and try to deal with the collinearity after the fact?

I replied:

I don’t have any great answers except to say that in this case I don’t know that it makes sense to think of word length or word frequency as a “treatment” in the statistical sense of the word. To see this, consider the potential-outcomes formulation (or, as Don Rubin would put it, “the RCM”). Suppose you want the treatment to be “increase word length by one letter.” How do you do this? You need to switch in a new word. But the effect will depend on which word you choose. I guess what I’m saying is, you can see how the speed of discrimination varies by word length and word frequency, and you might find a model that predicts well, in which case maybe the sample of words you use might not matter much. But if you don’t have a model with high predictive power, then I doubt there’s a unique right way to define your sample and your population; it will probably depend on what questions you are asking.

Delaney-Busch then followed up:

For clarification, this isn’t actually an experiment I was planning to run – I thought it would be a simple example that would help illustrate my general dilemna when it comes to psycholinguistics.

Your point on treatments is well-taken, though perhaps hard to avoid in research on language processing. It’s actually one of the reasons I’m concerned with the tension between potential collinearity and random sampling in cases where two or more variable correlate in a population. Theoretically, with a large random sample, I should be able to model the random effects of item in the same way I could model the random effects of subject in a between-subjects experiment. But I feel caught between a rock and a hard place when on the one hand, a random sample of words would almost certainly be collinear in the variables of interest, but on the other hand, sampling rules (such as “general a large number of potential samples and keep the one that is the least collinear”) undermines the ability to treat item as an actual random effect.

If you’d like, I would find it quite helpful to hear how you address this issue in the sampling of participants for your own research. Let’s say you were interested in teasing apart the effects of two correlated variables – education and median income – on some sort of political attitude. Would you prefer to sample randomly and just deal with the collinearity, or constrain your sample such that recruited participants had orthogonal education and median income factors? How much constraint would you accept on your sample before you start to worry about generalization (i.e. worry that you are simply measuring the fixed effect of different specific individuals), and is there any way to measure what effect your constraints have on your statistical inferences/tests?

Stan NYC Meetup – Thurs, July 31

The next Stan NYC meetup is happening on Thursday, July 31, at 7 pm. If you’re interested, registration is required and closes on Wednesday night:


The third session will focus on using the Stan language. If you’re bringing a laptop, please come with RStan, PyStan, or CmdStan already installed.


We’re going to focus less on the math and more on the usage of the Stan language. We’ll cover:

• Stan language blocks

• Data types

• Sampling statements

• Vectorization


On deck this week

Mon: A linguist has a question about sampling when the goal is causal inference from observational data

Tues: The Ben Geen case: Did a naive interpretation of a cluster of cases send an innocent nurse to prison until 2035?

Wed: Statistics and data science, again

Thurs: The health policy innovation center: how best to move from pilot studies to large-scale practice?

Fri: The “scientific surprise” two-step

Sat, Sun: As Chris Hedges would say: It’s too hot for words

Stan 2.4, New and Improved

We’re happy to announce that all three interfaces (CmdStan, PyStan, and RStan) are up and ready to go for Stan 2.4. As usual, you can find full instructions for installation on the

Here are the release notes with a list of what’s new and improved:

New Features
* L-BFGS optimization (now the default)
* completed higher-order autodiff (added all probability functions,
  matrix functions, and matrix operators);  tested up to 3rd order
* enhanced effective range of normal_cdf to prevent underflow/overflow
* added von Mises RNG
* added ability to use scalars in all element-wise operations
* allow matrix division for mixing scalars and matrices 
* vectorization of outcome variates in multivariate normal with efficiency boosts
* generalization of multivariate normal to allow rwo vectors as means

* move bin/print and bin/stanc to CmdStan;  no longer generating main
  when compiling model from Stan C++

New Developer
* Added Jeffrey Arnold as core Stan developer
* Added Mitzi Morris as core Stan developer

Bug Fixes
* modified error messages so that they're all 1-indexed instead of 0-indexed
* fixed double print out of times in commands
* const added to iterators to allow VS2008 compiles
* fix boundary conditions on ordered tests
* fix for pow as ^ syntax to catch illegal use of vectors (which
  aren't supported)
* allow zero-length inputs to multi_normal and multi_student_t
  with appropriate log prob (i.e., 0)
* fixed bug in inverse-Wishart RNG to match MCMCPack results
  with slightly asymmetric inputs
* fixed problem with compiling user-defined function twice
* fixed problem with int-only parameters for user-defined functions
* fixed NaN init problems for user-defined functions
* added check that user variable doesn't conflict with user function + doc
* disallow void argument types in user-defined functions

Code Cleanup and Efficiency Improvements
* removed main() from models generated from C++ Stan (they are
  now available only in CmdStan); removed no_main command options
* reserve vector sizes for saving for sample recorder
* removing many instances of std::cout from API (more still to go)
* removed non-functional Nesterov optimization option
* optimization code refactoring for testing ease
* better constant handling in von Mises distribution
* removed tabs from all source files
* massive re-org of testing to remove redundant files and allow
  traits-based specializations, plus fixed for 1-indexing

* added tests for log_softmax, multiply_lower_tri_self_transpose, tcrossprod
* break out function signature tests into individual files, add many
* enhanced cholesky factor tests for round trip transforms and
* extensive unit testing added for optimization 
* remove use of std::cout in all tests

Example Models
* lots of cleanup in links and models in ARM examples
* added BUGS litter example with more stable priors than in the 
  BUGS version (the original model doesn't fit well in BUGS as is, 

* add infix operators to manual
* categorical_logit sampling statement	
* Cholesky factor with unit diagonal transform example
* example of using linear regression for prediction/forecasting with
* clarified some relations of naive Bayes to clustering
  vs. classification and relation to non-identifiability
* new advice on multivariate priors for quad_form_diag
* fix typo in multiply_lower_self_transpose (thanks to Alexey Stukalov)
* fix formatting of reserved names in manual
* fixed typo and clarified effective sample size doc

Stan found using directed search


X and I did some “Sampling Through Adaptive Neighborhoods” ourselves the other day and checked out the nearby grave of Stanislaw Ulam, who is buried with his wife, Françoise Aron, and others of her family.


The above image of Stanislaw and Françoise Ulam comes from this charming mini-biography from Roland Brasseur, which I found here.

P.S. The Cimetière Montparnasse is full of famous people, including the great political scientist Raymond Aron (who perhaps has no close relation to Françoise Aron, given that his grave is listed as being in a different block of the cemetery), but the only other one we saw that day was Henri Poincaré.

NYC workshop 22 Aug on open source machine learning systems

The workshop is organized by John Langford (Microsoft Research NYC), along with Alekh Agarwal and Alina Beygelzimer, and it features Liblinear, Vowpal Wabbit, Torch, Theano, and . . . you guessed it . . . Stan!

Here’s the current program:

8:55am: Introduction
9:00am: Liblinear by CJ Lin.
9:30am: Vowpal Wabbit and Learning to Search (John Langford and Hal Daumé III)
10:00am: Break w/ refreshments
10:30am: Torch
11:00am: Theano and PyLearn (Bart van Merrienboer and Frederic Bastien)
11:30am: Stan (Bob Carpenter)
Noon: Lunch on your own
1:00pm: Summary: The systems as they exist
1:20pm: Panel: Can projects coordinate?
2:00pm: Break w/ refreshments
2:30pm: Discussion time for individual projects
4:00pm: Liblinear future plans
4:10pm: Vowpal Wabbit future plans
4:20pm: Torch future plans
4:30pm: Theano and PyLearn future plans
4:40pm: Stan future plans
4:50pm: Conclusion

It’s open to all (I think) as long as you let them know you’ll be coming; just email

“An Experience with a Registered Replication Project”

Anne Pier Salverda writes:

I came across this blog entry, “An Experience with a Registered Replication Project,” and thought that you would find this interesting.

It’s written by Simone Schnall, a social psychologist who is the first author of an oft-cited Psych Science(!) paper (“Cleanliness reduces the severity of moral judgments”) that a group of researchers from Michigan State failed to replicate as part of a big replication project.

Schnall writes about her experience as the subject of a “failed recplication”. I like the tone of her blog entry. She discusses some issues that she has with how the replication of her work, and the publication of that work in a special issue of a journal was handled. A lot of what she writes is very reasonable.

Schnall believes that her finding did not replicate because there were ceiling effects in a lot of the items in the (direct) replication of her study. So far so good; people can mistakes in analyzing data from replication studies too. But then she writes the following:

It took me considerable time and effort to discover the ceiling effect in the replication data because it required working with the item-level data, rather than the average scores across all moral dilemmas. Even somebody familiar with a research area will have to spend quite some time trying to understand everything that was done in a specific study, what the variables mean, etc. I doubt many people will go through the trouble of running all these analyses and indeed it’s not feasible to do so for all papers that are published.

She may be talking about the peer-review process here, it’s not entirely clear to me from the context. But what am I to think of her own data-analysis skills and strategies if it takes her “considerable time and effort” to discover the ceiling effects in the data? Looking at the distribution of variables is the very first thing that any researcher should do when they start analyzing their data. Step 1: exploratory data analysis! If she’s right about how this aspect of the data accounts for the failure to replicate her original finding, of course that makes for an interesting story—we need to be critical of failed replications too. But the thought that there may be a substantial number of scientists who don’t look in detail at basic aspects of their data before they start working with aggregated data makes me shake my head in disbelief.

My reply:

On the details of the ceiling effect, all I can say is that I’ve made a lot of mistakes in data analysis myself, so if it really took Schnall longer than it should’ve to discover this aspect of her data, I wouldn’t be so hard on her. Exploratory analysis is always a good idea but it’s still easy to miss things.

But speaking more generally about the issues of scientific communciation, I disagree with Schnall’s implication that authors of published papers should have some special privileges regarding the discussion of their work. Think about all the researchers who did studies where they made no dramatic claims, found no statistical significance, and then didn’t get published? Why don’t they get special treatment too? I think a big problem with the current system is that it puts published work on a plateau where it is difficult to dislodge. For example, “That is the assumption behind peer-review: You trust that somebody with the relevant expertise has scrutinized a paper regarding its results and conclusions, so you don’t have to.” My response to this is that, in many areas of research, peer reviewers do not seem to be deserving of that trust. Peer review often seems to be a matter of researchers approving other papers in their subfield, and accepting claims of statistical significance despite all the problems of multiple comparisons and implausible effect size estimates.

Schnall quotes Daniel Kahneman who writes that if replicators make no attempts to work with authors of the original work, “this behavior should be prohibited, not only because it is uncollegial but because it is bad science. A good-faith effort to consult with the original author should be viewed as essential to a valid replication.” I have mixed feelings about this. Maybe Kahneman is correct about replication per se, where there is a well-defined goal to get the exact but I don’t think it should apply more generally to criticism.

Regarding the specifics of her blog post, she has a lot of discussion about how the replication is different from her study. This is all fine, but the flip side of this is that why should her original study be considered representative of the general population? Once you accept that effect sizes were vary, this calls into question the paradigm of making quite general claims based on a study of a particular sample. Finally, I was disappointed that not anywhere in her blog does she consider the possibility that maybe, just maybe, her findings were spurious. She writes, “I have worked in this area for almost 20 years and am confident that my results are robust.” The problem is that such confidence can be self-reinforcing: once you think you’ve found something, you can keep finding it. The rules of statistical significance give a researcher enough “outs” that he or she can persist in a dead-end research paradigm for a long time. Just to be clear, I’m not saying that’s what’s happening here—I know nothing about Schnall’s work—but it can happen. Consider, for example, the work of Satoshi Kanazawa.

That said, I understand and sympathize with Schnall’s annoyance at some of the criticisms she’s received. I’ve had similar feelings to criticisms of my work: sometimes people point out actual and serious errors, sometimes mistakes I’ve made of over-generalization, other times I’ve misconstrued the literature in some way. Just recently my colleagues and I made some changes in this paper because someone pointed out some small ways in which it was misleading. Luckily he told us while the paper was still being revised for publication. Other times I learn about problems later and need to issue a correction. But sometimes I do get misinformed criticisms, even published papers where someone misunderstands my work and slams it. And I don’t like it. So I see where she’s coming from.

Again, from a statistical perspective, my biggest problem with what Schnall writes is that she doesn’t seem to consider, at all, the possibility that her original results could be spurious and arise from some combination of measurement error and capitalization on noise which, looked at a certain way, might include some statistically significant comparisons. I worry that she is thinking in a binary way: her earlier paper was a success, the replication was a failure, so she seeks to explain the differences. There are certainly differences between any two experiments on the same general phenomenon—that’s why we do meta-analysis—but, as Kahneman and Tversky demonstrated four decades ago, researchers tend to systematically underestimate the importance of variation in interpreting experimental results.

Following up, Salverda writes:

In the meantime, I came across two papers. One is by Schnall (in press, I believe), a response to the paper that failed to replicate her work.

And the people who failed to replicate her work [David Johnson, Felix Cheung, and Brent Donnellan] wrote a paper that addresses her criticism.

If it was good enough for Martin Luther King and Laurence Tribe . . .

People keep pointing me to this.

P.S. I miss the old days when people would point me to bad graphs.