Skip to content

Garrison Keillor would be spinning etc.

Under the subject line, “Misleading Graphs of the Week,” Bill Jefferys sends along this:


I agreed with Bill’s colleague Helen Read who wondered why should the 90th percentile be some magic number? Just change it to 85% or 95% or whatever and all the graphs will look different. Also kinda horrible that they’re presenting percentages to 2 decimal places, but that’s not a graphics issue, it’s just plain old innumeracy.

In Bayesian regression, it’s easy to account for measurement error

Mikhail Balyasin writes:

I have come across this paper by Jacob Westfall and Tal Yarkoni, “Statistically Controlling for Confounding Constructs Is Harder than You Think.” I think it talks about very similar issues you raise on your blog, but in this case they advise to use SEM [structural equation models] to control for confounding constructs. In fact, in relation to Bayesian models, they have this to say: “…Nor is the problem restricted to frequentist approaches, as the same issues would arise for Bayesian models that fail to explicitly account for measurement error.”

So, I would be very interested to hear from you how one would account for measurement error in Bayesian setting and whether this claim is true. I’ve tried to search your blog for something similar, but couldn’t find anything.

My reply: The funny thing is, I could’ve sworn that someone pointed me to this article already and that I’d blogged on it. But I searched the blog, including forthcoming posts, and found nothing, so here we go again.

1. The paper seems to all about type 1 error rates, and I have no interest in type 1 error rates (see, for example, this paper with Jennifer and Masanao), so I’ll tune all that out. That said, I respect Westfall and Yarkoni so I’m guessing their method makes some sense, and maybe it could even be translated into my language if anyone wanted to do so.

2. You can account for measurement error directly in Bayesian inference by just putting the measurement error model directly into the posterior distribution. For example, the Stan manual has a chapter on measurement error models and meta-analysis which begins:

Most quantities used in statistical models arise from measurements. Most of these measurements are taken with some error. When the measurement error is small relative to the quantity being measured, its effect on a model is usually small. When measurement error is large relative to the quantity being measured, or when very precise relations can be estimated being measured quantities, it is useful to introduce an explicit model of measurement error. One kind of measurement error is rounding.

Meta-analysis plays out statistically very much like measurement error models, where the inferences drawn from multiple data sets are combined to do inference over all of them. Inferences for each data set are treated as providing a kind of measurement error with respect to true parameter values.

A Bayesian approach to measurement error can be formulated directly by treating the true quantities being measured as missing data (Clayton, 1992; Richardson and Gilks, 1993). This requires a model of how the measurements are derived from the true values.

And then it continues with some Stan models like this:

Screen Shot 2016-04-25 at 6.07.00 PM

3. To get back to the Westfall and Yarkoni paper, which argues that throwing in a new predictor doesn’t help as much as you might expect: This should pop out directly from the Bayesian model, in that if measurement error is large you will do a lot of partial pooling.

How paracompact is that?

Screen Shot 2016-04-25 at 2.39.22 PM

Dominic on stan-users writes:

I was reading through and came across the term with which I was not familiar: “paracompact.” I wrote a short blog post about it: It may be of interest to other folks reading the aforementioned paper. I would have used a partition of unity to justify the corollary myself but now I understand paracompactness.

And Betancourt replied:

The relevance of paracompactness (and how it follows immediately from local compactness and second-countabilitity) is discussed in Lee’s “Smooth Manifolds.” That is by far the best reference on differential geometry that I have yet to come across.

I don’t know what they’re talking about but I thought it might interest some of you.

Fast CAR: Two weird tricks for fast conditional autoregressive models in Stan

Screen Shot 2016-09-02 at 10.06.56 PM

Max Joseph writes:

Conditional autoregressive (CAR) models are popular as prior distributions for spatial random effects with areal spatial data. Historically, MCMC algorithms for CAR models have benefitted from efficient Gibbs sampling via full conditional distributions for the spatial random effects. But, these conditional specifications do not work in Stan, where the joint density needs to be specified (up to a multiplicative constant).

CAR models can still be implemented in Stan by specifying a multivariate normal prior on the spatial random effects, parameterized by a mean vector and a precision matrix. This works, but is slow and hard to scale to large datasets.

Order(s) of magnitude speedups can be achieved by combining 1) sparse matrix multiplications from Kyle Foreman (outlined on the stan-users mailing list), and 2) a fancy determinant trick from Jin, Xiaoping, Bradley P. Carlin, and Sudipto Banerjee. “Generalized hierarchical multivariate CAR models for areal data.” Biometrics 61.4 (2005): 950-961.

With the oft-used Scotland lip cancer dataset, the sparse CAR implementation with the NUTS (No-U-Turn Sampler) algorithm in Stan gives 120 effective samples/sec compared to 7 effective samples/sec for the precision matrix implementation.

Details for these sparse exact methods can be found here.

Max Joseph is part of the Earth Lab Analytics Hub, University of Colorado – Boulder.

Bob Carpenter adds:

I put Max Joseph’s case study up on our web page. The top-level case studies page is here (with license and links to package dependencies and the case study).

The direct link to the case study is here.

Very cool!

Cool indeed. With this set-up, the implementation of such models has moved from “statistics research” to “statistics practice.”

P.S. I took a look at the document and I have a few questions/comments:

1. The map of Scotland is distorted. No big deal but might as well get that right.

2. I can’t believe that the “tau ~ gamma(0.5, .0005);” prior is a good idea. It just looks weird to me. At the very least I’d suggest a reparameterization to make it scale-free.

3. Is there a way to package up all that code into a function? I don’t have enough experience with Stan functions to be sure, but the idea would be to have a CAR model with all that code inside the function so that, as a user, I could just say theta ~ CAR(…) with all the parameters, and then I wouldn’t have to worry about all those matrix manipulations.

A four-way conversation on weighting and regression for causal inference

It started with a project that Sharad Goel is doing, comparing decisions of judges in an urban court system. Sharad was talking with Avi Feller, Art Owen, and me about estimating the effect of a certain decision option that judges have, controlling for pre-treatment differences between defendants.


I’m interested in what that data shows about the relative skill of the numerous judges.

I expect some are stricter and some are more lenient. At any level of strictness it would be interesting to see whether any
are especially good at their decision. This would be one way to estimate whether ‘u’ [the hypothetical unobserved pretreatment covariates] matters. Probably a lot of the judges think they’re pretty good. Based on similar ratings of surgeons, I’d expect a bunch that you can’t tell apart, a few that are quite a bit worse than most, and just maybe, a few that are quite a bit better than most.

The sanity check I would apply to models is to resample the judges.

I have not done much with propensity scoring. I’m intrigued by the thought that it is not properly Bayesian. My first thought is that there should be a way to reconcile these things.


I agree that this is an interesting question. We’ve started looking at this recently, but the complication is that judge assignment isn’t completely random, and it does appear that some judges do indeed see higher risk defendants.

I’m still trying to understand if propensity scores are ever really preferably to the straightforward outcome regression. Andrew, am I correct in thinking that you would say “no, just do the regression”? One thing I find attractive about propensity scores is that then I can look at the balance plots, which gives me some confidence that we can estimate the treatment probabilities reasonably well. And at that point, I feel like it’s natural to use the IPW estimator (or the doubly robust estimator). But perhaps I should just interpret the balance plots as evidence that the outcome regression is also ok?


Yes, if the variables that predict the decision are included in the regression model then I’d say, yes, you got ignorability so just fit the damn model. The propensity scores are relevant to robustness. And you can make the balance plots even without doing any weighting.


That makes sense. But now I’m wondering why more people don’t look at the balance plots to check robustness of the outcome regression. I feel like I only see balance plots when people are actually using the propensity scores for something, like matching or IPW. Perhaps this is a common thing to do, and I’ve just missed that literature…?

Now Avi weighs in. Of all of us, Avi’s the only expert on causal inference. Here he goes:

A few thoughts on propensity scores and all this.

First, the (now classic) Kang and Schafer paper on “de-mystifying double robustness” is here. The simulation studies from this paper sparked a robust debate (sorry for the pun) in the causal inference literature. But I think that the bottom line is that there’s no magic bullet here—re-weighting estimators do better in some cases and regression-type estimators do better in some cases (of course, you can think of regression as a type of re-weighting estimator). In practice, with large samples and so long as the estimated propensity scores aren’t “too extreme,” then regression, IPW, and AIPW (i.e., double robust) estimates should all be in the same ball park. Thus, it’s reassuring—if not surprising—that you find similar results with these three approaches.

For what it’s worth, Andy’s view of “just fit the damn model” is not the majority view among causal inference researchers. Personally, I prefer matching + regression (or post-stratifying on the propensity score + regression), which is generally in line with Don Rubin. The inimitable Jennifer Hill, for example, usually jumps straight to IPW (though you should confirm that with her). Guido Imbens has tried a bunch of things (he has a paper showing some good properties of double robust estimators in randomized trials, for example).

In general, I find “global” recommendations here misplaced, since it will depend a lot on your context and audience. And trust me that there are a lot of recommendations like that! Sme people say you should never do matching, others say you should never do weighting; some say you should always be “doubly robust,” others say you should never be doubly robust; and so on…

As for balance checks: I agree that this is a terrific idea! You can check out the Imbens and Rubin textbook for a fairly in-depth discussion of some of the issues here. In your applied setting (and assuming that you’re still doing all three analyses), I like the idea of doing balance checks for the entire data set, for the re-weighted data set, and separately by stratum (i.e., deciles of the propensity score). You can get much fancier, but that seems like a sensible starting point.

“Double robustness” has never been mystifying to me, perhaps because it came up in our 1990 paper on estimating the incumbency advantage, where Gary and I thought hard about the different assumptions of the model, and about what assumptions were required for our method to work.

And now to get back to the discussion:


I’m curious what you guys think of entropy balancing. Reweight the data in order to attain balance of the covariates: by Zhao and Percival, following up on Hainmueller.

They use entropy. I’d have probably used an empirical likelihood or worked out a variance favorable criterion (possibly allowing negative weights).


To me it seems like a bunch of hype. I’m fine with matching for the purpose of discarding data points that are not in the zone of overlap (as discussed in chapter 10 of my book with Jennifer) and I understand the rationale for poststratifying on propensity score (even though I’m a bit skeptical of that approach), but these fancy weighting schemes just seem bogus to me, I don’t see them doing anything for the real goal of estimating treatment effects.


That seems pretty harsh. Can you parse ‘hype’ and ‘bogus’ for me?

Hype might mean that their method is essentially the same as something older, and you think they’re just stepping in front of somebody else’s parade.

But bogus seems to indicate that the method will lead people to wrong conclusions, either wrong math (unlikely) or wrong connection to reality.


I will defer to Avi on the details but my general impression of these methods is that they are designed to get really good matching on the weighted means of the pretreatment variables. I really don’t wee the point, though, as I see matching as a way to remove some data points that are outside the region of overlap.

To put it another way, I think of these weighting methods as optimizing the wrong thing.

The “hype” comes because I feel like “genetic matching,” “entropy balancing,” etc, are high-tech ways of solving a problem that doesn’t need to be solved. It seems like hype to me not because they’re ripping anyone off, but because it seems unnecessary. Kinda like sneakers with microchips that are supposed to tune the rubber to make you jump higher.

But, sure, that’s too strong a judgment. These methods aren’t useful to _me_, but they can be useful to many people. In particular, if for some reason a researcher has decided ahead of time to simply compare the two groups via weighted averages—that is, he or she has decided _not_ to run a regression controlling for the variables that went into the propensity score—then, sure, it makes sense to weight to get the best possible balance.

Since I’d run the regression anyway, I can’t do anything with the weights. Running weighted regression will just increase my standard errors and add noise to the system. Yes the weights can give some robustness but most of that robustness is coming from excluding points outside the region of overlap.

That said, regression can be a lot of work. Jennifer Hill has that paper where they used BART, and it was a lot of work. I’d typically just do the usual linear or logistic regression with main effects and interactions. So in practice I’m sure there are problems where weighting would outperform whatever I’d do. I’m just skeptical about big machinery going into weighting because, as Avi said, the big thing is the ignorability assumption anyway.


One quick plug for the upcoming Atlantic Causal Inference Conference in NYC: we’ll be hosting a short “debate” between Tyler VanderWeele and Mark VanderLaan (the “vanderBate”). Mark argues that we should really focus on the estimation method we use in these settings—double-robustness and machine learning-based approaches (like TMLE), he believes, are strongly preferable to parametric regression models. Tyler, by contrast, argues that what really matters in all of this is the ignorability assumption and that we should be focused much more on questions of sensitivity. As you might imagine, I’m very much on Tyler’s side here.


My take away is that in practice it makes sense to just try both approaches (“fit the damn model” + your favorite weighting scheme), and check that the answer doesn’t depend too much on the method. If it does, then I guess you’d have to think a bit more carefully about which method is preferred, but if it doesn’t then it’s just one less thing to worry about….

Is there any way to check if the ignorability assumption is reasonable? For the bail problem, do we just have to assert that it’s unlikely a judge can glean much useful information by staring into a defendant’s eyes, or is there a more compelling argument to make?


Ignorability is an assumption but it can be possible to quantify departures from ignorability. The idea is to make predictions of distributions under the model and have some continuous nonignorability parameter (that’s 0 if ignorable, + if selection bias in one way, – if selection bias in another way). Obv this 1-parameter model can’t capture all aspects of nonignorability but you might be able to have it capture the departures of particular concern. Anyway, once you have this, you can make inferences under different values of this parameter and you can assess whether the inferences make sense. In your example below, the idea would be to model how much information the judge could plausibly learn from the defendant’s eyes, over and above any info in the public record.


There’s a massive literature on this sort of thing. Some immediate suggestions:

Seminal paper from Rosenbaum and Rubin here)

Guido Imbens’ version here

Paul Rosenbaum’s textbooks (though these are all randomization based), here

Recent work from my long-time collaborator Peng Ding, here

Happy to suggest more. I’m not really doing justice to the biostats side.

Lots of good stuff here. Let me emphasize that the fact that I consider some methods to be “hype” should not be taken to imply that I think they should never be used. I say this for two reasons. First, a method can be hyped and still be good. Bayesian methods could be said to be hyped, after all! Second, I have a lot of respect for methods that people use on real problems. Even if I think a method is not optimal, it might be the best thing standing for various problems.

Participate in this cool experiment about online privacy

Sharad Goel writes:

We just launched an experiment about online privacy, and I was wondering if you could post this on your blog.

In a nutshell, people upload their browsing history, which we then fingerprint and compare to the profiles of 100s of millions of Twitter users to find a match. Browsing history is something ad networks and others can collect without your explicit consent, so we’re trying to understand how much information leak there is. The experiment is a little creepy but also kind of fun I think….!

Sharad and I have collaborated on some projects (though I’m not involved in this one) and just about everything he does is great. He’s full of brilliant ideas.

Take that, Bruno Frey! Pharma company busts through Arrow’s theorem, sets new record!

I will tell a story and then ask a question.

The story: “Thousands of Americans are alive today because they were luckily selected to be in the placebo arm of the study”

Paul Alper writes:

As far as I can tell, you have never written about Tambocor (Flecainide) and the so-called CAST study. A locally prominent statistician loaned me the 1995 book by Thomas J. Moore, Deadly Medicine; Why tens of thousands of heart patients died in America’s worst drug disaster. Quite an eyeopener on many fronts but I found some tangential goodies. From page 61:

Scientific articles routinely list so many coauthors that an unwritten code usually determines the order in which the names appear. The doctor who did the most work and probably wrote the article appears as the first-named author.

The unwritten code also provides that the last-named author is the “senior author.”

From page 62:

The authors of the Tambocor study apparently evaded this problem [refusal of journals to accept duplication] by submitting their manuscript simultaneously to three different journals [JAMA, Circulation, and American Heart Journal ].

From page 63;

In all, 3M succeeded in publishing the same study six times. [Seems like a violation of Arrow’s theorem. — ed.]

As a medical doctor once pointed out, thousands of Americans are alive today because they were luckily selected to be in the placebo arm of the study.

I was curious so I looked up Flecainide on Wikipedia and found that it’s an antiarrhythmic agent. Hey, I’ve had arrhythmia! Also the drug remains in use. The Wikipedia entry didn’t mention any scandal; it just said that the results of the Cardiac Arrhythmia Suppression Trial (CAST) “were so significant that the trial was stopped early and preliminary results were published.” I followed the link which reports that “the study found that the tested drugs increased mortality instead of lowering it as was expected”:

Total mortality was significantly higher with both encainide and flecainide at a mean follow-up period of ten months. Within about two years after enrollment, encainide and flecainide were discontinued because of increased mortality and sudden cardiac death. CAST II compared moracizine to placebo but was also stopped because of early (within two weeks) cardiac death in the moracizine group, and long-term survival seemed highly unlikely. The excess mortality was attributed to proarrhythmic effects of the agents.

Alper adds more info from here:

From page 200, “Status on Sept. 1, 1998” with X and Y not being named (as it turned out, placebo and treatment, respectively)

Sudden death 3 19
Total Patients 576 571

But the drug is still in use, I guess it’s believed to help for some people. An interesting example of a varying treatment effect, indicating problems with the traditional statistical paradigm of estimating a constant or average effect.

The question: How to think about this?

The above story looks pretty bad. On the other hand, thousands of new drugs get tried out, some of them help, it stands to reason that some of them will hurt and even kill people too. So maybe this sort of negative study is an inevitable consequence of a useful process?

If anyone tried to bury the negative data, sure, that’s just evilicious. But if they legitimately thought the drug might work, and then it turned to kill people, them’s the breaks, right? Nothing unethical at all, prospectively speaking.

And if you publish your negative results 6 times, that shows a real commitment to correcting the record!

Why 2016 is not like 1964 and 1972

Nadia Hassan writes:

I saw your article in Slate. For what it’s worth, this new article, “Ideologically Extreme Candidates in U.S. Presidential Elections, 1948–2012,” by Marty Cohen, Mary McGrath, Peter Aronow, and John Zaller, looks at ideology-based extremism and finds weak effects of ideology. Like the high end is 1980 and the authors estimate Carter got ~1.4 points of vote share from being appreciably closer to the median voter.

This seems consistent with what Rosenstone wrote in his classic 1983 book, Forecasting Presidential Elections, and what I wrote in my Slate article. It’s good to see the argument updated and presented in more detail.

Hassan continues:

Two other factors may have impacted in 1964 and 1972: approval and incumbent party tenure. LBJ had a sky high approval rating, perhaps bolstered somewhat by JFK’s death. The other is time in office. Parties seem to do better their first term in office than 2nd or later. LBJ and Nixon got a couple of points from that. Eisenhower had economic performance slightly weaker than 2004 in 1956 but he still won by double digits—his approval rating was over 60%, compared to Bush in the high 40s.

Yes, good points. Also since I’m linking here, let me just say that I don’t like the headline Slate gave to my article, “Trump-Clinton Won’t Be a Landslide.” The subheading, “Conventional wisdom is that fringe candidates get repudiated, à la 1964 and 1972. The story isn’t so simple,” is fine. But I don’t like going on the record with a deterministic prediction—especially a prediction that I never made!

My article’s fine, though, it’s just the headline that bothered me.

Graph too clever by half

Mike Carniello writes:

I wondered what you make of this.

I pay for the NYT online and tablet – but not paper, so I don’t know how they’re representing this content in two dimensions.

I’ve paged through the thing a couple of times, not sure how useful it is – it seems like a series of figures – in a one-column, many-row display might have worked as well (or, perhaps, two-column (differentiating oil producer groups)).

I agree, I found it disconcerting that the axes start changing as I scroll down!

Publication bias occurs within as well as between projects

Kent Holsinger points to this post by Kevin Drum entitled, “Publication Bias Is Boring. You Should Care About It Anyway,” and writes:

I am an evolutionary biologist, not a psychologist, but this article describes a disturbing Scenario concerning oxytocin research that seems plausible. It is also relevant to the reproducibility/publishing issues you have been discussing recently on your blog.

Drum writes:

You all know about publication bias, don’t you? Sure you do. It’s the tendency to publish research that has bold, affirmative results and ignore research that concludes there’s nothing going on. This can happen two ways. First, it can be the researchers themselves who do it. In some cases that’s fine: the data just doesn’t amount to anything, so there’s nothing to write up. In other cases, it’s less fine: the data contradicts previous results, so you decide not to write it up. . . .

This is just fine but I want to emphasize that publication bias is not just about the “file drawer effect,” it’s not just about positive findings being published and zero or negative findings remaining unpublished. It’s also that, within any project, there are so many different results that researchers can decide what to focus on.

So, yes, sometimes a research team will try an idea and it won’t work and they won’t bother writing it up. Just one more dry hole—but if only the successes are written up and published, we will get a misleading view of reality: we’re seeing a nonrandom sample of results. But it’s more than that. Any study contains within itself so many possibilities that often something can be published that appears to be consistent with some vague theory. Embodied cognition, anyone.

This “garden of forking paths” is important because it shows how publication bias can occur, even if every study is published and there’s nothing in the file drawer.

Evaluating election forecasts

Nadia Hassan writes:

Nate Silver did a review of pre-election predictions from forecasting models in 2012. The overall results were not great, but many scholars noted that some models seemed to do quite well. You mentioned that you were interested in how top-notch models fare.

Nate agreed that some were better, but he raised the question of lucky vs. good with forecasters:
“Some people beat Vegas at roulette on any given evening. Some investors beat the stock market in any given month/quarter/year, and yet there is (relatively) little evidence of persistent stock-picking skill, etc, etc.”

The other thing is you did a paper with Wang on the limits of predictive accuracy. Many election models are linear regressions, but the point seems pertinent.

Election forecasting is seen by some as a valuable opportunity social science theories over time. It does seem like one can go wrong by just comparing pre-election forecasts to outcomes. How can one examine predictions sensibly considering these issues?

My reply: One way to increase N here is to look at state-by-state predictions. Here it makes sense to look at predictions for each state relative to the national average, rather than just looking at the raw prediction. To put it another way: suppose the state-level outcomes are y_1,…,y_50, and the national popular vote outcome is y_usa (a weighted average of the 50 y_j’s). Then you should evaluate the national prediction by comparing to y_usa, and you should evaluate state predictions of y_j – y_usa for each j. Otherwise you’re kinda double counting the national election and you’re not really evaluating different aspects of the prediction. You can also look at predictions of local elections, congressional elections, etc.

And always evaluate predictions on vote proportions, not just win/loss. That’s something I’ve been saying for a long long time (for example see this book review from 1993). To evaluate predictions based on win/loss is to just throw away information.

Birthdays and heat waves

I mentioned the birthdays example in a talk the other day, and Hal Varian pointed me to some research by David Lam and Jeffrey Miron, papers from the 1990s with titles like Seasonality of Births in Human Populations, The Effect of Temperature on Human Fertility, and Modeling Seasonality in Fecundability, Conceptions, and Births.

Aki and I have treated the birthdays problem as purely a problem in statistical modeling and computation and have not looked at all at work of demographers in this area. So it was good to learn of this work.

Hal also pointed me to a recent paper, Heat Waves at Conception and Later Life Outcomes by Joshua Wilde, Bénédicte Apouey, and Toni Jung, which I looked at and don’t believe at all.

Wilde et al. report that babies born 9 months after hot weather have better educational and health outcomes as adults, and they attribute this to a selection among fetuses, by which the higher temperature conditions make fetal development more difficult so that the weaker fetuses die and it is the stronger, healthier ones that survive. As is typically the case, I’m suspicious of this sort of bank-shot explanation.

Wilde et al. talk about the causal effect of temperature but I’m guessing it can all be explained by selection effects of parents, that different sorts of people get pregnant at different times of the year, with no causal effect of temperature at all. Yes they run some regressions controlling for family characteristics but I get the impression that the purpose of those regressions was just to confirm that their primary findings were OK: As sometimes happens in this sort of robustness analysis, they weren’t looking to find anything there, and then they successfully didn’t find anything. Not what I’d call convincing. The whole thing just seems like massive overreach to me. Also seems odd for them to talk about temperature “shocks”: It’s hardly a shock that it gets warm in the summer and cold in the winter.

I’m not saying that temperature at conception can’t have any effect on fetal health; I just don’t find the particular argument in this paper at all convincing. It’s the learning-through-regression paradigm out of control.

P.S. It’s April, and it just happens that the next available day on the blog is in August. What better time to post something on the effects of heat waves?

P.P.S. See here for further discussion by Joshua Wilde, the first author of the paper I write about above.

Who owns your code and text and who can use it legally? Copyright and licensing basics for open-source

I am not a lawyer (“IANAL” in web-speak); but even if I were, you should take this with a grain of salt (same way you take everything you hear from anyone). If you want the straight dope for U.S. law, see the U.S. government Copyright FAQ; it’s surprisingly clear for government legalese.

What is copyrighted?

Computer code and written material such as books, journals, and web pages, are subject to copyright law. Copyright is for the expression of an idea, not the idea itself. If you want to protect your ideas, you’ll need a patent (or to be good at keeping secrets).

Who owns copyrighted material?

In the U.S., copyright is automatically assigned to the author of any text or computer code. But if you want to sue someone for infringing your copyright, the government recommends registering the copyright. And most of the rest of the world respects U.S. copyright law.

Most employers require as part of their employment contract that copyright for works created by their employees be assigned to the employer. Although many people don’t know this, most universities require the assignment of copyright for code written by university research employees (including faculty and research scientists) to the university. Typically, universities allow the author to retain copyright for books, articles, tutorials, and other traditional written material. Web sites (especially with code) and syllabuses for courses are in a grey area.

The copyright holder may assign copyright to others. This is what authors do for non-open-access journals and books—they assign the copyright to the publisher. That means that even they may not be able to legally distribute copies of the work to other people; some journals allow crippled (non-official) versions of the works to be distributed. The National Institutes of Health require all research to be distributed openly, but they don’t require the official version to be so, so you can usually find two versions (pre-publication and official published version) of most work done under the auspices of the NIH.

What protections does copyright give you?

You can dictate who can use your work and for what. There are fair use exceptions, but I don’t understand the line between fair use and infringement (like other legal definitions, it’s all very fuzzy and subject to past and future court decisions).


For others to be able to use copyrighted text or code legally, the copyrighted material must be explicitly licensed for such use by the copyright holder. Just saying “common domain” or “this is trivial” isn’t enough. Just saying “do whatever you want with it” is in a grey area gain, because it’s not a recognized license and presumably that “whatever you want” doesn’t involve claiming copyright ownership. The actual copyright holder needs to explicitly license the material.

There is a frightening degree of non-conformance among open-source contributors, largely I suspect, due to misunderstandings of the author’s employment contract and copyright law.

Derived works

Most of the complication from software licensing comes from so-called derived works. For example, I download open-source package A, then extend it to produce open-source package B that includes open-source package A. That’s why most licenses explicitly state what happens in these cases. The reason we don’t like the Gnu Public Licenses (GPL) is that they restrict derived works with copyleft (forcing package B to adopt the same license, or at best one that’s compatible). That’s why I insisted on the BSD license for Stan—it’s maximally open in tems of what it allows others to do with the code, and it’s compatible with GPL. R’s licensed under the GPL, so we released RStan under the GPL so that users don’t have to deal with both the GPL and a second license to use RStan.

Where does Stan stand?

Columbia owns the copyright for all code written by Columbia research staff (research faculty, postdocs, and research scientists). It’s less clear (from our reading of the faculty handbook) who owns works created by Ph.D. students and teaching faculty. For non-Columbia contributions, the author (or their assignee) retains copyright for their contribution. The advantage of this distributed copyright is that ownership isn’t concentrated with one company or person; the disadvantage is that we’ll never be able to contact everyone to change licenses, etc.

The good news is that Columbia’s Tech Ventures office (the controller of software copyrights at Columbia), has given the Stan project a signed waiver that allows us to release all past and future work on Stan under open source licenses. They maintain the copyright, though, under our employment contracts (at least for the research faculty and research scientists).

For other contributors, we now require them to explicitly state who owns the copyrighted contribution and to agree that the copyright holder gives permission to license the material under the relevant license (BSD for most of Stan, GPL or MIT for some of the interfaces).

The other good news is that most universities and companies are coming around and allowing their employees to contribute to open-source projects. The Gnu Public License (GPL) is often an exception for companies, because they are afraid of its copyleft properties.


The Stan project is trying to cover our asses from being sued in the future by a putative copyright holder, though we don’t like having to deal with all this crap (pun intended).

Luckily, most universities these days seem to be opening up to open source (no, that wasn’t intended to continue the metaphor of the previous paragraph).

But what about patents?

Don’t get me started on software patents. Or patent trolls. Like copyrights, patents protect the owner of intellectual property against its illegal use by others. Unlike copyright, which is about the realization of an idea (such as a way of writing a recipe for chocolate chip cookies), patents are more abstract and are about the right to realize ideas (such as making a chocolate chip cookie in any fashion). If you need to remember one thing about patent law, it’s that a patent lets you stop others from using your patented technology—it doesn’t let you use it (your patent B may depend on some other patent A).

Or trademarks?

Like patents, trademarks prevent other people from (legally) using your intellectual property without your permission, such as building a knockoff logo or brand. Trademarks can involve names, font choices, color schemes, etc. The trademark itself can involve fonts, color schemes, similar names, etc. But they tend to be limited to areas, so we could register a trademark for Stan (which we’re considering doing), without running afoul of the down-under Stan.

There are also unregistered trademarks, but I don’t know all the subtleties about what rights registered trademarks grant you over the unregistered ones. Hopefully, we’ll never be writing that little R in a circle above the Stan name, Stan®; even if you do register a trademark, you don’t have to use that annoying mark—it’s just there to remind people that the item in question is trademarked.

Oooh, it burns me up

If any of you are members of the Marketing Research Association, could you please contact them and ask them to change their position on this issue:

Screen Shot 2016-01-11 at 4.43.58 PM

I have a feeling they won’t mind if you call them at home. With an autodialer. “Pollsters now must hand-dial cellphones, at great expense,” indeed. It’s that expensive to pay people to push a few buttons, huh?

Those creepy lobbyists are so creepy. Yeah, yeah, I know they’re part of the political process, but I don’t have to like them or their puppets in Congress.

Better to just not see the sausage get made


Mike Carniello writes:

This article in the NYT leads to the full text, in which these statement are buried (no pun intended):

What is the probability that two given texts were written by the same author? This was achieved by posing an alternative null hypothesis H0 (“both texts were written by the same author”) and attempting to reject it by conducting a relevant experiment. If its outcome was unlikely (P ≤ 0.2), we rejected the H0 and concluded that the documents were written by two individuals. Alternatively, if the occurrence of H0 was probable (P > 0.2), we remained agnostic.

See the footnote to this table:

Screen Shot 2016-04-16 at 8.45.22 PM

Ahhh, so horrible. The larger research claims might be correct, I have no idea. But I hate to see such crude statistical ideas being used, it’s like using a pickaxe to dig for ancient pottery.

Letters we never finished reading

I got a book in the mail attached to some publicity material that began:

Over the last several years, a different kind of science book has found a home on consumer bookshelves. Anchored by meticulous research and impeccable credentials, these books bring hard science to bear on the daily lives of the lay reader; their authors—including Malcolm Gladwell . . .

OK, then.

The book might be ok, though. I wouldn’t judge it on its publicity material.

Free workshop on Stan for pharmacometrics (Paris, 22 September 2016); preceded by (non-free) three day course on Stan for pharmacometrics

So much for one post a day…

Workshop: Stan for Pharmacometrics Day

If you are interested in a free day of Stan for pharmacometrics in Paris on 22 September 2016, see the registration page:

Julie Bertrand (statistical pharmacologist from Paris-Diderot and UCL) has finalized the program:

When Who What
09:00–09:30 Registration
9:30-10:00 Bob Carpenter Introduction to the Stan Language and Model Fitting Algorithms
10:00-10:30 Michael Betancourt Using Stan for Bayesian Inference in PK/PD Models
10:30-11:00 Bill Gillepsie Prototype Stan Functions for Bayesian Pharmacometric Modeling
11:00-11:30 coffee break
11:30-12:00 Sebastian Weber Bayesian popPK for Pediatrics – bridging from adults to pediatrics
12:00-12:30 Solene Desmee Using Stan for individual dynamic prediction of the risk of death in nonlinear joint models:
Application to PSA kinetics and survival in metastatic prostate cancer
12:30-13:30 lunch
13:30-14:00 Marc Vandemeulebroecke A longitudinal Item Response Theory model to characterize cognition over time in elderly subjects
14:00-14:30 William Barcella Modeling correlated binary variables: an application to lower urinary tract symptoms
14:30-15:00 Marie-Karelle Riviere Evaluation of the Fisher information matrix without linearization in
nonlinear mixed effects models for discrete and continuous outcomes
15:00-15:30 coffee break
15:30-16:00 Dan Simpson TBD
16:00-16:30 Frederic Bois Bayesian hierarchical modeling in pharmacology and toxicology / about what we need next
16:30-17:00 Everyone Discussion


Course: Bayesian Inference with Stan for Pharmacometrics

The three days preceding the workshop (19–21 September 2016), Michael Betancourt, Daniel Lee, and I will be teaching a course on Stan for Pharmacometrics. This, alas, is not free, but if you’re interested, registration details are here:

It’s going to be very hands-on and by the end you should be fitting hierarchical PK/PD models based on compartment differential equations.

P.S. As Andrew keeps pointing out, all proceeds (after overhead) go directly toward Stan development. It turns out to be very difficult to get funding to maintain software that people use, because most funding is directed at “novel” research (not software development, research, which means prototypes, not solid code). These courses help immensely to supplement our grant funding and let us continue to maintain Stan and its interfaces.

A day in the life

I like to post approx one item per day on this blog, so when multiple things come up in the same day, I worry about the sustainability of all this. I suppose I could up the posting rate to 2 a day but I think that could be too much of a burden on the readers.

So in this post I’ll just tell you everything I’ve been thinking about today, Thurs 14 Apr 2016.

Actually I’ll start with yesterday, when I posted an update to our Prior Choice Recommendations wiki. There had been a question on the Stan mailing list about priors for cutpoints in ordered logistic regression and this reminded me of a few things I wanted to add, not just on ordered regression but in various places in the wiki. This wiki is great and I’ll devote a full post to it sometime.

Also yesterday I edited a post on this sister blog. Posting there is a service to the political science profession and it’s good to reach Washington Post readers which is a different audience than what we have here. But it’s also can be exhausting as I need to explain everything, whereas for you regular readers I can just speak directly.

This morning I taught my class on design and analysis of sample surveys. Today’s class was on Mister P. Jitts led into a 20-minute discussion about the history and future of sample surveys. I don’t know much about the history of sample surveys. Why was there no Gallup Poll in 1990? How much random sampling was being done, anywhere, before 1930? I don’t know. After that, the class was all R/Stan demos and discussion. I had some difficulties. I took an old R script I had from last year’s class but it didn’t run. I’d deleted some of the data files—Census PUMS files I needed for the poststratification—so I needed to get them again.

After that I biked downtown to give a talk at Baruch College, where someone had asked me to speak. On the way down I heard this story, which the This American Life producers summarize as follows:

When Jonathan Goldstein was 11, his father gave him a book called Ultra-Psychonics: How to Work Miracles with the Limitless Power of Psycho-Atomic Energy. The book was like a grab bag of every occult, para-psychology, and self-help book popular at the time. It promised to teach you how to get rich, control other people’s minds, and levitate. Jonathan found the book in his apartment recently and decided to look into the magical claims the book made.

It turns out that the guy who wrote the book was just doing it to make money:

At the time, Schaumberger was living in New Jersey and making a decent wage as an editor at a publishing house that specialized in occult self help books with titles like “Secrets From Beyond The Pyramids” and “The Magic Of Chantomatics.” And he was astonished by the amount of money he saw writers making. . . .

Looking at it now, it seems obvious it was a lark. It almost reads like a parody of another famous science fiction slash self help book with a lot of psuedoscience jargon that, for legal reasons, I will only say rhymes with diuretics.

Take, for instance, the astral spur. You were supposed to use it at the race track to give your horse extra energy, and it involved standing on one foot and projecting a psychic laser at your horse’s hindquarters.

Then there’s the section on ultra vision influence. The road to domination is explained this way– one, sit in front of a mirror and practice staring fixedly into your own eyes. Two, practice the look on animals. Cats are the best. See if you can stare down a cat. Don’t be surprised if the cat seems to win the first few rounds. Three, practice the look on strangers on various forms of public transport. Stare steadily at someone sitting opposite you until you force them to turn their head away or look down. You have just mastered your first human subject.

I’m listening to this and I’m thinking . . . power pose! It’s just like power pose. It could be true, it kinda sounds right, it involves discipline and focus.

One difference is that power pose has a “p less than .05” attached to it. But, as we’ve seen over and over again, “p less than .05” doesn’t mean very much.

The other difference is that, presumably, the power pose researchers are sincere, whereas this guy was just gleefully making it all up. And yet . . . there’s this, from his daughter:

Well, he was very familiar with all these things. The “Egyptian Book of the Dead” was a big one, because there was always this thing of, well, maybe if they had followed the formulas correctly, maybe something . . . He may have wanted to believe. It may be that in his private thoughts, there were some things in there that he believed in.

I think there may be something going on here, the idea that, even if you make it up, if you will it, you can make it true. If you just try hard enough. I wonder if the power-pose researchers and the ovulation-and-clothing researchers and all the rest, I wonder if they have a bit of this attitude, that if they just really really try, it will all become true.

And then there was more. I’ve had my problems with This American Life from time to time, but this one was a great episode. It had this cool story of a woman who was caring for her mother with dementia, and she (the caregiver) and her husband learned about how to “get inside the world” of the mother so that everything worked much more smoothly. I’m thinking I should try this approach when talking with students!

OK, so I got to my talk. It went ok, I guess. I wasn’t really revved up for it. But by the time it was over I was feeling good. I think I’m a good speaker but one thing that continues to bug me is that I rarely elicit many questions. (Search this blog for Brad Paley for more on this.)

After my talk, on the way back, another excellent This American Life episode, including a goofy/chilling story of how the FBI was hassling some US Taliban activist and trying to get him to commit crimes so they could nail him for terrorism. Really creepy: they seemed to want to create crimes where none existed, just so they could take credit for catching another terrorist.

Got home and started typing this up.

What else relevant happened recently? On Monday I spoke at a conference on “Bayesian, Fiducial, and Frequentist Inference.” My title was “Taking Bayesian inference seriously,” and this was my abstract:

Over the years I have been moving toward the use of informative priors in more and more of my applications. I will discuss several examples from theory, application, and computing where traditional noninformative priors lead to disaster, but a little bit of prior information can make everything work out. Informative priors also can resolve some of the questions of replication and multiple comparisons that have recently shook the world of science. It’s funny for me to say this, after having practiced Bayesian statistics for nearly thirty years, but I’m only now realizing the true value of the prior distribution.

I don’t know if my talk quite lived up to this, but I have been thinking a lot about prior distributions, as was indicated at the top of this post. On the train ride to and from the conference (it was in New Jersey) I talked with Deborah Mayo. I don’t really remember anything we said—that’s what happens when I don’t take notes—but Mayo assured me she’d remember the important parts.

I also had an idea for a new paper, to be titled, “Backfire: How methods that attempt to avoid bias can destroy the validity and reliability of inferences.” OK, I guess I need a snappier title, but I think it’s an important point. Part of this material was in my talk, “‘Unbiasedness’: You keep using that word. I do not think it means what you think it means,” which I gave last year at Princeton—that was before Angus Deaton got mad at me, he was really nice during that visit and offered a lot of good comments, both during and after the talk—but I have some new material too. I want to work in the bit about the homeopathic treatments that have been so popular in social psychology.

Oh, also I received emails today from 2 different journals asking me to referee submitted papers, someone emailed me his book manuscript the other day, asking for comments, and a few other people emailed me articles they’d written.

I’m not complaining, nor am I trying to “busy-brag.” I love getting interesting things to read, and if I feel too busy I can just delete these messages. My only point is that there’s a lot going on, which is why it can be a challenge to limit myself to one blog post per day.

Finally, let me emphasize that I’m not saying there’s anything special about me. Or, to put it another way, sure, I’m special, and so are each of you. You too can do a Nicholson Baker and dissect every moment of your lives. That’s what blogging’s all about. God is in every leaf etc.

Hey pollsters! Poststratify on party ID, or we’re all gonna have to do it for you.

Alan Abramowitz writes:

In five days, Clinton’s lead increased from 5 points to 12 points. And Democratic party ID margin increased from 3 points to 10 points.

No, I don’t think millions of voters switched to the Democratic party. I think Democrats are were just more likely to respond in that second poll. And, remember, survey response rates are around 10%, whereas presidential election turnout is around 60%, so it makes sense that we’d see big swings in differential nonresponse to polls which will not be expected to map to comparable swings in differential voting turnout.

We’ve been writing about this a lot recently. Remember this post, and this earlier graph from Abramowitz:


and this news article with David Rothschild, and this research article with Rothschild, Doug Rivers, and Sharad Goel, and this research article from 2001 with Cavan Reilly and Jonathan Katz? The cool kids know about this stuff.

I’m telling you this for free cos, hey, it’s part of my job as a university professor. (The job is divided into teaching, research, and service; this is service.) But I know that there are polling and news organizations that make money off this sort of thing. So, my advice to you: start poststratifying on party ID. It’ll give you a leg up on the competition.

That is, assuming your goal is to assess opinion and not just to manufacture news. If what you’re looking for is headlines, then by all means go with the raw poll numbers. They jump around like nobody’s business.

P.S. Two questions came up in discussion:

1. If this is such a good idea, why aren’t pollsters doing it already? Many answers here, including (a) some pollsters are doing it already, (b) other pollsters get benefit from headlines, and you get more headlines with noisy data, (c) survey sampling is a conservative field and many practitioners resist new ideas (just search this blog for “buggy whip” for more on that topic), and, most interestingly, (d) response rates keep going down, so differential nonresponse might be a bigger problem now than it used to be.

2. Suppose I want to poststratify on party ID? What numbers should I use? If you’re poststratifying on party ID, you don’t simply want to adjust to party registration data: party ID is a survey response, and party registration is something different. The simplest approach would be to take some smoothed estimate of the party ID distribution from many surveys: this won’t be perfect but it should be better than taking any particular poll, and much better than not poststratifying at all. To get more sophisticated, you could model the party ID distribution as a slowly varying time series as in our 2001 paper but I doubt that’s really necessary here.

His varying slopes don’t seem to follow a normal distribution

Bruce Doré writes:

I have a question about multilevel modeling I’m hoping you can help with.

What should one do when random effects coefficients are clearly not normally distributed (i.e., coef(lmer(y~x+(x|id))) )? Is this a sign that the model should be changed? Or can you stick with this model and infer that the assumption of normally distributed coefficients is incorrect?

I’m seeing strongly leptokurtic random slopes in a context where I have substantive interest in the shape of this distribution. That is, it would be useful to know if there are more individuals with “extreme” and fewer with “moderate” slopes than you’d expect of a normal distribution.

My reply: You can fit a mixture model, or even better you can have a group-level predictor that breaks up your data appropriately. To put it another way: What are your groups? And which are the groups that have low slopes and which have high slopes? Or which have slopes near the middle of the distribution and which have extreme slopes? You could fit a mixture model where the variance varies, but I think you’d be better off with a model using group-level predictors. Also I recommend using Stan which is more flexible than lmer and gives you the full posterior distribution.

Doré then added:

My groups are different people reporting life satisfaction annually surrounding a stressful life event (divorce, bereavement, job loss). I take it that the kurtosis is a clue that there are unobserved person-level factors driving this slope variability? With my current data I don’t have any person-level predictors that could explain this variability, but certainly it would be good to try to find some.