Skip to content

We’re hiring! hiring! hiring! hiring!

[insert picture of adorable cat entwined with Stan logo]

We’re hiring postdocs to do Bayesian inference.

We’re hiring programmers for Stan.

We’re hiring a project manager.

How many people we hire depends on what gets funded. But we’re hiring a few people for sure.

We want the best best people who love to collaborate, who love to program, who love statistical modeling, who love to learn, who care about getting things right and are happy to admit their mistakes.

More details to be posted soon.

Powerpose update


I contacted Anna Dreber, one of the authors of the paper that failed to replicate power pose, and asked her about a particular question that came up regarding their replication study. One of the authors of the original power pose study wrote that the replication “varied methodologically in about a dozen ways — some of which were enormous, such as having people hold the poses for 6 instead of 2 minutes, which is very uncomfortable.” As commenter Phil put it, “It does seem kind of ridiculous to have people hold any pose other than ‘lounging on the couch’ for six minutes.”

In response, Dreber wrote:

We discuss this in the paper and this is what we say in the supplementary material:

A referee also pointed out that the prolonged posing time could cause participants to be uncomfortable, and this may counteract the effect of power posing. We therefore reanalyzed our data using responses to a post-experiment questionnaire completed by 159 participants. The questionnaire asked participants to rate the degree of comfort they experienced while holding the positions on a four-point scale from “not at all” (1) to “very” (4) comfortable. The responses tended toward the middle of the scale and did not differ by High- or Low-power condition (average responses were 2.38 for the participants in the Low-power condition and 2.35 for the participants in the High-power condition; mean difference = -0.025, CI(-0.272, 0.221); t(159) = -0.204, p = 0.839; Cohen’s d = -0.032). We reran our main analysis, excluding those participants who
were “not at all” comfortable (1) and also excluding those who were “not at all” (1) or “somewhat” comfortable (2). Neither sample restriction changes the results in a substantive way (Excluding participants who reported a score of 1 gives Risk (Gain): Mean difference = -.033, CI (-.100,
0.034); t(136) = -0.973, p = 0.333; Cohen’s d = -0.166; Testosterone Change: Mean difference = -4.728, CI(-11.229, 1.773); t(134) = -1.438, p = .153; Cohen’s d = -0.247; Cortisol: Mean difference = -0.024, CI (-.088, 0.040); t(134) = -0.737, p = 0.463; Cohen’s d = -0.126. Excluding participants who reported a score of 1 or 2 gives Risk (Gain): Mean difference = -.105, CI (-0.332, 0.122); t(68) = -0.922, p = 0.360; Cohen’s d = -0.222; Testosterone Change: Mean difference = -5.503, CI(-16.536, 5.530); t(66) = -0.996, p = .323; Cohen’s d = -0.243; Cortisol: Mean difference = -0.045, CI (-0.144, 0.053); t(66) = -0.921, p = 0.360; Cohen’s d = -0.225). Thus, including only those participants who report having been “quite comfortable” (3), or “very comfortable” (4) does not change our results.

Also, each of the two positions was held for 3 min each (so not one for 6 min).

So, yes, the two studies differed, but there’s no particular reason to believe that the 1-minute intervention would have a larger effect than the 3-minute intervention. Indeed, we’d typically think a longer treatment would have a larger effect.

Again, remember the time-reversal heuristic: Ranehill et al. did a large controlled study and found no effect of pose on hormones. Carney et al. did a small uncontrolled study and found a statistically significant comparison. This is not evidence in favor of the hypothesis that Carney et al. found something real; rather, it’s evidence consistent with zero effects.

Dreber added:

In our study, we actually wanted to see whether power posing “worked” – we thought that if we find effects, we can figure out some other fun studies related to this, so in that sense we were not out “to get” Carney et al. That is, we did not do any modifications in the setup that we thought would kill the original result.

Indeed, lots of people seem to miss this point, that if you really care about a topic, you’d want to replicate it and remove all doubt. When a researcher expresses the idea that replication, data sharing, etc., is some sort of attack, I think that betrays an attitude or a fear that the underlying effect really isn’t there. If it were there, you’d want to see it replicated over and over. A strong anvil need not fear the hammer. And it’s the insecure researchers who feel the need for bravado such as “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

P.S. I wrote the above post close to a year ago, well before the recent fuss over replication trolls or whatever it was that we were called. In the meantime, Tom Bartlett wrote a long news article about the whole power pose story, so you can go there for background if all this stuff is new to you.

To know the past, one must first know the future: The relevance of decision-based thinking to statistical analysis

We can break up any statistical problem into three steps:

1. Design and data collection.

2. Data analysis.

3. Decision making.

It’s well known that step 1 typically requires some thought of steps 2 and 3: It is only when you have a sense of what you will do with your data, that you can make decisions about where, when, and how accurately to take your measurements. In a survey, the plans for future data analysis influence which background variables to measure in the sample, whether to stratify or cluster; in an experiment, what pre-treatment measurements to take, whether to use blocking or multilevel treatment assignment; and so on.

The relevance for step 3 to step 2 is perhaps not so well understood. It came up in a recent thread following a comment by Nick Menzies. In many statistics textbooks (including my own), the steps of data analysis and decision making are kept separate: we first discuss how to analyze the data, with the general goal being the production of some (probabilistic) inferences that can be piped into any decision analysis.

But your decision plans may very well influence your analysis. Here are two ways this can happen:

– Precision. If you know ahead of time you only need to estimate a parameter to within an uncertainty of 0.1 (on some scale), say, and you have a simple analysis method that will give you this precision, you can just go simple and stop. This sort of thing occurs all the time.

– Relevance. If you know that a particular variable is relevant to your decision making, you should not sweep it aside, even if it is not statistically significant (or, to put it Bayesianly, even if you cannot express much certainty in the sign of its coefficient). For example, the problem that motivated our meta-analysis of effects of survey incentives was a decision of whether to give incentives to respondents in a survey we were conducting, the dollar value of any such incentive, and whether to give the incentive before or after the survey interview. It was important to keep all these variables in the model, even if their coefficients were not statistically significant, because the whole purpose of our study was to estimate these parameters. This is not to say that on should use simple least squares: another impact of the anticipated decision analysis is to suggest parts of the analysis where regularization and prior information will be particularly crucial.

Conversely, a variable that is not relevant to decisions could be excluded from the analysis (possibly for reasons of cost, convenience, or stability), in which case you’d interpret inferences as implicitly averaging over some distribution of that variable.

Frank Harrell statistics blog!

Frank Harrell, author of an influential book on regression modeling and currently both a biostatistics professor and a statistician at the Food and Drug Administration, has started a blog. He sums up “some of his personal philosophy of statistics” here:

Statistics needs to be fully integrated into research; experimental design is all important

Don’t be afraid of using modern methods

Preserve all the information in the data; Avoid categorizing continuous variables and predicted values at all costs

Don’t assume that anything operates linearly

Account for model uncertainty and avoid it when possible by using subject matter knowledge

Use the bootstrap routinely

Make the sample size a random variable when possible

Use Bayesian methods whenever possible

Use excellent graphics, liberally

To be trustworthy research must be reproducible

All data manipulation and statistical analysis must be reproducible (one ramification being that I advise against the use of point and click software in most cases)

Harrell continues:

Statistics has multiple challenges today, which I [Harrell] break down into three major sources:

1. Statistics has been and continues to be taught in a traditional way, leading to statisticians believing that our historical approach to estimation, prediction, and inference was good enough.

2. Statisticians do not receive sufficient training in computer science and computational methods, too often leaving those areas to others who get so good at dealing with vast quantities of data that they assume they can be self-sufficient in statistical analysis and not seek involvement of statisticians. Many persons who analyze data do not have sufficient training in statistics.

3. Subject matter experts (e.g., clinical researchers and epidemiologists) try to avoid statistical complexity by “dumbing down” the problem using dichotomization, and statisticians, always trying to be helpful, fail to argue the case that dichotomization of continuous or ordinal variables is almost never an appropriate way to view or analyze data. Statisticians in general do not sufficiently involve themselves in measurement issues.

No evidence of incumbency disadvantage?


Several years ago I learned that the incumbency advantage in India was negative! There, the politicians are so unpopular that when they run for reelection they’re actually at a disadvantage, on average, compared to fresh candidates.

At least, that’s what I heard.

But Andy Hall and Anthony Fowler just wrote a paper claiming that, no, there’s no evidence for negative incumbency advantages anywhere. Hall writes,

We suspect the existing evidence is the result of journals’ preference for “surprising” results. Since positive incumbency effects have been known for a long time, you can’t publish “just another incumbency advantage” paper anymore, but finding a counterintuitive disadvantage seems more exciting.

And here’s how their paper begins:

Scholars have long studied incumbency advantages in the United States and other advanced democracies, but a recent spate of empirical studies claims to have identified incumbency disadvantages in other, sometimes less developed, democracies including Brazil, Colombia, India, Japan, Mexico, and Romania. . . . we reassess the existing evidence and conclude that there is little compelling evidence of incumbency disadvantage in any context so far studied. Some of the incumbency disadvantage results in the literature arise from unusual specifications and are not statistically robust. Others identify interesting phenomena that are conceptually distinct from what most scholars would think of as incumbency advantage/disadvantage. For example, some incumbency disadvantage results come from settings where incumbents are not allowed to run for reelection. . . .

Interesting. I’ve not looked at their paper in detail but one thing I noticed is that a lot of these cited papers seem to have been estimating the incumbent party advantage, which doesn’t seem to me to be the same as the incumbency advantage as it’s usually understood. This discontinuity thing seems like a classic example of looking for the keys under the lamppost. I discussed the problems with that approach several years ago in this 2005 post, which I never bothered to write up as a formal article. Given that these estimates are still floating around, I kinda wish I had.

Stan JSS paper out: “Stan: A probabilistic programming language”

As a surprise welcome to 2017, our paper on how the Stan language works along with an overview of how the MCMC and optimization algorithms work hit the stands this week.

  • Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A Probabilistic Programming Language. Journal of Statistical Software 76(1).

The authors are the developers at the time the first revision was submitted. We now have quite a few more developers. Because of that, we’d still prefer that people cite the manual authored by the development team collectively rather than this paper citing only some of our current developers.

The original motivation for writing a paper was that Wikipedia rejected our attempts at posting a Stan Wikipedia page without a proper citation.

I’d like to thank to Achim Zeileis at JSS for his patience and help during the final wrap up.


Stan is a probabilistic programming language for specifying statistical models. A Stan program imperatively defines a log probability function over parameters conditioned on specified data and constants. As of version 2.14.0, Stan provides full Bayesian inference for continuous-variable models through Markov chain Monte Carlo methods such as the No-U-Turn sampler, an adaptive form of Hamiltonian Monte Carlo sampling. Penalized maximum likelihood estimates are calculated using optimization methods such as the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm. Stan is also a platform for computing log densities and their gradients and Hessians, which can be used in alternative algorithms such as variational Bayes, expectation propagation, and marginal inference using approximate integration. To this end, Stan is set up so that the densities, gradients, and Hessians, along with intermediate quantities of the algorithm such as acceptance probabilities, are easily accessible. Stan can be called from the command line using the cmdstan package, through R using the rstan package, and through Python using the pystan package. All three interfaces support sampling and optimization-based inference with diagnostics and posterior analysis. rstan and pystan also provide access to log probabilities, gradients, Hessians, parameter transforms, and specialized plotting.


   author = {Bob Carpenter and Andrew Gelman and Matthew Hoffman
             and Daniel Lee and Ben Goodrich and Michael Betancourt
             and Marcus Brubaker and Jiqiang Guo and Peter Li
             and Allen Riddell},
   title = {Stan: {A} Probabilistic Programming Language},
   journal = {Journal of Statistical Software},
   volume = {76},
   number = {1},
   year = {2017}

Further reading

Check out the Papers about Stan section of the Stan Citations web page. There’s more info on our autodiff and on how variational inference works and a link to the original NUTS paper. And of course, don’t miss Michael’s latest if you want to understand HMC and NUTS, A conceptual introduction to HMC.

Problems with “incremental validity” or more generally in interpreting more than one regression coefficient at a time

Kevin Lewis points us to this interesting paper by Jacob Westfall and Tal Yarkoni entitled, “Statistically Controlling for Confounding Constructs Is Harder than You Think.” Westfall and Yarkoni write:

A common goal of statistical analysis in the social sciences is to draw inferences about the relative contributions of different variables to some outcome variable. When regressing academic performance, political affiliation, or vocabulary growth on other variables, researchers often wish to determine which variables matter to the prediction and which do not—typically by considering whether each variable’s contribution remains statistically significant after statistically controlling for other predictors. When a predictor variable in a multiple regression has a coefficient that differs significantly from zero, researchers typically conclude that the variable makes a “unique” contribution to the outcome. . . .

Incremental validity claims pervade the social and biomedical sciences. In some fields, these claims are often explicit. To take the present authors’ own field of psychology as an example, a Google Scholar search for the terms “incremental validity” AND psychology returns (in January 2016) over 18,000 hits—nearly 500 of which contained the phrase “incremental validity” in the title alone. More commonly, however, incremental validity claims are implicit—as when researchers claim that they have statistically “controlled” or “adjusted” for putative confounds—a practice that is exceedingly common in fields ranging from epidemiology to econometrics to behavioral neuroscience (a Google Scholar search for “after controlling for” and “after adjusting for” produces over 300,000 hits in each case). The sheer ubiquity of such appeals might well give one the impression that such claims are unobjectionable, and if anything, represent a foundational tool for drawing meaningful scientific inferences.

Wow—what an excellent start! They’re right. We see this reasoning so often. Yes, it is generally not appropriate to interpret regression coefficients this way—see, for example, “Do not control for post-treatment variables,” section 9.7 of my book with Jennifer—and things get even worse when you throw statistical significance into the mix. But researchers use this fallacious reasoning because it fulfills a need, or a perceived need, which is to disentangle their causal stories.

Westfall and Yarkoni continue:

Unfortunately, incremental validity claims can be deeply problematic. As we demonstrate below, even small amounts of error in measured predictor variables can result in extremely poorly calibrated Type 1 error probabilities.

Ummmm, I don’t like that whole Type 1 error thing. It’s the usual story: I don’t think there are zero effects, so I think it’s just a mistake overall to be saying that some predictors matter and some don’t.

That said, for people who are working in that framework, I think Westfall and Yarkoni have an important message. They say in mathematics, and with several examples, what Jennifer and I alluded to, which is that even if you control for pre-treatment variables, you have to worry about latent variables you haven’t controlled for. As they put it, there can (and will) be “residual confounding.”

So I’ll quote them one more time:

The traditional approach of using multiple regression to support incremental validity claims is associated with extremely high false positive rates under realistic parameter regimes.


They also say, “the problem has a principled solution: inferences about the validity of latent constructs should be supported by latent-variable statistical approaches that can explicitly model measurement unreliability,” which seems reasonable enough. That said, I can’t go along with their recommendation that researchers “adopt statistical approaches like SEM”—that seems to often just make things worse! I say Yes to latent variable models but No to approaches which are designed to tease out things that just can’t be teased (as in the “affective priming” example discussed here).

I am sympathetic to Westfall and Yarkoni’s goal of providing solutions, not just criticism—but in this case I think the solutions are further away than they seem to believe, and that part of the solution will be to abandon some of researchers’ traditional goals.

“A Conceptual Introduction to Hamiltonian Monte Carlo”

Michael Betancourt writes:

Hamiltonian Monte Carlo has proven a remarkable empirical success, but only recently have we begun to develop a rigorous understanding of why it performs so well on difficult problems and how it is best applied in practice. Unfortunately, that understanding is con- fined within the mathematics of differential geometry which has limited its dissemination, especially to the applied communities for which it is particularly important.

In this review I [Betancourt] provide a comprehensive conceptual account of these theoretical foundations, focusing on developing a principled intuition behind the method and its optimal implementations rather of any ex- haustive rigor. Whether a practitioner or a statistician, the dedicated reader will acquire a solid grasp of how Hamiltonian Monte Carlo works, when it succeeds, and, perhaps most importantly, when it fails.

This is great stuff. He has 38 figures! Read the whole thing.

I wish Mike’s paper had existed 25 years ago, as it contains more sophisticated and useful versions of various intuitions that my colleagues and I had to work so hard to develop when working on .234.

A small, underpowered treasure trove?


Benjamin Kirkup writes:

As you sometimes comment on such things; I’m forwarding you a journal editorial (in a society journal) that presents “lessons learned” from an associated research study.

What caught my attention was the comment on the “notorious” design, the lack of “significant” results, and the “interesting data on nonsignificant associations.” Apparently, the work “does not serve to inform the regulatory decision-making process with respect to antimicrobial compounds” but is “still valuable and can be informative.”

Given the commissioning of a lessons-learned, how do you think the scientific publishing community should handle manuscripts presenting work with problematic designs and naturally uninformative outcomes?

The editorial in question is called Lessons Learned from Probing for Impacts of Triclosan and Triclocarban on Human Microbiomes, it is by Rolf Halden, and it appeared in a journal of the American Society for Microbiology.

I do find the whole story puzzling, that Halden describes the study as small and underpowered, while also “presenting a treasure trove of information.” The editorial almost like a political effort, designed to make everyone happy. That said, I don’t know jack about the effects of triclosan and triclocarban on human biology, so maybe this all makes sense in context.

The “underpowered treasure trove” thing reminds me a bit of when food researcher and business school professor Brian Wansink told the story of a “failed study which had null results” (in his words) which at the same time was “a cool (rich & unique) data set” that resulted in four completely independent published papers. Failed yet wonderful.

The Prior: Fully comprehended last, put first, checked the least?

Priors are important in Bayesian inference.

Some would even say : ” In Bayesian inference you can—OK, you must—assign a prior distribution representing the set of values the coefficient [i.e any unknown parameter] can be.”

Although priors are put first in most expositions, my sense is that in most applications they are seldom considered first, are checked the least and actually fully comprehended last (or perhaps not fully at all).

It reminds of the comical response of someone when asked for difficult directions – “If I wanted to go there, I wouldn’t start out from here.”

Perhaps this is less comical – “If I am going to be doing a Bayesian analyses, I do not want to be responsible for getting and checking the prior. Maybe the domain expert should do that or just accept the default priors I find in the examples sections of the software manual”.

In this post, I thought I would recall experiences in building judgement based predictive indexes where the prior (or something like it) is perhaps more naturally comprehended first, checked the most and settled on last. Here there are no distraction from the data model or posterior as there usually isn’t any data nor is any any data anticipated soon – so its just the prior.

Maybe not at the time, but certainly now I would view this as a very sensible way to generate a credible Bayesian informative prior that involved intensive testing of the prior before it was finally accepted. Below, I am recounting one particular example of this I was involved in about 25 years ago as a prelim to investigating in later posts what might be a profitable (to a scientific community) means to specify priors today.

Continue reading ‘The Prior: Fully comprehended last, put first, checked the least?’ »

StanCon 2017 Schedule

The first Stan Conference is next Saturday, January 21, 2017!

If you haven’t registered, here’s the link:

I wouldn’t wait until the last minute—we might sell out before you’re able to grab a ticket. We’re up to 125 registrants now. If we have any space left, tickets will be $400 at the door.

Schedule. January 21, 2017.

Time What
7:30 AM – 8:45 AM Registration and breakfast
8:45 AM – 9:00 AM Opening statements
9:00 AM – 10:00 AM Dev talk:
Andrew Gelman:
“10 Things I Hate About Stan”
10:00 AM – 10:30 AM Coffee
10:30 AM – 12:00 PM Contributed talks:

  1. Jonathan Auerbach, Rob Trangucci:
    “Twelve Cities: Does lowering speed limits save pedestrian lives?”
  2. Milad Kharratzadeh:
    “Hierarchical Bayesian Modeling of the English Premier League”
  3. Victor Lei, Nathan Sanders, Abigail Dawson:
    “Advertising Attribution Modeling in the Movie Industry”
  4. Woo-Young Ahn, Nate Haines, Lei Zhang:
    “hBayesDM: Hierarchical Bayesian modeling of decision-making tasks”
  5. Charles Margossian, Bill Gillespie:
    “Differential Equation Based Models in Stan”
12:00 PM – 1:15 PM Lunch
1:15 PM – 2:15 PM Dev talk:
Michael Betancourt:
“Everything You Should Have Learned About Markov Chain Monte Carlo”
2:15 PM – 2:30 PM Stretch break
2:30 PM – 3:45 PM Contributed talks:

  1. Teddy Groves:
    “How to Test IRT Models Using Simulated Data”
  2. Bruno Nicenboim, Shravan Vasishth:
    “Models of Retrieval in Sentence Comprehension”
  3. Rob Trangucci:
    “Hierarchical Gaussian Processes in Stan”
  4. Nathan Sanders, Victor Lei:
    “Modeling the Rate of Public Mass Shootings with Gaussian Processes”
3:45 PM – 4:45 PM Mingling and coffee
4:45 PM – 5:45 PM Q&A Panel
5:45 PM – 6:00 PM Closing remarks:
Bob Carpenter:
“Where is Stan Going Next?”

If you can’t tell, it’s going to be a packed day.


We couldn’t have done this without support from our sponsors. Seriously.

Bonus: All of our sponsors are using Stan!

When do stories work, Process tracing, and Connections between qualitative and quantitative research


Jonathan Stray writes:

I read your “when do stories work” paper (with Thomas Basbøll) with interest—as a journalist stories are of course central to my field. I wondered if you had encountered the “process tracing” literature in political science? It attempts to make sense of stories as “case studies” and there’s a nice logic of selection and falsification that has grown up around this.

This article by David Collier is a good overview of process tracing, with a neat typology of story-based theory tests.

Besides being a good paper generally, section 6 of this paper by James Mahoney and Gary Goertz discusses why you want non-random case/story selection in certain types of qualitative research.

This paper by Jack Levy is another typology of the types and uses of case studies/stories.

I had not heard about process tracing, and I’ll have to take a look at these papers. I’m very interested in the connections between quantitative and qualitative research. Indeed, one of my themes when criticizing recent research boondoggles such as power pose and himmicanes has been the weakness of the connections between the qualitative and quantitative aspects of the work. And recently I got a taste of this criticism myself when I was presenting some of our findings regarding social penumbras: a psychologist in the audience pointed out that one reason our results were so weak was because there was only a very weak link between qualitative theories of changes in political attitudes, and the particular quantitative measures we were using. In short, I was doing what I often criticize in others, which was to gather data using a crude measuring instrument and then just hope for some results. We did find some things—I still think the penumbra work has been a successful research project—but we could’ve done much better, I’m sure, had we better tied qualitative to quantitative ideas.

R packages interfacing with Stan: brms

Over on the Stan users mailing list I (Jonah) recently posted about our new document providing guidelines for developing R packages interfacing with Stan. As I say in the post and guidelines, we (the Stan team) are excited to see the emergence of some very cool packages developed by our users. One of these packages is Paul Bürkner’s brms. Paul is currently working on his PhD in statistics at the University of Münster, having previously studied psychology and mathematics at the universities of Münster and Hagen (Germany). Here is Paul writing about brms:

The R package brms implements a wide variety of Bayesian regression models using extended lme4 formula syntax and Stan for the model fitting. It has been on CRAN for about one and a half years now and has grown to be probably one of the most flexible R packages when it comes to regression models.

A wide range of distributions are supported, allowing users to fit — among others — linear, robust linear, count data, response time, survival, ordinal, and zero-inflated models. You can incorporate multilevel structures, smooth terms, autocorrelation, as well as measurement error in predictor variables to mention only a few key features. Furthermore, non-linear predictor terms can be specified similar to how it is done in the nlme package and on top of that all parameters of the response distribution can be predicted at the same time.

After model fitting, you have many post-processing and plotting methods to choose from. For instance, you can investigate and compare model fit using leave-one-out cross-validation and posterior predictive checks or predict responses for new data.

If you are interested and want to learn more about brms, please use the following links:

  • GitHub repository (for source code, bug reports, feature requests)
  • CRAN website (for vignettes with guidance on how to use the package)
  • Wayne Folta’s blog posts (for interesting brms examples)

Also, a paper about brms will be published soon in the Journal of Statistical Software.

My thanks goes to the Stan Development Team for creating Stan, which is probably the most powerful and flexible tool for performing Bayesian inference, and for allowing me to introduce brms here at this blog.

I’ve said it before and I’ll say it again

Ryan Giordano, Tamara Broderick, and Michael Jordan write:

In Bayesian analysis, the posterior follows from the data and a choice of a prior and a likelihood. One hopes that the posterior is robust to reasonable variation in the choice of prior, since this choice is made by the modeler and is often somewhat subjective. A different, equally subjectively plausible choice of prior may result in a substantially different posterior, and so different conclusions drawn from the data. . . .

To which I say:

,s/choice of prior/choice of prior and data model/g

Yes, the choice of data model (from which comes the likelihood) is made by the modeler and is often somewhat subjective. In those cases where the data model is not chosen subjectively by the modeler, it is typically chosen implicitly by convention, and there is even more reason to be concern about robustness.

Problems with randomized controlled trials (or any bounded statistical analysis) and thinking more seriously about story time

In 2010, I wrote:

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.” At the same time, in my capacity as a social scientist, I’ve published many applied research papers, almost none of which have used experimental data.

Randomized controlled trials (RCTs) have well-known problems with realism or validity (a problem that researchers try to fix using field experiments, but it’s not always possible to have a realistic field experiment either), and cost/ethics/feasibility (which pushes researchers toward smaller experiments in more artificial settings, which in turn can lead to statistical problems).

Beyond these, there is the indirect problem that RCTs are often overrated—researchers prize the internal validity of the RCT so much that they forget about problems of external validity and problems with statistical inference. We see that all the time: randomization doesn’t protect you from the garden of forking paths, but researchers, reviewers, publicists and journalists often act as if it does. I still remember a talk by a prominent economist several years ago who was using a crude estimation strategy—but, it was an RCT, so the economist expressed zero interest in using pre-test measures or any other approaches to variance reduction. There was a lack of understanding that there’s more to inference than unbiasedness.

From a different direction, James Heckman has criticized RCTs on the grounds that they can, and often are, performed in a black-box manner without connection to substantive theory. And, indeed, it was black-box causal inference that was taught to me as a statistics student many years ago, and I think the fields of statistics, economics, political science, psychology, and medicine, are still clouded by this idea that causal research is fundamentally unrelated to substantive theory.

In defense, proponents of randomized experiments have argued persuasively that all the problems with randomized experiments—validity, cost, etc.—arise just as much in observational studies. As Alan Gerber and Don Green put it, deciding to unilaterally disable your identification strategy does not magically connect your research to theory or remove selection bias. From this perspective, even when we are not able to perform RCTs,

Christopher Hennessy writes in with another criticism of RCTs:

In recent work, I set up parable economies illustrating that in dynamic settings measured treatment responses depart drastically and systematically from theory-implied causal effects (comparative statics) both in terms of magnitudes and signs. However, biases can be remedied and results extrapolated if randomisation advocates were to take the next step and actually estimate the underlying shock processes. That is, old-school time-series estimation is still needed if one is to make economic sense of measured treatment responses. In another line of work, I show that the econometric problems become more pernicious if the results of randomisation will inform future policy setting, as is the goal of many in Cambridge, for example. Even if an economic agent is measure zero, if he views that a randomisation is policy relevant, his behavior under observation will change since he understands the future distribution of the policy variable will also change. Essentially, if one is doing policy-relevant work, there is endogeneity bias after the fact. Or in other words, policy-relevance undermines credibility.

Rather than deal with these problems formally, there has been a tendency amongst a proper subset of empiricists to stifle their impact by lumping them into an amorphous set of “issues.” I think the field will make faster progress if we were to handle these issues the with same degree of formal rigor with which the profession deals with, say, standard errors. We should not let the good be the enemy of the best. A good place to start is to write down simple dynamic economic models that actually speak to the data generating processes being exploited. Absent such a mapping, reported econometric estimates are akin to a corporation reporting the absolute value of profits without reporting the currency or the sign. What does one learn from such a report? And how can such be useful in doing cost-benefit analyses on government policies? We have a long way to go. Premature claims of credibility only serve to delay confronting the issues formally and making progress.

Here are the abstracts of two of Hennessy’s papers:

Double-blind RCTs are viewed as the gold standard in eliminating placebo effects and identifying non-placebo physiological effects. Expectancy theory posits that subjects have better present health in response to better expected future health. We show that if subjects Bayesian update about efficacy based upon physiological responses during a single-stage RCT, expected placebo effects are generally unequal across treatment and control groups. Thus, the difference between mean health across treatment and control groups is a biased estimator of the mean non-placebo physiological effect. RCTs featuring low treatment probabilities are robust: Bias approaches zero as the treated group measure approaches zero.

Evidence from randomization is contaminated by ex post endogeneity if it is used to set policy endogenously in the future. Measured effects depend on objective functions into which experimental evidence is fed and prior beliefs over the distribution of parameters to be estimated. Endowed heterogeneous effects generates endogenous belief heterogeneity making it difficult/impossible to recover causal effects. Observer effects arise even if agents are measure zero, having no incentive to change behavior to influence outcomes.

As with the earlier criticisms, the implication is not that observational studies are OK, but rather that real-world complexity (in this case, dynamics of individual beliefs and decision making) should be included in a policy analysis, even if a RCT is part of the story. Don’t expect the (real) virtues of a randomized trial to extend to the interpretation of the results.

To put it another way, Hennessy is arguing that we should be able to think more rigorously, not just about a localized causal inference, but also about what is traditionally part of story time.

Time Inc. stoops to the level of the American Society of Human Genetics and PPNAS?

Do anyone out there know anyone at Time Inc? If so, I have a question for you. But first the story:

Mark Palko linked to an item from Barry Petchesky pointing out this article at the online site of Sports Illustrated Magazine.

Here’s Petchesky:

Over at Sports Illustrated, you can read an article about Tom Brady’s new line of sleepwear for A Company That Makes Stretchy Workout Stuff. The article contains the following lines:

“The TB12 Sleepwear line includes full-length shirts and pants—and a short-sleeve and shorts version—with bioceramics printed on the inside.”

“The print, sourced from natural minerals, activates the body’s natural heat and reflects it back as far infrared energy…”

“The line, available in both men’s [link to store for purchase] and women’s [link to store for purchase] sizes, costs between $80 to $100 [link to store for purchase].”

“[A Company That Makes Stretchy Workout Stuff]’s bioceramic-printed sleepwear uses far infrared energy to promote recovery…”

(There are quotes in the article, mostly from people with financial stakes in you buying these products. An actual sleep expert is quoted. He does not endorse or even reference the products discussed in this article, nor the science behind said products. His contribution to this article can be summed up as saying sleep is important.)

This is an advertisement, in every aspect save the one where money changed hands in exchange for its publication. (We think. This would honestly be a lot less embarrassing for SI to run if it were sponsored content and they just forgot to label it as such.) These sorts of advertisements, where certain types of reporters eagerly type up press releases because it’s quick and easy, are everywhere.

It seemed clear to me when clicking through to the link at that the article was sponsored content. But I could not find any such label.

The stretchy-underwear story is a bit of a joke, but in other places gets into what one might call Dr. Oz territory, as in this article hyping a brand-name “neuroscience” sports headset, with several quotes from the CEO of the company and a satisfied user and no quotes from competitors or skeptics.

This is the kind of one-sided story I’d expect to see coming from the American Society of Human Genetics or PPNAS, but it’s a bit disappointing to see it in a respected publication such as Sports Illustrated.

So here’s my question, which perhaps one of you can forward to a friend at Time Inc:

Is this what Sports Illustrated is all about now? I mean, sure, I’m not expecting crusading journalism every week. Sports is entertainment and as a sports fan I have no problem with the sports media promoting big-time sports. It’s symbiotic and that’s fine: the sports media needs sporting events to cover, and sports organizations want media coverage so that people will care more about the games. And I also understand that there’s no reason to gratuitously offend potential advertisers: no need for SI columnists to go on rants against training headsets or fancy sneakers or whatever.

But if you’re running ads, can’t you just label them as such? How hard would that be?

Don’t go all Dr. Oz on us, dudes!

I’m reminded of what a friend told me once, years ago, that it’s easier to be ethical when you’re rich. 40 years ago, the management of Time Inc. were sitting at the top of the world, bathed in prestige and attention and advertising dollars. They could afford the highest moral standards. Now they’re desperate for sponsorship and are doing the journalistic equivalent of knocking over liquor stores to pay the rent each month.

Or maybe this is all listed as sponsored content, and I just didn’t notice the label.

P.S. If you click on the author link at the above-discussed article on the stretchy underwear, you get a bunch more of the same:
Continue reading ‘Time Inc. stoops to the level of the American Society of Human Genetics and PPNAS?’ »

Confirmation bias

Shravan Vasishth is unimpressed by this evidence that was given to support the claim that being bilingual postpones symptoms of dementia:

Screen Shot 2016-07-26 at 1.23.27 PM

My reaction: Seems like there could be some selection issues, no?

Shravan: Also, low sample size, and confirming what she already believes. I would be more impressed if she found evidence against the bilingual advantage.

Me: Hmmm, that last bit is tricky, as there’s also a motivation for people to find surprising, stunning results.

Shravan: Yes, but you will never find that this surprising, stunning result is something that goes against the author’s own previously published work. It always goes against someone *else*’s. I find this issue to be the most surprising and worrying of all, even more than p-hacking, that we only ever find evidence consistent with our beliefs and theories, never against.

Indeed, Shravan’s example confirms what I already thought about scientists.

The Lure of Luxury


From the sister blog, a response to an article by psychologist Paul Bloom on why people own things they don’t really need:

Paul Bloom argues that humans dig deep, look beyond the surface, and attend to the nonobvious in ways that add to our pleasure and appreciation of the world of objects. I [Susan] wholly agree with that analysis. My objection, however, is that he does not go far enough. There is a dark side to our infatuation by and obsession with the past. Our focus on historical persistence reveals not just appreciation and pleasure, but also bigotry and cruelty. Bloom’s story is incomplete without bringing these cases to light.

All the examples that Bloom discusses involve what we might call positive contagion—an object gains value because of its link to a beloved individual, history, or brand. This positive glow rescues a seemingly offensive behavior: contrary to what we might at first think, spending exorbitant amounts on a watch is not selfish or self-absorbed but rather can be understood as benign and even virtuous. Those who spend on luxuries are not “irrational, wasteful, . . . evil”—rather, they appropriately take pleasure by rationally considering the joy that we all find in a cherished object’s history.

Yet attention to an object’s history does not merely provide joy. History can also be a taint leading to suspicion, segregation, and discrimination. The psychologist Paul Rozin notes that people seem to operate according to a principle of “magical contagion,” where one can be harmed by contact with an object involved with evil or death, leading people to reject wearing Hitler’s sweater, a suit that someone died in, or a house in which a murder was committed. Fair enough. But the troubling point is that this same impulse arises when people come into contact with objects linked to those who are not evil but just different—not part of one’s in-group. In fact, simply thinking about such contact can be disturbing.

Segregation and institutionalized discrimination reflect this impulse to avoid contact across social groups. In parts of India, elaborate behavioral codes ensure that individuals will not come into contact with objects that have been touched by those of a lower caste. Thus, some teashops use a “double-tumbler” system, such that Dalits (“untouchables”) are required to use different cups, plates, or utensils than caste Hindus. Whites-only drinking fountains in the pre–civil rights southern United States can be understood as a means of avoiding negative history—contact with an object that has been touched by members of a marginalized group. In the 1980s, many responded similarly to individuals with AIDS, who were sometimes banned from swimming pools and other public places. Indeed, in one national survey, many respondents reported that they would be less likely to wear a sweater that had been worn once by a person with AIDS, or would feel uncomfortable drinking out of a sterilized glass that had been used a few days earlier by a person with AIDS.

In our own research, Meredith Meyer, Sarah-Jane Leslie, Sarah Stilwell, and I [Susan] found similar negative responses to a homeless person, someone with low IQ, someone with schizophrenia, or someone who has committed a crime. Adults typically report feeling “creeped out” by the idea of receiving an organ transplant or blood transfusion from such individuals for fear they will be contaminated or even become more like the donor. These beliefs hold even when people are assured that the organ or blood is healthy. In this case, a heart’s history is thought to carry with it negative characteristics of a group subject to discrimination.

Attention to object history may indeed be a biological adaptation. It can serve us well and enrich our appreciation of the objects around us, from Rolex watches to discarded baby shoes to a poet’s unused typewriter paper. But it is important that we recognize the terrible costs of this way of thinking.

We fiddle while Rome burns: p-value edition


Raghu Parthasarathy presents a wonderfully clear example of disastrous p-value-based reasoning that he saw in a conference presentation. Here’s Raghu:

Consider, for example, some tumorous cells that we can treat with drugs 1 and 2, either alone or in combination. We can make measurements of growth under our various drug treatment conditions. Suppose our measurements give us the following graph:


. . . from which we tell the following story: When administered on their own, drugs 1 and 2 are ineffective — tumor growth isn’t statistically different than the control cells (p > 0.05, 2 sample t-test). However, when the drugs are administered together, they clearly affect the cancer (p < 0.05); in fact, the p-value is very small (0.002!). This indicates a clear synergy between the two drugs: together they have a much stronger effect than each alone does. (And that, of course, is what the speaker claimed.)

I [Raghu] will pause while you ponder why this is nonsense.

He continues:

Another interpretation of this graph is that the “treatments 1 and 2” data are exactly what we’d expect for drugs that don’t interact at all. Treatment 1 and Treatment 2 alone each increase growth by some factor relative to the control, and there’s noise in the measurements. The two drugs together give a larger, simply multiplicative effect, and the signal relative to the noise is higher (and the p-value is lower) simply because the product of 1’s and 2’s effects is larger than each of their effects alone.

And now the background:

I [Raghu] made up the graph above, but it looks just like the “important” graphs in the talk. How did I make it up? The control dataset is random numbers drawn from a normal distribution with mean 1.0 and standard deviation 0.75, with N=10 measurements. Drug 1 and drug 2’s “data” are also from normal distributions with the same N and the same standard deviation, but with a mean of 2.0. (In other words, each drug enhances the growth by a factor of 2.0.) The combined treatement is drawn from a distribution of mean 4.0 (= 2 x 2), again with the same number of measurements and the same noise. In other words, the simplest model of a simple effect. One can simulate this ad nauseum to get a sense of how the measurements might be expected to look.

Did I pick a particular outcome of this simulation to make a dramatic graph? Of course, but it’s not un-representative. In fact, of the cases in which Treatment 1 and Treatment 2 each have p>0.05, over 70% have p<0.05 for Treatment 1 x Treatment 2 ! Put differently, conditional on looking for each drug having an “insignificant” effect alone, there’s a 70% chance of the two together having a “significant” effect not because they’re acting together, but just because multiplying two numbers greater than one gives a larger number, and a larger number is more easily distinguished from 1!

As we’ve discussed many times, the problem here is partly with p-values themselves and partly with the null hypothesis significance testing framework:

1. The problem with p-values: the p-value is a strongly nonlinear transformation of data that is interpretable only under the null hypothesis, yet the usual purpose of the p-value in practice is to reject the null. My criticism here is not merely semantic or a clever tongue-twister or a “howler” (as Deborah Mayo would say); it’s real. In settings where the null hypothesis is not a live option, the p-value does not map to anything relevant.

To put it another way: Relative to the null hypothesis, the difference between a p-value of .13 (corresponding to a z-score of 1.5), and a p-value of .003 (corresponding to a z-score of 3), is huge; it’s the difference between a data pattern that could easily have arisen by chance alone, and a data pattern that it is highly unlikely to have arisen by chance. But, once you allow nonzero effects (as is appropriate in the sorts of studies that people are interested in doing in the first place), the difference between p-values of 1.5 and 3 is no big deal at all, it’s easily attributable to random variation. I don’t mind z-scores so much, but the p-value transformation does bad things to them.

2. The problem with null hypothesis significance testing: As Raghu discusses near the end of his post, this sort of binary thinking makes everything worse in that people inappropriately combine probabilistic statements with Boolean rules. And switching from p-values to confidence intervals doesn’t do much good here, for two reasons: (a) if all you do is check whether the conf intervals excludes 0, you haven’t gone forward at all, and (b) even if you do use them as uncertainty statements, classical intervals have all the biases that arise from not including prior information: classical conf intervals overestimate magnitudes of effect sizes.

Anyway, we know all this, but recognizing the ubiquity of fatally flawed significance-testing reasoning puts a bit more pressure on us to come up with and promote better alternatives that are just as easy to use. I do think this is possible; indeed I’m working on it when not spending my time blogging. . . .

“Which curve fitting model should I use?”


Oswaldo Melo writes:

I have learned many of curve fitting models in the past, including their technical and mathematical details. Now I have been working on real-world problems and I face a great shortcoming: which method to use.

As an example, I have to predict the demand of a product. I have a time series collected over the last 8 years. A simple set of (x,y) data about the relationship between the demand of a product on a certain week. I have this for 9 products. And to continue the study, I must predict the demand of each product for the next years.

Looks easy enough, right? Since I do not have the probability distribution of the data, just use a non-parametric curve fitting algorithm. But which one? Kernel smoothing? B-splines? Wavelets? Symbolic regression? What about Fourier analysis? Neural networks? Random forests?

There are dozens of methods that I could use. But which one has better performance remains a mystery. I tried to read many articles in which the authors make predictions based on a time- eries and in most, it
looks like the choice was completely arbitrarily. They would say: “now we will fit a curve to the data using multivariate adaptive regression splines.” But nowhere it’s explained why he used such a method instead of, let’s say, kernel regression or Fourier analysis or a neural network.

I am aware of cross-validation. But am I supposed to try all the dozen methods out there, cross-validate all of them, and see which one performs better? Can cross-validation even be used for all methods – I am not sure. I have mostly seen cross-validation being used within a single method, never between a lot of methods.

I could not find anything on the literature that answers such a simple question. “Which curve fitting model should I use?”

These are good questions. Here are my responses, in no particular order:

1. What is most important about a statistical model is not what it does with the data but, rather, what data it uses. You want to use a model that can take advantage of all the data you have.

2. In your setting with structured time series data, I’d use a multilevel model with coefficients that vary by product and by time. You may well have other structure in your data that you haven’t even mentioned yet, for example demand as broken down by geography or demographic sectors of your consumers; also the time dimension has structure, with different things happening at different times of year. If you want a nonparametric curve fit, you could try a Gaussian process, which plays well with Bayesian multilevel models.

3. Cross-validation is fine but it’s just one more statistical method. To put it another way, if you estimate a parameter or pick a method using cross-validation, it’s still just an estimate. Just cos something performs well in cross-validation, it doesn’t mean it’s the right answer. It doesn’t even mean it will predict well for new data.

4. There are lots of ways to solve a problem. The choice of method to use will depend on what information you want to include in your model, and also what sorts of extrapolations you’ll want to use it for.