Skip to content

Stan Weekly Roundup, 23 June 2017

Lots of activity this week, as usual.

* Lots of people got involved in pushing Stan 2.16 and interfaces out the door; Sean Talts got the math library, Stan library (that’s the language, inference algorithms, and interface infrastructure), and CmdStan out, while Allen Riddell got PyStan 2.16 out and Ben Goodrich and Jonah Gabry are tackling RStan 2.16

* Stan 2.16 is the last series of releases that will not require C++11; let the coding fun begin!

* Ari Hartikainen (of Aalto University) joined the Stan dev team—he’s working with Allen Riddell on PyStan, where judging from the pull request traffic, he put in a lot of work on the 2.16 release. Welcome!

* Imad Ali’s working on adding more cool features to RStanArm including time series and spatial models; yesterday he and Mitzi were scheming to get intrinsic conditional autoregressive models in and I heard all those time series name flying around (like ARIMA)

* Michael Betancourt rearranged the Stan web site with some input from me and Andrew; Michael added more descriptive text and Sean Talts managed to get the redirects in so all of our links aren’t broken; let us know what you think

* Markus Ojala of Smartly wrote a case study on their blog, Tutorial: How We Productized Bayesian Revenue Estimation with Stan

* Mitzi Morris got in the pull request for adding compound assignment and arithmetic; this adds statements such as n += 1.

* lots of chatter about characterization tests and a pull request from Daniel Lee to update some of update some of our our existing performance tests

* Roger Grosse from U.Toronto visited to tell us about his, Siddharth Ancha, and Daniel Roy’s 2016 NIPS paper on testing MCMC using bidirectional Monte Carlo sampling; we talked about how he modified Stan’s sampler to do annealed importance sampling

* GPU integration continues apace

* I got to listen in on Michael Betancourt and Maggie Lieu of the European Space Institute spend a couple days hashing out astrophysics models; Maggie would really like us to add integrals.

* Speaking of integration, Marco Inacio has updated his pull request; Michael’s worried there may be numerical instabilities, because trying to calculate arbitrary bounded integrals is not so easy in a lot of cases

* Andrew continues to lobby for being able to write priors directly into parameter declarations; for example, here’s what a hierarchical prior for beta might look like

parameters {
  real mu ~ normal(0, 2);
  real sigma ~ student_t(4, 0, 2);
  vector[N] beta ~ normal(mu, sigma);

* I got the go-ahead on adding foreach loops; Mitzi Morris will probably be coding them. We’re talking about

real ys[N];
for (y in ys)
  target += log_mix(lambda, normal_lpdf(y | mu[1], sigma[1]),
                            normal_lpdf(y | mu[2], sigma[2]));

* Kalman filter case study by Jouni Helske was discussed on Discourse

* Rob Trangucci rewrote the Gaussian processes chapter of the Stan manual; I’m to blame for the first version, writing it as I was learning GPs. For some reason, it’s not up on the web page doc yet.

* This is a very ad hoc list. I’m sure I missed lots of good stuff, so feel free to either send updates to me directly for next week’s letter or add things to comments. This project’s now way too big for me to track all the activity!

Best correction ever: “Unfortunately, the correct values are impossible to establish, since the raw data could not be retrieved.”

Commenter Erik Arnesen points to this:

Several errors and omissions occurred in the reporting of research and data in our paper: “How Descriptive Food Names Bias Sensory Perceptions in Restaurants,” Food Quality and Preference (2005) . . .

The dog ate my data. Damn gremlins. I hate when that happens.

As the saying goes, “Each year we publish 20+ new ideas in academic journals, and we appear in media around the world.” In all seriousness, the problem is not that they publish their ideas, the problem is that they are “changing or omitting data or results such that the research is not accurately represented in the research record.” And of course it’s not just a problem with Mr. Pizzagate or Mr. Gremlins or Mr. Evilicious or Mr. Politically Incorrect Sex Ratios: it’s all sorts of researchers who (a) don’t report what they actually did, and (b) refuse to reconsider their flimsy hypotheses in light of new theory or evidence.

Question about the secret weapon

Micah Wright writes:

I first encountered your explanation of secret weapon plots while I was browsing your blog in grad school, and later in your 2007 book with Jennifer Hill. I found them immediately compelling and intuitive, but I have been met with a lot of confusion and some skepticism when I’ve tried to use them. I’m uncertain as to whether it’s me that’s confused, or whether my audience doesn’t get it. I should note that my formal statistical training is somewhat limited—while I was able to take a couple of stats courses during my masters, I’ve had to learn quite a bit on the side, which makes me skeptical as to whether or not I actually understand what I’m doing.

My main question is this: when using the secret weapon, does it make sense to subset the data across any arbitrary variable of interest, as long as you want to see if the effects of other variables vary across its range? My specific case concerns tree growth (ring widths). I’m interested to see how the effect of competition (crowding and other indices) on growth varies at different temperatures, and if these patterns change in different locations (there are two locations). To do this, I subset the growth data in two steps: first by location, then by each degree of temperature, which I rounded to the nearest integer. I then ran the same linear model on each subset. The model had growth as the response, and competition variables as predictors, which were standardized. I’ve attached the resulting figure [see above], which plots the change in effect for each predictor over the range of temperature.

My reply: I like these graphs! In future you might try a 6 x K grid, where K is the number of different things you’re plotting. That is, right now you’re wasting one of your directions because your 2 x 3 grid doesn’t mean anything. These plots are fine, but if you have more information for each of these predictors, you can consider plotting the existing information as six little graphs stacked vertically and then you’ll have room for additional columns. In addition, you should make the tick marks much smaller, put the labels closer to the axes, and reduce the number of axis labels, especially on the vertical axes. For example, (0.0, 0.3, 0.6, 0.9) can be replaced by labels at 0, 0.5, 1.

Regarding the larger issue of, what is the secret weapon, as always I see it as an approximation to a full model that bridges the different analyses. It’s a sort of nonparametric analysis. You should be able to get better estimates by using some modeling, but a lot of that smoothing can be done visually anyway, so the secret weapon gets you most of the way there, and in my view it’s much much better than the usual alternative of fitting a single model to all the data without letting all the coefficients vary.

“Developers Who Use Spaces Make More Money Than Those Who Use Tabs”

Rudy Malka writes:

I think you’ll enjoy this nice piece of pop regression by David Robinson: developers who use spaces make more money than those who use tabs. I’d like to know your opinion about it.

At the above link, Robinson discusses a survey that allows him to compare salaries of software developers who use tabs to those who use spaces. The key graph is above. Robinson found similar results after breaking down the data by country, job title, or computer language used, and it also showed up in a linear regression controlling in a simple way for a bunch of factors.

As Robinson put it in terms reminiscent of our Why Ask Why? paper:

This is certainly a surprising result, one that I didn’t expect to find when I started exploring the data. . . . I tried controlling for many other confounding factors within the survey data beyond those mentioned here, but it was difficult to make the effect shrink and basically impossible to make it disappear.

Speaking with the benefit of hindsight—that is, seeing Robinson’s results and assuming they are a correct representation of real survey data—it all makes sense to me. Tabs seem so amateurish, I much prefer spaces—2 spaces, not 4, please!!!—so from that perspective it makes sense to me that the kind of programmers who use tabs tend to be programmers with poor taste and thus, on average, of lower quality.

I just want to say one thing. Robinson writes, “Correlation is not causation, and we can never be sure that we’ve controlled for all the confounding factors present in a dataset.” But this isn’t quite the point. Or, to put it another way, I think he has the right instinct here but isn’t quite presenting the issue precisely. To see why, suppose the survey had only 2 questions: How much money do you make? and Do you use spaces or tabs? And suppose we had no other information on the respondents. And, for that matter, suppose there was no nonresponse and that we had a simple random sample of all programmers from some specified set of countries. In that case, we’d know for sure that there are no other confounding factors in the dataset, as the dataset is nothing but those two columns of numbers. But we’d still be able to come up with a zillion potential explanations.

To put it another way, the descriptive comparison is interesting in its own right, and we just should be careful about misusing causal language. Instead of saying, “using spaces instead of tabs leads to an 8.6% higher salary,” we could say, “comparing two otherwise similar programmers, the one who uses spaces has, on average, an 8.6% higher salary than the one who uses tabs.” That’s a bit of a mouthful—but such a mouthful is necessary to accurately describe the comparison that’s being made.

Time-sharing Experiments for the Social Sciences

Jamie Druckman writes:

Time-sharing Experiments for the Social Sciences (TESS) is an NSF-funded initiative. Investigators propose survey experiments to be fielded using a nationally representative Internet platform via NORC’s AmeriSpeak® Panel (see http:/ for more information). In an effort to enable younger scholars to field larger-scale studies than what TESS normally conducts, we are pleased to announce a Special Competition for Young Investigators. While anyone can submit at any time through TESS’s regular proposal mechanism, this Special Competition is limited to graduate students and individuals who are who are no more than 3 years post-PhD. Winning projects will be allowed to be fielded at a size up to twice the usual budget as a regular TESS study. For more specifics on the special competition, see:  We will begin accepting proposals for the Special Competition on August 1, 2017, and the deadline is October 1, 2017.  Full details about the competition are available at   This page includes information about what is required of proposals and how to submit, and should be reviewed by anyone entering the competition.

After Peptidegate, a proposed new slogan for PPNAS. And, as a bonus, a fun little graphics project.

Someone pointed me to this post by “Neuroskeptic”:

A new paper in the prestigious journal PNAS contains a rather glaring blooper. . . . right there in the abstract, which states that “three neuropeptides (β-endorphin, oxytocin, and dopamine) play particularly important roles” in human sociality. But dopamine is not a neuropeptide. Neither are serotonin or testosterone, but throughout the paper, Pearce et al. refer to dopamine, serotonin and testosterone as ‘neuropeptides’. That’s just wrong. A neuropeptide is a peptide active in the brain, and a peptide in turn is the term for a molecule composed of a short chain of amino acids. Neuropeptides include oxytocin, vasopressin, and endorphins – which do feature in the paper. But dopamine and serotonin aren’t peptides, they’re monoamines, and testosterone isn’t either, it’s a steroid. This isn’t a matter of opinion, it’s basic chemistry.

The error isn’t just an isolated typo: ‘neuropeptide’ occurs 27 times in the paper, while the correct terms for the non-peptides are never used.

Neuroskeptic speculates on how this error got in:

It’s a simple mistake; presumably whoever wrote the paper saw oxytocin and vasopressin referred to as “neuropeptides” and thought that the term was a generic one meaning “signalling molecule.” That kind of mistake could happen to anyone, so we shouldn’t be too harsh on the authors . . .

The authors of the papers work in a psychology department so I guess they’re rusty on their organic chemistry.

Fair enough; I haven’t completed a chemistry class since 11th grade, and I didn’t know what a peptide is, either. Then again, I’m not writing articles on peptides for the National Academy of Sciences.

But how did this get through the review process? Let’s take a look at the published article:

Ahhhh, now I understand. The editor is Susan Fiske, notorious as the person who opened the gates of PPNAS for the articles on himmicanes, air rage, and ages ending in 9. I wonder who were the reviewers of this new paper. Nobody who knows what a peptide is, I guess. Or maybe they just read it very quickly, flipped through to the graphs and the conclusions, and didn’t read a lot of the words.

Did you catch that? Neuroskeptic refers to “the prestigious journal PNAS.” That’s PPNAS for short. This is fine, I guess. Maybe the science is ok. Based on a quick scan of the paper, I don’t think we should take a lot of the specific claims seriously, as they seem to based on the difference between “significant” and “non-significant.”

In particular, I’m not quite sure what is their support for the statement from the abstract that “each neuropeptide is quite specific in its domain of influence.” They’re rejecting various null hypotheses but I don’t know that this is supporting their substantive claims in the way that they’re saying.

I might be missing something here—I might be missing a lot—but in any case there seem to be some quality control problems at PPNAS. This should be no surprise: PPNAS is a huge journal, publishing over 3000 papers each year.

On their website they say, “PNAS publishes only the highest quality scientific research,” but this statement is simply false. I can’t really comment on this particular paper—it doesn’t seem like “the highest quality scientific research” to me, but, again, maybe I’m missing something big here. But I can assure you that the papers on himmicanes, air rage, and ages ending in 9 are not “the highest quality scientific research.” They’re not high quality research at all! What they are, is low-quality research that happens to be high-quality clickbait.

OK, let’s be fair. This is not a problem unique to PPNAS. The Lancet publishes crap papers, Psychological Science published crap papers, even JASA and APSR have their share of duds. Statistical Science, to its eternal shame, published that Bible Code paper in 1994. That’s fine, it’s how the system operates. Editors are only human.

But, really, do we have to make statements that we know are false? Platitudes are fine but let’s avoid intentional untruths.

So, instead of “PNAS publishes only the highest quality scientific research,” how about this: “PNAS aims to publish only the highest quality scientific research.” That’s fair, no?

P.S. Here’s a fun little graphics project: Redo Figure 1 as a lineplot. You’ll be able to show a lot more comparisons much more directly using lines rather than bars. The current grid of barplots is not the worst thing in the world—it’s much better than a table—but it could be much improved.

P.P.S. Just to be clear: (a) I don’t know anything about peptides so I’m offering no independent judgment of the paper in question; (b) whatever the quality is of this particular paper, does not affect my larger point that PPNAS publishes some really bad papers and so they should change their slogan to something more accurate.

P.P.P.S. The relevant Pubpeer page pointed to the following correction note that was posted on the PPNAS site after I wrote the above post but before it was posted:

The authors wish to note, “We used the term ‘neuropeptide’ in referring to the set of diverse neurochemicals that we examined in this study, some of which are not peptides; dopamine and serotonin are neurotransmitters and should be listed as such, and testosterone should be listed as a steroid. Our usage arose from our primary focus on the neuropeptides endorphin and oxytocin. Notwithstanding the biochemical differences between these neurochemicals, we note that these terminological issues have no implications for the significance of the findings reported in this paper.”

On deck through the rest of the year (and a few to begin 2018)

Here they are. I love seeing all the titles lined up in one place; it’s like a big beautiful poem about statistics:

  • After Peptidegate, a proposed new slogan for PPNAS. And, as a bonus, a fun little graphics project.
  • “Developers Who Use Spaces Make More Money Than Those Who Use Tabs”
  • Question about the secret weapon
  • Incentives Matter (Congress and Wall Street edition)
  • Analyze all your comparisons. That’s better than looking at the max difference and trying to do a multiple comparisons correction.
  • Problems with the jargon the jargon “statistically significant” and “clinically significant”
  • Capitalist science: The solution to the replication crisis?
  • Bayesian, but not Bayesian enough
  • Let’s stop talking about published research findings being true or false
  • Plan 9 from PPNAS
  • No, I’m not blocking you or deleting your comments!
  • “Furthermore, there are forms of research that have reached such a degree of complexity in their experimental methodology that replicative repetition can be difficult.”
  • “The Null Hypothesis Screening Fallacy”?
  • What is a pull request?
  • Turks need money after expensive weddings
  • Statisticians and economists agree: We should learn from data by “generating and revising models, hypotheses, and data analyzed in response to surprising findings.”
  • My unpublished papers
  • Bigshot psychologist, unhappy when his famous finding doesn’t replicate, won’t consider that he might have been wrong; instead he scrambles furiously to preserve his theories
  • Night Hawk
  • Why they aren’t behavioral economists: Three sociologists give their take on “mental accounting”
  • Further criticism of social scientists and journalists jumping to conclusions based on mortality trends
  • Daryl Bem and Arthur Conan Doyle
  • Classical statisticians as Unitarians
  • Slaying Song
  • What is “overfitting,” exactly?
  • Graphs as comparisons: A case study
  • Should we continue not to trust the Turk? Another reminder of the importance of measurement
  • “The ‘Will & Grace’ Conjecture That Won’t Die” and other stories from the blogroll
  • His concern is that the authors don’t control for the position of games within a season.
  • How does a Nobel-prize-winning economist become a victim of bog-standard selection bias?
  • “Bayes factor”: where the term came from, and some references to why I generally hate it
  • A stunned Dyson
  • Applying human factors research to statistical graphics
  • Recently in the sister blog
  • Adding a predictor can increase the residual variance!
  • Died in the Wool
  • “Statistics textbooks (including mine) are part of the problem, I think, in that we just set out ‘theta’ as a parameter to be estimated, without much reflection on the meaning of ‘theta’ in the real world.”
  • An improved ending for The Martian
  • Delegate at Large
  • Iceland education gene trend kangaroo
  • Reproducing biological research is harder than you’d think
  • The fractal zealots
  • Giving feedback indirectly by invoking a hypothetical reviewer
  • It’s hard to know what to say about an observational comparison that doesn’t control for key differences between treatment and control groups, chili pepper edition
  • PPNAS again: If it hadn’t been for the jet lag, would Junior have banged out 756 HRs in his career?
  • Look. At. The. Data. (Hollywood action movies example)
  • “This finding did not reach statistical sig­nificance, but it indicates a 94.6% prob­ability that statins were responsible for the symptoms.”
  • Wolfram on Golomb
  • Irwin Shaw, John Updike, and Donald Trump
  • What explains my lack of openness toward this research claim? Maybe my cortex is just too damn thick and wrinkled
  • I love when I get these emails!
  • Consider seniority of authors when criticizing published work?
  • Does declawing cause harm?
  • Bird fight! (Kroodsma vs. Podos)
  • The Westlake Review
  • “Social Media and Fake News in the 2016 Election”
  • Also holding back progress are those who make mistakes and then label correct arguments as “nonsensical.”
  • Just google “Despite limited statistical power”
  • It is somewhat paradoxical that good stories tend to be anomalous, given that when it comes to statistical data, we generally want what is typical, not what is surprising. Our resolution of this paradox is . . .
  • “Babbage was out to show that not only was the system closed, with a small group controlling access to the purse strings and the same individuals being selected over and again for the few scientific honours or paid positions that existed, but also that one of the chief beneficiaries . . . was undeserving.”
  • Irish immigrants in the Civil War
  • Mixture models in Stan: you can use log_mix()
  • Don’t always give ’em what they want: Practicing scientists want certainty, but I don’t want to offer it to them!
  • Cumulative residual plots seem like they could be useful
  • Sucker MC’s keep falling for patterns in noise
  • Nice interface, poor content
  • “From that perspective, power pose lies outside science entirely, and to criticize power pose would be a sort of category error, like criticizing The Lord of the Rings on the grounds that there’s no such thing as an invisibility ring, or criticizing The Rotter’s Club on the grounds that Jonathan Coe was just making it all up.”
  • Chris Moore, Guy Molyneux, Etan Green, and David Daniels on Bayesian umpires
  • Using statistical prediction (also called “machine learning”) to potentially save lots of resources in criminal justice
  • “Mainstream medicine has its own share of unnecessary and unhelpful treatments”
  • What are best practices for observational studies?
  • The Groseclose endgame: Getting from here to there.
  • Causal identification + observational study + multilevel model
  • All cause and breast cancer specific mortality, by assignment to mammography or control
  • Iterative importance sampling
  • Rosenbaum (1999): Choice as an Alternative to Control in Observational Studies
  • Gigo update (“electoral integrity project”)
  • How to design and conduct a subgroup analysis?
  • Local data, centralized data analysis, and local decision making
  • Too much backscratching and happy talk: Junk science gets to share in the reputation of respected universities
  • Selection bias in the reporting of shaky research: An example
  • Self-study resources for Bayes and Stan?
  • Looking for the bottom line
  • “How conditioning on post-treatment variables can ruin your experiment and what to do about it”
  • Trial by combat, law school style
  • Causal inference using data from a non-representative sample
  • Type M errors studied in the wild
  • Type M errors in the wild—really the wild!
  • Where does the discussion go?
  • Maybe this paper is a parody, maybe it’s a semibluff
  • As if the 2010s never happened
  • Using black-box machine learning predictions as inputs to a Bayesian analysis
  • It’s not enough to be a good person and to be conscientious. You also need good measurement. Cargo-cult science done very conscientiously doesn’t become good science, it just falls apart from its own contradictions.
  • Air rage update
  • Getting the right uncertainties when fitting multilevel models
  • Chess records page
  • Weisburd’s paradox in criminology: it can be explained using type M errors
  • “Cheerleading with an agenda: how the press covers science”
  • Automated Inference on Criminality Using High-tech GIGO Analysis
  • Some ideas on using virtual reality for data visualization: I don’t really agree with the details here but it’s all worth discussing
  • Contribute to this pubpeer discussion!
  • For mortality rate junkies
  • The “fish MRI” of international relations studies.
  • “5 minutes? Really?”
  • 2 quick calls
  • Should we worry about rigged priors? A long discussion.
  • I’m not on twitter
  • I disagree with Tyler Cowen regarding a so-called lack of Bayesianism in religious belief
  • “Why bioRxiv can’t be the Central Service”
  • Sudden Money
  • The house is stronger than the foundations
  • Please contribute to this list of the top 10 do’s and don’ts for doing better science
  • Partial pooling with informative priors on the hierarchical variance parameters: The next frontier in multilevel modeling
  • Does racquetball save lives?
  • When do we want evidence-based change? Not “after peer review”
  • “I agree entirely that the way to go is to build some model of attitudes and how they’re affected by recent weather and to fit such a model to “thick” data—rather than to zip in and try to grab statistically significant stylized facts about people’s cognitive illusions in this area.”
  • “Bayesian evidence synthesis”
  • Freelance orphans: “33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??”
  • Beyond forking paths: using multilevel modeling to figure out what can be learned from this survey experiment
  • From perpetual motion machines to embodied cognition: The boundaries of pseudoscience are being pushed back into the trivial.
  • Why I think the top batting average will be higher than .311: Over-pooling of point predictions in Bayesian inference
  • “La critique est la vie de la science”: I kinda get annoyed when people set themselves up as the voice of reason but don’t ever get around to explaining what’s the unreasonable thing they dislike.
  • How to discuss your research findings without getting into “hypothesis testing”?
  • Does traffic congestion make men beat up their wives?
  • The Publicity Factory: How even serious research gets exaggerated by the process of scientific publication and reporting
  • I think it’s great to have your work criticized by strangers online.
  • In the open-source software world, bug reports are welcome. In the science publication world, bug reports are resisted, opposed, buried.
  • If you want to know about basketball, who ya gonna trust, the Irene Blecker Rosenfeld Professor of Psychology at Cornell University and author of “The Wisest One in the Room: How You Can Benefit from Social Psychology’s Most Powerful Insights,” . . . or that poseur Phil Jackson??
  • Quick Money
  • An alternative to the superplot
  • Where the money from Wiley Interdisciplinary Reviews went . . .
  • Retract or correct, don’t delete or throw into the memory hole
  • Using Mister P to get population estimates from respondent driven sampling
  • “Americans Greatly Overestimate Percent Gay, Lesbian in U.S.”
  • “It all reads like a classic case of faulty reasoning where the reasoner confuses the desirability of an outcome with the likelihood of that outcome.”
  • Pseudoscience and the left/right whiplash
  • The time reversal heuristic (priming and voting edition)
  • The Night Riders
  • Why you can’t simply estimate the hot hand using regression
  • Stan to improve rice yields
  • When people proudly take ridiculous positions
  • “A mixed economy is not an economic abomination or even a regrettably unavoidable political necessity but a natural absorbing state,” and other notes on “Whither Science?” by Danko Antolovic
  • Noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger.
  • What should this student do? His bosses want him to p-hack and they don’t even know it!
  • Fitting multilevel models when predictors and group effects correlate
  • I hate that “Iron Law” thing
  • High five: “Now if it is from 2010, I think we can make all sorts of assumptions about the statistical methods without even looking.”
  • “What is a sandpit?”
  • No no no no no on “The oldest human lived to 122. Why no person will likely break her record.”
  • Tips when conveying your research to policymakers and the news media
  • Graphics software is not a tool that makes your graphs for you. Graphics software is a tool that allows you to make your graphs.
  • Spatial models for demographic trends?
  • A pivotal episode in the unfolding of the replication crisis
  • We start by talking reproducible research, then we drift to a discussion of voter turnout
  • Wine + Stan + Climate change = ?
  • Stan is a probabilistic programming language
  • Using output from a fitted machine learning algorithm as a predictor in a statistical model
  • Poisoning the well with a within-person design? What’s the risk?
  • “Dear Professor Gelman, I thought you would be interested in these awful graphs I found in the paper today.”
  • I know less about this topic than I do about Freud.
  • Driving a stake through that ages-ending-in-9 paper
  • What’s the point of a robustness check?
  • Oooh, I hate all talk of false positive, false negative, false discovery, etc.
  • Trouble Ahead
  • A new definition of the nerd?
  • Orphan drugs and forking paths: I’d prefer a multilevel model but to be honest I’ve never fit such a model for this sort of problem
  • Popular expert explains why communists can’t win chess championships!
  • The four missing books of Lawrence Otis Graham
  • “There was this prevalent, incestuous, backslapping research culture. The idea that their work should be criticized at all was anathema to them. Let alone that some punk should do it.”
  • Loss of confidence
  • “How to Assess Internet Cures Without Falling for Dangerous Pseudoscience”
  • Ed Jaynes outta control!
  • A reporter sent me a Jama paper and asked me what I thought . . .
  • Workflow, baby, workflow
  • Two steps forward, one step back
  • Yes, you can do statistical inference from nonrandom samples. Which is a good thing, considering that nonrandom samples are pretty much all we’ve got.
  • The Night Riders
  • The piranha problem in social psychology / behavioral economics: The “take a pill” model of science eats itself
  • Ready Money
  • Stranger than fiction
  • “The Billy Beane of murder”?
  • Red doc, blue doc, rich doc, rich doc
  • Working Class Postdoc
  • “We wanted to reanalyze the dataset of Nelson et al. However, when we asked them for the data, they said they would only share the data if we were willing to include them as coauthors.”
  • UNDER EMBARGO: the world’s most unexciting research finding
  • Setting up a prior distribution in an experimental analysis
  • Walk a Crooked MiIe
  • It’s . . . spam-tastic!
  • The failure of null hypothesis significance testing when studying incremental changes, and what to do about it
  • Robust standard errors aren’t for me
  • Stupid-ass statisticians don’t know what a goddam confidence interval is
  • Forking paths plus lack of theory = No reason to believe any of this.
  • Turn your scatterplots into elegant apparel and accessories!
  • Your (Canadian) tax dollars at work

And a few to begin 2018:

  • The Ponzi threshold and the Armstrong principle
  • I’m with Errol: On flypaper, photography, science, and storytelling
  • Politically extreme yet vital to the nation
  • How does probabilistic computation differ in physics and statistics?
  • “Each computer run would last 1,000-2,000 hours, and, because we didn’t really trust a program that ran so long, we ran it twice, and it verified that the results matched. I’m not sure I ever was present when a run finished.”


We’ll also intersperse topical items as appropriate.

Not everyone’s aware of falsificationist Bayes

Stephen Martin writes:

Daniel Lakens recently blogged about philosophies of science and how they relate to statistical philosophies. I thought it may be of interest to you. In particular, this statement:

From a scientific realism perspective, Bayes Factors or Bayesian posteriors do not provide an answer to the main question of interest, which is the verisimilitude of scientific theories. Belief can be used to decide which questions to examine, but it can not be used to determine the truth-likeness of a theory.

My response, TLDR:
1) frequentism and NP require more subjectivity than they’re given credit for (assumptions, belief in perfectly known sampling distributions, Beta [and thus type-2 error ‘control’] requires subjective estimate of the alternative effect size)

2) Bayesianism isn’t inherently more subjective, it just acknowledges uncertainty given the data [still data-driven!]

3) Popper probably wouldn’t like the NHST ritual, given that we use p-values to support hypotheses, not to refute an accepted hypothesis [the nil-hypothesis of 0 is not an accepted hypothesis in most cases]

4) Refuting falsifiable hypotheses can be done in Bayes, which is largely what Popper cared about anyway

5) Even in a NP or LRT framework, people don’t generally care about EXACT statistical hypotheses, they care about substantive hypotheses, which map to a range of statistical/estimate hypotheses, and YET people don’t test the /range/, they test point values; bayes can easily ‘test’ the hypothesized range.

My [Martin’s] full response is here.

I agree with everything that Martin writes above. And, for that matter, I agree with most of Lakens wrote too. The starting point for all of this is my 2011 article, Induction and deduction in Bayesian data analysis. Also relevant are my 2013 article with Shalizi, Philosophy and the practice of Bayesian statistics and our response to the ensuing discussion, and my recent article with Hennig, Beyond subjective and objective in statistics.

Lakens covers the same Popper-Lakatos ground that we do, although he (Lakens) doesn’t appear to be aware of the falsificationist view of Bayesian data analysis, as expressed in chapter 6 of BDA and the articles listed above. Lakens is stuck in a traditionalist view of Bayesian inference as based on subjectivity and belief, rather than what I consider a more modern approach of conditionality, where Bayesian inference works out the implications of a statistical model or system of assumptions, the better to allow us to reveal problems that motivate improvements and occasional wholesale replacements of our models.

Overall I’m glad Lakens wrote his post because he’s reminding people of important issues that are not handled well in traditional frequentist or subjective-Bayes approaches, and I’m glad that Martin filled in some of the gaps. The audience for all of this seems to be psychology researchers, so let me re-emphasize a point I’ve made many times, the distinction between statistical models and scientific models. A statistical model is necessarily specific, and we should avoid the all-too-common mistake of rejecting some uninteresting statistical model and taking this as evidence for a preferred scientific model. That way lies madness.

Breaking the dataset into little pieces and putting it back together again

Alex Konkel writes:

I was a little surprised that your blog post with the three smaller studies versus one larger study question received so many comments, and also that so many people seemed to come down on the side of three smaller studies. I understand that Stephen’s framing led to some confusion as well as practical concerns, but I thought the intent of the question was pretty straightforward.

At the risk of beating a dead horse, I wanted to try asking the question a different way: if you conducted a study (or your readers, if you want to put this on the blog), would you ever divide up the data into smaller chunks to see if a particular result appeared in each subset? Ignoring cases where you might want to examine qualitatively different groups, of course; would you ever try to make fundamentally homogeneous/equivalent subsets? Would you ever advise that someone else do so?

For those caught up in the details, assume an extremely simple design. A simple comparison of two groups ending in a (Bayesian) t-test with no covariates, nothing fancy. In a very short time period you collected 450 people in each group using exactly the same procedure for each one; there is zero reason to believe that the data were affected by anything other than your group assignment. Would you forego analyzing the entire sample and instead break them into three random chunks?

My personal experience is that empirically speaking, no one does this. Except for cases where people are interested in avoiding model overfitting and so use some kind of cross validation or training set vs testing set paradigm, I have never seen someone break their data into small groups to increase the amount of information or strengthen their conclusions. The blog comments, however, seem to come down on the side of this being a good practice. Are you (or your readers) going to start doing this?

My reply:

From a Bayesian standpoint, the result is the same, whether you consider all the data at once, or stir in the data one-third at a time. The problem would come if you make intermediate decisions that involve throwing away information, for example if you take parts of the data and just describe them as statistically significant or not.

Don’t say “improper prior.” Say “non-generative model.”

[cat picture]

In Bayesian Data Analysis, we write, “In general, we call a prior density p(θ) proper if it does not depend on data and integrates to 1.” This was a step forward from the usual understanding which is that a prior density is improper if an infinite integral.

But I’m not so thrilled with the term “proper” because it has different meanings for different people.

Then the other day I heard Dan Simpson and Mike Betancourt talking about “non-generative models,” and I thought, Yes! this is the perfect term! First, it’s unambiguous: a non-generative model is a model for which it is not possible to generate data. Second, it makes use of the existing term, “generative model,” hence no need to define a new concept of “proper prior.” Third, it’s a statement about the model as a whole, not just the prior.

I’ll explore the idea of a generative or non-generative model through some examples:

Classical iid model, y_i ~ normal(theta, 1), for i=1,…,n. This is not generative because there’s no rule for generating theta.

Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with uniform prior density, p(theta) proportional to 1 on the real line. This is not generative because you can’t draw theta from a uniform on the real line.

Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with data-based prior, theta ~ normal(y_bar, 10), where y_bar is the sample mean of y_1,…,y_n. This model is not generative because to generate theta, you need to know y, but you can’t generate y until you know theta.

In contrast, consider a Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with non-data-based prior, theta ~ normal(0, 10). This is generative: you draw theta from the prior, then draw y given theta.

Some subtleties do arise. For example, we’re implicitly conditioning on n. For the model to be fully generative, we’d need a prior distribution for n as well.

Similarly, for a regression model to be fully generative, you need a prior distribution on x.

Non-generative models have their uses; we should just recognize when we’re using them. I think the traditional classification of prior, labeling them as improper if they have infinite integral, does not capture the key aspects of the problem.

P.S. Also relevant is this comment, regarding some discussion of models for the n:

As in many problems, I think we get some clarity by considering an existing problem as part of a larger hierarchical model or meta-analysis. So if we have a regression with outcomes y, predictors x, and sample size n, we can think of this as one of a larger class of problems, in which case it can make sense to think of n and x as varying across problems.

The issue is not so much whether n is a “random variable” in any particular study (although I will say that, in real studies, n typically is not precisely defined ahead of time, what with difficulties of recruitment, nonresponse, dropout, etc.) but rather that n can vary across the reference class of problems for which a model will be fit.

Where’d the $2500 come from?

Brad Buchsbaum writes:

Sometimes I read the New York Times “Well” articles on science and health. It’s a mixed bag, sometimes it’s quite good and sometimes not. I came across this yesterday:

What’s the Value of Exercise? $2,500

For people still struggling to make time for exercise, a new study offers a strong incentive: You’ll save $2,500 a year.

The savings, a result of reduced medical costs, don’t require much effort to accrue — just 30 minutes of walking five days a week is enough.

The findings come from an analysis of 26,239 men and women, published today in the Journal of the American Heart Association. . . .

I [Buchsbaum] thought: I wonder where the number came from? So I tracked down the paper referred to in the article (which was unhelpfully not linked or properly named).

I was horrified to find that the $2500 figure appears to be nowhere in the paper (see table 2). Moreover, the closest number I could find ($1900) was based on a regression model without covarying age, sex, ethnicity, income, or anything else. Of course older people exercise less and spend more on healthcare!

I sent the following email (see below) to the NYTimes author, but she has not responded.

At any rate, I thought this example of very high-profile science-blogging to be particularly egregious, so I thought I’d bring it to your attention.

The research article is Economic Impact of Moderate-Vigorous Physical Activity Among Those With and Without Established Cardiovascular Disease: 2012 Medical Expenditure Panel Survey, by Javier Valero-Elizondo, Joseph Salami, Chukwuemeka Osondu, Oluseye Ogunmoroti, Alejandro Arrieta, Erica Spatz, Adnan Younus, Jamal Rana, Salim Virani, Ron Blankstein, Michael Blaha, Emir Veledar, and Khurram Nasir.

And here’s Buchsbaum’s letter to Gretchen Reynolds, the author of that news article:

I very much enjoy your health articles for the New York Times. Sometimes I try and find the paper and examine the data, just for my own benefit.

After perusing the paper, I’m was not quite sure where the $2500 figure came from. In table 2 (see attached paper), the unadjusted expenditures are reported over all subjects.

non-optimal PA: $5397, optimal PA: $3443 for a difference of $1900.

This is close to $2500 but your number is higher.

However, remember, this is an *unadjusted model*. It does not account for age, sex, family income, race/ethnicity, insurance type, geographical location or comorbidity.

In other words, it’s a virtually useless model.

Lets look at Model 3, which does account for the above factors.

non-optimal PA: $4867, optimal PA: $4153 for a difference of $714

So $714 closer to the mark.

BUT, this includes ALL subjects, including those with cardiovascular disease (CVD).

If you look at people without CVD then the estimates depend on the cardiovascular risk profile (CRF). If you have an average or optimal profile then the difference is around $430 or $493. If you have a “poor” profile, then the difference is around $1060 (although the 95% confidence intervals overlapped, meaning the effect was not reliable).

What is my conclusion?

I’m afraid the title of your article is misleading since it is larger (by $600) than the $1900 estimate based on the meaningless unadjusted model! Even if the title was “What’s the Value of Exercise? $700”, it would still be misleading, because it implicitly assumes a causal relationship between exercise and expenditure.

Remember also that the adjusted variables are only the measures the authors happened to record. There are dozens of potentially other mediating variables which are related to both physical exercise and health expenditures. Including these other adjusting factors might further reduce the estimates.

Best Regards,

It’s just a news article so some oversimplification is perhaps unavoidable. But I do wonder where the $2500 number came from. I’m guessing it’s from some press release but I don’t know.

Also, I’m surprised the reporter didn’t respond to the email. But maybe New York Times reporters get too many emails to respond to, or even read. I should also emphasize that I did not read that news article or the scientific paper in detail, so I’m not endorsing (or disagreeing with) Buchsbaum’s claim. Here I’m just interested the general challenge of tracking down numbers like that $2500 that have no apparent source.

Stan Weekly Roundup, 16 June 2017

We’re going to be providing weekly updates for what’s going on behind the scenes with Stan. Of course, it’s not really behind the scenes, because the relevant discussions are at

  • stan-dev GitHub organization: this is the home of all of our source repos; design discussions are on the Stan Wiki

  • Stan Discourse Groups: this is the home of our user and developer lists (they’re all open); feel free to join the discussion—we try to be friendly and helpful in our responses, and there is a lot of statistical and computational expertise in the wings from our users, who are increasingly joining the discussion. By the way, thanks for that—it takes a huge load off us to get great answers from users to other user questions. We’re up to about 15 or so active discussion threads a day or thereabouts (active topics in the last 24 hours include AR(K) models, web site reorganization, ragged arrays, order statitic priors, new R packages built on top of Stan, docker images for Stan on AWS, and many more!)

OK, let’s get started with the weekly review, though this is a special summer double issue, just like the New Yorker.

Your news here: If you have any Stan news you’d like to share, please let me know at (we’ll probably get a more standardized way to do this in the future).

New web site: Michael Betancourt redesigned the Stan web site; hopefully this will be easier to use. We’re no longer trying to track the literature. If you want to see the Stan literature in progress, do a search for “Stan Development Team” or “” on Google Scholar; we can’t keep up! Do let us know either in an issue on GitHub for the web site or in the user group on Discourse if you have comments or suggestions.

New user and developer lists: We’ve shuttered our Google group and moved to Discourse for both our user and developer lists (they’re consolidated now in categories on one list). It’s easy to signup with GitHub or Google IDs and much easier to search and use online.
See Stan Discourse Groups and for the old discussions, Stan’s shuttered Google group for users and Stan’s shuttered Google group for developers“. We’re not removing any of the old content, but we are prohibiting new posts.

GPU support: Rok Cesnovar and Steve Bronder have been getting GPU support working for linear algebra operations. They’re starting with Cholesky decomposition because it’s a bottleneck for Gaussian process (GP) models and because it has the pleasant property of being quadratic in data and cubic in computation.
See math pull request 529

Distributed computing support: Sebastian Weber is leading the charge into distributed computing using the MPI framework (multi-core or multi-machine) by essentially coding up map-reduce for derivatives inside of Stan. Together with GPU support, distributed computing of derivatives will give us a TensorFlow-like flexibility to accelerate computations. Sebastian’s also looking into parallelizing the internals of the Boost and CVODES ordinary differential equation (ODE) solvers using OpenCL.
See math issue 101, math issue 551,

Logging framework: Daniel Lee added a logging framework to Stan to allow finer-grained control of

Operands and partials: Sean Talts finished the refactor of our underlying operands and partials data structure, which makes it much simpler to write custom derivative functions

See pull request 547

Autodiff testing framework: Bob Carpenter finished the first use case for a generalized autodiff tester to test all of our higher-order autodiff thoroughly
See math pull request 562

C++11: We’re all working toward the 2.16 release, which will be our last release before we open the gates of C++11 (and some of C++14). This is going to make our code a whole lot easier to write and maintain, and will open up awesome possibilities like having closures to define lambdas within the Stan language, as well as consolidating many of our uses of Boost into standard template library.

Append arrays: Ben Bales added signatures for append_array, to work like our appends for vectors and matrices.
See pull request 554 and pull request 550

ODE system size checks: Sebastian Weber pushed a bug fix that cleans up ODE system size checks to avoid seg faults at run time.
See pull request 559

RNG consistency in transformed data: A while ago we relaced the generated-quantities-only nature of _rng functions by allowing them in transformed data (so you can fit fake data generated wholly within Stan or represent posterior uncertainty of some other process, allowing “cut”-like models to be formulated as a two-stage process); Mitzi Morris just cleaned these up so we use the same RNG seed for all chains so that we can perform converence monitoring; multiple replications would then be done by running the whole multi-chain process multiple times.
See Stan pull request 2313

NSF Grant: CI-SUSTAIN: Stan for the Long Run: We (Bob Carpenter, Andrew Gelman, Michael Betancourt) were just awarded an NSF grant for Stan sustainability. This was a follow-on from the first Compute Resource Initiative (CRI) grant we got after building the system. Yea! This adds roughly a year of funding for the team at Columbia University. Our goal is to put in governance processes for sustaining the project as well as shore up all of our unit tests and documentation.

Hiring: We hired two full-time Stan staff at Columbia. Sean Talts joins as a developer at Columbia and Breck Baldwin as a business manager for the project, both at Columbia. Sean had already been working as a contractor for us, hence all the pull requests. (Pro tip: The best way to get a foot in the door for an open-source project is to submit a useful pull request.)

SPEED: Parallelizing Stan using the Message Passing Interface (MPI)

Sebastian Weber writes:

Bayesian inference has to overcome tough computational challenges and thanks to Stan we now have a scalable MCMC sampler available. For a Stan model running NUTS, the computational cost is dominated by gradient calculations of the model log-density as a function of the parameters. While NUTS is scalable to huge parameter spaces, this scalability becomes more of a theoretical one as the computational cost explodes. Models which involve ordinary differential equations (ODE) are such an example, where the runtimes can be of the order of days.

The obvious speedup when using Stan is to run multiple chains at the same time on different computer cores. However, this cannot reduce the total runtime per chain, which requires within-chain parallelization.

Hence, a viable approach is to parallelize the gradient calculation within a chain. As many Bayesian models facilitate hierarchical models over groupings we can often calculate contributions to the log-likelihood separately for each of these groups.

Therefore, the concept of an embarrassingly parallel program can be applied in this setting, i.e. one can calculate these independent work chunks on separate CPU cores and then collect the results.

For reasons implied by Stan’s internals (the gradient calculation must not run in a threaded program) we are restricted in applicable techniques. One possibility is the Message Passing Interface (MPI) which spawns multiple CPU cores by firing off independent processes. A root process will send packets of work (sets of parameters) to the child nodes which do the work and then send back the results (function return values and the gradients). A first toy example shows dramatic speedups (3 ODEs, 7 parameters). That is, when going from 1 core runtime of 5.2h we can crank it down to just 17 minutes by using 20 cores (18x speedup) on a single machine with 20 cores. MPI scales also across machines and when throwing 40 cores at the problem we are down to 10 minutes which is “only” a 31x speedup (see the above plot).

Of course, the MPI approach works best on clusters with many CPU
cores. Overall, this is fantastic news for big models as this opens the door to scale out large problems onto clusters which are available nowadays in many research facilities.

The source code for this prototype is on our github repository. This code should be regarded as working research code and we are currently working on bringing this feature into the main Stan distribution.

Wow. This is a big deal. There are lots of problems where this method will be useful.

P.S. What’s with the weird y-axis labels on that graph? I think it would work better to just go 1, 2, 4, 8, 16, 32 on both axes. I like the wall-time markings on the line, though; that helped me follow what was going on.

Pizzagate gets even more ridiculous: “Either they did not read their own previous pizza buffet study, or they do not consider it to be part of the literature . . . in the later study they again found the exact opposite, but did not comment on the discrepancy.”


Several months ago, Jordan Anaya​, Tim van der Zee, and Nick Brown reported that they’d uncovered 150 errors in 4 papers published by Brian Wansink, a Cornell University business school professor and who describes himself as a “world-renowned eating behavior expert for over 25 years.”

150 errors is pretty bad! I make mistakes myself and some of them get published, but one could easily go through an entire career publishing less than 150 mistakes. So many in a single paper is kind of amazing.

After the Anaya et al. paper came out, people dug into other papers of Wansink and his collaborators and found lots more errors.

Wansink later released a press release pointing to a website which he said contained data and code from the 4 published papers.

In that press release he described his lab as doing “great work,” which seems kinda weird to me, given that their published papers are of such low quality. Usually we would think that if a lab does great work, this would show up in its publications, but this did not seem to have happened in this case.

In particular, even if the papers in question had no data-reporting errors at all, we would have no reason to believe any of the scientific claims that were made therein, as these claims were based on p-values computed from comparisons selected from uncontrolled and abundant researcher degrees of freedom. These papers are exercises in noise mining, not “great work” at all, not even good work, not even acceptable work.

The new paper

As noted above, Wansink shared a document that he said contained the data from those studies. In a new paper, Anaya, van der Zee, and Brown analyzed this new dataset. They report some mistakes they (Anaya et al.) had made in their earlier paper, and many places where Wanink’s papers misreported his data and data collection protocols.

Some examples:

All four articles claim the study was conducted over a 2-week period, however the senior author’s blog post described the study as taking one month (Wansink, 2016), the senior author told Retraction Watch it was a two-month study (McCook, 2017b), a news article indicated the study was at least 3 weeks long (Lazarz, 2007), and the data release states the study took place from October 18 to December 8, 2007 (Wansink and Payne, 2007). Why the articles claimed the study only took two weeks when all the other reports indicate otherwise is a mystery.

Furthermore, articles 1, 2, and 4 all claim that the study took place in spring. For the Northern Hemisphere spring is defined as the months March, April, and May. However, the news report was dated November 18, 2007, and the data release states the study took place between October and December.

And this:

Article 1 states that the diners were asked to estimate how much they ate, while Article 3 states that the amount of pizza and salad eaten was unobtrusively observed, going so far as to say that appropriate subtractions were made for uneaten pizza and salad. Adding to the confusion Article 2 states:
“Unfortunately, given the field setting, we were not able to accurately measure consumption of non-pizza food items.”

In Article 3 the tables included data for salad consumed, so this statement was clearly inaccurate.

And this:

Perhaps the most important question is why did this study take place? In the blog post the senior author did mention having a “Plan A” (Wansink, 2016), and in a Retraction Watch interview revealed that the original hypothesis was that people would eat more pizza if they paid more (McCook, 2017a). The origin of this “hypothesis” is likely a previous study from this lab, at a different pizza buffet, with nearly identical study design (Just and Wansink, 2011). In that study they found diners who paid more ate significantly more pizza, but the released data set for the present study actually suggests the opposite, that diners who paid less ate more. So was the goal of this study to replicate their earlier findings? And if so, did they find it concerning that not only did they not replicate their earlier result, but found the exact opposite? Did they not think this was worth reporting?
Another similarity between the two pizza studies is the focus on taste of the pizza. Article 1 specifically states:

“Our reading of the literature leads us to hypothesize that one would rate pizza from an $8 pizza buffet as tasting better than the same pizza at a $4 buffet.”

Either they did not read their own previous pizza buffet study, or they do not consider it to be part of the literature, because in that paper they found ratings for overall taste, taste of first slice, and taste of last slice to all be higher in the lower price group, albeit with different levels of significance (Just and Wansink, 2011). However, in the later study they again found the exact opposite, but did not comment on the discrepancy.

Anaya et al. summarize:

Of course, there is a parsimonious explanation for these contradictory results in two apparently similar studies, namely that one or both sets of results are the consequence of modeling noise. Given the poor quality of the released data from the more recent articles . . . it seems quite likely that this is the correct explanation for the second set of studies, at least.

And this:

No good theory, no good data, no good statistics, no problem. Again, see here for the full story.

Not the worst of it

And, remember, those 4 pizzagate papers are not the worst things Wansink has published. They’re only the first four articles that anyone bothered to examine carefully enough to see all the data problems.

There was this example dug up by Nick Brown:

A further lack of randomness can be observed in the last digits of the means and F statistics in the three published tables of results . . . Here is a plot of the number of times each decimal digit appears in the last position in these tables:

These don’t look like so much like real data but they do seem consistent with someone making up numbers and not wanting them to seem too round, and not being careful to include enough 0’s and 5’s in the last digits.

And this discovery by Tim van der Zee:

Wansink, B., Cheney, M. M., & Chan, N. (2003). Exploring comfort food preferences across age and gender. Physiology & Behavior, 79(4), 739-747.

Citations: 334

Using the provided summary statistics such as mean, test statistics, and additional given constraints it was calculated that the data set underlying this study is highly suspicious. For example, given the information which is provided in the article the response data for a Likert scale question should look like this:

Furthermore, although this is the most extreme possible version given the constraints described in the article, it is still not consistent with the provided information.

In addition, there are more issues with impossible or highly implausible data.


Sığırcı, Ö, Rockmore, M., & Wansink, B. (2016). How traumatic violence permanently changes shopping behavior. Frontiers in Psychology, 7,

Citations: 0

This study is about World War II veterans. Given the mean age stated in the article, the distribution of age can only look very similar to this:

The article claims that the majority of the respondents were 18 to 18.5 years old at the end of WW2 whilst also have experienced repeated heavy combat. Almost no soldiers could have had any other age than 18.

In addition, the article claims over 20% of the war veterans were women, while women only officially obtained the right to serve in combat very recently.

There’s lots more at the link.

From the NIH guidelines on research misconduct:

Falsification: Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.

Ride a Crooked Mile

Joachim Krueger writes:

As many of us rely (in part) on p values when trying to make sense of the data, I am sending a link to a paper Patrick Heck and I published in Frontiers in Psychology. The goal of this work is not to fan the flames of the already overheated debate, but to provides some estimates about what p can and cannot do. Statistical inference will always require experience and good judgment regardless of which school of thought (Bayesian, frequentist, or other) we are leaning to.

I have three reactions.

1. I don’t think there’s any “overheated debate” about the p-value; it’s a method that has big problems and is part of the larger problem that is null hypothesis significance testing (see my article, The problems with p-values are not just with p-values); also p-values are widely misunderstood (see also here).

From a Bayesian point of view, p-values are most cleanly interpreted in the context of uniform prior distributions—but the setting of uniform priors, where there’s nothing special about zero, is the scenario where p-values are generally irrelevant.

So I don’t have much use for p-values. They still get used in practice—a lot—so there’s room for lots more articles explaining them to users, but I’m kinda tired of the topic.

2. I disagree with Krueger’s statement that “statistical inference will always require experience and good judgment.” For better or worse, lots of statistical inference is done using default methods by people with poor judgment and little if any relevant experience. Too bad, maybe, but that’s how it is.

Does statistical inference require experience and good judgment? No more than driving a car requires experience and good judgment. All you need is gas in the tank and the key in the ignition and you’re ready to go. The roads have all been paved and anyone can drive on them.

3. In their article, Krueger and Heck write, “Finding p = 0.055 after having found p = 0.045 does not mean that a bold substantive claim has been refuted (Gelman and Stern, 2006).” Actually, our point was much bigger than that. Everybody knows that 0.05 is arbitrary and there’s no real difference between 0.045 and 0.055. Our point was that apparent huge differences in p-values are not actually stable (“statistically significant”). For example, a p-value of 0.20 is considered to be useless (indeed, it’s often taken, erroneously, as evidence of no effect), and a p-value of 0.01 is considered to be strong evidence. But a p-value of 0.20 corresponds to a z-score of 1.28, and a p-value of 0.01 corresponds to a z-score of 2.58. The difference is 1.3, which is not close to statistically significant. (The difference between two independent estimates, each with standard error 1, has a standard error of sqrt(2); thus a difference in z-scores of 1.3 is actually less than 1 standard error away from zero!) So I fear that, by comparing 0.055 to 0.045, they are minimizing the main point of our paper.

More generally I think that all the positive aspects of the p-value they discuss in their paper would be even more positive if researchers were to use the z-score and not ever bother with the misleading transformation into the so-called p-value. I’d much rather see people reporting z-scores of 1.5 or 2 or 2.5 than reporting p-values of 0.13, 0.05, and 0.01.

Kaiser Fung’s data analysis bootcamp

Kaiser Fung announces a new educational venture he’s created, a bootcamp (12-week full-time in-person program with a curriculum) of short courses with a goal of getting people their first job in an analytics role for a business unit (not engineering or software development, so he is not competing directly with MS Data Science or data science bootcamps). Their curriculum is deliberately designed to be broad but not deep.

I asked Kaiser if he had anything else he wanted to share, and he wrote:

I think our major differentiation from other bootcamps out there includes:

a. There are lots of jobs in these other business units outside engineering and software development. Hiring managers in marketing, operations, servicing, etc. are looking for the ability to interpret and reason with data, and use data to solve business problems. Our broad-based curriculum caters to this need.

b. I don’t believe that coding is the end-all of data science. Coding schools teach people how to code but knowing what to code is more important. Therefore, our curriculum covers R, Python, and machine learning but also statistical reasoning, survey design, Excel, intro to marketing, intro to finance, etc.

c. We provide quality through small class size, in-person instruction and instructors who are industry practitioners. The average instructor has 10 years of industry experience, and is in a director or higher level position. These instructors know what hiring managers want since they are hiring managers themselves.

d. We are building a diverse class. We take social scientists, designers as well as STEM people. We just require some exposure to programming concepts and data analyses, and a good college degree.

Statistical Challenges of Survey Sampling and Big Data (my remote talk in Bologna this Thurs, 15 June, 4:15pm)

Statistical Challenges of Survey Sampling and Big Data

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University, New York

Big Data need Big Model. Big Data are typically convenience samples, not random samples; observational comparisons, not controlled experiments; available data, not measurements designed for a particular study. As a result, it is necessary to adjust to extrapolate from sample to population, to match treatment to control group, and to generalize from observations to underlying constructs of interest. Big Data + Big Model = expensive computation, especially given that we do not know the best model ahead of time and thus must typically fit many models to understand what can be learned from any given dataset. We discuss Bayesian methods for constructing, fitting, checking, and improving such models.

It’ll be at the 5th Italian Conference on Survey Methodology, at the Department of Statistical Sciences of the University of Bologna. A low-carbon remote talk.

Criminology corner: Type M error might explain Weisburd’s Paradox

[silly cartoon found by googling *cat burglar*]

Torbjørn Skardhamar, Mikko Aaltonen, and I wrote this article to appear in the Journal of Quantitative Criminology:

Simple calculations seem to show that larger studies should have higher statistical power, but empirical meta-analyses of published work in criminology have found zero or weak correlations between sample size and estimated statistical power. This is “Weisburd’s paradox” and has been attributed by Weisburd, Petrosino, and Mason (1993) to a difficulty in maintaining quality control as studies get larger, and attributed by Nelson, Wooditch, and Dario (2014) to a negative correlation between sample sizes and the underlying sizes of the effects being measured. We argue against the necessity of both these explanations, instead suggesting that the apparent Weisburd paradox might be explainable as an artifact of systematic overestimation inherent in post-hoc power calculations, a bias that is large with small N. Speaking more generally, we recommend abandoning the use of statistical power as a measure of the strength of a study, because implicit in the definition of power is the bad idea of statistical significance as a research goal.

I’d never heard of Weisburd’s paradox before writing this article. What happened is that the journal editors contacted me suggesting the topic, I then read some of the literature and wrote my article, then some other journal editors didn’t think it was clear enough so we found a couple of criminologists to coauthor the paper and add some context, eventually producing the final version linked here. I hope it’s helpful to researchers in that field and more generally. I expect that similar patterns hold with published data in other social science fields and in medical research too.

PhD student fellowship opportunity! in Belgium! to work with us! on the multiverse and other projects on improving the reproducibility of psychological research!!!

[image of Jip and Janneke dancing with a cat]

Wolf Vanpaemel and Francis Tuerlinckx write:

We at the Quantitative Psychology and Individual Differences, KU Leuven, Belgium are looking for a PhD candidate. The goal of the PhD research is to develop and apply novel methodologies to increase the reproducibility of psychological science. More information can be found on the job website or by contacting us at or The deadline for application is Monday June 26, 2017.

One of the themes a successful candidate may work on is the further development of the multiverse. I expect to be an active collaborator in this work.

So please apply to this one. We’d like to get the best possible person to be working on this exciting project.

Why I’m not participating in the Transparent Psi Project

I received the following email from psychology researcher Zoltan Kekecs:

I would like to ask you to participate in the establishment of the expert consensus design of a large scale fully transparent replication of Bem’s (2011) ‘Feeling the future’ Experiment 1. Our initiative is called the ‘Transparent Psi Project’. [] Our aim is to develop a consensus design that is mutually acceptable for both psi proponent and mainstream researchers, containing clear criteria for credibility.

I replied:

Thanks for the invitation. I am not so interested in this project because I think that all the preregistration in the world won’t solve the problem of small effect sizes and poor measurements. It is my impression from Bem’s work and others that the field of psi is plagued by noisy measurements and poorly specified theories. Sure, preregistration etc. would stop many of the problems–in particular, there’s no way that Bem would’ve seen 9 out of 9 statistically significant p-values, or whatever that was. But I can’t in good conscience recommend the spending of effort in this way. I think any serious work in this area would have to go beyond the phenomenological approach and perform more direct measurements, as for example here: . I’ve not actually read the paper linked there so this may be a bad example but the point is that one could possibly study such things scientifically with a physical model of the process. To just keep taking Bem-style measurements, though, I think that’s hopeless: it’s the kangaroo problem ( Better to preregister than not, but better still not to waste time on this or similarly-hopeless problems (studying sex ratios in samples of size 3000, estimating correlations of monthly cycle on political attitudes using between-person comparisons, power pose, etc.). I recognize that some of these ideas, ESP included, had some legitimate a priori plausibility, but, at this point, a Bem-style experiment seems like a shot in the dark. And, of course, even with preregistration, there’s a 5% chance you’ll see something statistically significant just by chance, leading to further confusion. In summary, preregistration and consensus helps with the incentives, but all the incentives in the world are no substitute for good measurements. (See the discussion of “in many cases we are loath to recommend pre-registered replication” here:

Kekecs wrote back:

Thank you for your feedback. We fully realize the problem posed by small effect size. However, this problem in itself can be solved simply by throwing a larger sample at it. In fact based on our simulations we plan to collect 14,000-60,000 data points (700 – 3,000 participants) using bayesian analysis and optional stopping, aiming to reach a Bayes factor threshold of 60 or 1/60. Our simulations show that using these parameters we only have a p = 0.0004 false positive chance, so it is highly unlikely that we would accidentally generate more confusion on the field just by conducting the replication. On the contrary, by doing our study, we will effectively more than double the amount of total data accumulated so far by Bem´s and others studies using this paradigm, which should help with clarity on the field by introducing good quality, credible data.

You might be right though that the measurements itself is faulty, and that we cannot expect precognition to work in an environmentally invalid situation like this. But in reality, we don’t have any information on how precognition should works if it really does exist, so I am not sure what would be a better way of measuring it than seeing how effective are people at predict future events.

Our main goal here is not really to see whether precognition exists or not. The ultimate aim of our efforts is to do a proof of concept study where we will see whether it is possible to come to a consensus on criterion of acceptability and credibility in a field this divided, and to come up with ways in which we can negate all possibilities of questionable research practice. This approach can then be transferred to other fields as well.

I then responded:

I still think it’s hopeless. The problem (which I’ll say using generic units as I’m not familiar with the ESP experiment) is: suppose you have a huge sample size and can detect an effect of 0.003 (on some scale) with standard error 0.001. Statistically significant, preregistered, the whole deal. Fine. But then you could very well see an effect of -0.002 with different people, in a different setting. And -0.003 somewhere else. And 0.001 somewhere else. Etc. You’re talking about effects that are indistinguishable given various sources of leakage in the experiment.

I support your general goal but I recommend you choose a more promising topic than ESP or power pose or various other topics that get talked about so much.

Kekecs replied:

We are already committed to follow through with this particular setting. But I agree with you that our approach can be easily transferred to the research of other effects and we fully intend to do that.

If you put it that way, your question is all about construct validity. Whether we can detect the effect that we really want to detect, or are there other confounds that bias the measurement. In this particular experimental setting which is simple as stone (basically people are guessing about the outcomes of future coin flips) the types of bias that we can expect are more related to questionable research practices (QRPs) than anything else. The only way other types of bias, such as personal differences in ability (sampling bias), participant expectancy, and demand characteristics, etc., can have an effect is if there is truly an anomalous effect. For example if we detected an effect of 0.003 with 0.001 SE only because we accidentally sampled people with high psi abilities, our conclusion that there is a psi effect would still be true (although our effect size estimate would be slightly off).

That is why in this project we are focusing mainly on negating all possibilities of QRPs and full transparency. I am not sure what other types of leakage can we have in this particular experiment if we addressed all possible QRPs. Would you care to elaborate?

I responded:

Just in answer to that last question: I’m not sure what other types of leakage might exist—it’s my impression that Bem’s experiments had various problems, so I guess it depends how exact a replication you’re talking about. My real point, though, is if we think ESP exists at all, then an effect that’s +0.003 on Monday and -0.002 on Tuesday and +0.001 on Wednesday probably isn’t so interesting. This becomes clearer if we move the domain away from possible null phenomena such as ESP or homeopathy, to things like social priming, which presumably has some effect, but which varies so much by person and by context to be generally unpredictable and indistinguishable from noise. I don’t think ESP is such a good model for psychology research because it’s one of the few things people study that really could be zero.

And then Kekecs closed out the discussion:

In response, I find doing this effort on the field of ESP interesting exactly because the effect could potentially be zero. Positive findings have an overwhelming dominance in both psi literature, and social sciences literature in general. In the case of most other social science research, it is a theoretical possibility (but unrealistic) that researchers just get lucky all the time and they always ask the right questions, that is why they are so effective in finding positive effects. Again, this is obviously cannot be true for the entirety of the literature, but for each topic studied individually, it can be quite probable that there is an effect if ever so small, which blurs the picture about publication bias and other types of bias in the literature. However, it may be that there is no ESP effect at all. In that case, we would have a field where the effect of bias in research can be studied in its purest form.

From another perspective, precognition in particular is a perfect research topic exactly because these designs by their nature are very well protected from the usual threats to internal validity, at least in the positive direction. It is hard to see what could make a person perform better at predicting the outcome of a state of the art random number generator if there is no psi effect. Bias can always be introduced by different questionable research practices (QRPs), but if we are able to design a study completely immune the QRPs, there is no real possibility for bias toward type I error. Of course, if the effect really exists, all the usual threats to validity can have an influence (for example, it is possible that people can get “psi fatigue” if they perform a lot of trials, or that events and contextual features, or even expectancy can have an effect on performance), but we cannot make a type I error in that case, because the effect exists, we can only make errors in estimating the size of the effect, or a type II error.

So understanding what is underlying the dominance of positive effects in ESP research is very important. If there is no effect, psi literature can serve as a case study for bias in its purest form, which can help us understand it in other research fields. On the other hand, if we find an effect when all QRPs are controlled for, we may need to really rethink our current paradigm.

I continue to think that the study of ESP is irrelevant for psychology, both for substantive reasons—there is no serious underlying theory or clear evidence for ESP, it’s all just hope and intuition—and for methodological reasons, in that zero is a real possibility. In contrast, even silly topics such as power pose and embodied cognition seem to me to have some relevance to psychology and also involve the real challenge that there are no zeroes. Standing in an unusual position for two minutes will have some effect on your thinking and behavior; the debate is what are the consistent effects, if any. That’s my take, anyway; but I wanted to share Kekecs’s view too, given all the effort he’s putting into this project.