Skip to content

What makes a mathematical formula beautiful?


Hiro Minato pointed me to this paper (hyped here) by Semir Zeki, John Romaya, Dionigi Benincasa, and Michael Atiyah on “The experience of mathematical beauty and its neural correlates,” who report:

We used functional magnetic resonance imaging (fMRI) to image the activity in the brains of 15 mathematicians when they viewed mathematical formulae which they had individually rated as beautiful, indifferent or ugly. Results showed that the experience of mathematical beauty correlates parametrically with activity in the same part of the emotional brain, namely field A1 of the medial orbito-frontal cortex (mOFC), as the experience of beauty derived from other sources.

I wrote that I looked at the paper and I don’t believe it!

Minato replied:

I think what they did wasn’t good enough to answer or even approach the question (scientifically or otherwise). . . . Meanwhile, someone can probably study sociology or culture of mathematicians to understand why mathematicians want to describe some “good” mathematics beautiful, elegant, etc.

I agree. Mathematical beauty is a fascinating topic; I just don’t think they’re going to learn much via MRI scans. It just seems like too crude a tool, kinda like writing a bunch of formulas on paper, feeding these sheets of paper to lab rats, and then performing a chemical analysis of the poop. The connection between input and output is just too noisy and indirect.

This seems like a problem that could use the collaboration of mathematicians, psychologists, and historians or sociologists. And just think of how much sociologist time you could afford, using the money you saved from not running the MRI machine!

Thanks, eBay!


Our recent Stan short course went very well, and we wanted to thank Julia Neznanova and Paul Burt of eBay NYC for giving us the space where we held the class.

More evidence that even top researchers routinely misinterpret p-values

Blake McShane writes:

I wanted to write to you about something related to your ongoing posts on replication in psychology as well as your recent post the ASA statement on p-values. In addition to the many problems you and others have documented with the p-value as a measure of evidence (both those computed “honestly” and those computed after fishing, the garden of forking paths, etc.), another problem seems to be that academic researchers across the biomedical and social sciences genuinely interpret them quite poorly.

In a forthcoming paper, my colleague David Gal and I survey top academics across a wide variety of fields including the editorial board of Psychological Science and authors of papers published in the New England Journal of Medicine, the American Economic Review, and other top journals. We show:
[1] Researchers interpret p-values dichotomously (i.e., focus only on whether p is below or above 0.05).
[2] They fixate on them even when they are irrelevant (e.g., when asked about descriptive statistics).
[3] These findings apply to likelihood judgments about what might happen to future subjects as well as to choices made based on the data.
We also show they ignore the magnitudes of effect sizes.

In case you have any interest, I am attaching the paper. Unfortunately, our data is presented in tabular format due to an insistent AE; thus, I am also attaching our supplement which presents the data graphically.

And here’s their key finding:

Screen Shot 2016-04-06 at 3.03.29 PM

Bad news. I’m sure researchers would also misinterpret Bayesian inferences to the extent that they are being used in a null hypothesis significance testing framework. I think p-values are particularly bad because, either they’re being used to make (generally meaningless) claims about hypotheses being true or false, or they’re being used as an indirect estimation tool, in which case they have that horrible nonlinear transformation that makes them so hard to interpret (the difference between significant and non-significant not being itself statistically significant and all that, which is part of the bigger problem that these numbers are just about impossible to interpret in the (real-world) scenario in which the null hypothesis is not precisely true.

Bayesian Inference with Stan for Pharmacometrics Class

Bob Carpenter, Daniel Lee, and Michael Betancourt will be teaching the 3-day class starting on 19 September in Paris. Following is the outline for the course:

Day 1

Introduction to Bayesian statistics

  • Likelihood / sampling distributions
  • Priors, Posteriors via Bayes’s rule
  • Posterior expectations and quantiles
  • Events as expectations of indicator functions

Introduction to Stan

  • Basic data types
  • Variable declarations
  • Constrained parameters and transforms to unconstrained
  • Program blocks and execution
  • Derived quantities
  • Built-in functions and operators
  • Statements: sampling, assignment, loops, conditionals, blocks
  • How to use Stan within R with RStan

Hands-on examples

Day 2

ODE and PK/PD Modeling

  • Parameters and data to ODEs
  • Non-stiff ODE solver
  • Stiff ODE solver
  • Control parameters and tolerances
  • Coupled ODE systems for sensitivities
  • Elimination half-lifes

Inference with Markov chain Monte Carlo

  • Monte Carlo methods and plug-in inference
  • Markov chain Monte Carlo
  • Convergence diagnostics, R-hat, effective sample size
  • Effective sample size vs. number of iterations
  • Plug-in posterior expectations and quantiles
  • Event probability calculations

Hands-on examples

Day 3

Additional Topics in PK/PD Modeliong

  • Bolus and infusion dosing
  • Lag time and absorption models
  • Linear versus Michaelis/Menten elimination
  • Hierarchical models for patient-level effects
  • Transit compartment models and time lags
  • Multi-compartment models and varying time scales
  • Joint PK/PD modeling: Bayes vs. “cut”
  • Meta-analysis
  • Formulating informative priors
  • Clinical trial simulations and power calculations

Stan programming techniques

  • Reproducible research practices
  • Probabilistic programming principles
  • Generated quantities for inference
  • Data simulation and model checking
  • Posterior predictive checks
  • Cross-validation and predictive calibration
  • Variable transforms for sampling efficiency
  • Multiple indexing and range slicing
  • Marginalizing discrete parameters
  • Handling missing data
  • Ragged and aparse data structures
  • Identifiability and problematic posteriors
  • Weakly informative priors

If you are in Europe in September, please come and join us. Thanks to Julie Bertrand and France Mentré from Université Paris Diderot for helping us organize the course.

You can register here.

Killer O


Taggert Brooks points to this excellent news article by George Johnson, who reports:

Epidemiologists have long been puzzled by a strange pattern in their data: People living at higher altitudes appear less likely to get lung cancer. . . . The higher you live, the thinner the air, so maybe oxygen is a cause of lung cancer. . . .

But the hypothesis is not as crazy as it may sound. Oxygen is what energizes the cells of our bodies. Like any fuel, it inevitably spews out waste — a corrosive exhaust of substances called “free radicals,” or “reactive oxygen species,” that can mutate DNA and nudge a cell closer to malignancy.

Back to the epidemiology. Researchers Kamen Simeonov and Daniel Himmelstein adjusted for a bunch of demographic and medical variables, and then:

After an examination of all these numbers for the residents of 260 counties in the Western United States, situated from sea level to nearly 11,400 feet, one pattern stood out: a correlation between the concentration of oxygen in the air and the incidence of lung cancer. For each 1,000-meter rise in elevation, there were 7.23 fewer lung cancer cases per 100,000 people.

“7.23” . . . that’s a bit overprecise, there’s no way you could know it to this level of accuracy. But I get the general idea.

As Brooks notes, this idea is not new. He links to a 1987 paper by Clarice Weinberg, Kenneth Brown, and David Hoel, who discussed “recent evidence implicating reactive forms of oxygen in carcinogenesis and atherosclerosis” and wrote that “reduced oxygen pressure of inspired air may be protective against certain causes of death.”

The idea has also hit the mass media. For example, from a 2012 article by Michael Corvinus in Cracked (yes, Cracked):

One of the disadvantages of living at higher altitudes is that there’s less oxygen in the air, which can suck for those with respiratory problems. One of the advantages of those places, however, is that … there’s less oxygen in the air. A lack of oxygen makes people’s bodies more efficient, which makes them live longer. . . . Dr. Benjamin Honigman at the University of Colorado School of Medicine theorized that the lower levels of oxygen force the body to become more efficient at distributing that oxygen, activating certain genes that enhance heart function and create new blood vessels for bringing blood to and from the heart, greatly lowering the chances of heart disease.

On deck this week

Mon: Killer O

Tues: More evidence that even top researchers routinely misinterpret p-values

Wed: What makes a mathematical formula beautiful?

Thurs: Fish cannot carry p-values

Fri: Does Benadryl make you senile? Challenges in research communication

Sat: What recommendations to give when a medical study is not definitive (which of course will happen all the time, especially considering that new treatments should be compared to best available alternatives, which implies that most improvements will be incremental at best)

Sun: Powerpose update

“Children seek historical traces of owned objects”

Recently in the sister blog:

An object’s mental representation includes not just visible attributes but also its nonvisible history. The present studies tested whether preschoolers seek subtle indicators of an object’s history, such as a mark acquired during its handling. Five studies with 169 children 3–5 years of age and 97 college students found that children (like adults) searched for concealed traces of object history, invisible traces of object history, and the absence of traces of object history, to successfully identify an owned object. Controls demonstrated that children (like adults) appropriately limit their search for hidden indicators when an owned object is visibly distinct. Altogether, these results demonstrate that concealed and invisible indicators of history are an important component of preschool children’s object concepts.

“The Dark Side of Power Posing”

Shravan points us to this post from Jay Van Bavel a couple years ago. It’s an interesting example because Bavel expresses skepticism about the “power pose” hype but he makes the same general mistake of Carney, Cuddy, Yap, and other researchers in this area in that he overreacts to every bit of noise that’s been p-hacked and published.

Here’s Bavel:

Some of the new studies used different analysis strategies than the original paper . . . but they did find that the effects of power posing were replicable, if troubling. People who assume high-power poses were more likely to steal money, cheat on a test and commit traffic violations in a driving simulation. In one study, they even took to the streets of New York City and found that automobiles with more expansive driver’s seats were more likely to be illegally parked. . . .

Dr. Brinol [sic] and his colleagues found that power posing increased self-confidence, but only among participants who already had positive self-thoughts. In contrast, power posing had exactly the opposite effect on people who had negative self-thoughts. . . .

In two studies, Joe Cesario and Melissa McDonald found that power poses only increased power when they were made in a context that indicated dominance. Whereas people who held a power pose while they imagined standing at an executive desk overlooking a worksite engaged in powerful behavior, those who held a power pose while they imagined being frisked by the police actually engaged in less powerful behavior. . . .

In a way I like all this because it shows how the capitalize-on-noise strategy which worked so well for the original power pose authors can also be used to dismantle the whole idea. So that’s cool. But from a scientific point of view, I think there’s so much noise here that any of these interactions could well go in the opposite direction. Not to mention all the unstudied interactions and all the interactions that happened not to be statistically significant in these particular small samples.

I’m not trying to slam Bavel here. The above-linked post was published in 2013, before we were all fully aware of how easy it was for researchers to get statistical significance from noise, even without having to try. Now we know better: just cos some correlation or interaction appears in a sample, we don’t have to think it represents anything in the larger population.

When do statistical rules affect drug approval?

Someone writes in:

I have MS and take a disease-modifying drug called Copaxone. Sandoz developed a generic version​ of Copaxone​ and filed for FDA approval. Teva, the manufacturer of Copaxone, filed a petition opposing that approval (surprise!). FDA rejected Teva’s petitions and approved the generic.

My insurance company encouraged me to switch to the generic. Specifically, they increased the copay​ for the non-generic​ from $50 to $950 per month. That got my attention. My neurologist recommended against switching to the generic.

Consequently, I decided to try to review the FDA decision to see if I could get any insight into the basis for ​my neurologist’s recommendation​dation.​

What appeared on first glance to be a telling criticism of the Teva submission was a reference​ by the FDA​ to “non-standard statistical criteria” together with the FDA’s statement that reanalysis with standard practices found different results than those found by Teva. So, I looked up back at the Teva filing to identify the non-standard statistical criteria they used. If I found the right part of the Teva filing, they used R packages named ComBat and LIMMA​—both empirical Bayes tools.

​Now, it is possible that I have made a mistake and have not properly identified the statistical criteria that the FDA found wanting. I was unable to find any specific statement w.r.t. the “non-standard” statistics.

But, if empirical Bayes works better than older methods, then falling back to older methods would result in weaker inferences—and the rejection of the data from Teva.

It seems to me that this case raises interesting questions about the adoption and use of empirical Bayes. How should the FDA have treated the “non-standard statistical criteria”? More generally, is there a problem with getting regulatory agencies to accept Bayesian models? Maybe there is some issue here that would be appropriate for a masters student in public policy.

My correspondent included some relevant documentation:

The FDA docket files are available at!docketBrowser;rpp=25;po=0;dct=SR;D=FDA-2015-P-1050​

The test below is from ​ April 15, 2015 content/uploads/2016/07/Citizen_Petition_Denial_Letter_From_CDER_to_Teva_Pharmaceuticals.pdf”>FDA Denial Letter to Teva at pp. 41-42​

​Specifically, we concluded that the mouse splenocyte studies were poorly designed, contained a high level of residual batch bias, and used non-standard statistical criteria for assessing the presence of differentially expressed genes. When FDA reanalyzed the microarray data from one Teva study using industry standard practices and criteria, Copaxone and the comparator (Natco) product were found to have very similar effects on the efficacy-related pathways proposed for glatiramer acetate’s mechanism of action.

​The image below is from the ​Teva Petition, July 2, 2014 at p. 60


And he adds:

My interest in this topic arose only because of my MS treatment—I have had no contact with Teva, Sandoz, or the FDA. And I approve of the insurance company’s action—that is, I think that creating incentives to encourage consumers to switch to generic medicines is usually a good idea.

I have no knowledge of any of this stuff, but the interaction of statistics and policy seems generally relevant so I thought I would share this with all of you.

Ioannidis: “Evidence-Based Medicine Has Been Hijacked”

The celebrated medical-research reformer has a new paper (sent to me by Keith O’Rourke; official published version here), where he writes:

As EBM [evidence-based medicine] became more influential, it was also hijacked to serve agendas different from what it originally aimed for. Influential randomized trials are largely done by and for the benefit of the industry. Meta-analyses and guidelines have become a factory, mostly also serving vested interests. National and federal research funds are funneled almost exclusively to research with little relevance to health outcomes. We have supported the growth of principal investigators who excel primarily as managers absorbing more money.

He continues:

Diagnosis and prognosis research and efforts to individualize treatment have fueled recurrent spurious promises. Risk factor epidemiology has excelled in salami-sliced data-dredged papers with gift authorship and has become adept to dictating policy from spurious evidence. Under market pressure, clinical medicine has been transformed to finance-based medicine. In many places, medicine and health care are wasting societal resources and becoming a threat to human well-being. Science denialism and quacks are also flourishing and leading more people astray in their life choices, including health.

And concludes:

EBM still remains an unmet goal, worthy to be attained.

Read the whole damn thing.

Going beyond confidence intervals

Anders Lamberg writes:

In an article by Tom Sigfried, Science News, July 3 2014, “Scientists’ grasp of confidence intervals doesn’t inspire confidence” you are cited: “Gelman himself makes the point most clearly, though, that a 95 percent probability that a confidence interval contains the mean refers to repeated sampling, not any one individual interval.”

I have some simple questions that I hope you can answer. I am not a statistician but a biologist only with basic education in statistics. My company is working with surveillance of populations of salmon in Norwegian rivers and we have developed methods for counting all individuals in populations. We have moved from using estimates acquired from samples, to actually counting all individuals in the populations. This is possible because the salmon migrate between the ocean and the rivers and often have to pass narrow parts of the rivers where we use underwater video cameras to cover whole cross section. In this way we “see” every individual and can categorize size, sex etc. Another argument for counting all individuals is that our Atlantic salmon populations rarely exceed 3000 individuals (average of approx. 500) in contrast to Pacific salmon populations where numbers are more in the range of 100 000 to more than a million.

In Norway we also have a large salmon farming industry where salmon are held in net pens in the sea. The problem is that these fish, which have been artificially selected for over 10 generations, is a threat to the natural populations if they escape and breed with the wild salmon. There is a concern that the “natural gene pool” will be diluted. That was only a background for my questions, although the nature of the statistical problem is general for all sampling.

Here is the statistical problem: In a breeding population of salmon in a river there may be escapees from the fish farms. It is important to know the proportion of farmed escapees. If it exceed 5 % in a given population, measures should made to reduce the number of farmed salmon in that river. But how can we find the real proportion of farmed salmon in a river? The method used for over 30 years now is a sampling of approximately 60 salmon from each river and counting how many wild and how many farmed salmon you got in that sample. The total population may be 3000 individuals in total.

There is only taken one sample. A point estimate is calculated and a confidence interval for that estimate. In one realistic example we may sample 60 salmon and find that 6 of them are farmed fish. That gives a point estimate of 10 % farmed fish in the population of 3000 in that specific river. The 95% confidence interval will be from approximately 2% to 18%. Most commonly it is only the point estimate that is reported.

When I read your comment in the article cited in the start of this mail, I see that something must be wrong with this sampling procedure. Our confidence interval is linked to the sample and does not necessarily reflect the “real value” that we are interested in. As I see it now our point estimate acquired from only one sample does not give us much at all. We should have repeated the sampling procedure many times to get an estimate that is precise enough to say if we have passed the limit of 5% farmed fish in that population.

Can we use the one sample of 60 salmon in the example to say anything at all about the proportion of farmed salmon in that river? Can we use the point estimate 10%?

We have asked this question to the government, but they reply that it is more likely the real value lies near the 10% point estimate since the confidence has the shape of a normal distribution.

Is this correct?

As I see it the real value does not have to lie within the 95 % confidence interval at all. However, if we increase the sample size close to the population size, we will get a precise estimate. But, what happens when we use small samples and do not repeat?

My reply:

In this case, the confidence intervals seem reasonable enough (under the usual assumption that you are measuring a simple random sample). I suspect the real gains will come from combining estimates from different places and different times. A hierarchical model will allow you to do some smoothing.

Here’s an example. Suppose you sample 60 salmon in the same place each year and the number of farmed fish you see are 7, 9, 7, 6, 5, 8, 7, 2, 8, 7, … These data are consistent with their being a constant proportion of 10% farmed fish (indeed, I created these particular numbers using rbinom(10,60,.1) in R). On the other hand, if the number you see are 8, 12, 9, 5, 3, 11, 8, 0, 11, 9, … then this is evidence for real fluctuations. And of course if you see a series such as 5, 0, 3, 8, 9, 11, 9, 12, …, this is evidence for a trend. So you’d want to go beyond confidence intervals to make use of all that information. There’s actually a lot of work done using Bayesian methods in fisheries which might be helpful here.

Bayesian Linear Mixed Models using Stan: A tutorial for psychologists, linguists, and cognitive scientists

This article by Tanner Sorensen, Sven Hohenstein, and Shravan Vasishth might be of interest to some of you.

No, Google will not “sway the presidential election”

Grrr, this is annoying. A piece of exaggerated science reporting hit PPNAS and was promoted in Politico, then Kaiser Fung and I shot it down (“Could Google Rig the 2016 Election? Don’t Believe the Hype”) in our Daily Beast column last September.

Then it appeared again this week in a news article in the Christian Science Monitor.

I know Christian Scientists believe in a lot of goofy things but I didn’t know that they’d fall for silly psychology studies.

The Christian Science Monitor reporter did link to our column and did note that we don’t buy the Google-can-sway-the-election claim—so, in that sense, I can’t hope for much more. What I really think is that Rosen should’ve read what Kaiser and I wrote, realized our criticisms were valid, and then have not wasted time reporting on the silly claim based on a huge, unrealistic manipulation in a highly artificial setting. But that would’ve involved shelving a promising story idea, and what reporter wants to do that?

The Christian Science Monitor reporter did link to our column and did note that we don’t buy the Google-can-sway-the-election claim. So I can’t really get upset about the reporting: if the reporter is not an expert on politics, it can be hard for him to judge what to believe.

Nonetheless, even though it’s not really the reporter’s fault, the whole event saddens me, in that it illustrates how ridiculous hype pays off. The original researchers did a little study which has some value but then they hyped it well beyond any reasonable interpretation (as their results came from a huge, unrealistic manipulation in a highly artificial setting), resulting in a ridiculous claim that Google can sway the presidential election. The hypesters got rewarded for their hype with media coverage. Which of course motivates more hype in the future. It’s a moral hazard.

I talked about this general problem a couple years ago, under the heading, Selection bias in the reporting of shaky research. It goes like this. Someone does a silly study and hypes it up. Some reporters realize right away that it’s ridiculous, others ask around and learn that it makes no sense, and they don’t bother reporting on it. Other reporters don’t know any better—that’s just the way it is, nobody can be an expert on everything—and they report on it. Hence the selection bias: The skeptics don’t waste their time writing about a bogus or over-hyped study; the credulous do. The net result is that the hype continues.

P.S. I edited the above post (striking through some material and replacing with two new paragraphs) in response to comments.

Moving statistical theory from a “discovery” framework to a “measurement” framework

Avi Adler points to this post by Felix Schönbrodt on “What’s the probability that a significant p-value indicates a true effect?” I’m sympathetic to the goal of better understanding what’s in a p-value (see for example my paper with John Carlin on type M and type S errors) but I really don’t like the framing in terms of true and false effects, false positives and false negatives, etc. I work in social and environmental science. And in these fields it almost never makes sense to me to think about zero effects. Real-world effects vary, they can be difficult to measure, and statistical theory can be useful in quantifying available information—that I agree with. But I don’t get anything out of statements such as “Prob(effect is real | p-value is significant).”

This is not a particular dispute with Schönbrodt’s work; rather, it’s a more general problem I have with setting up the statistical inference problem in that way. I have a similar problem with “false discovery rate,” in that I don’t see inferences (“discoveries”) as being true or false. Just for example, does the notorious “power pose” paper represent a false discovery? In a way, sure, in that the researchers were way overstating their statistical evidence. But I think the true effect on power pose has to be highly variable, and I don’t see the benefit of trying to categorize it as true or false.

Another way to put it is that I prefer to thing of statistics via a “measurement” paradigm rather than a “discovery” paradigm. Discoveries and anomalies do happen—that’s what model checking and exploratory data analysis are all about—but I don’t really get anything out of the whole true/false thing. Hence my preference for looking at type M and type S errors, which avoid having to worry about whether some effect is zero.

That all said, I know that many people like the true/false framework so you can feel free to follow the above link and see what Schönbrodt is doing.

On deck this week

Mon: Moving statistical theory from a “discovery” framework to a “measurement” framework

Tues: Bayesian Linear Mixed Models using Stan: A tutorial for psychologists, linguists, and cognitive scientists

Wed: Going beyond confidence intervals

Thurs: Ioannidis: “Evidence-Based Medicine Has Been Hijacked”

Fri: What’s powdery and comes out of a metallic-green cardboard can?

Sat: “The Dark Side of Power Posing”

Sun: “Children seek historical traces of owned objects”

“Pointwise mutual information as test statistics”

Christian Bartels writes:

Most of us will probably agree that making good decisions under uncertainty based on limited data is highly important but remains challenging.

We have decision theory that provides a framework to reduce risks of decisions under uncertainty with typical frequentist test statistics being examples for controlling errors in absence of prior knowledge. This strong theoretical framework is mainly applicable to comparatively simple problems. For non-trivial models and/or if there is only limited data, it is often not clear how to use the decision theory framework.

In practice, careful iterative model building and checking seems to be the best what can be done – be it using Bayesian methods or applying “frequentist” approaches (here, in this particular context, “frequentist” seems often to be used as implying “based on minimization”).

As a hobby, I tried to expand the armory for decision making under uncertainty with complex models, focusing on trying to expand the reach of decision theoretic, frequentist methods. Perhaps at one point in the future, it will be become possible to bridge the existing, good pragmatic approaches into the decision theoretical framework.

So far:

– I evaluated an efficient integration method for repeated evaluation of statistical integrals (e.g., p-values) for a set of of hypotheses. Key to the method was the use of importance sampling. See here.

– I proposed pointwise mutual information as an efficient test statistics that is optimal under certain considerations. The commonly used alternative is the likelihood ratio test, which, in the limit where asymptotics are not valid, is annoyingly inefficient since it requires repeated minimizations of randomly generated data.
Bartels, Christian (2015): Generic and consistent confidence and credible regions.

More work is required, in particular:

– Dealing with nuisance parameters

– Including prior information.

Working on these aspects, I would appreciate feedback on what exists so far, in general, and on the proposal of using the pointwise mutual information as test statistics, in particular.

I have nothing to add here. The topic is important so I thought this was worth sharing.

You can post social science papers on the new SocArxiv

I learned about it from this post by Elizabeth Popp Berman.

The temporary SocArxiv site is here. It is connected to the Open Science Framework, which we’ve heard a lot about in discussions of preregistration.

You can post your papers at SocArxiv right away following these easy steps:

Send an email to the following address(es) from the email account you would like used on the OSF:

For Preprints, email
The format of the email should be as follows:

Preprint Title
Message body
Preprint abstract
Your preprint file (e.g., .docx, PDF, etc.)

It’s super-easy, actually much much easier than submitting to Arxiv. I assume that Arxiv has good reasons for its more elaborate submission process, but for now I found SocArxiv’s no-frills approach very pleasant.

I tried it out by sending a few papers, and it worked just fine. I’m already happy because I was able to upload my hilarious satire article with Jonathan Falk. (Here’s the relevant SocArxiv page.) When I tried to post that article on Arxiv last month, they rejected it as follows:

On Jun 16, 2016, at 12:17 PM, arXiv Moderation wrote:

Your submission has been removed. Our volunteer moderators determined that your article does not contain substantive research to merit inclusion within arXiv. Please note that our moderators are not referees and provide no reviews with such decisions. For in-depth reviews of your work you would have to seek feedback from another forum.

Please do not resubmit this paper without contacting arXiv moderation and obtaining a positive response. Resubmission of removed papers may result in the loss of your submission privileges.

For more information on our moderation policies see:

And the followup:

Dear Andrew Gelman,

Our moderators felt that a follow up should be made to point out arXiv only accepts articles that would be refereeable by a conventional publication venue. Submissions that that contain inflammatory or fictitious content or that use highly dramatic and mis-representative titles/abstracts/introductions may be removed. Repeated submissions of inflammatory or highly dramatic content may result in the suspension of submission privileges.

This kind of annoyed me because the only reason my article with Falk would not be refereeable by a conventional publication venue is because of all our jokes. Had we played it straight and pretended we were doing real research, we could’ve had a good shot at Psych Science or PPNAS. So we were, in effect, penalized for our honesty in writing a satire rather than a hoax.

As my couathor put it, the scary thing is how close our silly paper actually is to a publishable article, not how far.

Also, I can’t figure out how Arxiv’s rules were satisfied by this 2015 paper, “It’s a Trap: Emperor Palpatine’s Poison Pill,” which is more fictitious than ours, also includes silly footnotes, etc.

Anyway, I don’t begrudge Arxiv their gatekeeping. Arxiv is great great great, and I’m not at all complaining about their decision not to publish our funny article. Their site, their rules. Indeed, I wonder what will happen if someone decides to bomb SocArxiv with fake papers. At some point, a human will need to enter the loop, no?

For now, though, I think it’s great that there’s a place where everyone can post their social science papers.

Bigmilk strikes again

Screen Shot 2016-07-16 at 9.14.34 AM

Paul Alper sends along this news article by Kevin Lomagino, Earle Holland, and Andrew Holtz on the dairy-related corruption in a University of Maryland research study on the benefits of chocolate milk (!).

The good news is that the university did not stand behind its ethically-challenged employee. Instead:

“I did not become aware of this study at all until after it had become a news story,” Patrick O’Shea, UMD’s Vice President and Chief Research Officer, said in a teleconference. He says he took a look at both the chocolate milk and concussions news release and an earlier one comparing the milk to sports recovery drinks. “My reaction was, ‘This just doesn’t seem right. I’m not sure what’s going on here, but this just doesn’t seem right.’”

Back when I was a student there, we called it UM. I wonder when they changed it to UMD?

Also this:

O’Shea said in a letter that the university would immediately take down the release from university websites, return some $200,000 in funds donated by dairy companies to the lab that conducted the study, and begin implementing some 15 recommendations that would bring the university’s procedures in line with accepted norms. . . .

Dr. Shim’s lab was the beneficiary of large donations from Allied Milk Foundation, which is associated with First Quarter Fresh, the company whose chocolate milk was being studied and favorably discussed in the UMD news release.

Also this from a review committee:

There are simply too many uncontrolled variables to produce meaningful scientific results.

Wow—I wonder what Harvard Business School would say about this, if this criterion were used to judge some of its most famous recent research?

And this:

The University of Maryland says it will never again issue a news release on a study that has not been peer reviewed.

That seems a bit much. I think peer review is overrated, and if a researcher has some great findings, sure, why not do the press release? The key is to have clear lines of responsibility. And I agree with the University of Maryland on this:

The report found that while the release was widely circulated prior to distribution, nobody knew for sure who had the final say over what it could claim. “There is no institutional protocol for approval of press releases and lines of authority are poorly defined,” according to the report. It found that Dr. Shim was given default authority over the news release text, and that he disregarded generally accepted standards as to when study results should be disseminated in news releases.

Now we often seem to have the worst of both worlds, with irresponsible researchers making extravagant and ill-founded claims and then egging on press agents to make even more extreme statements. Again, peer review has nothing to do with it. There is a problem with press releases that nobody is taking responsibility for.

One-day workshop on causal inference (NYC, Sat. 16 July)

James Savage is teaching a one-day workshop on causal inference this coming Saturday (16 July) in New York using RStanArm. Here’s a link to the details:

Here’s the course outline:

How do prices affect sales? What is the uplift from a marketing decision? By how much will studying for an MBA affect my earnings? How much might an increase in minimum wages affect employment levels?

These are examples of causal questions. Sadly, they are the sorts of questions that data scientists’ run-of-the-mill predictive models can be ill-equipped to answer.

In this one-day course, we will cover methods for answering these questions, using easy-to-use Bayesian data analysis tools. The topics include:

– Why do experiments work? Understanding the Rubin causal model

– Regularized GLMs; bad controls; souping-up linear models to capture nonlinearities

– Using panel data to control for some types of unobserved confounding information

– ITT, natural experiments, and instrumental variables

– If we have time, using machine learning models for causal inference.

All work will be done in R, using the new rstanarm package.

Lunch, coffee, snacks and materials will be provided. Attendees should bring a laptop with R, RStudio and rstanarm already installed. A limited number of scholarships are available. The course is in no way affiliated with Columbia.

Replin’ ain’t easy: My very first preregistration


I’m doing my first preregistered replication. And it’s a lot of work!

We’ve been discussing this for awhile—here’s something I published in 2013 in response to proposals by James Moneghan and by Macartan Humphreys, Raul Sanchez de la Sierra, and Peter van der Windt for preregistration in political science, here’s a blog discussion (“Preregistration: what’s in it for you?”) from 2014.

Several months ago I decided I wanted to perform a preregistered replication of my 2013 AJPS paper with Yair on MRP. We found some interesting patterns of voting and turnout, but I was concerned that perhaps we were overinterpreting patterns from a single dataset. So we decided to re-fit our model to data from a different poll. That paper had analyzed the 2008 election using pre-election polls from Pew Research. The 2008 Annenberg pre-election poll was also available, so why not try that too?

Since we were going to do a replication anyway, why not preregister it? This wasn’t as easy as you might think. First step was getting our model to fit with the old data; this was not completely trivial given changes in software, and we needed to tweak the model in some places. Having checked that we could successfully duplicate our old study, we then re-fit our model to two surveys from 2004. We then set up everything to run on Annenberg 2008. At this point we paused, wrote everything up, and submitted to a journal. We wanted to time-stamp the analysis, and it seemed worthwhile to do this in a formal journal setting so that others could see all the steps in one place. The paper (that is, the preregistration plan) was rejected by the AJPS. They suggested we send it to Political Analysis, but they ended up rejecting it too. Then we sent it to Statistics, Politics, and Policy, which agreed to publish the full paper: preregistration plan plus analysis.

But, before doing the analysis, I wanted to time-stamp the preregistration plan. I put the paper up on my website, but that’s not really preregistration. So then I tried Arxiv. That took awhile too—it first they were thrown off by the paper being incomplete (by necessity, as we want to first publish the article with the plan but without the replication results). But they finally posted it.

The Arxiv post is our official announcement of preregistration. Now that it’s up, we (Rayleigh, Yair, and I) can run the analysis and write it up!

What have we learned?

Even before performing the replication analysis on the 2008 Annenberg data, this preregistration exercise has taught me some things:

1. The old analysis was not in runnable condition. We and others are now in position to fit the model to other data much more directly.

2. There do seem to be some problems with our model in how it fits the data. To see this, compare Figure 1 to Figure 2 of our new paper. Figure 1 shows our model fit to the 2008 Pew data (essentially a duplication of Figure 2 of our 2013 paper), and Figure 2 shows this same model fit to the 2004 Annenberg data.

So, two changes: Pew vs. Annenberg, and 2008 vs. 2004. And the fitted models look qualitatively different. The graphs take up a lot of space, so I’ll just show you the results for a few states.

We’re plotting the probability of supporting the Republican candidate for president (among the supporters of one of the two major parties; that is, we’re plotting the estimates of R/(R+D)) as a function of respondent’s family income (divided into five categories). Within each state, we have two lines: the brown line shows estimated Republican support among white voters, and the black lines shows estimated Republican support among all voters in the state. Y-axis goes from 0 to 100%.

From Figure 1:


From Figure 2:


You see that? The fitted lines are smoother in Figure 2 than in Figure 1, they seem to be tied closer to the data points. It appears as if this is coming from the raw data, which seem in Figure 2 to be closer to clean monotonic patterns.

My first thought was that this was something to do with sample size. OK, that was my third thought. My first thought was that it was a bug in the code, and my second thought was that there was some problem with coding of the income variable. But I don’t think it was any of these things. Annenberg 2004 had a larger sample than Pew 2008, so we re-fit to two random subsets of those Annenberg 2004 data, and the resulting graphs (not shown in the paper) look similar to the Figure 2 shown above; they were still a lot smoother than Figure 1 which shows results from Pew 2008.

We discuss this at the end of Section 2 of our new paper and don’t come to any firm conclusions. We’ll see what turns up with the replication on Annenberg 2008.

Anyway, the point is:
– Replication is not so easy.
– We can learn even from setting up the replications.
– Published results (even from me!) are always only provisional and it makes sense to replicate on other data.