Skip to content

“The earth is flat (p > 0.05): Significance thresholds and the crisis of unreplicable research”

Valentin Amrhein​, Fränzi Korner-Nievergelt, and Tobias Roth write:

The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process. We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as ‘we need more stringent decision rules’, ‘sample sizes will decrease’ or ‘we need to get rid of p-values’.

The general ideas should be familiar to regular readers of this blog.

Blue Cross Blue Shield Health Index

Chris Famighetti points us to this page which links to an interactive visualization.

There are some problems with the mapping software—when I clicked through, it showed a little map of the western part of the U.S., accompanied by huge swathes of Canada and the Pacific Ocean—and I haven’t taken a look at the methodology.

But in a way it doesn’t really matter. This sort of thing is great. If it’s solid work, that’s great already. If it’s flawed, it’s out there and people can make suggestions for improvement. So I hope this Blue Cross team gets lots of feedback uses this to improve what they’ve got.

Prior choice recommendations wiki !

Here’s the wiki, and here’s the background:

Our statistical models are imperfect compared to the true data generating process and our complete state of knowledge (from an informational-Bayesian perspective) or the set of problems over which we wish to average our inferences (from a population-Bayesian or frequentist perspective).

The practical question here is what model to choose, or what class of models. Advice here is not typically so clear. We have a bunch of existing classes of models such as linear regression, logistic regression, along with newer things such as deep learning, and usual practice is to take a model that has been applied on similar problems and keep using it until it is obviously wrong. That’s not such a bad way to go; I’m just pointing out the informality of this aspect of model choice.

What about the choice of prior distribution in a Bayesian model? The traditional approach leads to an awkward choice: either the fully informative prior (wildly unrealistic in most settings) or the noninformative prior which is supposed to give good answers for any possible parameter valuers (in general, feasible only in settings where data happen to be strongly informative about all parameters in your model).

We need something in between. In a world where Bayesian inference has become easier and easier for more and more complicated models (and where approximate Bayesian inference is useful in large and tangled models such as recently celebrated deep learning applications), we need prior distributions that can convey information, regularize, and suitably restrict parameter spaces (using soft rather than hard constraints, for both statistical and computational reasons).

I used to talk about these as weakly informative prior distributions (as in this paper from 2006 on hierarchical variance parameters and this from 2008 on logistic regression coefficients) but now I just call them “prior distributions.” In just about any real problem, there’s no fully informative prior and there’s no noninformative prior; every prior contains some information and does some regularization, while still leaving some information “on the table,” as it were.

That is, all priors are weakly informative; we now must figure out where to go from there. We should be more formal about the benefits of including prior information, and the costs of overconfidence (that is, including information that we don’t really have).

We (Dan Simpson, Michael Betancourt, Aki Vehtari, and various others) have been thinking about the problem in different ways involving theory, applications, and implementation in Stan.

For now, we have this wiki on Prior Choice Recommendations. I encourage you to take a look, and feel free to post your questions, thoughts, and suggested additions in comments below.

A whole fleet of Wansinks: is “evidence-based design” a pseudoscience that’s supporting a trillion-dollar industry?

Following a recent post that mentioned << le Sherlock Holmes de l'alimentation >>, we go this blockbuster comment which seemed worth its own post by Ujjval Vyas:

I work in an area where social psychology is considered the gold standard of research and thus the whole area is completely full of Wansink stuff (“people recover from surgery faster if they have a view of nature out the window”, “obesity and diabetes are caused by not enough access to nature for the poor”, biomimicry is a particularly egregious idea in this field). No one even knows how to really read any of the footnotes or cares, since it is all about confirmation bias and the primary professional organizations in the area directly encourage such lack of rigor. Obscure as it may sound, the whole area of “research” into architecture and design is full of this kind of thing. But the really odd part is that the field is made up of people who have no idea what a good study is or could be (architects, designers, interior designers, academic “researchers” at architecture schools or inside furniture manufacturers trying to sell more). They even now have groups that pursue “evidence-based healthcare design” which simply means that some study somewhere says what they need it to say. The field is at such a low level that it is not worth mentioning in many ways except that it is deeply embedded in a $1T industry for building and construction as well as codes and regulations based on this junk. Any idea of replication is simply beyond the kenning in this field because, as one of your other commenters put it, the publication is only a precursor to Ted talks and keynote addresses and sitting on officious committees to help change the world (while getting paid well). Sadly, as you and commenters have indicated, no one thinks they are doing anything wrong at all. I only add this comment to suggest that there are whole fields and sub-fields that suffer from the problems outlined here (much of this research would make Wansink look scrupulous).

Here’s the wikipedia page on Evidence-based design, including this chilling bit:

As EBD is supported by research, many healthcare organizations are adopting its principles with the guidance of evidence-based designers. The Center for Health Design and InformeDesign (a not-for-profit clearinghouse for design and human-behaviour research) have developed the Pebble Project, a joint research effort by CHD and selected healthcare providers on the effect of building environments on patients and staff.

The Evidence Based Design Accreditation and Certification (EDAC) program was introduced in 2009 by The Center for Health Design to provide internationally recognized accreditation and promote the use of EBD in healthcare building projects, making EBD an accepted and credible approach to improving healthcare outcomes. The EDAC identifies those experienced in EBD and teaches about the research process: identifying, hypothesizing, implementing, gathering and reporting data associated with a healthcare project.

Later on the page is a list of 10 strategies (1. Start with problems. 2. Use an integrated multidisciplinary approach with consistent senior involvement, ensuring that everyone with problem-solving tools is included. etc.). Each of these steps seems reasonable, but put them together and they do read like a recipe for taking hunches, ambitious ideas, and possible scams and making them look like science. So I’m concerned. Maybe it would make sense to collaborate with someone in the field of architecture and design and try to do something about this.

P.S. It might seem kinda mean for me to pick on these qualitative types for trying their best to achieve something comparable to quantitative rigor. But . . . if there are really billions of dollars at stake, we shouldn’t sit idly by. Also, I feel like Wansink-style pseudoscience can be destructive of qualitative expertise. I’d rather see some solid qualitative work than bogus number crunching.

“Data sleaze: Uber and beyond”

Interesting discussion from Kaiser Fung. I don’t have anything to add here; it’s just a good statistics topic.

Scroll through Kaiser’s blog for more:

Dispute over analysis of school quality and home prices shows social science is hard

My pre-existing United boycott, and some musing on randomness and fairness


Using prior knowledge in frequentist tests

Christian Bartels send along this paper, which he described as an attempt to use informative priors for frequentist test statistics.

I replied:

I’ve not tried to follow the details but this reminds me of our paper on posterior predictive checks. People think of this as very Bayesian but my original idea when doing this research was to construct frequentist tests using Bayesian averaging in order to get p-values. This was motivated by a degrees-of-freedom-correction problem where the model had nonlinear constraints and so one could not simply do a classical correction based on counting parameters.

To which Bartels wrote:

Your work starts from the same point as mine, existing frequentist tests may be inadequate for the problem of interest. Your work ends also, where I would like to end, performing tests via integration over (i.e.,sampling of) paremeters and future observation using likelihood and prior.

In addition, I try to anchor the approach in decision theory (as referenced in my write up). Perhaps this is too ambitious, we’ll see.

Results so far, using the language of your publication:
– The posterior distribution p(theta|y) is a good choice for the deviance D(y,theta). It gives optimal confidence intervals/sets in the sense proposed by Schafer, C.M. and Stark, P.B., 2009. Constructing confidence regions of optimal expected size. Journal of the American Statistical Association, 104(487), pp.1080-1089.
– Using informative priors for the deviance D(y,theta)=p(theta|y) may improve the quality of decisions, e.g., may improve thenpower of tests.
– For the marginalization, I find it difficult to strike the balance between proposing something that can be argued/shown to give optimal tests, and something that can be calculate with availabe computational resources. I hope to end up with something like one of the variants shown in your Figure 1.

I noted that you distinguish test statistics from deviances that do depend or do not depend on the parameter. I’m not aware of anything that prevents you from using deviances with a dependence on parameters for frequentist tests – it is just inconvenient, if you are after generic, closed form solutions for tests. I did not make this differentiation, and refer to tests independent on whether they depend on the parameters or not.

I don’t really have anything more to say here, as I have not thought about these issues for awhile. But I thought Bartels’s paper and this discussion might interest some of you.

I hate R, volume 38942

R doesn’t allow block comments. You have to comment out each line, or you can encapsulate the block in if(0){} which is the world’s biggest hack. Grrrrr.

P.S. Just to clarify: I want block commenting not because I want to add long explanatory blocks of text to annotate my scripts. I want block commenting because I alter my scripts, and sometimes I want to comment out a block of code.

The next Lancet retraction? [“Subcortical brain volume differences in participants with attention deficit hyperactivity disorder in children and adults”]

Someone who prefers to remain anonymous asks for my thoughts on this post by Michael Corrigan and Robert Whitaker, “Lancet Psychiatry Needs to Retract the ADHD-Enigma Study: Authors’ conclusion that individuals with ADHD have smaller brains is belied by their own data,” which begins:

Lancet Psychiatry, a UK-based medical journal, recently published a study titled Subcortical brain volume differences in participants with attention deficit hyperactivity disorder in children and adults: A cross-sectional mega-analysis. According to the paper’s 82 authors, the study provides definitive evidence that individuals with ADHD have altered, smaller brains. But as the following detailed review reveals, the study does not come close to supporting such claims.

Below are tons of detail, so let me lead with my conclusion, which is that the criticisms coming from Corrigan and Whitaker seem reasonable to me. That is, based on my quick read, the 82 authors of that published paper seem to have made a big mistake in what they wrote.

I’d be interested to see if the authors have offered any reply to these criticisms. The article has just recently come out—the journal publication is dated April 2017—and I’d like to see what the authors have to say.

OK, on to the details. Here are Corrigan and Whitaker:

The study is beset by serious methodological shortcomings, missing data issues, and statistical reporting errors and omissions. The conclusion that individuals with ADHD have smaller brains is contradicted by the “effect-size” calculations that show individual brain volumes in the ADHD and control cohorts largely overlapped. . . .

Their results, the authors concluded, contained important messages for clinicians: “The data from our highly powered analysis confirm that patients with ADHD do have altered brains and therefore that ADHD is a disorder of the brain.” . . .

The press releases sent to the media reflected the conclusions in the paper, and the headlines reported by the media, in turn, accurately summed up the press releases. Here is a sampling of headlines:

Given the implications of this study’s claims, it deserves to be closely analyzed. Does the study support the conclusion that children and adults with ADHD have “altered brains,” as evidenced by smaller volumes in different regions of the brain? . . .

Alternative Headline: Large Study Finds Children with ADHD Have Higher IQs!

To discover this finding, you need to spend $31.50 to purchase the article, and then make a special request to Lancet Psychiatry to send you the appendix. Then you will discover, on pages 7 to 9 in the appendix, a “Table 2” that provides IQ scores for both the ADHD cohort and the controls.

Although there were 23 clinical sites in the study, only 20 reported comparative IQ data. In 16 of the 20, the ADHD cohort had higher IQs on average than the control group. In the other four clinics, the ADHD and control groups had the same average IQ (with the mean IQ scores for both groups within two points of each other.) Thus, at all 20 sites, the ADHD group had a mean IQ score that was equal to, or higher than, the mean IQ score for the control group. . . .

And why didn’t the authors discuss the IQ data in their paper, or utilize it in their analyses? . . . Indeed, if the IQ data had been promoted in the study’s abstract and to the media, the public would now be having a new discussion: Is it possible that children diagnosed with ADHD are more intelligent than average? . . .

They Did Not Find That Children Diagnosed with ADHD Have Smaller Brain Volumes . . .

For instance, the authors reported a Cohen’s d effect size of .19 for differences in the mean volume of the accumbens in children under 15. . . in this study, for youth under 15, it was the largest effect size of all the brain volume comparisons that were made. . . . Approximately 58% of the ADHD youth in this convenience sample had an accumbens volume below the average in the control group, while 42% of the ADHD youth had an accumbens volume above the average in the control group. Also, if you knew the accumbens volume of a child picked at random, you would have a 54% chance that you could correctly guess which of the two cohorts—ADHD or healthy control—the child belonged to. . . . The diagnostic value of an MRI brain scan, based on the findings in this study, would be of little more predictive value than the toss of a coin. . . .

The authors reported that the “volumes of the accumbens, amygdala, caudate, hippocampus, putamen, and intracranial volume were smaller in individuals with ADHD compared with controls in the mega-analysis” (p. 1). If this is true, then smaller brain volumes should show up in the data from most, if not all, of the 21 sites that had a control group. But that was not the case. . . . The problem here is obvious. If authors are claiming that smaller brain regions are a defining “abnormality” of ADHD, then such differences should be consistently found in mean volumes of ADHD cohorts at all sites. The fact that there was such variation in mean volume data is one more reason to see the authors’ conclusions—that smaller brain volumes are a defining characteristic of ADHD—as unsupported by the data. . . .

And now here’s what the original paper said:

We aimed to investigate whether there are structural differences in children and adults with ADHD compared with those without this diagnosis. In this cross-sectional mega-analysis [sic; see P.P.S. below], we used the data from the international ENIGMA Working Group collaboration, which in the present analysis was frozen at Feb 8, 2015. Individual sites analysed structural T1-weighted MRI brain scans with harmonised protocols of individuals with ADHD compared with those who do not have this diagnosis. . . .

Our sample comprised 1713 participants with ADHD and 1529 controls from 23 sites . . . The volumes of the accumbens (Cohen’s d=–0·15), amygdala (d=–0·19), caudate (d=–0·11), hippocampus (d=–0·11), putamen (d=–0·14), and intracranial volume (d=–0·10) were smaller in individuals with ADHD compared with controls in the mega-analysis. There was no difference in volume size in the pallidum (p=0·95) and thalamus (p=0·39) between people with ADHD and controls.

The above demonstrates some forking paths, and there are a bunch more in the published paper, for example:

Exploratory lifespan modelling suggested a delay of maturation and a delay of degeneration, as e ect sizes were highest in most subgroups of children (<15 years) versus adults (>21 years): in the accumbens (Cohen’s d=–0·19 vs –0·10), amygdala (d=–0·18 vs –0·14), caudate (d=–0·13 vs –0·07), hippocampus (d=–0·12 vs –0·06), putamen (d=–0·18 vs –0·08), and intracranial volume (d=–0·14 vs 0·01). There was no di erence between children and adults for the pallidum (p=0·79) or thalamus (p=0·89). Case-control differences in adults were non-signi cant (all p>0·03). Psychostimulant medication use (all p>0·15) or symptom scores (all p>0·02) did not in uence results, nor did the presence of comorbid psychiatric disorders (all p>0·5). . . .

Outliers were identified at above and below one and a half times the interquartile range per cohort and group (case and control) and were excluded . . . excluding collinearity of age, sex, and intracranial volume (variance in ation factor <1·2) . . . The model included diagnosis (case=1 and control=0) as a factor of interest, age, sex, and intracranial volume as fixed factors, and site as a random factor. In the analysis of intracranial volume, this variable was omitted as a covariate from the model. Handedness was added to the model to correct for possible effects of lateralisation, but was excluded from the model when there was no significant contribution of this factor. . . . stratified by age: in children aged 14 years or younger, adolescents aged 15–21 years, and adults aged 22 years and older. We removed samples that were left with ten patients or fewer because of the stratification. . . .

Forking paths are fine; I have forking paths in every analysis I’ve ever done. But forking paths render published p-values close to meaningless; in particular I have no reason to take seriously a statement such as, “p values were significant at the false discovery rate corrected threshold of p=0·0156,” from the summary of the paper.

So let’s forget about p-values and just look at the data graphs, which appear in the published paper:

Unfortunately these are not raw data or even raw averages for each age; instead they are “moving averages, corrected for age, sex, intracranial volume, and site for the subcortical volumes.” But we’ll take what we’ve got.

From the above graphs, it doesn’t seem like much of anything is going on: the blue and red lines cross all over the place! So now I don’t understand this summary graph from the paper:

I mean, sure, I see it for Accumbens, I guess, if you ignore the older people. But, for the others, the lines in the displayed age curves cross all over the place.

The article in question has the following list of authors: Martine Hoogman, Janita Bralten, Derrek P Hibar, Maarten Mennes, Marcel P Zwiers, Lizanne S J Schweren, Kimm J E van Hulzen, Sarah E Medland, Elena Shumskaya, Neda Jahanshad, Patrick de Zeeuw, Eszter Szekely, Gustavo Sudre, Thomas Wolfers, Alberdingk M H Onnink, Janneke T Dammers, Jeanette C Mostert, Yolanda Vives-Gilabert, Gregor Kohls, Eileen Oberwelland, Jochen Seitz, Martin Schulte-Rüther, Sara Ambrosino, Alysa E Doyle, Marie F Høvik, Margaretha Dramsdahl, Leanne Tamm, Theo G M van Erp, Anders Dale, Andrew Schork, Annette Conzelmann, Kathrin Zierhut, Ramona Baur, Hazel McCarthy, Yuliya N Yoncheva, Ana Cubillo, Kaylita Chantiluke, Mitul A Mehta, Yannis Paloyelis, Sarah Hohmann, Sarah Baumeister, Ivanei Bramati, Paulo Mattos, Fernanda Tovar-Moll, Pamela Douglas, Tobias Banaschewski, Daniel Brandeis, Jonna Kuntsi, Philip Asherson, Katya Rubia, Clare Kelly, Adriana Di Martino, Michael P Milham, Francisco X Castellanos, Thomas Frodl, Mariam Zentis, Klaus-Peter Lesch, Andreas Reif, Paul Pauli, Terry L Jernigan, Jan Haavik, Kerstin J Plessen, Astri J Lundervold, Kenneth Hugdahl, Larry J Seidman, Joseph Biederman, Nanda Rommelse, Dirk J Heslenfeld, Catharina A Hartman, Pieter J Hoekstra, Jaap Oosterlaan, Georg von Polier, Kerstin Konrad, Oscar Vilarroya, Josep Antoni Ramos-Quiroga, Joan Carles Soliva, Sarah Durston, Jan K Buitelaar, Stephen V Faraone, Philip Shaw, Paul M Thompson, Barbara Franke.

I also found a webpage for their research group, featuring this wonderful map:

The number of sites looks particularly impressive when you include each continent twice like that. But they should really do some studies in Antarctica, given how huge it appears to be!

P.S. Following the links, I see that Corrigan and Whitaker come into this with a particular view:

Mad in America’s mission is to serve as a catalyst for rethinking psychiatric care in the United States (and abroad). We believe that the current drug-based paradigm of care has failed our society, and that scientific research, as well as the lived experience of those who have been diagnosed with a psychiatric disorder, calls for profound change.

This does not mean that the critics are wrong—presumably the authors of the original paper came into their research with their own strong views—; it can just be helpful to know where they’re coming from.

P.P.S. The paper discussed above uses the term “mega-analysis.” At first I thought this might be some sort of typo, but apparently the expression does exist and has been around for awhile. From my quick search, it appears that the term was first used by James Dillon in a 1982 article, “Superanalysis,” in Evaluation News, where he defined mega-analysis as “a method for synthesizing the results of a series of meta-analyses.”

But in the current literature, “mega-analysis” seems to simply refer to a meta-anlaysis that uses the raw data from the original studies.

If so, I’m unhappy with the term “mega-analysis” because: (a) The “mega” seems a bit hypey, (b) What if the original studies are small? Then even all the data combined might not be so “mega”?, and (c) I don’t like the implication that plain old “meta-analysis” doesn’t use the raw data. I’m pretty sure that the vast majority of meta-analyses use only published summaries, but I’ve always thought of it as the preferred version of meta-anlaysis to use the original data.

I bring up this mega-analysis thing not as a criticism of the Hoogman et al. paper—they’re just using what appears to be a standard term in their field—but just as an interesting side-note.

P.P.P.S. The above post represents my current impression. As I wrote, I’d be interested to see the original authors’ reply to the criticism. Lancet does have a pretty bad reputation—it’s known for publishing flawed, sensationalist work—but I’m sure they run the occasional good article too. So I wouldn’t want to make any strong judgments in this case before hearing more.

P.P.P.P.S. Regarding the title of this post: No, I don’t think Lancet would ever retract this paper, even if all the above criticisms are correct. It seems that retraction is used only in response to scientific misconduct, not in response to mere error. So when I say “retraction,” I mean what one might call “conceptual retraction.” The real question is: Will this new paper join the list of past Lancet papers which we would not want to take seriously, and which we regret were ever published?

Stan in St. Louis this Friday

This Friday afternoon I (Jonah) will be speaking about Stan at Washington University in St. Louis. The talk is open to the public, so anyone in the St. Louis area who is interested in Stan is welcome to attend. Here are the details:

Title: Stan: A Software Ecosystem for Modern Bayesian Inference
Jonah Sol Gabry, Columbia University

Neuroimaging Informatics and Analysis Center (NIAC) Seminar Series
Friday April 28, 2017, 1:30-2:30pm
NIL Large Conference Room
#2311, 2nd Floor, East Imaging Bldg.
4525 Scott Avenue, St. Louis, MO


Stan without frontiers, Bayes without tears

This recent comment thread reminds me of a question that comes up from time to time, which is how to teach Bayesian statistics to students who aren’t comfortable with calculus. For continuous models, probabilities are integrals. And in just about every example except the one at 47:16 of this video, there are multiple parameters, so probabilities are multiple integrals.

So how to teach this to the vast majority of statistics users who can’t easily do multivariate calculus?

I dunno, but I don’t know that this has anything in particular to do with Bayes. Think about classical statistics, at least the sort that gets used in political science. Linear regression requires multivariate calculus too (or some pretty slick algebra or geometry) to get that least-squares solution. Not to mention the interpretation of the standard error. And then there’s logistic regression. Going further we move to popular machine learning methods which are really gonna seem like nothing more than black boxes. Kidz today all wanna do deep learning or random forests or whatever. And that’s fine. But no way are most of them learning the math behind it.

Teach people to drive. Then later, if they want or need, they can learn how the internal combustion engine works.

So, in keeping with this attitude, teach Stan. Students set up the model, they push the button, they get the answers. No integrals required. Yes, you have to work with posterior simulations so there is integration implicitly—the conceptual load is not zero—but I think (hope?) that this approach of using simulations to manage uncertainty is easier and more direct than expressing everything in terms of integrals.

But it’s not just model fitting, it’s also model building and model checking. Cross validation, graphics, etc. You need less mathematical sophistication to evaluate a method than to construct it.

About ten years ago I wrote an article, “Teaching Bayesian applied statistics to graduate students in political science, sociology, public health, education, economics, . . .” After briefly talking about a course that uses the BDA book and assumes that students know calculus, I continued:

My applied regression and multilevel modeling class has no derivatives and no integrals—it actually has less math than a standard regression class, since I also avoid matrix algebra as much as possible! What it does have is programming, and this is an area where many of the students need lots of practice. The course is Bayesian in that all inference is implicitly about the posterior distribution. There are no null hypotheses and alternative hypotheses, no Type 1 and Type 2 errors, no rejection regions and confidence coverage.

It’s my impression that most applied statistics classes don’t get into confidence coverage etc., but they can still mislead students by giving the impression that those classical principles are somehow fundamental. My class is different because I don’t pretend in that way. Instead I consider a Bayesian approach as foundational, and I teach students how to work with simulations.

My article continues:

Instead, the course is all about models, understanding the models, estimating parameters in the models, and making predictions. . . . Beyond programming and simulation, probably the Number 1 message I send in my applied statistics class is to focus on the deterministic part of the model rather than the error term. . . .

Even a simple model such as y = a + b*x + error is not so simple if x is not centered near zero. And then there are interaction models—these are incredibly important and so hard to understand until you’ve drawn some lines on paper. We draw lots of these lines, by hand and on the computer. I think of this as Bayesian as well: Bayesian inference is conditional on the model, so you have to understand what the model is saying.

The meta-hype algorithm


Kevin Lewis pointed me to this article:

There are several methods for building hype. The wealth of currently available public relations techniques usually forces the promoter to judge, a priori, what will likely be the best method. Meta-hype is a methodology that facilitates this decision by combining all identified hype algorithms pertinent for a particular promotion problem. Meta-hype generates a final press release that is at least as good as any of the other models considered for hyping the claim. The overarching aim of this work is to introduce meta-hype to analysts and practitioners. This work compares the performance of journal publication, preprints, blogs, twitter, Ted talks, NPR, and meta-hype to predict successful promotion. A nationwide database including 89,013 articles, tweets, and news stories. All algorithms were evaluated using the total publicity value (TPV) in a test sample that was not included in the training sample used to fit the prediction models. TPV for the models ranged between 0.693 and 0.720. Meta-hype was superior to all but one of the algorithms compared. An explanation of meta-hype steps is provided. Meta-hype is the first step in targeted hype, an analytic framework that yields double hyped promotion with fewer assumptions than the usual publicity methods. Different aspects of meta-hype depending on the context, its function within the targeted promotion framework, and the benefits of this methodology in the addiction to extreme claims are discussed.

I can’t seem to find the link right now, but you get the idea.

P.S. The above is a parody of the abstract of a recent article on “super learning” by Acion et al. I did not include a link because the parody was not supposed to be a criticism of the content of the paper in any way; I just thought some of the wording in the abstract was kinda funny. Indeed, I thought I’d disguised the abstract enough that no one would make the connection but I guess Google is more powerful than I’d realized.

But this discussion by Erin in the comment thread revealed that some people were taking this post in a way I had not intended. So I added this comment.

tl;dr: I’m not criticizing the content of the Acion et al. paper in any way, and the above post was not intended to be seen as such a criticism.

Would you prefer three N=300 studies or one N=900 study?

Stephen Martin started off with a question:

I’ve been thinking about this thought experiment:

Imagine you’re given two papers.
Both papers explore the same topic and use the same methodology. Both were preregistered.
Paper A has a novel study (n1=300) with confirmed hypotheses, followed by two successful direct replications (n2=300, n3=300).
Paper B has a novel study with confirmed hypotheses (n=900).
*Intuitively*, which paper would you think has the most evidence? (Be honest, what is your gut reaction?)

I’m reasonably certain the answer is that both papers provide the same amount of evidence, by essentially the likelihood principle, and if anything, one should trust the estimates of paper B more (unless you meta-analyzed paper A, which should give you the same answer as paper B, more or less).

However, my intuition was correct that most people in this group would choose paper A (See for poll results).

My reasoning is that if you are observing data from the same DGP, then where you cut the data off is arbitrary; why would flipping a coin 10x, 10x, 10x, 10x, 10x provide more evidence than flipping the coin 50x? The method in paper A essentially just collected 300, drew a line, collected 300, drew a line, then collected 300 more, and called them three studies; this has no more information in sum (in a fisherian sense, the information would just add together) than if you didn’t arbitrarily cut the data into sections.

If you read in the comments of this group (which has researchers predominantly of the NHST world), one sees this fallacy that merely by passing a threshold more times means you have more evidence. They use p*p*p to justify it (even though that doesn’t make sense, because one could partition the data into 10 n=90 sets and get ‘more evidence’ by this logic; in fact, you could have 90 p-values of ~.967262 and get a p-value of .05). They use fisher’s method to say the p-value could be low (~.006), even though when combined, the p-value would actually be even lower (~.0007). One employs only Neyman-Pearson logic, and this results in a t1 error probability of .05^3.

I replied:

What do you mean by “confirmed hypotheses,” and what do you mean by a replication being “successful”? And are you assuming that the data are identical in the two scenarios?

To which Martin answered:

I [Martin], in a sense, left it ambiguous because I suspected that knowing nothing else, people would put paper A, even though asymptotically it should provide the same information as paper B.

I also left ‘confirmed hypothesis’ vague, because I didn’t want to say one must use one given framework. Basically, the hypotheses were supported by whatever method one uses to judge support (whether it be p-values, posteriors, bayes factors, whatever).

Successful replication as in, the hypotheses were supported again in the replication studies.

Finally, my motivating intuition was that paper A could basically be considered paper B if you sliced the data into thirds, or paper B could be written had you just combined the three n=300 samples.

That said, if you are experimenter A gaining three n=300 samples, your data should asymptotically (or, over infinite datasets) equal that of experimenter B gaining one n=900 sample (over infinite datasets), in the sense that the expected total information is equal, and the accumulated evidence should be equal. Therefore, even if any given two papers have different datasets, asymptotically they should provide equal information, and there’s not a good reason to prefer three smaller studies over 1 larger one.

Yet, knowing nothing else, people assumed paper A, I think, because three studies is more intuitively appealing than one large study, even if the two could be interchangeable had you divided the larger sample into three, or combined the smaller samples into 1.

From my perspective, Martin’s question can’t really be answered because I don’t know what’s in papers A and B, and I don’t know what is meant by a replication being “successful.” I think the answer depends a lot on these pieces of information, and I’m still not quite sure what Martin’s getting at here. But maybe some of you have thoughts on this one.

Drug-funded profs push drugs

Someone who wishes to remain anonymous writes:

I just read a long ProPublica article that I think your blog commenters might be interested in. It’s from February, but was linked to by the Mad Biologist today ( Here is a link to the article:

In short, it’s about a group of professors (mainly economists) who founded a consulting firm that works for many big pharma companies. They publish many peer-reviewed articles, op-eds, blogs, etc on the debate about high pharmaceutical prices, always coming to the conclusion that high prices are a net benefit (high prices -> more innovation -> better treatments in the future vs poor people having no access to existing treatment today). They also are at best very inconsistent about disclosing their affiliations and funding.

One minor thing that struck me is the following passage, about their response to a statistical criticism of one of their articles:

The founders of Precision Health Economics defended their use of survival rates in a published response to the Dartmouth study, writing that they “welcome robust scientific debate that moves forward our understanding of the world” but that the research by their critics had “moved the debate backward.”

The debate here appears to be about lead-time bias – increased screening leads to earlier detection which can increase survival rates without actually improving outcomes. So on the face it doesn’t seem like an outrageous criticism. If they have controlled it appropriately, they should have a “robust debate” so they can convince their critics and have more support for increasing drug prices! Of course I doubt they have any interest in actually having this debate. It seems similar to the responses you get from Wansink, Cuddy (or the countless other researchers promoting flawed studies who have been featured on your blog) when they are confronted with valid criticism: sound reasonable, do nothing, and get let off the hook.

This interests me because I consult for pharmaceutical companies. I don’t really have anything to add, but this sort of conflict of interest does seem like something to worry about.

We talk a lot on this blog about bad science that’s driven by some combination of careerism and naivite. We shouldn’t forget about the possibility of flat-out corruption.

Journals for insignificant results

Tom Daula writes:

I know you’re not a fan of hypothesis testing, but the journals in this blog post are an interesting approach to the file drawer problem. I’ve never heard of them or their like. An alternative take (given academia standard practice) is “Journal for XYZ Discipline papers that p-hacking and forking paths could not save.”

Psychology: Journal of Articles in Support of the Null Hypothesis

Biomedicine: Journal of Negative Results in Biomedicine

Ecology and Evolutionary Biology: Journal of Negative Results

In psychology, this sort of journal isn’t really needed because we already have PPNAS, where they publish articles in support of the null hypothesis all the time, they just don’t realize it!

OK, ok, all jokes aside, the above post recommends:

Is it time for Economics to catch up? . . . a number of prominent Economists have endorsed this idea (even if they are not ready to pioneer the initiative). So, imagine… a call for papers along the following lines:

Series of Unsurprising Results in Economics (SURE)

Is the topic of your paper interesting, your analysis carefully done, but your results are not “sexy”? If so, please consider submitting your paper to SURE. An e-journal of high-quality research with “unsurprising” findings.
How does it work:
— We accept papers from all fields of Economics…
— Which have been rejected at a journal indexed in EconLit…
— With the ONLY important reason being that their results are statistically insignificant or otherwise “unsurprising”.

I can’t imagine this working. Why not just publish everything on SSRN or whatever, and then this SURE can just link to the articles in question (along with the offending referee reports)?

Also, I’m reminded of the magazine McSweeney’s, which someone once told me had been founded based on the principle of publishing stories that had been rejected elsewhere.

Teaching Statistics: A Bag of Tricks (second edition)

Hey! Deb Nolan and I finished the second edition of our book, Teaching Statistics: A Bag of Tricks. You can pre-order it here.

I love love love this book. As William Goldman would say, it’s the “good parts version”: all the fun stuff without the standard boring examples (counting colors of M&M’s, etc.). Great stuff for teaching, also I’ve been told that’s a fun read for students of statistics.

Here’s the table of contents. If this doesn’t look like fun to you, don’t buy the book.

Representists versus Propertyists: RabbitDucks – being good for what?


It is not that unusual in statistics to get the same statistical output (uncertainty interval, estimate, tail probability,etc.) for every sample, or some samples or the same distribution of outputs or the same expectations of outputs or just close enough expectations of outputs. Then, I would argue one has a variation on a DuckRabbit. In the DuckRabbit, the same sign represents different objects with different interpretations (what to make of it) whereas here we have differing signs (models) representing the same object (an inference of interest) with different interpretations (what to make of them). I will imaginatively call this a RabbitDuck ;-)

Does one always choose a Rabbit or a Duck, or sometimes one or another or always both? I would argue the higher road is both – that is to use differing models to collect and consider the  different interpretations. Multiple perspectives can always be more informative (if properly processed), increasing our hopes to find out how things actually are by increasing the chances and rate of getting less wrong. Though this getting less wrong is in expectation only – it really is an uncertain world.

Of course, in statistics a good guess for the Rabbit interpretation would be Bayesian and for the Duck, Frequentest (Canadian spelling). Though, as one of Andrew’s colleagues once argued it is really modellers versus non modellers rather than Bayesians versus Frequentests and that makes a lot of sense to me. Representists are Rabbits “conjecturing, assessing, and adopting idealized representations of reality, predominantly using probability generating models for both parameters and data” while  Propertyists are Ducks “primarily being about discerning procedures with good properties that are uniform over a wide range of possible underlying realities and restricting use, especially in science, to just those procedures” from here.  Given that “idealized representations of reality” can only be indirectly checked (i.e. always remain possibly wrong) and “good properties” always beg the question “good for what?” (as well as only hold over a range of possible but largely unrepresented realities) – it should be a no brainer? that would it be more profitable than not to thoroughly think through both perspectives (and more actually).

An alternative view might be Leo Breiman’s “two cultures” paper.

This issue of multiple perspectives also came up in Bob’s recent post where the possibility arose that some might think it taboo to mix Bayes and Frequentist perspectives.

Some case studies would be:  Continue reading ‘Representists versus Propertyists: RabbitDucks – being good for what?’ »

My proposal for JASA: “Journal” = review reports + editors’ recommendations + links to the original paper and updates + post-publication comments

Whenever they’ve asked me to edit a statistics journal, I say no thank you because I think I can make more of a contribution through this blog. I’ve said no enough times that they’ve stopped asking me. But I’ve had an idea for awhile and now I want to do it.

I think that journals should get out of the publication business and recognize that their goal is curation. My preferred model is that everything gets published on some sort of super-Arxiv, and then the role of an organization such as the Journal of the American Statistical Association is to pick papers to review and to recommend. The “journal” is then the review reports plus the editors’ recommendations plus links to the original paper and any updates plus post-publication comments.

If JASA is interested in going this route, I’m in.

My talk this Friday in the Machine Learning in Finance workshop

This is kinda weird because I don’t know anything about machine learning in finance. I guess the assumption is that statistical ideas are not domain specific. Anyway, here it is:

What can we learn from data?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

The standard framework for statistical inference leads to estimates that are horribly biased and noisy for many important examples. And these problems all even worse as we study subtle and interesting new questions. Methods such as significance testing are intended to protect us from hasty conclusions, but they have backfired: over and over again, people think they have learned from data but they have not. How can we have any confidence in what we think we’ve learned from data? One appealing strategy is replication and external validation but this can be difficult in the real world of social science. We discuss statistical methods for actually learning from data without getting fooled.

Reputational incentives and post-publication review: two (partial) solutions to the misinformation problem

So. There are erroneous analyses published in scientific journals and in the news. Here I’m not talking not about outright propaganda, but about mistakes that happen to coincide with the preconceptions of their authors.

We’ve seen lots of examples. Here are just a few:

– Political scientist Larry Bartels is committed to a model of politics in which voters make decisions based on irrelevant information. He’s published claims about shark attacks deciding elections and subliminal smiley faces determining attitudes about immigration. In both cases, second looks by others showed that the evidence wasn’t really there. I think Bartels was sincere; he just did sloppy analyses—statistics is hard!—and jumped to conclusions that supported his existing views.

– New York Times columnist David Brooks has a habit of citing statistics that fall apart under closer inspection. I think Brooks believes these things when he writes them—OK, I guess he never really believed that Red Lobster thing, he must really have been lying exercising poetic license on that one—but what’s important is that these stories work to make his political points, and he doesn’t care when they’re proved wrong.

– David’s namesake and fellow NYT op-ed columnist Arthur Brooks stepped in it one or twice when reporting survey data. He wrote that Tea Party supporters were happier than other voters, but a careful look at the data suggested the opposite. A. Brooks’s conclusions were counterintuitive and supported his political views; they just didn’t happen to line up with reality.

– The familiar menagerie from the published literature in social and behavioral sciences: himmicanes, air rage, ESP, ages ending in 9, power pose, pizzagate, ovulation and voting, ovulation and clothing, beauty and sex ratio, fat arms and voting, etc etc.

– Gregg Easterbrook writing about politics.

And . . . we have a new one. A colleague emailed me expressing annoyance at a recent NYT op-ed by historian Stephanie Coontz entitled, “Do Millennial Men Want Stay-at-Home Wives?”

Emily Beam does the garbage collection. The short answer is that, no, there’s no evidence that millennial men want stay-at-home wives. Here’s Beam:

You can’t say a lot about millennials based on talking to 66 men.

The GSS surveys are pretty small – about 2,000-3,000 per wave – so once you split by sample, and then split by age, and then exclude the older millennials (age 26-34) who don’t show any negative trend in gender equality, you’re left with cells of about 60-100 men ages 18-25 per wave. . . .

Suppose you want to know whether there is a downward trend in young male disagreement with the women-in-the-kitchen statement. Using all available GSS data, there is a positive, not statistically significant trend in men’s attitudes (more disagreement). Starting in 1988 only, there is very, very small negative, not statistically significant effect.

Only if we pick 1994 as a starting point, as Coontz does, ignoring the dip just a few years prior, do we see a negative less-than half-percentage point drop in disagreement per year, significant at the 10-percent level.

To Coontz’s (or the NYT’s) credit, they followed up with a correction, but it’s half-assed:

The trend still confirms a rise in traditionalism among high school seniors and 18-to-25-year-olds, but the new data shows that this rise is no longer driven mainly by young men, as it was in the General Social Survey results from 1994 through 2014.

And at this point I have no reason to believe anything that Coontz says on this topic, any more than I’d trust what David Brooks has to say about high school test scores or the price of dinner at Red Lobster, or Arthur Brooks on happiness measurements, or Susan Fiske on himmicanes, power pose, and air rage. All these people made natural mistakes but then were overcommitted, in part I suspect because the mistaken analyses what they’d like to think is true.

But it’s good enough for the New York Times, or PPNAS, right?

The question is, what to do about it. Peer review can’t be the solution: for scientific journals, the problem with peer review is the peers, and when it comes to articles in the newspaper, there’s no way to do systematic review. The NYT can’t very well send all their demography op-eds to Emily Beam and Jay Livingston, after all. Actually, maybe they could—it’s not like they publish so many op-eds on the topic—but I don’t think this is going to happen.

So here are two solutions:

1. Reputational incentives. Make people own their errors. It’s sometimes considered rude to do this, to remind people that Satoshi Kanazawa Satoshi Kanazawa Satoshi Kanazawa published a series of papers that were dead on arrival because the random variation in his data was so much larger than any possible signal. Or to remind people that Amy Cuddy Amy Cuddy Amy Cuddy still goes around promoting power pose even thought the first author on that paper had disowned the entire thing. Or that John Bargh John Bargh John Bargh made a career out of a mistake and now refuses to admit his findings didn’t replicate. Or that David Brooks David Brooks David Brooks reports false numbers and then refused to correct them. Or that Stephanie Coontz Stephanie Coontz Stephanie Coontz jumped to conclusions based on a sloppy reading of trends from a survey.

But . . . maybe we need these negative incentives. If there’s a positive incentive for getting your name out there, there should be a negative incentive for getting it wrong. I’m not saying the positive and negative incentives should be equal, just that there more of a motivation for people to check what they’re doing.

And, yes, don’t keep it a secret that I published a false theorem once, and, another time, had to retract the entire empirical section of a published paper because we’d reverse-coded a key variable in our analysis.

2. Post-publication review.

I’ve talked about this one before. Do it for real, in scientific journals and also the newspapers. Correct your errors. And, when you do so, link to the people who did the better analyses.

Incentives and post-publication review go together. To the extent that David Brooks is known as the guy who reports made-up statistics and then doesn’t correct them—if this is his reputation—this gives the incentives for future Brookses (if not David himself) to prominently correct his mistakes. If Stephanie Coontz and the New York Times don’t want to be mocked on twitter, they’re motivated to follow up with serious corrections, not minimalist damage control.

Some perspective here

Look, I’m not talking about tarring and feathering here. The point is that incentives are real; they already exist. You really do (I assume) get a career bump from publishing in Psychological Science and PPNAS, and your work gets more noticed if you publish an op-ed in the NYT or if you’re featured on NPR or Ted or wherever. If all incentives are positive, that creates problems. It creates a motivation for sloppy work. It’s not that anyone is trying to do sloppy work.

Econ gets it (pretty much) right

Say what you want about economists, but they’ve got this down. First off, they understand the importance of incentives. Second, they’re harsh, harsh critics of each other. There’s not much of an econ equivalent to quickie papers in Psychological Science or PPNAS. Serious econ papers go through tons of review. Duds still get through, of course (even some duds in PPNAS). But, overall, it seems to me that economists avoid what might be called the “happy talk” problem. When an economist publishes something, he or she tries to get it right (politically-motivated work aside), in awareness that lots of people are on the lookout for errors, and this will rebound back to the author’s reputation.

Donald Trump’s nomination as an unintended consequence of Citizens United

The biggest surprise of the 2016 election campaign was Donald Trump winning the Republican nomination for president.

A key part of the story is that so many of the non-Trump candidates stayed in the race so long because everyone thought Trump was doomed, so they were all trying to grab Trump’s support when he crashed. Instead, Trump didn’t crash, and he benefited from the anti-Trump forces not coordinating on an alternative.

David Banks shares a theory of how it was that these candidates all stayed in so long:

I [Banks] see it as an unintended consequence of Citizens United. Before that [Supreme Court] decision, the $2000 cap on what individuals/corporations could contribute largely meant that if a candidate did not do well in one of the first three primaries, they pretty much had to drop out and their supporters would shift to their next favorite choice. But after Citizens United, as long as a candidate has one billionaire friend, they can stay in the race through the 30th primary if they want. And this is largely what happened. Trump regularly got the 20% of the straight-up crazy Republican vote, and the other 15 candidates fragmented the rest of the Republicans for whom Trump was the least liked candidate. So instead of Rubio dropping out after South Carolina and his votes shifting over to Bush, and Fiorino dropping out and her votes shifting to Bush, so that Bush would jump from 5% to 10% to 15% to 20% to 25%, etc., we wound up very late in the primaries with Trump looking like the most dominant candidate to field.

Of course, things are much more complex than this facile theory suggests, and lots of other things were going on in parallel. But it still seems to me that this partly explains how Trump threaded the needle to get the Republican nomination.

Interesting. I’d not seen this explanation before so I thought I’d share it with you.