Skip to content

“This is why FDA doesn’t like Bayes—strong prior and few data points and you can get anything”

[cat picture]

In the context of a statistical application, someone wrote:

Since data is retrospective I had to use informative prior. The fit of urine improved significantly (very good) without really affecting concentration. This is why FDA doesn’t like Bayes—strong prior and few data points and you can get anything. Hopefully in this case I can justify the prior that 5% error in urine measurements is reasonable.

I responded to “This is why FDA doesn’t like Bayes—strong prior and few data points and you can get anything” as follows:

That’s ok for me. The point is that FDA should require that the prior be science-based. For example, consider the normal(0, infinity) priors that are implicitly used by the fat-arms-and-political-attitudes crowd. Those are really strong priors, not at all supported by science!

My correspondent replied:

I think if you read clinical studies for medical devices carefully enough the allowance for informative priors as long as they are scientifically based is there (we need to doublecheck). But prevailing recommendation is using non informative priors. Therefore my guess the conception is that FDA is not supporting strong priors. We can’t deny that FDA has prominent Bayesists. The recent comment about using informative prior by FDA statistician was “how we defend the approach with industry” so there are some other aspects that are not so visible. In my case the strong prior is justified since 5% error in urine measurements is tight but still in the realm of reasonable. Since we had only few data points it was the only way to reconstruct the parameters and show that PBPK models fit clinical data well.

Just to clarify: I’m not knocking the FDA in this post. This should be clear from the text but I could see how the title (which is a quote from my correspondent) could give the wrong impression.

This one came in the email from July 2015

Sent to all the American Politics faculty at Columbia, including me:

RE: Donald Trump presidential candidacy


Firstly, apologies for the group email but I wasn’t sure who would be best prized to answer this query as we’ve not had much luck so far.

I am a Dubai-based reporter for **.
Donald Trump recently announced his intension to run for the US presidency in 2016.
He currently has a lot of high profile commercial and business deals in Dubai and is actively in talks for more in the wider region.

We have been trying to determine:
If a candidate succeeds in winning a nomination and goes on to win the election and reside in the White House do they have to give up their business interests as these would be seen as a conflict of interest? Can a US president serve in office and still have massive commercial business interests abroad?

Basically, would Trump have to relinquish these relationships if he was successfully elected? Are there are existing rules specifically governing this? Is there any previous case studies to go on?

Lastly, what are his chances of winning a nomination or being elected? So far, from what we have read it seems highly unlikely?


Executive Editor

At the time I just thought this was funny, first the idea that Trump might become president; second the idea, if he did, that there would be conflicts of interest relating to his foreign businesses.

Check out the comments where one person naively says that Trump would probably put his assets into a blind trust, and I naively reply that Trump won’t be elected president. A failure of imagination in so many ways.

And here we are.

The statistical crisis in science: How is it relevant to clinical neuropsychology?

[cat picture]

Hilde Geurts and I write:

There is currently increased attention to the statistical (and replication) crisis in science. Biomedicine and social psychology have been at the heart of this crisis, but similar problems are evident in a wide range of fields. We discuss three examples of replication challenges from the field of social psychology and some proposed solutions, and then consider the applicability of these ideas to clinical neuropsychology. In addition to procedural developments such as preregistration and open data and criticism, we recommend that data be collected and analyzed with more recognition that each new study is a part of a learning process. The goal of improving neuropsychological assessment, care, and cure is too important to not take good scientific practice seriously.

Prior information, not prior belief

From a couple years ago:

The prior distribution p(theta) in a Bayesian analysis is often presented as a researcher’s beliefs about theta. I prefer to think of p(theta) as an expression of information about theta.

Consider this sort of question that a classically-trained statistician asked me the other day:

If two Bayesians are given the same data, they will come to two conclusions. What do you think about that? Does it bother you?

My response is that the statistician has nothing to do with it. I’d prefer to say that if two different analyses are done using different information, they will come to different conclusions. This different information can come in the prior distribution p(theta), it could come in the data model p(y|theta), it could come in the choice of how to set up the model and what data to include in the first place. I’ve listed these in roughly increasing order of importance.

Sure, we could refer to all statistical models as “beliefs”: we have a belief that certain measurements are statistically independent with a common mean, we have a belief that a response function is additive and linear, we have a belief that our measurements are unbiased, etc. Fine. But I don’t think this adds anything beyond just calling this a “model.” Indeed, referring to “belief” can be misleading. When I fit a regression model, I don’t typically believe in additivity or linearity at all, I’m just fitting a model, using available information and making assumptions, compromising the goal of including all available information because of the practical difficulties of fitting and understanding a huge model.

Same with the prior distribution. When putting together any part of a statistical model, we use some information without wanting to claim that this represents our beliefs about the world.

The Bolt from the Blue

Lionel Hertzog writes:

In the method section of a recent Nature article in my field of research (diversity-ecosystem function) one can read the following:

The inclusion of many predictors in statistical models increases the chance of type I error (false positives). To account for this we used a Bernoulli process to detect false discovery rates, where the probability (P) of finding a given number of significant predictors (K) just by chance is a proportion of the total number of predictors tested (N = 16 in our case: the abundance and richness of 7 and 9 trophic groups, respectively) and the P value considered significant (α = 0.05 in our case). The probability of finding three significant predictors on average, as we did, is therefore, P = [16!/(16 − 3)!3!] × 0.053(1 − 0.05)(16 − 3) = 0.0359, indicating that the effects we found are very unlikely to be spurious. The probability of false discovery rates when considering all models and predictors fit (14 ecosystem services × 16 richness and abundance metrics) and the ones that were significant amongst them (52: 25 significant abundance predictors and 27 significant richness predictors) was even lower (P < 0.0001).

I am no statistician but all my stats-sense are turning red when reading this. What do you think?

My reply: It might be that all the science is solid here; I have not read the paper nor do I have any particular expertise in this area. But I too am disturbed by the passage above, for a couple of reasons, even beyond the usual error of reversing the probability (“the effects we found are very unlikely to be spurious”). My big problems with the the above sort of analysis are:
(a) The whole null-hypothesis rejection thing. I don’t think anyone should care about the null hypothesis. In real life, there’s always variation.
(b) The idea of picking out “three significant predictors” and the whole “false discovery” thing. Again, everything’s happening and there is variation. The whole true-false thing doesn’t make sense to me in this sort of situation.

P.S. The article is called “Biodiversity at multiple trophic levels is needed for ecosystem multifunctionality,” and its authors are
Santiago Soliveres, Fons van der Plas, Peter Manning, Daniel Prati, Martin M. Gossner, Swen C. Renner, Fabian Alt, Hartmut Arndt, Vanessa Baumgartner, Julia Binkenstein, Klaus Birkhofer, Stefan Blaser, Nico Blüthgen, Steffen Boch, Stefan Böhm, Carmen Börschig, Francois Buscot, Tim Diekötter, Johannes Heinze, Norbert Hölzel, Kirsten Jung, Valentin H. Klaus, Till Kleinebecker, Sandra Klemmer, Jochen Krauss, Markus Lange, E. Kathryn Morris, Jörg Müller, Yvonne Oelmann, Jörg Overmann, Esther Pašalić, Matthias C. Rillig, H. Martin Schaefer, Michael Schloter, Barbara Schmitt, Ingo Schöning, Marion Schrumpf, Johannes Sikorski, Stephanie A. Socher, Emily F. Solly, Ilja Sonnemann, Elisabeth Sorkau, Juliane Steckel, Ingolf Steffan-Dewenter, Barbara Stempfhuber, Marco Tschapka, Manfred Türke, Paul C. Venter, Christiane N. Weiner, Wolfgang W. Weisser, Michael Werner, Catrin Westphal, Wolfgang Wilcke, Volkmar Wolters, Tesfaye Wubet, Susanne Wurst, Markus Fischer and Eric Allan.

Update rstanarm to version 2.15.3

Ben Goodrich writes:

We just released rstanarm 2.15.3, which fixed a major bug that was introduced back in January with the 2.14.1 release where models of the form

stan_glmer(y ~ ... + (1 | group1) + (1 | group2), family = binomial())

would produce WRONG RESULTS. This only applies to Bernoulli models with multiple group-specific intercept shifts and no group-specific deviations in the coefficients. If you have estimated any such models in 2017 with rstanarm, you should re-estimate them. The other changes to rstanarm were minor.

If you are using OSX, it may be necessary to install rstanarm and / or rstan from source to work around the “c++ exception (unknown reason)” error that there are several inconclusive threads about.

The latest version of rstan is 2.15.1, released a couple weeks ago.

Danger Sign

Melvyn Weeks writes:

I [Weeks] have a question related to comparability and departures thereof in regression models.

I am familiar with these issues, namely the problems of a lack of complete overlap and imbalance as applied to treatment models where there exists a binary treatment.

However, it strikes me that these issues apply more generally to regression models where there are multiple levels of treatments, and in the limit where the treatment variable is continuous. In this instance a treatment effect is some form of average effect when there is a change of one unit applied to a particular regressor.

Related, just by being able to measure the potential confounding variable and include as a regressor doesnt mean that there are not problems in conducting causal inference from differences in the distribution of regressors differ across many groups. These issues are more apparent in the context of matching when the comparisons are explicit and separate from estimation.

I would welcome your observations on whether the concepts of overlap and imbalance apply more generally to regression models, and problems of undertaking causal inference when there are differences in the distribution of regressors across groups, where the groups may also be individuals in a panel dataset.

My response: Yes, I agree that this can happen with continuous treatments as well. I’m not aware of discussion of this issue in the literature.

Maybe some of you have other thoughts or references?

Another perspective on peer review

Andrew Preston writes:

I’m the co-founder and CEO of Our mission is to modernise peer review. We’ve developed a way to give researchers formal recognition for their peer review (i.e., evidence that can go on your CV) and have partnerships with over 1k top journals.

I’ve just finished reading your rebuttal of Susan Fiske’s editorial critique of social media. While this is directly relevant to the discussion that happens on Publons (great example here), your description of what has happened in the last 10 years brings to mind a Kuhnian paradigm shift in the field.

I think that’s important.

We’re working on a course to train peer reviewers, which we’re calling the Publons Academy. An awareness of statistical issues is an important part of that.

I have no idea what this is all about, and as you know I’d like to get rid of the current reviewing system entirely, but I thought I’d share this with you.

“The earth is flat (p > 0.05): Significance thresholds and the crisis of unreplicable research”

Valentin Amrhein​, Fränzi Korner-Nievergelt, and Tobias Roth write:

The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process. We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a realistic statistical power of 40%, given that there is a true effect, only one in six studies will significantly replicate the significant result of another study. Even at a good power of 80%, results from two studies will be conflicting, in terms of significance, in one third of the cases if there is a true effect. This means that a replication cannot be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgement based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to publication bias against nonsignificant findings. Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the observed effect size, e.g., from a sample average, and from a measure of uncertainty, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, such as ‘we need more stringent decision rules’, ‘sample sizes will decrease’ or ‘we need to get rid of p-values’.

The general ideas should be familiar to regular readers of this blog.

Blue Cross Blue Shield Health Index

Chris Famighetti points us to this page which links to an interactive visualization.

There are some problems with the mapping software—when I clicked through, it showed a little map of the western part of the U.S., accompanied by huge swathes of Canada and the Pacific Ocean—and I haven’t taken a look at the methodology.

But in a way it doesn’t really matter. This sort of thing is great. If it’s solid work, that’s great already. If it’s flawed, it’s out there and people can make suggestions for improvement. So I hope this Blue Cross team gets lots of feedback uses this to improve what they’ve got.

Prior choice recommendations wiki !

Here’s the wiki, and here’s the background:

Our statistical models are imperfect compared to the true data generating process and our complete state of knowledge (from an informational-Bayesian perspective) or the set of problems over which we wish to average our inferences (from a population-Bayesian or frequentist perspective).

The practical question here is what model to choose, or what class of models. Advice here is not typically so clear. We have a bunch of existing classes of models such as linear regression, logistic regression, along with newer things such as deep learning, and usual practice is to take a model that has been applied on similar problems and keep using it until it is obviously wrong. That’s not such a bad way to go; I’m just pointing out the informality of this aspect of model choice.

What about the choice of prior distribution in a Bayesian model? The traditional approach leads to an awkward choice: either the fully informative prior (wildly unrealistic in most settings) or the noninformative prior which is supposed to give good answers for any possible parameter valuers (in general, feasible only in settings where data happen to be strongly informative about all parameters in your model).

We need something in between. In a world where Bayesian inference has become easier and easier for more and more complicated models (and where approximate Bayesian inference is useful in large and tangled models such as recently celebrated deep learning applications), we need prior distributions that can convey information, regularize, and suitably restrict parameter spaces (using soft rather than hard constraints, for both statistical and computational reasons).

I used to talk about these as weakly informative prior distributions (as in this paper from 2006 on hierarchical variance parameters and this from 2008 on logistic regression coefficients) but now I just call them “prior distributions.” In just about any real problem, there’s no fully informative prior and there’s no noninformative prior; every prior contains some information and does some regularization, while still leaving some information “on the table,” as it were.

That is, all priors are weakly informative; we now must figure out where to go from there. We should be more formal about the benefits of including prior information, and the costs of overconfidence (that is, including information that we don’t really have).

We (Dan Simpson, Michael Betancourt, Aki Vehtari, and various others) have been thinking about the problem in different ways involving theory, applications, and implementation in Stan.

For now, we have this wiki on Prior Choice Recommendations. I encourage you to take a look, and feel free to post your questions, thoughts, and suggested additions in comments below.

A whole fleet of Wansinks: is “evidence-based design” a pseudoscience that’s supporting a trillion-dollar industry?

Following a recent post that mentioned << le Sherlock Holmes de l'alimentation >>, we go this blockbuster comment which seemed worth its own post by Ujjval Vyas:

I work in an area where social psychology is considered the gold standard of research and thus the whole area is completely full of Wansink stuff (“people recover from surgery faster if they have a view of nature out the window”, “obesity and diabetes are caused by not enough access to nature for the poor”, biomimicry is a particularly egregious idea in this field). No one even knows how to really read any of the footnotes or cares, since it is all about confirmation bias and the primary professional organizations in the area directly encourage such lack of rigor. Obscure as it may sound, the whole area of “research” into architecture and design is full of this kind of thing. But the really odd part is that the field is made up of people who have no idea what a good study is or could be (architects, designers, interior designers, academic “researchers” at architecture schools or inside furniture manufacturers trying to sell more). They even now have groups that pursue “evidence-based healthcare design” which simply means that some study somewhere says what they need it to say. The field is at such a low level that it is not worth mentioning in many ways except that it is deeply embedded in a $1T industry for building and construction as well as codes and regulations based on this junk. Any idea of replication is simply beyond the kenning in this field because, as one of your other commenters put it, the publication is only a precursor to Ted talks and keynote addresses and sitting on officious committees to help change the world (while getting paid well). Sadly, as you and commenters have indicated, no one thinks they are doing anything wrong at all. I only add this comment to suggest that there are whole fields and sub-fields that suffer from the problems outlined here (much of this research would make Wansink look scrupulous).

Here’s the wikipedia page on Evidence-based design, including this chilling bit:

As EBD is supported by research, many healthcare organizations are adopting its principles with the guidance of evidence-based designers. The Center for Health Design and InformeDesign (a not-for-profit clearinghouse for design and human-behaviour research) have developed the Pebble Project, a joint research effort by CHD and selected healthcare providers on the effect of building environments on patients and staff.

The Evidence Based Design Accreditation and Certification (EDAC) program was introduced in 2009 by The Center for Health Design to provide internationally recognized accreditation and promote the use of EBD in healthcare building projects, making EBD an accepted and credible approach to improving healthcare outcomes. The EDAC identifies those experienced in EBD and teaches about the research process: identifying, hypothesizing, implementing, gathering and reporting data associated with a healthcare project.

Later on the page is a list of 10 strategies (1. Start with problems. 2. Use an integrated multidisciplinary approach with consistent senior involvement, ensuring that everyone with problem-solving tools is included. etc.). Each of these steps seems reasonable, but put them together and they do read like a recipe for taking hunches, ambitious ideas, and possible scams and making them look like science. So I’m concerned. Maybe it would make sense to collaborate with someone in the field of architecture and design and try to do something about this.

P.S. It might seem kinda mean for me to pick on these qualitative types for trying their best to achieve something comparable to quantitative rigor. But . . . if there are really billions of dollars at stake, we shouldn’t sit idly by. Also, I feel like Wansink-style pseudoscience can be destructive of qualitative expertise. I’d rather see some solid qualitative work than bogus number crunching.

“Data sleaze: Uber and beyond”

Interesting discussion from Kaiser Fung. I don’t have anything to add here; it’s just a good statistics topic.

Scroll through Kaiser’s blog for more:

Dispute over analysis of school quality and home prices shows social science is hard

My pre-existing United boycott, and some musing on randomness and fairness


Using prior knowledge in frequentist tests

Christian Bartels send along this paper, which he described as an attempt to use informative priors for frequentist test statistics.

I replied:

I’ve not tried to follow the details but this reminds me of our paper on posterior predictive checks. People think of this as very Bayesian but my original idea when doing this research was to construct frequentist tests using Bayesian averaging in order to get p-values. This was motivated by a degrees-of-freedom-correction problem where the model had nonlinear constraints and so one could not simply do a classical correction based on counting parameters.

To which Bartels wrote:

Your work starts from the same point as mine, existing frequentist tests may be inadequate for the problem of interest. Your work ends also, where I would like to end, performing tests via integration over (i.e.,sampling of) paremeters and future observation using likelihood and prior.

In addition, I try to anchor the approach in decision theory (as referenced in my write up). Perhaps this is too ambitious, we’ll see.

Results so far, using the language of your publication:
– The posterior distribution p(theta|y) is a good choice for the deviance D(y,theta). It gives optimal confidence intervals/sets in the sense proposed by Schafer, C.M. and Stark, P.B., 2009. Constructing confidence regions of optimal expected size. Journal of the American Statistical Association, 104(487), pp.1080-1089.
– Using informative priors for the deviance D(y,theta)=p(theta|y) may improve the quality of decisions, e.g., may improve thenpower of tests.
– For the marginalization, I find it difficult to strike the balance between proposing something that can be argued/shown to give optimal tests, and something that can be calculate with availabe computational resources. I hope to end up with something like one of the variants shown in your Figure 1.

I noted that you distinguish test statistics from deviances that do depend or do not depend on the parameter. I’m not aware of anything that prevents you from using deviances with a dependence on parameters for frequentist tests – it is just inconvenient, if you are after generic, closed form solutions for tests. I did not make this differentiation, and refer to tests independent on whether they depend on the parameters or not.

I don’t really have anything more to say here, as I have not thought about these issues for awhile. But I thought Bartels’s paper and this discussion might interest some of you.

I hate R, volume 38942


R doesn’t allow block comments. You have to comment out each line, or you can encapsulate the block in if(0){} which is the world’s biggest hack. Grrrrr.

P.S. Just to clarify: I want block commenting not because I want to add long explanatory blocks of text to annotate my scripts. I want block commenting because I alter my scripts, and sometimes I want to comment out a block of code.

The next Lancet retraction? [“Subcortical brain volume differences in participants with attention deficit hyperactivity disorder in children and adults”]

[cat picture]

Someone who prefers to remain anonymous asks for my thoughts on this post by Michael Corrigan and Robert Whitaker, “Lancet Psychiatry Needs to Retract the ADHD-Enigma Study: Authors’ conclusion that individuals with ADHD have smaller brains is belied by their own data,” which begins:

Lancet Psychiatry, a UK-based medical journal, recently published a study titled Subcortical brain volume differences in participants with attention deficit hyperactivity disorder in children and adults: A cross-sectional mega-analysis. According to the paper’s 82 authors, the study provides definitive evidence that individuals with ADHD have altered, smaller brains. But as the following detailed review reveals, the study does not come close to supporting such claims.

Below are tons of detail, so let me lead with my conclusion, which is that the criticisms coming from Corrigan and Whitaker seem reasonable to me. That is, based on my quick read, the 82 authors of that published paper seem to have made a big mistake in what they wrote.

I’d be interested to see if the authors have offered any reply to these criticisms. The article has just recently come out—the journal publication is dated April 2017—and I’d like to see what the authors have to say.

OK, on to the details. Here are Corrigan and Whitaker:

The study is beset by serious methodological shortcomings, missing data issues, and statistical reporting errors and omissions. The conclusion that individuals with ADHD have smaller brains is contradicted by the “effect-size” calculations that show individual brain volumes in the ADHD and control cohorts largely overlapped. . . .

Their results, the authors concluded, contained important messages for clinicians: “The data from our highly powered analysis confirm that patients with ADHD do have altered brains and therefore that ADHD is a disorder of the brain.” . . .

The press releases sent to the media reflected the conclusions in the paper, and the headlines reported by the media, in turn, accurately summed up the press releases. Here is a sampling of headlines:

Given the implications of this study’s claims, it deserves to be closely analyzed. Does the study support the conclusion that children and adults with ADHD have “altered brains,” as evidenced by smaller volumes in different regions of the brain? . . .

Alternative Headline: Large Study Finds Children with ADHD Have Higher IQs!

To discover this finding, you need to spend $31.50 to purchase the article, and then make a special request to Lancet Psychiatry to send you the appendix. Then you will discover, on pages 7 to 9 in the appendix, a “Table 2” that provides IQ scores for both the ADHD cohort and the controls.

Although there were 23 clinical sites in the study, only 20 reported comparative IQ data. In 16 of the 20, the ADHD cohort had higher IQs on average than the control group. In the other four clinics, the ADHD and control groups had the same average IQ (with the mean IQ scores for both groups within two points of each other.) Thus, at all 20 sites, the ADHD group had a mean IQ score that was equal to, or higher than, the mean IQ score for the control group. . . .

And why didn’t the authors discuss the IQ data in their paper, or utilize it in their analyses? . . . Indeed, if the IQ data had been promoted in the study’s abstract and to the media, the public would now be having a new discussion: Is it possible that children diagnosed with ADHD are more intelligent than average? . . .

They Did Not Find That Children Diagnosed with ADHD Have Smaller Brain Volumes . . .

For instance, the authors reported a Cohen’s d effect size of .19 for differences in the mean volume of the accumbens in children under 15. . . in this study, for youth under 15, it was the largest effect size of all the brain volume comparisons that were made. . . . Approximately 58% of the ADHD youth in this convenience sample had an accumbens volume below the average in the control group, while 42% of the ADHD youth had an accumbens volume above the average in the control group. Also, if you knew the accumbens volume of a child picked at random, you would have a 54% chance that you could correctly guess which of the two cohorts—ADHD or healthy control—the child belonged to. . . . The diagnostic value of an MRI brain scan, based on the findings in this study, would be of little more predictive value than the toss of a coin. . . .

The authors reported that the “volumes of the accumbens, amygdala, caudate, hippocampus, putamen, and intracranial volume were smaller in individuals with ADHD compared with controls in the mega-analysis” (p. 1). If this is true, then smaller brain volumes should show up in the data from most, if not all, of the 21 sites that had a control group. But that was not the case. . . . The problem here is obvious. If authors are claiming that smaller brain regions are a defining “abnormality” of ADHD, then such differences should be consistently found in mean volumes of ADHD cohorts at all sites. The fact that there was such variation in mean volume data is one more reason to see the authors’ conclusions—that smaller brain volumes are a defining characteristic of ADHD—as unsupported by the data. . . .

And now here’s what the original paper said:

We aimed to investigate whether there are structural differences in children and adults with ADHD compared with those without this diagnosis. In this cross-sectional mega-analysis [sic; see P.P.S. below], we used the data from the international ENIGMA Working Group collaboration, which in the present analysis was frozen at Feb 8, 2015. Individual sites analysed structural T1-weighted MRI brain scans with harmonised protocols of individuals with ADHD compared with those who do not have this diagnosis. . . .

Our sample comprised 1713 participants with ADHD and 1529 controls from 23 sites . . . The volumes of the accumbens (Cohen’s d=–0·15), amygdala (d=–0·19), caudate (d=–0·11), hippocampus (d=–0·11), putamen (d=–0·14), and intracranial volume (d=–0·10) were smaller in individuals with ADHD compared with controls in the mega-analysis. There was no difference in volume size in the pallidum (p=0·95) and thalamus (p=0·39) between people with ADHD and controls.

The above demonstrates some forking paths, and there are a bunch more in the published paper, for example:

Exploratory lifespan modelling suggested a delay of maturation and a delay of degeneration, as e ect sizes were highest in most subgroups of children (<15 years) versus adults (>21 years): in the accumbens (Cohen’s d=–0·19 vs –0·10), amygdala (d=–0·18 vs –0·14), caudate (d=–0·13 vs –0·07), hippocampus (d=–0·12 vs –0·06), putamen (d=–0·18 vs –0·08), and intracranial volume (d=–0·14 vs 0·01). There was no di erence between children and adults for the pallidum (p=0·79) or thalamus (p=0·89). Case-control differences in adults were non-signi cant (all p>0·03). Psychostimulant medication use (all p>0·15) or symptom scores (all p>0·02) did not in uence results, nor did the presence of comorbid psychiatric disorders (all p>0·5). . . .

Outliers were identified at above and below one and a half times the interquartile range per cohort and group (case and control) and were excluded . . . excluding collinearity of age, sex, and intracranial volume (variance in ation factor <1·2) . . . The model included diagnosis (case=1 and control=0) as a factor of interest, age, sex, and intracranial volume as fixed factors, and site as a random factor. In the analysis of intracranial volume, this variable was omitted as a covariate from the model. Handedness was added to the model to correct for possible effects of lateralisation, but was excluded from the model when there was no significant contribution of this factor. . . . stratified by age: in children aged 14 years or younger, adolescents aged 15–21 years, and adults aged 22 years and older. We removed samples that were left with ten patients or fewer because of the stratification. . . .

Forking paths are fine; I have forking paths in every analysis I’ve ever done. But forking paths render published p-values close to meaningless; in particular I have no reason to take seriously a statement such as, “p values were significant at the false discovery rate corrected threshold of p=0·0156,” from the summary of the paper.

So let’s forget about p-values and just look at the data graphs, which appear in the published paper:

Unfortunately these are not raw data or even raw averages for each age; instead they are “moving averages, corrected for age, sex, intracranial volume, and site for the subcortical volumes.” But we’ll take what we’ve got.

From the above graphs, it doesn’t seem like much of anything is going on: the blue and red lines cross all over the place! So now I don’t understand this summary graph from the paper:

I mean, sure, I see it for Accumbens, I guess, if you ignore the older people. But, for the others, the lines in the displayed age curves cross all over the place.

The article in question has the following list of authors: Martine Hoogman, Janita Bralten, Derrek P Hibar, Maarten Mennes, Marcel P Zwiers, Lizanne S J Schweren, Kimm J E van Hulzen, Sarah E Medland, Elena Shumskaya, Neda Jahanshad, Patrick de Zeeuw, Eszter Szekely, Gustavo Sudre, Thomas Wolfers, Alberdingk M H Onnink, Janneke T Dammers, Jeanette C Mostert, Yolanda Vives-Gilabert, Gregor Kohls, Eileen Oberwelland, Jochen Seitz, Martin Schulte-Rüther, Sara Ambrosino, Alysa E Doyle, Marie F Høvik, Margaretha Dramsdahl, Leanne Tamm, Theo G M van Erp, Anders Dale, Andrew Schork, Annette Conzelmann, Kathrin Zierhut, Ramona Baur, Hazel McCarthy, Yuliya N Yoncheva, Ana Cubillo, Kaylita Chantiluke, Mitul A Mehta, Yannis Paloyelis, Sarah Hohmann, Sarah Baumeister, Ivanei Bramati, Paulo Mattos, Fernanda Tovar-Moll, Pamela Douglas, Tobias Banaschewski, Daniel Brandeis, Jonna Kuntsi, Philip Asherson, Katya Rubia, Clare Kelly, Adriana Di Martino, Michael P Milham, Francisco X Castellanos, Thomas Frodl, Mariam Zentis, Klaus-Peter Lesch, Andreas Reif, Paul Pauli, Terry L Jernigan, Jan Haavik, Kerstin J Plessen, Astri J Lundervold, Kenneth Hugdahl, Larry J Seidman, Joseph Biederman, Nanda Rommelse, Dirk J Heslenfeld, Catharina A Hartman, Pieter J Hoekstra, Jaap Oosterlaan, Georg von Polier, Kerstin Konrad, Oscar Vilarroya, Josep Antoni Ramos-Quiroga, Joan Carles Soliva, Sarah Durston, Jan K Buitelaar, Stephen V Faraone, Philip Shaw, Paul M Thompson, Barbara Franke.

I also found a webpage for their research group, featuring this wonderful map:

The number of sites looks particularly impressive when you include each continent twice like that. But they should really do some studies in Antarctica, given how huge it appears to be!

P.S. Following the links, I see that Corrigan and Whitaker come into this with a particular view:

Mad in America’s mission is to serve as a catalyst for rethinking psychiatric care in the United States (and abroad). We believe that the current drug-based paradigm of care has failed our society, and that scientific research, as well as the lived experience of those who have been diagnosed with a psychiatric disorder, calls for profound change.

This does not mean that the critics are wrong—presumably the authors of the original paper came into their research with their own strong views—; it can just be helpful to know where they’re coming from.

P.P.S. The paper discussed above uses the term “mega-analysis.” At first I thought this might be some sort of typo, but apparently the expression does exist and has been around for awhile. From my quick search, it appears that the term was first used by James Dillon in a 1982 article, “Superanalysis,” in Evaluation News, where he defined mega-analysis as “a method for synthesizing the results of a series of meta-analyses.”

But in the current literature, “mega-analysis” seems to simply refer to a meta-anlaysis that uses the raw data from the original studies.

If so, I’m unhappy with the term “mega-analysis” because: (a) The “mega” seems a bit hypey, (b) What if the original studies are small? Then even all the data combined might not be so “mega”?, and (c) I don’t like the implication that plain old “meta-analysis” doesn’t use the raw data. I’m pretty sure that the vast majority of meta-analyses use only published summaries, but I’ve always thought of it as the preferred version of meta-anlaysis to use the original data.

I bring up this mega-analysis thing not as a criticism of the Hoogman et al. paper—they’re just using what appears to be a standard term in their field—but just as an interesting side-note.

P.P.P.S. The above post represents my current impression. As I wrote, I’d be interested to see the original authors’ reply to the criticism. Lancet does have a pretty bad reputation—it’s known for publishing flawed, sensationalist work—but I’m sure they run the occasional good article too. So I wouldn’t want to make any strong judgments in this case before hearing more.

P.P.P.P.S. Regarding the title of this post: No, I don’t think Lancet would ever retract this paper, even if all the above criticisms are correct. It seems that retraction is used only in response to scientific misconduct, not in response to mere error. So when I say “retraction,” I mean what one might call “conceptual retraction.” The real question is: Will this new paper join the list of past Lancet papers which we would not want to take seriously, and which we regret were ever published?

Stan in St. Louis this Friday

This Friday afternoon I (Jonah) will be speaking about Stan at Washington University in St. Louis. The talk is open to the public, so anyone in the St. Louis area who is interested in Stan is welcome to attend. Here are the details:

Title: Stan: A Software Ecosystem for Modern Bayesian Inference
Jonah Sol Gabry, Columbia University

Neuroimaging Informatics and Analysis Center (NIAC) Seminar Series
Friday April 28, 2017, 1:30-2:30pm
NIL Large Conference Room
#2311, 2nd Floor, East Imaging Bldg.
4525 Scott Avenue, St. Louis, MO


Stan without frontiers, Bayes without tears

[cat picture]

This recent comment thread reminds me of a question that comes up from time to time, which is how to teach Bayesian statistics to students who aren’t comfortable with calculus. For continuous models, probabilities are integrals. And in just about every example except the one at 47:16 of this video, there are multiple parameters, so probabilities are multiple integrals.

So how to teach this to the vast majority of statistics users who can’t easily do multivariate calculus?

I dunno, but I don’t know that this has anything in particular to do with Bayes. Think about classical statistics, at least the sort that gets used in political science. Linear regression requires multivariate calculus too (or some pretty slick algebra or geometry) to get that least-squares solution. Not to mention the interpretation of the standard error. And then there’s logistic regression. Going further we move to popular machine learning methods which are really gonna seem like nothing more than black boxes. Kidz today all wanna do deep learning or random forests or whatever. And that’s fine. But no way are most of them learning the math behind it.

Teach people to drive. Then later, if they want or need, they can learn how the internal combustion engine works.

So, in keeping with this attitude, teach Stan. Students set up the model, they push the button, they get the answers. No integrals required. Yes, you have to work with posterior simulations so there is integration implicitly—the conceptual load is not zero—but I think (hope?) that this approach of using simulations to manage uncertainty is easier and more direct than expressing everything in terms of integrals.

But it’s not just model fitting, it’s also model building and model checking. Cross validation, graphics, etc. You need less mathematical sophistication to evaluate a method than to construct it.

About ten years ago I wrote an article, “Teaching Bayesian applied statistics to graduate students in political science, sociology, public health, education, economics, . . .” After briefly talking about a course that uses the BDA book and assumes that students know calculus, I continued:

My applied regression and multilevel modeling class has no derivatives and no integrals—it actually has less math than a standard regression class, since I also avoid matrix algebra as much as possible! What it does have is programming, and this is an area where many of the students need lots of practice. The course is Bayesian in that all inference is implicitly about the posterior distribution. There are no null hypotheses and alternative hypotheses, no Type 1 and Type 2 errors, no rejection regions and confidence coverage.

It’s my impression that most applied statistics classes don’t get into confidence coverage etc., but they can still mislead students by giving the impression that those classical principles are somehow fundamental. My class is different because I don’t pretend in that way. Instead I consider a Bayesian approach as foundational, and I teach students how to work with simulations.

My article continues:

Instead, the course is all about models, understanding the models, estimating parameters in the models, and making predictions. . . . Beyond programming and simulation, probably the Number 1 message I send in my applied statistics class is to focus on the deterministic part of the model rather than the error term. . . .

Even a simple model such as y = a + b*x + error is not so simple if x is not centered near zero. And then there are interaction models—these are incredibly important and so hard to understand until you’ve drawn some lines on paper. We draw lots of these lines, by hand and on the computer. I think of this as Bayesian as well: Bayesian inference is conditional on the model, so you have to understand what the model is saying.

The meta-hype algorithm

[cat picture]

Kevin Lewis pointed me to this article:

There are several methods for building hype. The wealth of currently available public relations techniques usually forces the promoter to judge, a priori, what will likely be the best method. Meta-hype is a methodology that facilitates this decision by combining all identified hype algorithms pertinent for a particular promotion problem. Meta-hype generates a final press release that is at least as good as any of the other models considered for hyping the claim. The overarching aim of this work is to introduce meta-hype to analysts and practitioners. This work compares the performance of journal publication, preprints, blogs, twitter, Ted talks, NPR, and meta-hype to predict successful promotion. A nationwide database including 89,013 articles, tweets, and news stories. All algorithms were evaluated using the total publicity value (TPV) in a test sample that was not included in the training sample used to fit the prediction models. TPV for the models ranged between 0.693 and 0.720. Meta-hype was superior to all but one of the algorithms compared. An explanation of meta-hype steps is provided. Meta-hype is the first step in targeted hype, an analytic framework that yields double hyped promotion with fewer assumptions than the usual publicity methods. Different aspects of meta-hype depending on the context, its function within the targeted promotion framework, and the benefits of this methodology in the addiction to extreme claims are discussed.

I can’t seem to find the link right now, but you get the idea.

P.S. The above is a parody of the abstract of a recent article on “super learning” by Acion et al. I did not include a link because the parody was not supposed to be a criticism of the content of the paper in any way; I just thought some of the wording in the abstract was kinda funny. Indeed, I thought I’d disguised the abstract enough that no one would make the connection but I guess Google is more powerful than I’d realized.

But this discussion by Erin in the comment thread revealed that some people were taking this post in a way I had not intended. So I added this comment.

tl;dr: I’m not criticizing the content of the Acion et al. paper in any way, and the above post was not intended to be seen as such a criticism.

Would you prefer three N=300 studies or one N=900 study?

Stephen Martin started off with a question:

I’ve been thinking about this thought experiment:

Imagine you’re given two papers.
Both papers explore the same topic and use the same methodology. Both were preregistered.
Paper A has a novel study (n1=300) with confirmed hypotheses, followed by two successful direct replications (n2=300, n3=300).
Paper B has a novel study with confirmed hypotheses (n=900).
*Intuitively*, which paper would you think has the most evidence? (Be honest, what is your gut reaction?)

I’m reasonably certain the answer is that both papers provide the same amount of evidence, by essentially the likelihood principle, and if anything, one should trust the estimates of paper B more (unless you meta-analyzed paper A, which should give you the same answer as paper B, more or less).

However, my intuition was correct that most people in this group would choose paper A (See for poll results).

My reasoning is that if you are observing data from the same DGP, then where you cut the data off is arbitrary; why would flipping a coin 10x, 10x, 10x, 10x, 10x provide more evidence than flipping the coin 50x? The method in paper A essentially just collected 300, drew a line, collected 300, drew a line, then collected 300 more, and called them three studies; this has no more information in sum (in a fisherian sense, the information would just add together) than if you didn’t arbitrarily cut the data into sections.

If you read in the comments of this group (which has researchers predominantly of the NHST world), one sees this fallacy that merely by passing a threshold more times means you have more evidence. They use p*p*p to justify it (even though that doesn’t make sense, because one could partition the data into 10 n=90 sets and get ‘more evidence’ by this logic; in fact, you could have 90 p-values of ~.967262 and get a p-value of .05). They use fisher’s method to say the p-value could be low (~.006), even though when combined, the p-value would actually be even lower (~.0007). One employs only Neyman-Pearson logic, and this results in a t1 error probability of .05^3.

I replied:

What do you mean by “confirmed hypotheses,” and what do you mean by a replication being “successful”? And are you assuming that the data are identical in the two scenarios?

To which Martin answered:

I [Martin], in a sense, left it ambiguous because I suspected that knowing nothing else, people would put paper A, even though asymptotically it should provide the same information as paper B.

I also left ‘confirmed hypothesis’ vague, because I didn’t want to say one must use one given framework. Basically, the hypotheses were supported by whatever method one uses to judge support (whether it be p-values, posteriors, bayes factors, whatever).

Successful replication as in, the hypotheses were supported again in the replication studies.

Finally, my motivating intuition was that paper A could basically be considered paper B if you sliced the data into thirds, or paper B could be written had you just combined the three n=300 samples.

That said, if you are experimenter A gaining three n=300 samples, your data should asymptotically (or, over infinite datasets) equal that of experimenter B gaining one n=900 sample (over infinite datasets), in the sense that the expected total information is equal, and the accumulated evidence should be equal. Therefore, even if any given two papers have different datasets, asymptotically they should provide equal information, and there’s not a good reason to prefer three smaller studies over 1 larger one.

Yet, knowing nothing else, people assumed paper A, I think, because three studies is more intuitively appealing than one large study, even if the two could be interchangeable had you divided the larger sample into three, or combined the smaller samples into 1.

From my perspective, Martin’s question can’t really be answered because I don’t know what’s in papers A and B, and I don’t know what is meant by a replication being “successful.” I think the answer depends a lot on these pieces of information, and I’m still not quite sure what Martin’s getting at here. But maybe some of you have thoughts on this one.