Skip to content

Fish cannot carry p-values

Following up on our discussion from last week on inference for fisheries, Anders Lamberg writes:

Since I first sent you the question, there has been a debate here too.

In the discussion you send, there is a debate both about the actual sampling (the mathematics) and about more the practical/biological issues. How accurate can farmed be separate from wild fish, is the 5 % farmed fish limit correct etc… There is constantly acquired new data on this first type of question. I am not worried about that, because there is an actual process going on that makes methods better.

However, it is the discussion of the second question, use of statistics and models, that until recently, have not been discussed properly. Here a lot of biologists have used the concept “confidence interval” without really understanding what it means. I gave you an example of sampling 60 salmon in a population of 3000. There are a lot of examples where the sample size have been as low as 10 individuals. The problem has been how to interpret the uncertainty. Here is a constructed (but not far from realistic example) example:

Population size is 3000. From different samples you could hypothetically get these three results:

1) You sample 10, get 1 farmed fish. This gives 10 % farmed fish
2) You sample 30, get 3 farmed fish. This gives 10 % farmed fish
3) You sample 60, get 6 farmed fish. This gives 10 % farmed fish

All surveys show the same result, but they are dramatically different when you have to draw a conclusion.

When reporting the sampling (current practice) it is the point estimate 10 % that is the main reported result. Sometimes the confidence interval with upper and lower limits is also reported, but not discussed. Since there is only one sample drawn from the populations, not discussing the uncertainty with such small samples can lead to wrong conclusions. In most projects a typical biologist is reporting, the results are a part of a hypothetical deductive research process. The new thing with the farmed salmon surveys, is that the results are measured against a defined limit : 5 %. If the point estimate is above 5 %, it means millions in costs (actually billions) for the industry. On the other hand, if the observed point estimate is below 5 % the uncertainty could affect he wild salmon populations . This could result in a long term disaster for the wild salmon.

With the risk of being viciously tabloid: The biologists (and I am one of them) have suddenly come into a situation where their reports have direct consequences. The question about the farmed salmon frequencies in the wild populations have become a political question in Norway – at the highest level. Suddenly we have to really discuss uncertainty in our data. I do not say that all biologists have been ignorant, but I suspect that a lot of publications have not and do not address uncertainty with respect.

The last months more mathematical expertise here in Norway have been involved in the “farmed salmon question” presented. The conclusion so far is that you cannot use the point estimate. You have to view the confidence interval as a test of a hypothesis:

H: The level of farmed salmon is over 5 %

If the 95 % confidence interval has an upper limit that contains the value 5 % or higher, you have to start measures. If the point estimate for example is 1 % but the upper limit in the 95 % confidence interval is 6 %, we must start the job to remove farmed salmon from that population. The problem with this and the fact that the confidence interval from almost all the surveys will contain the critical value of 5 % (although the point estimate is much lower), is that in most populations you cannot reject the hypothesis. The reason for all intervals containing the critical value, is the small sample sizes.

To use this kind of sampling procedure your sample size should exceed about 200 salmon to give a result that will the fish farming industry fair treatment. On the other hand, small sample sizes and large confidence intervals will always be a benefit for the wild salmon. I would like that on behalf of nature, but we biologists will then not be a relevant as experts that give advice in the society as a whole.

Then there are a lot of practical implications linked to the minimum sample size of 200. Since the sample is done by rod catch, some salmon will die due to the sampling procedure. But the most serious problem with the sampling is that several new reports now show that the farmed fish will more frequently take the bait. It is shown that the catchability of farmed salmon is from 3 to 10 times higher than that of wild salmon. This will vary so you cannot put in a constant factor in the calculations.

The solution so far seems to use other methods to acquire the samples. Snorkeling in the rivers performed by trained persons, show that over 85 % of the farmed fish is correctly classified. Since a snorkeling survey involves from 80 to 100 % of the population, the only significant error is the wrong classification, which is a small error compared to the uncertainty of small sample procedures.

Thanks again for showing interest in this question. The research institutions in Norway have not been that positive to even discuss the theme. I suspect that has to do with money. Fish farmers have focus on growth and money but sadly, but so far I guess the researchers involved to monitor environmental impacts see that a crises give more money for research. Therefore it is important to have the discussion free of all questions about money. Here in Norway I miss that kind of approach you have to the topic. The discussions and development and testing of new hypothesis is the reason why we became biologists? It is the closest you come to be a criminal investigator. We did not want to become politicians.

My general comment is to remove the whole “hypothesis” thing. It’s an estimation problem. You’re trying to estimate the level of farmed fish, which varies over time and across locations. And you have some decisions to make. I see zero benefit, and much harm, to framing this as a hypothesis testing problem.

Wald and those other guys from the 1940s were brilliant, doing statistics and operations research in real time during a real-life war. But the framework they were using was improvised, it was rickety, and in the many decades since, people keep trying to adapt it in inappropriate settings. Time to attack inference and decision problems directly, instead of tying yourself into knots with hypotheses and confidence intervals and upper limits and all the rest.

Call for research on California water resources

Patrick Atwater writes:

I serve as a project manager of the California Data Collaborative, a coalition of water utilities working together to share data and ensure water reliability.

We’ve put together a quick call for ideas on studies into the demand effects of water rates leveraging this unique database. California’s water world is highly fragmented across 411 retailers so this centralized repository greatly simplifies the life of prospective researchers.

Your audience is the perfect crowd to leverage this dataset and if you haven’t noticed, we’ve got a big drought out here in California so could use all the help we can get!

I have no idea what this is about but you can click on the link to find out for yourself.Cal

What makes a mathematical formula beautiful?


Hiro Minato pointed me to this paper (hyped here) by Semir Zeki, John Romaya, Dionigi Benincasa, and Michael Atiyah on “The experience of mathematical beauty and its neural correlates,” who report:

We used functional magnetic resonance imaging (fMRI) to image the activity in the brains of 15 mathematicians when they viewed mathematical formulae which they had individually rated as beautiful, indifferent or ugly. Results showed that the experience of mathematical beauty correlates parametrically with activity in the same part of the emotional brain, namely field A1 of the medial orbito-frontal cortex (mOFC), as the experience of beauty derived from other sources.

I wrote that I looked at the paper and I don’t believe it!

Minato replied:

I think what they did wasn’t good enough to answer or even approach the question (scientifically or otherwise). . . . Meanwhile, someone can probably study sociology or culture of mathematicians to understand why mathematicians want to describe some “good” mathematics beautiful, elegant, etc.

I agree. Mathematical beauty is a fascinating topic; I just don’t think they’re going to learn much via MRI scans. It just seems like too crude a tool, kinda like writing a bunch of formulas on paper, feeding these sheets of paper to lab rats, and then performing a chemical analysis of the poop. The connection between input and output is just too noisy and indirect.

This seems like a problem that could use the collaboration of mathematicians, psychologists, and historians or sociologists. And just think of how much sociologist time you could afford, using the money you saved from not running the MRI machine!

Thanks, eBay!


Our recent Stan short course went very well, and we wanted to thank Julia Neznanova and Paul Burt of eBay NYC for giving us the space where we held the class.

More evidence that even top researchers routinely misinterpret p-values

Blake McShane writes:

I wanted to write to you about something related to your ongoing posts on replication in psychology as well as your recent post the ASA statement on p-values. In addition to the many problems you and others have documented with the p-value as a measure of evidence (both those computed “honestly” and those computed after fishing, the garden of forking paths, etc.), another problem seems to be that academic researchers across the biomedical and social sciences genuinely interpret them quite poorly.

In a forthcoming paper, my colleague David Gal and I survey top academics across a wide variety of fields including the editorial board of Psychological Science and authors of papers published in the New England Journal of Medicine, the American Economic Review, and other top journals. We show:
[1] Researchers interpret p-values dichotomously (i.e., focus only on whether p is below or above 0.05).
[2] They fixate on them even when they are irrelevant (e.g., when asked about descriptive statistics).
[3] These findings apply to likelihood judgments about what might happen to future subjects as well as to choices made based on the data.
We also show they ignore the magnitudes of effect sizes.

In case you have any interest, I am attaching the paper. Unfortunately, our data is presented in tabular format due to an insistent AE; thus, I am also attaching our supplement which presents the data graphically.

And here’s their key finding:

Screen Shot 2016-04-06 at 3.03.29 PM

Bad news. I’m sure researchers would also misinterpret Bayesian inferences to the extent that they are being used in a null hypothesis significance testing framework. I think p-values are particularly bad because, either they’re being used to make (generally meaningless) claims about hypotheses being true or false, or they’re being used as an indirect estimation tool, in which case they have that horrible nonlinear transformation that makes them so hard to interpret (the difference between significant and non-significant not being itself statistically significant and all that, which is part of the bigger problem that these numbers are just about impossible to interpret in the (real-world) scenario in which the null hypothesis is not precisely true.

Bayesian Inference with Stan for Pharmacometrics Class

Bob Carpenter, Daniel Lee, and Michael Betancourt will be teaching the 3-day class starting on 19 September in Paris. Following is the outline for the course:

Day 1

Introduction to Bayesian statistics

  • Likelihood / sampling distributions
  • Priors, Posteriors via Bayes’s rule
  • Posterior expectations and quantiles
  • Events as expectations of indicator functions

Introduction to Stan

  • Basic data types
  • Variable declarations
  • Constrained parameters and transforms to unconstrained
  • Program blocks and execution
  • Derived quantities
  • Built-in functions and operators
  • Statements: sampling, assignment, loops, conditionals, blocks
  • How to use Stan within R with RStan

Hands-on examples

Day 2

ODE and PK/PD Modeling

  • Parameters and data to ODEs
  • Non-stiff ODE solver
  • Stiff ODE solver
  • Control parameters and tolerances
  • Coupled ODE systems for sensitivities
  • Elimination half-lifes

Inference with Markov chain Monte Carlo

  • Monte Carlo methods and plug-in inference
  • Markov chain Monte Carlo
  • Convergence diagnostics, R-hat, effective sample size
  • Effective sample size vs. number of iterations
  • Plug-in posterior expectations and quantiles
  • Event probability calculations

Hands-on examples

Day 3

Additional Topics in PK/PD Modeliong

  • Bolus and infusion dosing
  • Lag time and absorption models
  • Linear versus Michaelis/Menten elimination
  • Hierarchical models for patient-level effects
  • Transit compartment models and time lags
  • Multi-compartment models and varying time scales
  • Joint PK/PD modeling: Bayes vs. “cut”
  • Meta-analysis
  • Formulating informative priors
  • Clinical trial simulations and power calculations

Stan programming techniques

  • Reproducible research practices
  • Probabilistic programming principles
  • Generated quantities for inference
  • Data simulation and model checking
  • Posterior predictive checks
  • Cross-validation and predictive calibration
  • Variable transforms for sampling efficiency
  • Multiple indexing and range slicing
  • Marginalizing discrete parameters
  • Handling missing data
  • Ragged and aparse data structures
  • Identifiability and problematic posteriors
  • Weakly informative priors

If you are in Europe in September, please come and join us. Thanks to Julie Bertrand and France Mentré from Université Paris Diderot for helping us organize the course.

You can register here.

Killer O


Taggert Brooks points to this excellent news article by George Johnson, who reports:

Epidemiologists have long been puzzled by a strange pattern in their data: People living at higher altitudes appear less likely to get lung cancer. . . . The higher you live, the thinner the air, so maybe oxygen is a cause of lung cancer. . . .

But the hypothesis is not as crazy as it may sound. Oxygen is what energizes the cells of our bodies. Like any fuel, it inevitably spews out waste — a corrosive exhaust of substances called “free radicals,” or “reactive oxygen species,” that can mutate DNA and nudge a cell closer to malignancy.

Back to the epidemiology. Researchers Kamen Simeonov and Daniel Himmelstein adjusted for a bunch of demographic and medical variables, and then:

After an examination of all these numbers for the residents of 260 counties in the Western United States, situated from sea level to nearly 11,400 feet, one pattern stood out: a correlation between the concentration of oxygen in the air and the incidence of lung cancer. For each 1,000-meter rise in elevation, there were 7.23 fewer lung cancer cases per 100,000 people.

“7.23” . . . that’s a bit overprecise, there’s no way you could know it to this level of accuracy. But I get the general idea.

As Brooks notes, this idea is not new. He links to a 1987 paper by Clarice Weinberg, Kenneth Brown, and David Hoel, who discussed “recent evidence implicating reactive forms of oxygen in carcinogenesis and atherosclerosis” and wrote that “reduced oxygen pressure of inspired air may be protective against certain causes of death.”

The idea has also hit the mass media. For example, from a 2012 article by Michael Corvinus in Cracked (yes, Cracked):

One of the disadvantages of living at higher altitudes is that there’s less oxygen in the air, which can suck for those with respiratory problems. One of the advantages of those places, however, is that … there’s less oxygen in the air. A lack of oxygen makes people’s bodies more efficient, which makes them live longer. . . . Dr. Benjamin Honigman at the University of Colorado School of Medicine theorized that the lower levels of oxygen force the body to become more efficient at distributing that oxygen, activating certain genes that enhance heart function and create new blood vessels for bringing blood to and from the heart, greatly lowering the chances of heart disease.

On deck this week

Mon: Killer O

Tues: More evidence that even top researchers routinely misinterpret p-values

Wed: What makes a mathematical formula beautiful?

Thurs: Fish cannot carry p-values

Fri: Does Benadryl make you senile? Challenges in research communication

Sat: What recommendations to give when a medical study is not definitive (which of course will happen all the time, especially considering that new treatments should be compared to best available alternatives, which implies that most improvements will be incremental at best)

Sun: Powerpose update

“Children seek historical traces of owned objects”

Recently in the sister blog:

An object’s mental representation includes not just visible attributes but also its nonvisible history. The present studies tested whether preschoolers seek subtle indicators of an object’s history, such as a mark acquired during its handling. Five studies with 169 children 3–5 years of age and 97 college students found that children (like adults) searched for concealed traces of object history, invisible traces of object history, and the absence of traces of object history, to successfully identify an owned object. Controls demonstrated that children (like adults) appropriately limit their search for hidden indicators when an owned object is visibly distinct. Altogether, these results demonstrate that concealed and invisible indicators of history are an important component of preschool children’s object concepts.

“The Dark Side of Power Posing”

Shravan points us to this post from Jay Van Bavel a couple years ago. It’s an interesting example because Bavel expresses skepticism about the “power pose” hype but he makes the same general mistake of Carney, Cuddy, Yap, and other researchers in this area in that he overreacts to every bit of noise that’s been p-hacked and published.

Here’s Bavel:

Some of the new studies used different analysis strategies than the original paper . . . but they did find that the effects of power posing were replicable, if troubling. People who assume high-power poses were more likely to steal money, cheat on a test and commit traffic violations in a driving simulation. In one study, they even took to the streets of New York City and found that automobiles with more expansive driver’s seats were more likely to be illegally parked. . . .

Dr. Brinol [sic] and his colleagues found that power posing increased self-confidence, but only among participants who already had positive self-thoughts. In contrast, power posing had exactly the opposite effect on people who had negative self-thoughts. . . .

In two studies, Joe Cesario and Melissa McDonald found that power poses only increased power when they were made in a context that indicated dominance. Whereas people who held a power pose while they imagined standing at an executive desk overlooking a worksite engaged in powerful behavior, those who held a power pose while they imagined being frisked by the police actually engaged in less powerful behavior. . . .

In a way I like all this because it shows how the capitalize-on-noise strategy which worked so well for the original power pose authors can also be used to dismantle the whole idea. So that’s cool. But from a scientific point of view, I think there’s so much noise here that any of these interactions could well go in the opposite direction. Not to mention all the unstudied interactions and all the interactions that happened not to be statistically significant in these particular small samples.

I’m not trying to slam Bavel here. The above-linked post was published in 2013, before we were all fully aware of how easy it was for researchers to get statistical significance from noise, even without having to try. Now we know better: just cos some correlation or interaction appears in a sample, we don’t have to think it represents anything in the larger population.

When do statistical rules affect drug approval?

Someone writes in:

I have MS and take a disease-modifying drug called Copaxone. Sandoz developed a generic version​ of Copaxone​ and filed for FDA approval. Teva, the manufacturer of Copaxone, filed a petition opposing that approval (surprise!). FDA rejected Teva’s petitions and approved the generic.

My insurance company encouraged me to switch to the generic. Specifically, they increased the copay​ for the non-generic​ from $50 to $950 per month. That got my attention. My neurologist recommended against switching to the generic.

Consequently, I decided to try to review the FDA decision to see if I could get any insight into the basis for ​my neurologist’s recommendation​dation.​

What appeared on first glance to be a telling criticism of the Teva submission was a reference​ by the FDA​ to “non-standard statistical criteria” together with the FDA’s statement that reanalysis with standard practices found different results than those found by Teva. So, I looked up back at the Teva filing to identify the non-standard statistical criteria they used. If I found the right part of the Teva filing, they used R packages named ComBat and LIMMA​—both empirical Bayes tools.

​Now, it is possible that I have made a mistake and have not properly identified the statistical criteria that the FDA found wanting. I was unable to find any specific statement w.r.t. the “non-standard” statistics.

But, if empirical Bayes works better than older methods, then falling back to older methods would result in weaker inferences—and the rejection of the data from Teva.

It seems to me that this case raises interesting questions about the adoption and use of empirical Bayes. How should the FDA have treated the “non-standard statistical criteria”? More generally, is there a problem with getting regulatory agencies to accept Bayesian models? Maybe there is some issue here that would be appropriate for a masters student in public policy.

My correspondent included some relevant documentation:

The FDA docket files are available at!docketBrowser;rpp=25;po=0;dct=SR;D=FDA-2015-P-1050​

The test below is from ​ April 15, 2015 content/uploads/2016/07/Citizen_Petition_Denial_Letter_From_CDER_to_Teva_Pharmaceuticals.pdf”>FDA Denial Letter to Teva at pp. 41-42​

​Specifically, we concluded that the mouse splenocyte studies were poorly designed, contained a high level of residual batch bias, and used non-standard statistical criteria for assessing the presence of differentially expressed genes. When FDA reanalyzed the microarray data from one Teva study using industry standard practices and criteria, Copaxone and the comparator (Natco) product were found to have very similar effects on the efficacy-related pathways proposed for glatiramer acetate’s mechanism of action.

​The image below is from the ​Teva Petition, July 2, 2014 at p. 60


And he adds:

My interest in this topic arose only because of my MS treatment—I have had no contact with Teva, Sandoz, or the FDA. And I approve of the insurance company’s action—that is, I think that creating incentives to encourage consumers to switch to generic medicines is usually a good idea.

I have no knowledge of any of this stuff, but the interaction of statistics and policy seems generally relevant so I thought I would share this with all of you.

Ioannidis: “Evidence-Based Medicine Has Been Hijacked”

The celebrated medical-research reformer has a new paper (sent to me by Keith O’Rourke; official published version here), where he writes:

As EBM [evidence-based medicine] became more influential, it was also hijacked to serve agendas different from what it originally aimed for. Influential randomized trials are largely done by and for the benefit of the industry. Meta-analyses and guidelines have become a factory, mostly also serving vested interests. National and federal research funds are funneled almost exclusively to research with little relevance to health outcomes. We have supported the growth of principal investigators who excel primarily as managers absorbing more money.

He continues:

Diagnosis and prognosis research and efforts to individualize treatment have fueled recurrent spurious promises. Risk factor epidemiology has excelled in salami-sliced data-dredged papers with gift authorship and has become adept to dictating policy from spurious evidence. Under market pressure, clinical medicine has been transformed to finance-based medicine. In many places, medicine and health care are wasting societal resources and becoming a threat to human well-being. Science denialism and quacks are also flourishing and leading more people astray in their life choices, including health.

And concludes:

EBM still remains an unmet goal, worthy to be attained.

Read the whole damn thing.

Going beyond confidence intervals

Anders Lamberg writes:

In an article by Tom Sigfried, Science News, July 3 2014, “Scientists’ grasp of confidence intervals doesn’t inspire confidence” you are cited: “Gelman himself makes the point most clearly, though, that a 95 percent probability that a confidence interval contains the mean refers to repeated sampling, not any one individual interval.”

I have some simple questions that I hope you can answer. I am not a statistician but a biologist only with basic education in statistics. My company is working with surveillance of populations of salmon in Norwegian rivers and we have developed methods for counting all individuals in populations. We have moved from using estimates acquired from samples, to actually counting all individuals in the populations. This is possible because the salmon migrate between the ocean and the rivers and often have to pass narrow parts of the rivers where we use underwater video cameras to cover whole cross section. In this way we “see” every individual and can categorize size, sex etc. Another argument for counting all individuals is that our Atlantic salmon populations rarely exceed 3000 individuals (average of approx. 500) in contrast to Pacific salmon populations where numbers are more in the range of 100 000 to more than a million.

In Norway we also have a large salmon farming industry where salmon are held in net pens in the sea. The problem is that these fish, which have been artificially selected for over 10 generations, is a threat to the natural populations if they escape and breed with the wild salmon. There is a concern that the “natural gene pool” will be diluted. That was only a background for my questions, although the nature of the statistical problem is general for all sampling.

Here is the statistical problem: In a breeding population of salmon in a river there may be escapees from the fish farms. It is important to know the proportion of farmed escapees. If it exceed 5 % in a given population, measures should made to reduce the number of farmed salmon in that river. But how can we find the real proportion of farmed salmon in a river? The method used for over 30 years now is a sampling of approximately 60 salmon from each river and counting how many wild and how many farmed salmon you got in that sample. The total population may be 3000 individuals in total.

There is only taken one sample. A point estimate is calculated and a confidence interval for that estimate. In one realistic example we may sample 60 salmon and find that 6 of them are farmed fish. That gives a point estimate of 10 % farmed fish in the population of 3000 in that specific river. The 95% confidence interval will be from approximately 2% to 18%. Most commonly it is only the point estimate that is reported.

When I read your comment in the article cited in the start of this mail, I see that something must be wrong with this sampling procedure. Our confidence interval is linked to the sample and does not necessarily reflect the “real value” that we are interested in. As I see it now our point estimate acquired from only one sample does not give us much at all. We should have repeated the sampling procedure many times to get an estimate that is precise enough to say if we have passed the limit of 5% farmed fish in that population.

Can we use the one sample of 60 salmon in the example to say anything at all about the proportion of farmed salmon in that river? Can we use the point estimate 10%?

We have asked this question to the government, but they reply that it is more likely the real value lies near the 10% point estimate since the confidence has the shape of a normal distribution.

Is this correct?

As I see it the real value does not have to lie within the 95 % confidence interval at all. However, if we increase the sample size close to the population size, we will get a precise estimate. But, what happens when we use small samples and do not repeat?

My reply:

In this case, the confidence intervals seem reasonable enough (under the usual assumption that you are measuring a simple random sample). I suspect the real gains will come from combining estimates from different places and different times. A hierarchical model will allow you to do some smoothing.

Here’s an example. Suppose you sample 60 salmon in the same place each year and the number of farmed fish you see are 7, 9, 7, 6, 5, 8, 7, 2, 8, 7, … These data are consistent with their being a constant proportion of 10% farmed fish (indeed, I created these particular numbers using rbinom(10,60,.1) in R). On the other hand, if the number you see are 8, 12, 9, 5, 3, 11, 8, 0, 11, 9, … then this is evidence for real fluctuations. And of course if you see a series such as 5, 0, 3, 8, 9, 11, 9, 12, …, this is evidence for a trend. So you’d want to go beyond confidence intervals to make use of all that information. There’s actually a lot of work done using Bayesian methods in fisheries which might be helpful here.

Bayesian Linear Mixed Models using Stan: A tutorial for psychologists, linguists, and cognitive scientists

This article by Tanner Sorensen, Sven Hohenstein, and Shravan Vasishth might be of interest to some of you.

No, Google will not “sway the presidential election”

Grrr, this is annoying. A piece of exaggerated science reporting hit PPNAS and was promoted in Politico, then Kaiser Fung and I shot it down (“Could Google Rig the 2016 Election? Don’t Believe the Hype”) in our Daily Beast column last September.

Then it appeared again this week in a news article in the Christian Science Monitor.

I know Christian Scientists believe in a lot of goofy things but I didn’t know that they’d fall for silly psychology studies.

The Christian Science Monitor reporter did link to our column and did note that we don’t buy the Google-can-sway-the-election claim—so, in that sense, I can’t hope for much more. What I really think is that Rosen should’ve read what Kaiser and I wrote, realized our criticisms were valid, and then have not wasted time reporting on the silly claim based on a huge, unrealistic manipulation in a highly artificial setting. But that would’ve involved shelving a promising story idea, and what reporter wants to do that?

The Christian Science Monitor reporter did link to our column and did note that we don’t buy the Google-can-sway-the-election claim. So I can’t really get upset about the reporting: if the reporter is not an expert on politics, it can be hard for him to judge what to believe.

Nonetheless, even though it’s not really the reporter’s fault, the whole event saddens me, in that it illustrates how ridiculous hype pays off. The original researchers did a little study which has some value but then they hyped it well beyond any reasonable interpretation (as their results came from a huge, unrealistic manipulation in a highly artificial setting), resulting in a ridiculous claim that Google can sway the presidential election. The hypesters got rewarded for their hype with media coverage. Which of course motivates more hype in the future. It’s a moral hazard.

I talked about this general problem a couple years ago, under the heading, Selection bias in the reporting of shaky research. It goes like this. Someone does a silly study and hypes it up. Some reporters realize right away that it’s ridiculous, others ask around and learn that it makes no sense, and they don’t bother reporting on it. Other reporters don’t know any better—that’s just the way it is, nobody can be an expert on everything—and they report on it. Hence the selection bias: The skeptics don’t waste their time writing about a bogus or over-hyped study; the credulous do. The net result is that the hype continues.

P.S. I edited the above post (striking through some material and replacing with two new paragraphs) in response to comments.

Moving statistical theory from a “discovery” framework to a “measurement” framework

Avi Adler points to this post by Felix Schönbrodt on “What’s the probability that a significant p-value indicates a true effect?” I’m sympathetic to the goal of better understanding what’s in a p-value (see for example my paper with John Carlin on type M and type S errors) but I really don’t like the framing in terms of true and false effects, false positives and false negatives, etc. I work in social and environmental science. And in these fields it almost never makes sense to me to think about zero effects. Real-world effects vary, they can be difficult to measure, and statistical theory can be useful in quantifying available information—that I agree with. But I don’t get anything out of statements such as “Prob(effect is real | p-value is significant).”

This is not a particular dispute with Schönbrodt’s work; rather, it’s a more general problem I have with setting up the statistical inference problem in that way. I have a similar problem with “false discovery rate,” in that I don’t see inferences (“discoveries”) as being true or false. Just for example, does the notorious “power pose” paper represent a false discovery? In a way, sure, in that the researchers were way overstating their statistical evidence. But I think the true effect on power pose has to be highly variable, and I don’t see the benefit of trying to categorize it as true or false.

Another way to put it is that I prefer to thing of statistics via a “measurement” paradigm rather than a “discovery” paradigm. Discoveries and anomalies do happen—that’s what model checking and exploratory data analysis are all about—but I don’t really get anything out of the whole true/false thing. Hence my preference for looking at type M and type S errors, which avoid having to worry about whether some effect is zero.

That all said, I know that many people like the true/false framework so you can feel free to follow the above link and see what Schönbrodt is doing.

On deck this week

Mon: Moving statistical theory from a “discovery” framework to a “measurement” framework

Tues: Bayesian Linear Mixed Models using Stan: A tutorial for psychologists, linguists, and cognitive scientists

Wed: Going beyond confidence intervals

Thurs: Ioannidis: “Evidence-Based Medicine Has Been Hijacked”

Fri: What’s powdery and comes out of a metallic-green cardboard can?

Sat: “The Dark Side of Power Posing”

Sun: “Children seek historical traces of owned objects”

“Pointwise mutual information as test statistics”

Christian Bartels writes:

Most of us will probably agree that making good decisions under uncertainty based on limited data is highly important but remains challenging.

We have decision theory that provides a framework to reduce risks of decisions under uncertainty with typical frequentist test statistics being examples for controlling errors in absence of prior knowledge. This strong theoretical framework is mainly applicable to comparatively simple problems. For non-trivial models and/or if there is only limited data, it is often not clear how to use the decision theory framework.

In practice, careful iterative model building and checking seems to be the best what can be done – be it using Bayesian methods or applying “frequentist” approaches (here, in this particular context, “frequentist” seems often to be used as implying “based on minimization”).

As a hobby, I tried to expand the armory for decision making under uncertainty with complex models, focusing on trying to expand the reach of decision theoretic, frequentist methods. Perhaps at one point in the future, it will be become possible to bridge the existing, good pragmatic approaches into the decision theoretical framework.

So far:

– I evaluated an efficient integration method for repeated evaluation of statistical integrals (e.g., p-values) for a set of of hypotheses. Key to the method was the use of importance sampling. See here.

– I proposed pointwise mutual information as an efficient test statistics that is optimal under certain considerations. The commonly used alternative is the likelihood ratio test, which, in the limit where asymptotics are not valid, is annoyingly inefficient since it requires repeated minimizations of randomly generated data.
Bartels, Christian (2015): Generic and consistent confidence and credible regions.

More work is required, in particular:

– Dealing with nuisance parameters

– Including prior information.

Working on these aspects, I would appreciate feedback on what exists so far, in general, and on the proposal of using the pointwise mutual information as test statistics, in particular.

I have nothing to add here. The topic is important so I thought this was worth sharing.

You can post social science papers on the new SocArxiv

I learned about it from this post by Elizabeth Popp Berman.

The temporary SocArxiv site is here. It is connected to the Open Science Framework, which we’ve heard a lot about in discussions of preregistration.

You can post your papers at SocArxiv right away following these easy steps:

Send an email to the following address(es) from the email account you would like used on the OSF:

For Preprints, email
The format of the email should be as follows:

Preprint Title
Message body
Preprint abstract
Your preprint file (e.g., .docx, PDF, etc.)

It’s super-easy, actually much much easier than submitting to Arxiv. I assume that Arxiv has good reasons for its more elaborate submission process, but for now I found SocArxiv’s no-frills approach very pleasant.

I tried it out by sending a few papers, and it worked just fine. I’m already happy because I was able to upload my hilarious satire article with Jonathan Falk. (Here’s the relevant SocArxiv page.) When I tried to post that article on Arxiv last month, they rejected it as follows:

On Jun 16, 2016, at 12:17 PM, arXiv Moderation wrote:

Your submission has been removed. Our volunteer moderators determined that your article does not contain substantive research to merit inclusion within arXiv. Please note that our moderators are not referees and provide no reviews with such decisions. For in-depth reviews of your work you would have to seek feedback from another forum.

Please do not resubmit this paper without contacting arXiv moderation and obtaining a positive response. Resubmission of removed papers may result in the loss of your submission privileges.

For more information on our moderation policies see:

And the followup:

Dear Andrew Gelman,

Our moderators felt that a follow up should be made to point out arXiv only accepts articles that would be refereeable by a conventional publication venue. Submissions that that contain inflammatory or fictitious content or that use highly dramatic and mis-representative titles/abstracts/introductions may be removed. Repeated submissions of inflammatory or highly dramatic content may result in the suspension of submission privileges.

This kind of annoyed me because the only reason my article with Falk would not be refereeable by a conventional publication venue is because of all our jokes. Had we played it straight and pretended we were doing real research, we could’ve had a good shot at Psych Science or PPNAS. So we were, in effect, penalized for our honesty in writing a satire rather than a hoax.

As my couathor put it, the scary thing is how close our silly paper actually is to a publishable article, not how far.

Also, I can’t figure out how Arxiv’s rules were satisfied by this 2015 paper, “It’s a Trap: Emperor Palpatine’s Poison Pill,” which is more fictitious than ours, also includes silly footnotes, etc.

Anyway, I don’t begrudge Arxiv their gatekeeping. Arxiv is great great great, and I’m not at all complaining about their decision not to publish our funny article. Their site, their rules. Indeed, I wonder what will happen if someone decides to bomb SocArxiv with fake papers. At some point, a human will need to enter the loop, no?

For now, though, I think it’s great that there’s a place where everyone can post their social science papers.

Bigmilk strikes again

Screen Shot 2016-07-16 at 9.14.34 AM

Paul Alper sends along this news article by Kevin Lomagino, Earle Holland, and Andrew Holtz on the dairy-related corruption in a University of Maryland research study on the benefits of chocolate milk (!).

The good news is that the university did not stand behind its ethically-challenged employee. Instead:

“I did not become aware of this study at all until after it had become a news story,” Patrick O’Shea, UMD’s Vice President and Chief Research Officer, said in a teleconference. He says he took a look at both the chocolate milk and concussions news release and an earlier one comparing the milk to sports recovery drinks. “My reaction was, ‘This just doesn’t seem right. I’m not sure what’s going on here, but this just doesn’t seem right.’”

Back when I was a student there, we called it UM. I wonder when they changed it to UMD?

Also this:

O’Shea said in a letter that the university would immediately take down the release from university websites, return some $200,000 in funds donated by dairy companies to the lab that conducted the study, and begin implementing some 15 recommendations that would bring the university’s procedures in line with accepted norms. . . .

Dr. Shim’s lab was the beneficiary of large donations from Allied Milk Foundation, which is associated with First Quarter Fresh, the company whose chocolate milk was being studied and favorably discussed in the UMD news release.

Also this from a review committee:

There are simply too many uncontrolled variables to produce meaningful scientific results.

Wow—I wonder what Harvard Business School would say about this, if this criterion were used to judge some of its most famous recent research?

And this:

The University of Maryland says it will never again issue a news release on a study that has not been peer reviewed.

That seems a bit much. I think peer review is overrated, and if a researcher has some great findings, sure, why not do the press release? The key is to have clear lines of responsibility. And I agree with the University of Maryland on this:

The report found that while the release was widely circulated prior to distribution, nobody knew for sure who had the final say over what it could claim. “There is no institutional protocol for approval of press releases and lines of authority are poorly defined,” according to the report. It found that Dr. Shim was given default authority over the news release text, and that he disregarded generally accepted standards as to when study results should be disseminated in news releases.

Now we often seem to have the worst of both worlds, with irresponsible researchers making extravagant and ill-founded claims and then egging on press agents to make even more extreme statements. Again, peer review has nothing to do with it. There is a problem with press releases that nobody is taking responsibility for.