Skip to content

The meta-hype algorithm


Kevin Lewis pointed me to this article:

There are several methods for building hype. The wealth of currently available public relations techniques usually forces the promoter to judge, a priori, what will likely be the best method. Meta-hype is a methodology that facilitates this decision by combining all identified hype algorithms pertinent for a particular promotion problem. Meta-hype generates a final press release that is at least as good as any of the other models considered for hyping the claim. The overarching aim of this work is to introduce meta-hype to analysts and practitioners. This work compares the performance of journal publication, preprints, blogs, twitter, Ted talks, NPR, and meta-hype to predict successful promotion. A nationwide database including 89,013 articles, tweets, and news stories. All algorithms were evaluated using the total publicity value (TPV) in a test sample that was not included in the training sample used to fit the prediction models. TPV for the models ranged between 0.693 and 0.720. Meta-hype was superior to all but one of the algorithms compared. An explanation of meta-hype steps is provided. Meta-hype is the first step in targeted hype, an analytic framework that yields double hyped promotion with fewer assumptions than the usual publicity methods. Different aspects of meta-hype depending on the context, its function within the targeted promotion framework, and the benefits of this methodology in the addiction to extreme claims are discussed.

I can’t seem to find the link right now, but you get the idea.

Would you prefer three N=300 studies or one N=900 study?

Stephen Martin started off with a question:

I’ve been thinking about this thought experiment:

Imagine you’re given two papers.
Both papers explore the same topic and use the same methodology. Both were preregistered.
Paper A has a novel study (n1=300) with confirmed hypotheses, followed by two successful direct replications (n2=300, n3=300).
Paper B has a novel study with confirmed hypotheses (n=900).
*Intuitively*, which paper would you think has the most evidence? (Be honest, what is your gut reaction?)

I’m reasonably certain the answer is that both papers provide the same amount of evidence, by essentially the likelihood principle, and if anything, one should trust the estimates of paper B more (unless you meta-analyzed paper A, which should give you the same answer as paper B, more or less).

However, my intuition was correct that most people in this group would choose paper A (See for poll results).

My reasoning is that if you are observing data from the same DGP, then where you cut the data off is arbitrary; why would flipping a coin 10x, 10x, 10x, 10x, 10x provide more evidence than flipping the coin 50x? The method in paper A essentially just collected 300, drew a line, collected 300, drew a line, then collected 300 more, and called them three studies; this has no more information in sum (in a fisherian sense, the information would just add together) than if you didn’t arbitrarily cut the data into sections.

If you read in the comments of this group (which has researchers predominantly of the NHST world), one sees this fallacy that merely by passing a threshold more times means you have more evidence. They use p*p*p to justify it (even though that doesn’t make sense, because one could partition the data into 10 n=90 sets and get ‘more evidence’ by this logic; in fact, you could have 90 p-values of ~.967262 and get a p-value of .05). They use fisher’s method to say the p-value could be low (~.006), even though when combined, the p-value would actually be even lower (~.0007). One employs only Neyman-Pearson logic, and this results in a t1 error probability of .05^3.

I replied:

What do you mean by “confirmed hypotheses,” and what do you mean by a replication being “successful”? And are you assuming that the data are identical in the two scenarios?

To which Martin answered:

I [Martin], in a sense, left it ambiguous because I suspected that knowing nothing else, people would put paper A, even though asymptotically it should provide the same information as paper B.

I also left ‘confirmed hypothesis’ vague, because I didn’t want to say one must use one given framework. Basically, the hypotheses were supported by whatever method one uses to judge support (whether it be p-values, posteriors, bayes factors, whatever).

Successful replication as in, the hypotheses were supported again in the replication studies.

Finally, my motivating intuition was that paper A could basically be considered paper B if you sliced the data into thirds, or paper B could be written had you just combined the three n=300 samples.

That said, if you are experimenter A gaining three n=300 samples, your data should asymptotically (or, over infinite datasets) equal that of experimenter B gaining one n=900 sample (over infinite datasets), in the sense that the expected total information is equal, and the accumulated evidence should be equal. Therefore, even if any given two papers have different datasets, asymptotically they should provide equal information, and there’s not a good reason to prefer three smaller studies over 1 larger one.

Yet, knowing nothing else, people assumed paper A, I think, because three studies is more intuitively appealing than one large study, even if the two could be interchangeable had you divided the larger sample into three, or combined the smaller samples into 1.

From my perspective, Martin’s question can’t really be answered because I don’t know what’s in papers A and B, and I don’t know what is meant by a replication being “successful.” I think the answer depends a lot on these pieces of information, and I’m still not quite sure what Martin’s getting at here. But maybe some of you have thoughts on this one.

Drug-funded profs push drugs

Someone who wishes to remain anonymous writes:

I just read a long ProPublica article that I think your blog commenters might be interested in. It’s from February, but was linked to by the Mad Biologist today ( Here is a link to the article:

In short, it’s about a group of professors (mainly economists) who founded a consulting firm that works for many big pharma companies. They publish many peer-reviewed articles, op-eds, blogs, etc on the debate about high pharmaceutical prices, always coming to the conclusion that high prices are a net benefit (high prices -> more innovation -> better treatments in the future vs poor people having no access to existing treatment today). They also are at best very inconsistent about disclosing their affiliations and funding.

One minor thing that struck me is the following passage, about their response to a statistical criticism of one of their articles:

The founders of Precision Health Economics defended their use of survival rates in a published response to the Dartmouth study, writing that they “welcome robust scientific debate that moves forward our understanding of the world” but that the research by their critics had “moved the debate backward.”

The debate here appears to be about lead-time bias – increased screening leads to earlier detection which can increase survival rates without actually improving outcomes. So on the face it doesn’t seem like an outrageous criticism. If they have controlled it appropriately, they should have a “robust debate” so they can convince their critics and have more support for increasing drug prices! Of course I doubt they have any interest in actually having this debate. It seems similar to the responses you get from Wansink, Cuddy (or the countless other researchers promoting flawed studies who have been featured on your blog) when they are confronted with valid criticism: sound reasonable, do nothing, and get let off the hook.

This interests me because I consult for pharmaceutical companies. I don’t really have anything to add, but this sort of conflict of interest does seem like something to worry about.

We talk a lot on this blog about bad science that’s driven by some combination of careerism and naivite. We shouldn’t forget about the possibility of flat-out corruption.

Journals for insignificant results

Tom Daula writes:

I know you’re not a fan of hypothesis testing, but the journals in this blog post are an interesting approach to the file drawer problem. I’ve never heard of them or their like. An alternative take (given academia standard practice) is “Journal for XYZ Discipline papers that p-hacking and forking paths could not save.”

Psychology: Journal of Articles in Support of the Null Hypothesis

Biomedicine: Journal of Negative Results in Biomedicine

Ecology and Evolutionary Biology: Journal of Negative Results

In psychology, this sort of journal isn’t really needed because we already have PPNAS, where they publish articles in support of the null hypothesis all the time, they just don’t realize it!

OK, ok, all jokes aside, the above post recommends:

Is it time for Economics to catch up? . . . a number of prominent Economists have endorsed this idea (even if they are not ready to pioneer the initiative). So, imagine… a call for papers along the following lines:

Series of Unsurprising Results in Economics (SURE)

Is the topic of your paper interesting, your analysis carefully done, but your results are not “sexy”? If so, please consider submitting your paper to SURE. An e-journal of high-quality research with “unsurprising” findings.
How does it work:
— We accept papers from all fields of Economics…
— Which have been rejected at a journal indexed in EconLit…
— With the ONLY important reason being that their results are statistically insignificant or otherwise “unsurprising”.

I can’t imagine this working. Why not just publish everything on SSRN or whatever, and then this SURE can just link to the articles in question (along with the offending referee reports)?

Also, I’m reminded of the magazine McSweeney’s, which someone once told me had been founded based on the principle of publishing stories that had been rejected elsewhere.

Teaching Statistics: A Bag of Tricks (second edition)

Hey! Deb Nolan and I finished the second edition of our book, Teaching Statistics: A Bag of Tricks. You can pre-order it here.

I love love love this book. As William Goldman would say, it’s the “good parts version”: all the fun stuff without the standard boring examples (counting colors of M&M’s, etc.). Great stuff for teaching, also I’ve been told that’s a fun read for students of statistics.

Here’s the table of contents. If this doesn’t look like fun to you, don’t buy the book.

Representists versus Propertyists: RabbitDucks – being good for what?


It is not that unusual in statistics to get the same statistical output (uncertainty interval, estimate, tail probability,etc.) for every sample, or some samples or the same distribution of outputs or the same expectations of outputs or just close enough expectations of outputs. Then, I would argue one has a variation on a DuckRabbit. In the DuckRabbit, the same sign represents different objects with different interpretations (what to make of it) whereas here we have differing signs (models) representing the same object (an inference of interest) with different interpretations (what to make of them). I will imaginatively call this a RabbitDuck ;-)

Does one always choose a Rabbit or a Duck, or sometimes one or another or always both? I would argue the higher road is both – that is to use differing models to collect and consider the  different interpretations. Multiple perspectives can always be more informative (if properly processed), increasing our hopes to find out how things actually are by increasing the chances and rate of getting less wrong. Though this getting less wrong is in expectation only – it really is an uncertain world.

Of course, in statistics a good guess for the Rabbit interpretation would be Bayesian and for the Duck, Frequentest (Canadian spelling). Though, as one of Andrew’s colleagues once argued it is really modellers versus non modellers rather than Bayesians versus Frequentests and that makes a lot of sense to me. Representists are Rabbits “conjecturing, assessing, and adopting idealized representations of reality, predominantly using probability generating models for both parameters and data” while  Propertyists are Ducks “primarily being about discerning procedures with good properties that are uniform over a wide range of possible underlying realities and restricting use, especially in science, to just those procedures” from here.  Given that “idealized representations of reality” can only be indirectly checked (i.e. always remain possibly wrong) and “good properties” always beg the question “good for what?” (as well as only hold over a range of possible but largely unrepresented realities) – it should be a no brainer? that would it be more profitable than not to thoroughly think through both perspectives (and more actually).

An alternative view might be Leo Breiman’s “two cultures” paper.

This issue of multiple perspectives also came up in Bob’s recent post where the possibility arose that some might think it taboo to mix Bayes and Frequentist perspectives.

Some case studies would be:  Continue reading ‘Representists versus Propertyists: RabbitDucks – being good for what?’ »

My proposal for JASA: “Journal” = review reports + editors’ recommendations + links to the original paper and updates + post-publication comments

Whenever they’ve asked me to edit a statistics journal, I say no thank you because I think I can make more of a contribution through this blog. I’ve said no enough times that they’ve stopped asking me. But I’ve had an idea for awhile and now I want to do it.

I think that journals should get out of the publication business and recognize that their goal is curation. My preferred model is that everything gets published on some sort of super-Arxiv, and then the role of an organization such as the Journal of the American Statistical Association is to pick papers to review and to recommend. The “journal” is then the review reports plus the editors’ recommendations plus links to the original paper and any updates plus post-publication comments.

If JASA is interested in going this route, I’m in.

My talk this Friday in the Machine Learning in Finance workshop

This is kinda weird because I don’t know anything about machine learning in finance. I guess the assumption is that statistical ideas are not domain specific. Anyway, here it is:

What can we learn from data?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

The standard framework for statistical inference leads to estimates that are horribly biased and noisy for many important examples. And these problems all even worse as we study subtle and interesting new questions. Methods such as significance testing are intended to protect us from hasty conclusions, but they have backfired: over and over again, people think they have learned from data but they have not. How can we have any confidence in what we think we’ve learned from data? One appealing strategy is replication and external validation but this can be difficult in the real world of social science. We discuss statistical methods for actually learning from data without getting fooled.

Reputational incentives and post-publication review: two (partial) solutions to the misinformation problem

So. There are erroneous analyses published in scientific journals and in the news. Here I’m not talking not about outright propaganda, but about mistakes that happen to coincide with the preconceptions of their authors.

We’ve seen lots of examples. Here are just a few:

– Political scientist Larry Bartels is committed to a model of politics in which voters make decisions based on irrelevant information. He’s published claims about shark attacks deciding elections and subliminal smiley faces determining attitudes about immigration. In both cases, second looks by others showed that the evidence wasn’t really there. I think Bartels was sincere; he just did sloppy analyses—statistics is hard!—and jumped to conclusions that supported his existing views.

– New York Times columnist David Brooks has a habit of citing statistics that fall apart under closer inspection. I think Brooks believes these things when he writes them—OK, I guess he never really believed that Red Lobster thing, he must really have been lying exercising poetic license on that one—but what’s important is that these stories work to make his political points, and he doesn’t care when they’re proved wrong.

– David’s namesake and fellow NYT op-ed columnist Arthur Brooks stepped in it one or twice when reporting survey data. He wrote that Tea Party supporters were happier than other voters, but a careful look at the data suggested the opposite. A. Brooks’s conclusions were counterintuitive and supported his political views; they just didn’t happen to line up with reality.

– The familiar menagerie from the published literature in social and behavioral sciences: himmicanes, air rage, ESP, ages ending in 9, power pose, pizzagate, ovulation and voting, ovulation and clothing, beauty and sex ratio, fat arms and voting, etc etc.

– Gregg Easterbrook writing about politics.

And . . . we have a new one. A colleague emailed me expressing annoyance at a recent NYT op-ed by historian Stephanie Coontz entitled, “Do Millennial Men Want Stay-at-Home Wives?”

Emily Beam does the garbage collection. The short answer is that, no, there’s no evidence that millennial men want stay-at-home wives. Here’s Beam:

You can’t say a lot about millennials based on talking to 66 men.

The GSS surveys are pretty small – about 2,000-3,000 per wave – so once you split by sample, and then split by age, and then exclude the older millennials (age 26-34) who don’t show any negative trend in gender equality, you’re left with cells of about 60-100 men ages 18-25 per wave. . . .

Suppose you want to know whether there is a downward trend in young male disagreement with the women-in-the-kitchen statement. Using all available GSS data, there is a positive, not statistically significant trend in men’s attitudes (more disagreement). Starting in 1988 only, there is very, very small negative, not statistically significant effect.

Only if we pick 1994 as a starting point, as Coontz does, ignoring the dip just a few years prior, do we see a negative less-than half-percentage point drop in disagreement per year, significant at the 10-percent level.

To Coontz’s (or the NYT’s) credit, they followed up with a correction, but it’s half-assed:

The trend still confirms a rise in traditionalism among high school seniors and 18-to-25-year-olds, but the new data shows that this rise is no longer driven mainly by young men, as it was in the General Social Survey results from 1994 through 2014.

And at this point I have no reason to believe anything that Coontz says on this topic, any more than I’d trust what David Brooks has to say about high school test scores or the price of dinner at Red Lobster, or Arthur Brooks on happiness measurements, or Susan Fiske on himmicanes, power pose, and air rage. All these people made natural mistakes but then were overcommitted, in part I suspect because the mistaken analyses what they’d like to think is true.

But it’s good enough for the New York Times, or PPNAS, right?

The question is, what to do about it. Peer review can’t be the solution: for scientific journals, the problem with peer review is the peers, and when it comes to articles in the newspaper, there’s no way to do systematic review. The NYT can’t very well send all their demography op-eds to Emily Beam and Jay Livingston, after all. Actually, maybe they could—it’s not like they publish so many op-eds on the topic—but I don’t think this is going to happen.

So here are two solutions:

1. Reputational incentives. Make people own their errors. It’s sometimes considered rude to do this, to remind people that Satoshi Kanazawa Satoshi Kanazawa Satoshi Kanazawa published a series of papers that were dead on arrival because the random variation in his data was so much larger than any possible signal. Or to remind people that Amy Cuddy Amy Cuddy Amy Cuddy still goes around promoting power pose even thought the first author on that paper had disowned the entire thing. Or that John Bargh John Bargh John Bargh made a career out of a mistake and now refuses to admit his findings didn’t replicate. Or that David Brooks David Brooks David Brooks reports false numbers and then refused to correct them. Or that Stephanie Coontz Stephanie Coontz Stephanie Coontz jumped to conclusions based on a sloppy reading of trends from a survey.

But . . . maybe we need these negative incentives. If there’s a positive incentive for getting your name out there, there should be a negative incentive for getting it wrong. I’m not saying the positive and negative incentives should be equal, just that there more of a motivation for people to check what they’re doing.

And, yes, don’t keep it a secret that I published a false theorem once, and, another time, had to retract the entire empirical section of a published paper because we’d reverse-coded a key variable in our analysis.

2. Post-publication review.

I’ve talked about this one before. Do it for real, in scientific journals and also the newspapers. Correct your errors. And, when you do so, link to the people who did the better analyses.

Incentives and post-publication review go together. To the extent that David Brooks is known as the guy who reports made-up statistics and then doesn’t correct them—if this is his reputation—this gives the incentives for future Brookses (if not David himself) to prominently correct his mistakes. If Stephanie Coontz and the New York Times don’t want to be mocked on twitter, they’re motivated to follow up with serious corrections, not minimalist damage control.

Some perspective here

Look, I’m not talking about tarring and feathering here. The point is that incentives are real; they already exist. You really do (I assume) get a career bump from publishing in Psychological Science and PPNAS, and your work gets more noticed if you publish an op-ed in the NYT or if you’re featured on NPR or Ted or wherever. If all incentives are positive, that creates problems. It creates a motivation for sloppy work. It’s not that anyone is trying to do sloppy work.

Econ gets it (pretty much) right

Say what you want about economists, but they’ve got this down. First off, they understand the importance of incentives. Second, they’re harsh, harsh critics of each other. There’s not much of an econ equivalent to quickie papers in Psychological Science or PPNAS. Serious econ papers go through tons of review. Duds still get through, of course (even some duds in PPNAS). But, overall, it seems to me that economists avoid what might be called the “happy talk” problem. When an economist publishes something, he or she tries to get it right (politically-motivated work aside), in awareness that lots of people are on the lookout for errors, and this will rebound back to the author’s reputation.

Donald Trump’s nomination as an unintended consequence of Citizens United

The biggest surprise of the 2016 election campaign was Donald Trump winning the Republican nomination for president.

A key part of the story is that so many of the non-Trump candidates stayed in the race so long because everyone thought Trump was doomed, so they were all trying to grab Trump’s support when he crashed. Instead, Trump didn’t crash, and he benefited from the anti-Trump forces not coordinating on an alternative.

David Banks shares a theory of how it was that these candidates all stayed in so long:

I [Banks] see it as an unintended consequence of Citizens United. Before that [Supreme Court] decision, the $2000 cap on what individuals/corporations could contribute largely meant that if a candidate did not do well in one of the first three primaries, they pretty much had to drop out and their supporters would shift to their next favorite choice. But after Citizens United, as long as a candidate has one billionaire friend, they can stay in the race through the 30th primary if they want. And this is largely what happened. Trump regularly got the 20% of the straight-up crazy Republican vote, and the other 15 candidates fragmented the rest of the Republicans for whom Trump was the least liked candidate. So instead of Rubio dropping out after South Carolina and his votes shifting over to Bush, and Fiorino dropping out and her votes shifting to Bush, so that Bush would jump from 5% to 10% to 15% to 20% to 25%, etc., we wound up very late in the primaries with Trump looking like the most dominant candidate to field.

Of course, things are much more complex than this facile theory suggests, and lots of other things were going on in parallel. But it still seems to me that this partly explains how Trump threaded the needle to get the Republican nomination.

Interesting. I’d not seen this explanation before so I thought I’d share it with you.

Fitting hierarchical GLMs in package X is like driving car Y

Given that Andrew started the Gremlin theme (the car in the image at the right), I thought it would only be fitting to link to the following amusing blog post:

It’s exactly what it says on the tin. I won’t spoil the punchline, but will tell you the packages considered are: lme4, JAGS, RStan(Arm), and INLA.

What do you think?

Anyway, don’t take my word for it—read the original post. I’m curious about others’ take on systems for fitting GLMs and how they compare (to cars or otherwise).

You might also like…

My favorite automative analogy was made in the following essay, from way back in the first dot-com boom:

Although it’s about operating systems, the satirical take on closed- vs. open-source is universal.

(Some of) what I thought

Chris Brown reports in the post,

I simulated a simple hierarchical data-set to test each of the models. The script is available here. The data-set has 100 binary measurements. There is one fixed covariate (continuous) and one random effect with five groups. The linear predictor was transformed to binomial probabilities using the logit function. For the Bayesian approaches, slightly different priors were used for each package, depending on what was available. See the script for more details on priors.

Apples and oranges. This doesn’t make a whole lot of sense, given that lme4 is giving you max marginal likelihood, whereas JAGS and Stan give you full Bayes. And if you use different priors in Stan and JAGS, you’re not even fitting the same posterior. I’m afraid I’ve never understood INLA (lucky for me Dan Simpson’s visiting us this week, so there’s no time like the present to learn it). You’ll also find that relative performance of Stan and JAGS will vary dramatically based on the shape of the posterior and scale of the data.

It’s all about effective sample size. The author doesn’ tmention the subtlety of choosing a way to estimate effective sample size (RStan’s is more conservative than the Coda package, using a variance approach like that of the split R-hat we use to detect convergence problems in RStan).

Random processes are hard to compare. You’ll find a lot of variation across runs with different random inits. You really want to start JAGS and Stan at the same initial points and run to the same effective sample size over multiple runs and compare averages and variation.

RStanArm, not RStan. I looked at the script, and it turns out the post is comparing RStanArm, not coding a model in Stan itself and running it in RStan. Here’s the code.

t_prior <- student_t(df = 4, location = 0, scale = 2.5)
mb.stanarm <- microbenchmark(mod.stanarm <- stan_glmer(y ~ x + (1|grps),
                                                       data = dat,
                                                       family = binomial(link = 'logit'),
                                                       prior = t_prior,
                                                       prior_intercept = t_prior,
                                                       chains = 3, cores = 1, seed = 10),
                             times = 1L)

Parallelization reduces wall time. This script runs RStanArm three Markov chains on a single core, meaning they have to run one after the other. This can obviously be sped up by the up to the number of cores you have and letting them all run at the same time. Presuambly JAGS could be sped up the same way. The multiple chains are embarassingly parallelizable, after all.

It's hard to be fair! There's a reason we don't do a lot of these comparisons ourselves!

“Do you think the research is sound or is it gimmicky pop science?”

David Nguyen writes:

I wanted to get your opinion on Do you think the research is sound or is it gimmicky pop science?

My reply: I have no idea. But since I see no evidence on the website, I’ll assume it’s pseudoscience until I hear otherwise. I won’t believe it until it has the endorsement of Susan T. Fiske.

P.S. Oooh, that Fiske slam was so unnecessary, you say. But she still hasn’t apologized for falling asleep on the job and greenlighting himmicanes, air rage, and ages ending in 9.

Organizations that defend junk science are pitiful suckers get conned and conned again

So. Cornell stands behind Wansink, and Ohio State stands behind Croce. George Mason University bestows honors on Weggy. Penn State trustee disses “so-called victims.” Local religious leaders aggressively defend child abusers in their communities. And we all remember how long it took for Duke University to close the door on Dr. Anil Potti.

OK, I understand all these situations. It’s the sunk cost fallacy: you’ve invested a lot of your reputation in somebody; you don’t want to admit that, all this time, they’ve been using you.

Still, it makes me sad.

These organizations—Cornell, Ohio State, etc.—are victims as much as perpetrators. Wansink, Croce, etc., couldn’t have done it on their own: in their quest to illegitimately extract millions of corporate and government dollars, they made use of their prestigious university affiliations. A press release from a “Cornell professor” sounds so much more credible than a press release from some fast-talking guy with a P.O. box.

Cornell, Ohio State, etc., they’ve been played, and they still don’t realize it.

Remember, a key part of the long con is misdirection: make the mark think you’re his friend.

Causal inference conference in North Carolina

Michael Hudgens announces:

Registration for the 2017 Atlantic Causal Inference Conference is now open. The registration site is here. More information about the conference, including the poster session and the Second Annual Causal Inference Data Analysis Challenge can be found on the conference website here.

We held the very first Atlantic Causal Inference Conference here at Columbia twelve years ago, and it’s great to see that it has been continuing so successfully.

The Efron transition? And the wit and wisdom of our statistical elders

Stephen Martin writes:

Brad Efron seems to have transitioned from “Bayes just isn’t as practical” to “Bayes can be useful, but EB is easier” to “Yes, Bayes should be used in the modern day” pretty continuously across three decades.

Also, Lindley’s comment in the first article is just GOLD:
“The last example with [lambda = theta_1theta_2] is typical of a sampling theorist’s impractical discussions. It is full of Greek letters, as if this unusual alphabet was a repository of truth.” To which Efron responded “Finally, I must warn Professor Lindley that his brutal, and largely unprovoked, attack has been reported to FOGA (Friends of the Greek Alphabet). He will be in for a very nasty time indeed if he wishes to use as much as an epsilon or an iota in any future manuscript.”

“Perhaps the author has been falling over all those bootstraps lying around.”

“What most statisticians have is a parody of the Bayesian argument, a simplistic view that just adds a woolly prior to the sampling-theory paraphernalia. They look at the parody, see how absurd it is, and thus dismiss the coherent approach as well.”

I pointed Stephen to this post and this article (in particular the bottom of page 295). Also this, I suppose.

Causal inference conference at Columbia University on Sat 6 May: Varying Treatment Effects

Hey! We’re throwing a conference:

Varying Treatment Effects

The literature on causal inference focuses on estimating average effects, but the very notion of an “average effect” acknowledges variation. Relevant buzzwords are treatment interactions, situational effects, and personalized medicine. In this one-day conference we shall focus on varying effects in social science and policy research, with particular emphasis on Bayesian modeling and computation.

The focus will be on applied problems in social science.

The organizers are Jim Savage, Jennifer Hill, Beth Tipton, Rachael Meager, Andrew Gelman, Michael Sobel, and Jose Zubizarreta.

And here’s the schedule:

9:30 AM
1. Heterogeneity across studies in meta-analyses of impact evaluations.
– Michael Kremer, Harvard
– Greg Fischer, LSE
– Rachael Meager, MIT
– Beth Tipton, Columbia
10-45 – 11 coffee break

2. Heterogeneity across sites in multi-site trials.
– David Yeager, UT Austin
– Avi Feller, Berkeley
– Luke Miratrix, Harvard
– Ben Goodrich, Columbia
– Michael Weiss, MDRC

12:30-1:30 Lunch

3. Heterogeneity in experiments versus quasi-experiments.
– Vivian Wong, University of Virginia
– Michael Gechter, Penn State
– Peter Steiner, U Wisconsin
– Bryan Keller, Columbia

3:00 – 3:30 afternoon break

4. Heterogeneous effects at the structural/atomic level.
– Jennifer Hill, NYU
– Peter Rossi, UCLA
– Shoshana Vasserman, Harvard
– Jim Savage, Lendable Inc.
– Uri Shalit, NYU

Closing remarks: Andrew Gelman

Please register for the conference here. Admission is free but we would prefer if you register so we have a sense of how many people will show up.

We’re expecting lots of lively discussion.

P.S. Signup for outsiders seems to have filled up. Columbia University affiliates who are interested in attending should contact me directly.

I wanna be ablated

Mark Dooris writes:

I am senior staff cardiologist from Australia. I attach a paper that was presented at our journal club some time ago. It concerned me at the time. I send it as I suspect you collect similar papers. You may indeed already be aware of this paper. I raised my concerns about the “too good to be true” and plethora of “p-values” all in support of desired hypothesis. I was decried as a naysayer and some individuals wanted to set their own clinics on the basis of the study (which may have been ok if it was structured as a replication prospective randomized clinical trial).

I would value your views on the statistical methods and the results…it is somewhat pleasing: fat bad…lose fat good and may even be true in some specific sense but please look at the number of comparisons, which exceed the number of patients and how they are almost perfectly consistent with an amazing dose response esp structural changes.

I am not at all asserting there is fraud I am just pointing out how anomalous this is. Perhaps it is most likely that many of these tests were inevitably unable to be blinded…losing 20 kg would be an obvious finding in imaging. Many of the claimed detected differences in echocardiography seem to exceed the precision of the test (a test which has greater uncertainty in measurements in the obese patients). Certainly the blood parameters may be real but there has been accounting for multiple comparisons.

PS: I do not know, work with or have any relationship with the authors. I am an interventional cardiologist (please don’t hold that against me) and not an electrophysiologist.

The paper that he sent is called “Long-Term Effect of Goal-Directed Weight Management in an Atrial Fibrillation Cohort: A Long-Term Follow-Up Study (LEGACY),” it’s by Rajeev K. Pathak, Melissa E. Middeldorp, Megan Meredith, Abhinav B. Mehta, Rajiv Mahajan, Christopher X. Wong, Darragh Twomey, Adrian D. Elliott, Jonathan M. Kalman, Walter P. Abhayaratna, Dennis H. Lau, and Prashanthan Sanders, and it appeared in 2015 in the Journal of the American College of Cardiology.

The topic of atrial fibrillation concerns me personally! But my body mass index is less than 27 so I don’t seem to be in the target population for this study.

Anyway, I did take a look. The study in question was observational: they divided the patients into three groups, not based on treatments that had been applied, but based on weight loss (>=10%, 3-9%, <3%; all patients had been counseled to try to lose weight). As Dooris writes, the results seem almost too good to be true: For all five of their outcomes (atrial fibrillation frequency, duration, episode severity, symptom subscale, and global well-being), there is a clean monotonic stepping down from group 1 to group 2 to group 3. I guess maybe the symptom subscale and the global well-being measure are combinations of the first three outcomes? So maybe it’s just three measures, not five, that are showing such clean trends. All the measures show huge improvements from baseline to follow-up in all groups, which I guess just demonstrates that the patients were improving in any case. Anyway, I don’t really know what to make of all this but I thought I’d share it with you.

P.S. Dooris adds:

I must admit to feeling embarrassed for my, perhaps, premature and excessive skepticism. I read the comments with interest.

I am sorry to read that you have some personal connection to atrial fibrillation but hope that you have made (a no doubt informed) choice with respect to management. It is an “exciting” time with respect to management options. I am not giving unsolicited advice (and as I have expressed I am just a “plumber” not an “electrician”).
I remain skeptical about the effect size and the complete uniformity of the findings consistent with the hypothesis that weight loss is associated with reduced symptoms of AF, reduced burden of AF, detectable structural changes on echocardiography and uniformly positive effects on lipid profile.
I want to be clear:
  • I find the hypothesis plausible
  • I find the implications consistent with my pre-conceptions and my current advice (this does not mean they are true or based on compelling evidence)
  • The plausibility (for me) arises from
    • there are relatively small studies and meta-analyses that suggest weight loss is associated with “beneficial” effects on blood pressure and lipids. However, the effects are variable. There seems to be differences between genders and differences between methods of weight loss. The effect size is generally smaller than in the LEGACY trial
    • there is evidence of cardiac structural changes: increase chamber size, wall thickness and abnormal diastolic function and some studies suggest that the changes are reversible, perhaps the most change in patients with diastolic dysfunction. I note perhaps the largest change detected with weight loss is reduction in epicardial fat. Some cardiac MRI studies (which have better resolution) have supported this
    • there is electrophysiological data  in suggesting differences in electrophysiological properties in patients with atrial fibrillation related to obesity
  • What concerned me about the paper was the apparent homogeneity of this particular population that seemed to allow the detection of such a strong and consistent relationship.  This seemed “too good to be true”.  I think it does not show the variability I would have expected:
    • gender
    • degree of diastolic dysfunction
    • smoking
    • what other changes during the period were measured?: medication, alcohol etc
    • treatment interaction: I find it difficult to work out who got ablated, how many attempts. Are the differences more related to successful ablations or other factors
    • “blinding”: although the operator may have been blinded to patient category patients with smaller BMI are easier to image and may have less “noisy measurements”. Are the real differences, therefore, smaller than suggested
  • I accept that the authors used repeated measures ANOVA to account for paired/correlated nature of the testing.  However, I do not see the details of the model used.
  • I would have liked to see the differences rather than the means and SD as well as some graphical presentation of the data to see the variability as well as modeling of the relationship between weight loss and effect.
I guess I have not seen a paper where everything works out like you want.  I admit that I should have probably suppressed my disbelief (and waited for replication). What’s the down side? “We got the answer we all want”. “It fits with the general results of other work.” I still feel uneasy not at least asking some questions.
I think as a profession, we medical practitioners  have been guilty of “p-hacking” and over-reacting to small studies with large effect sizes. We have spent too much time in “the garden of forking paths” and believe where have got too after picking throw the noise every apparent signal that suits our preconceptions.  We have wonderful large scale randomized clinical trials that seem to answer narrow but important questions and that is great. However, we still publish a lot of lower quality stuff and promulgate “p-hacking” and related methods to our trainees. I found the Smaldino and McElreath paper timely and instructive (I appreciate you have already seen it).
So, I sent you the email because I felt uneasy (perhaps guilty about my “p-hacking” sins of commission of the past and acceptance of such work of others).


Air rage rage

Commenter David alerts us that Consumer Reports fell for the notorious air rage story.

Background on air rage here and here. Or, if you want to read something by someone other than me, here. This last piece is particularly devastating as it addresses flaws in the underlying research article, hype in the news reporting, and the participation of the academic researcher in that hype.

From the author’s note by Allen St. John at the bottom of that Consumer Reports story:

For me, there’s no better way to spend a day than talking to a bunch of experts about an important subject and then writing a story that’ll help others be smarter and better informed.

Shoulda talked with just one more expert. Maybe Susan T. Fiske at Princeton University—I’ve heard she knows a lot about social psychology. She’s also super-quotable!

P.S. A commenter notifies us that Wired fell for this one too. Too bad. I guess that air rage study is just too good to check.

Bayesian Posteriors are Calibrated by Definition

Time to get positive. I was asking Andrew whether it’s true that I have the right coverage in Bayesian posterior intervals if I generate the parameters from the prior and the data from the parameters. He replied that yes indeed that is true, and directed me to:

  • Cook, S.R., Gelman, A. and Rubin, D.B. 2006. Validation of software for Bayesian models using posterior quantiles. Journal of Computational and Graphical Statistics 15(3):675–692.

The argument is really simple, but I don’t remember seeing it in BDA.

How to simulate data from a model

Suppose we have a Bayesian model composed of a prior with probability function p(\theta) and sampling distribution with probability function p(y \mid \theta). We then simulate parameters and data as follows.

Step 1. Generate parameters \theta^{(0)} according to the prior p(\theta).

Step 2. Generate data y^{(0)} according to the sampling distribution p(y \mid \theta^{(0)}).

There are three important things to note about the draws:

Consequence 1. The chain rule shows (\theta^{(0)}, y^{(0)}) is generated from the joint p(\theta, y).

Consequence 2. Bayes’s rule shows \theta^{(0)} is a random draw from the posterior p(\theta \mid y^{(0)}).

Consequence 3. The law of total probabiltiy shows that y^{(0)} is a draw from the prior predictive p(y).

Why does this matter?

Because it means the posterior is properly calibrated in the sense that if we repeat this process, the true parameter value \theta^{(0)}_k for any component k of the parameter vector will fall in a 50% posterior interval for that component, p(\theta_k \mid y^{(0)}), exactly 50% of the time under repeated simulations. Same thing for any other interval. And of course it holds up for pairs or more of variables and produces the correct dependencies among variables.

What’s the catch?

The calibration guarantee only holds if the data y^{(0)} actually was geneated from the model, i.e., from the data marginal p(y).

Consequence 3 assures us that this is so. The marginal distribution of the data in the model is also known as the prior predictive distribution; it is defined in terms of the prior and sampling distribution as

p(y) = \int_{\Theta} p(y \mid \theta) \, p(\theta) \, \mathrm{d}\theta.

It’s what we predict about potential data from the model itself. The generating process in Step 1 and Step 2 above follow the integral. Because (\theta^{(0)}, y^{(0)}) is drawn from the joint p(\theta, y), we know that y^{(0)} is drawn from the prior predictive p(y). That is, we know y^{(0)} was generated from the model in the simulations.

Cook et al.’s application to testing

Cook et al. outline the above steps as a means to test MCMC software for Bayesian models. The idea is to generate a bunch of data sets (\theta^{(0)}, y^{(0)}), then for each one make a sequence of draws \theta^{(1)}, \ldots, \theta^{(L)} according to the posterior p(\theta \mid y^{(0)}). If the software is functioning correctly, everything is calibrated and the quantile in which \theta^{(0)}_k falls in \theta^{(1)}_k, \ldots, \theta^{(L)}_k is uniform. The reason it’s uniform is that it’s equally likely to be ranked anywhere because all of the \theta^{(\ell)} including \theta^{(0)} are just random draws from the posterior.

What about overfitting?

What overfitting? Overfitting occurs when the model fits the data too closely and fails to generalize well to new observations. In theory, if we have the right model, we get calibration, and there can’t be overfitting of this type—predictions for new observations are automatically calibrated, because predictions are just another parameter in the model. In other words, we don’t overfit by construction.

In practice, we almost never believe we have the true generative process for our data. We often make do with convenient approximations of some underlying data generating process. For example, we might choose a logistic regression because we’re dealing with binary trial data, not because we believe the log odds are truly a linear function of the predictors.

We even more rarely believe we have the “true” priors. It’s not even clear what “true” would mean when the priors are about our knowledge of the world. The posteriors suffer the same philosophical fate, being more about our knowledge than about the world. But the posteriors have the redeeming quality that we can test them predictively on new data.

In the end, the theory gives us only cold comfort.

But not all is lost!

To give ourselves some peace of mind that our inferences are not going astray, we try to calibrate against real data using cross-validation or actual hold-out testing. For an example, see my Stan case study on repeated binary trial data, which Ben and Jonah conveniently translated into RStanArm.

Basically, we treat Bayesian models like meteorologists—they are making probabilistic predictions after all. To try to assess the competence of a meteorologist, one asks on how many of the N days about which we said it was 20% likely to rain did it actually rain? If the predictions are independent and calibrated, we’d you expect the distribution of rainy days to be \mathrm{Binom}(N, 0.2). To try to assess the competence of a predictive model, we can do exactly the same thing.

Pizzagate update! Response from the Cornell University Media Relations Office

Hey! A few days ago I received an email from the Cornell University Media Relations Office. As I reported in this space, I responded as follows:

Dear Cornell University Media Relations Office:

Thank you for pointing me to these two statements. Unfortunately I fear that you are minimizing the problem.

You write, “while numerous instances of inappropriate data handling and statistical analysis in four published papers were alleged, such errors did not constitute scientific misconduct ( However, given the number of errors cited and their repeated nature, we established a process in which Professor Wansink would engage external statistical experts to validate his review and reanalysis of the papers and attendant published errata. . . . Since the original critique of Professor Wansink’s articles, additional instances of self-duplication have come to light. Professor Wansink has acknowledged the repeated use of identical language and in some cases dual publication of materials.”

But there are many, many more problems in Wansink’s published work, beyond those 4 initially-noticed papers and beyond self-duplication.

Your NIH link above defines research misconduct as “fabrication, falsification and plagiarism, and does not include honest error or differences of opinion. . .” and defines falsification as “Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.”

This phrase, “changing or omitting data or results such that the research is not accurately represented in the research record,” is an apt description of much of Wansink’s work, going far beyond those four particular papers that got the ball rolling, and far beyond duplication of materials. For a thorough review, see this recent post by Tim van der Zee, who points to 37 papers by Wansink, many of which have serious data problems:

And all this doesn’t even get to criticism of Wansink having openly employed a hypotheses-after-results-are-known methodology which leaves his statistics meaningless, even setting aside data errors.

There’s also Wansink’s statement which refers to “the great work of the Food and Brand Lab,” which is an odd phrase to use to describe a group that has published papers with hundreds of errors and major massive data inconsistencies that represent, at worst, fraud, and, at best, some of the sloppiest empirical work—published or unpublished—that I have ever seen. In either case, I consider this pattern of errors to represent research misconduct.

I understand that it’s natural to think that nothing can every be proven, Rashomon and all that. But in this case the evidence for research misconduct is all out in the open, in dozens of published papers.

I have no personal stake in this matter and I have no plans to file any sort of formal complaint. But as a scientist, this bothers me: Wansink’s misconduct, his continuing attempt to minimize it, and this occurring at a major university.

Andrew Gelman

Let me emphasize at this point that the Cornell University Media Relations Office has no obligation to respond to me. They’re already pretty busy, what with all the Fox News crews coming on campus, not to mention the various career-capping studies that happen to come through. Just cos the Cornell University Media Relations Office sent me an email, this implies no obligation on their part to reply to my response.

Anyway, that all said, I thought you might be interested in what the Cornell University Media Relations Office had to say.

So, below, here is their response, in its entirety: