Skip to content

“Do you think the research is sound or is it gimmicky pop science?”

David Nguyen writes:

I wanted to get your opinion on http://www.scienceofpeople.com/. Do you think the research is sound or is it gimmicky pop science?

My reply: I have no idea. But since I see no evidence on the website, I’ll assume it’s pseudoscience until I hear otherwise. I won’t believe it until it has the endorsement of Susan T. Fiske.

P.S. Oooh, that Fiske slam was so unnecessary, you say. But she still hasn’t apologized for falling asleep on the job and greenlighting himmicanes, air rage, and ages ending in 9.

Organizations that defend junk science are pitiful suckers get conned and conned again

So. Cornell stands behind Wansink, and Ohio State stands behind Croce. George Mason University bestows honors on Weggy. Penn State trustee disses “so-called victims.” Local religious leaders aggressively defend child abusers in their communities. And we all remember how long it took for Duke University to close the door on Dr. Anil Potti.

OK, I understand all these situations. It’s the sunk cost fallacy: you’ve invested a lot of your reputation in somebody; you don’t want to admit that, all this time, they’ve been using you.

Still, it makes me sad.

These organizations—Cornell, Ohio State, etc.—are victims as much as perpetrators. Wansink, Croce, etc., couldn’t have done it on their own: in their quest to illegitimately extract millions of corporate and government dollars, they made use of their prestigious university affiliations. A press release from a “Cornell professor” sounds so much more credible than a press release from some fast-talking guy with a P.O. box.

Cornell, Ohio State, etc., they’ve been played, and they still don’t realize it.

Remember, a key part of the long con is misdirection: make the mark think you’re his friend.

Causal inference conference in North Carolina

Michael Hudgens announces:

Registration for the 2017 Atlantic Causal Inference Conference is now open. The registration site is here. More information about the conference, including the poster session and the Second Annual Causal Inference Data Analysis Challenge can be found on the conference website here.

We held the very first Atlantic Causal Inference Conference here at Columbia twelve years ago, and it’s great to see that it has been continuing so successfully.

The Efron transition? And the wit and wisdom of our statistical elders

Stephen Martin writes:

Brad Efron seems to have transitioned from “Bayes just isn’t as practical” to “Bayes can be useful, but EB is easier” to “Yes, Bayes should be used in the modern day” pretty continuously across three decades.

http://www2.stat.duke.edu/courses/Spring10/sta122/Handouts/EfronWhyEveryone.pdf
http://projecteuclid.org/download/pdf_1/euclid.ss/1028905930
http://statweb.stanford.edu/~ckirby/brad/other/2009Future.pdf

Also, Lindley’s comment in the first article is just GOLD:
“The last example with [lambda = theta_1theta_2] is typical of a sampling theorist’s impractical discussions. It is full of Greek letters, as if this unusual alphabet was a repository of truth.” To which Efron responded “Finally, I must warn Professor Lindley that his brutal, and largely unprovoked, attack has been reported to FOGA (Friends of the Greek Alphabet). He will be in for a very nasty time indeed if he wishes to use as much as an epsilon or an iota in any future manuscript.”

“Perhaps the author has been falling over all those bootstraps lying around.”

“What most statisticians have is a parody of the Bayesian argument, a simplistic view that just adds a woolly prior to the sampling-theory paraphernalia. They look at the parody, see how absurd it is, and thus dismiss the coherent approach as well.”

I pointed Stephen to this post and this article (in particular the bottom of page 295). Also this, I suppose.

Causal inference conference at Columbia University on Sat 6 May: Varying Treatment Effects

Hey! We’re throwing a conference:

Varying Treatment Effects

The literature on causal inference focuses on estimating average effects, but the very notion of an “average effect” acknowledges variation. Relevant buzzwords are treatment interactions, situational effects, and personalized medicine. In this one-day conference we shall focus on varying effects in social science and policy research, with particular emphasis on Bayesian modeling and computation.

The focus will be on applied problems in social science.

The organizers are Jim Savage, Jennifer Hill, Beth Tipton, Rachael Meager, Andrew Gelman, Michael Sobel, and Jose Zubizarreta.

And here’s the schedule:

9:30 AM
1. Heterogeneity across studies in meta-analyses of impact evaluations.
– Michael Kremer, Harvard
– Greg Fischer, LSE
– Rachael Meager, MIT
– Beth Tipton, Columbia
10-45 – 11 coffee break

11:00
2. Heterogeneity across sites in multi-site trials.
– David Yeager, UT Austin
– Avi Feller, Berkeley
– Luke Miratrix, Harvard
– Ben Goodrich, Columbia
– Michael Weiss, MDRC

12:30-1:30 Lunch

1:30
3. Heterogeneity in experiments versus quasi-experiments.
– Vivian Wong, University of Virginia
– Michael Gechter, Penn State
– Peter Steiner, U Wisconsin
– Bryan Keller, Columbia

3:00 – 3:30 afternoon break

3:30
4. Heterogeneous effects at the structural/atomic level.
– Jennifer Hill, NYU
– Peter Rossi, UCLA
– Shoshana Vasserman, Harvard
– Jim Savage, Lendable Inc.
– Uri Shalit, NYU

5pm
Closing remarks: Andrew Gelman

Please register for the conference here. Admission is free but we would prefer if you register so we have a sense of how many people will show up.

We’re expecting lots of lively discussion.

P.S. Signup for outsiders seems to have filled up. Columbia University affiliates who are interested in attending should contact me directly.

I wanna be ablated

Mark Dooris writes:

I am senior staff cardiologist from Australia. I attach a paper that was presented at our journal club some time ago. It concerned me at the time. I send it as I suspect you collect similar papers. You may indeed already be aware of this paper. I raised my concerns about the “too good to be true” and plethora of “p-values” all in support of desired hypothesis. I was decried as a naysayer and some individuals wanted to set their own clinics on the basis of the study (which may have been ok if it was structured as a replication prospective randomized clinical trial).

I would value your views on the statistical methods and the results…it is somewhat pleasing: fat bad…lose fat good and may even be true in some specific sense but please look at the number of comparisons, which exceed the number of patients and how they are almost perfectly consistent with an amazing dose response esp structural changes.

I am not at all asserting there is fraud I am just pointing out how anomalous this is. Perhaps it is most likely that many of these tests were inevitably unable to be blinded…losing 20 kg would be an obvious finding in imaging. Many of the claimed detected differences in echocardiography seem to exceed the precision of the test (a test which has greater uncertainty in measurements in the obese patients). Certainly the blood parameters may be real but there has been accounting for multiple comparisons.

PS: I do not know, work with or have any relationship with the authors. I am an interventional cardiologist (please don’t hold that against me) and not an electrophysiologist.

The paper that he sent is called “Long-Term Effect of Goal-Directed Weight Management in an Atrial Fibrillation Cohort: A Long-Term Follow-Up Study (LEGACY),” it’s by Rajeev K. Pathak, Melissa E. Middeldorp, Megan Meredith, Abhinav B. Mehta, Rajiv Mahajan, Christopher X. Wong, Darragh Twomey, Adrian D. Elliott, Jonathan M. Kalman, Walter P. Abhayaratna, Dennis H. Lau, and Prashanthan Sanders, and it appeared in 2015 in the Journal of the American College of Cardiology.

The topic of atrial fibrillation concerns me personally! But my body mass index is less than 27 so I don’t seem to be in the target population for this study.

Anyway, I did take a look. The study in question was observational: they divided the patients into three groups, not based on treatments that had been applied, but based on weight loss (>=10%, 3-9%, <3%; all patients had been counseled to try to lose weight). As Dooris writes, the results seem almost too good to be true: For all five of their outcomes (atrial fibrillation frequency, duration, episode severity, symptom subscale, and global well-being), there is a clean monotonic stepping down from group 1 to group 2 to group 3. I guess maybe the symptom subscale and the global well-being measure are combinations of the first three outcomes? So maybe it’s just three measures, not five, that are showing such clean trends. All the measures show huge improvements from baseline to follow-up in all groups, which I guess just demonstrates that the patients were improving in any case. Anyway, I don’t really know what to make of all this but I thought I’d share it with you.

P.S. Dooris adds:

I must admit to feeling embarrassed for my, perhaps, premature and excessive skepticism. I read the comments with interest.

I am sorry to read that you have some personal connection to atrial fibrillation but hope that you have made (a no doubt informed) choice with respect to management. It is an “exciting” time with respect to management options. I am not giving unsolicited advice (and as I have expressed I am just a “plumber” not an “electrician”).
I remain skeptical about the effect size and the complete uniformity of the findings consistent with the hypothesis that weight loss is associated with reduced symptoms of AF, reduced burden of AF, detectable structural changes on echocardiography and uniformly positive effects on lipid profile.
I want to be clear:
  • I find the hypothesis plausible
  • I find the implications consistent with my pre-conceptions and my current advice (this does not mean they are true or based on compelling evidence)
  • The plausibility (for me) arises from
    • there are relatively small studies and meta-analyses that suggest weight loss is associated with “beneficial” effects on blood pressure and lipids. However, the effects are variable. There seems to be differences between genders and differences between methods of weight loss. The effect size is generally smaller than in the LEGACY trial
    • there is evidence of cardiac structural changes: increase chamber size, wall thickness and abnormal diastolic function and some studies suggest that the changes are reversible, perhaps the most change in patients with diastolic dysfunction. I note perhaps the largest change detected with weight loss is reduction in epicardial fat. Some cardiac MRI studies (which have better resolution) have supported this
    • there is electrophysiological data  in suggesting differences in electrophysiological properties in patients with atrial fibrillation related to obesity
  • What concerned me about the paper was the apparent homogeneity of this particular population that seemed to allow the detection of such a strong and consistent relationship.  This seemed “too good to be true”.  I think it does not show the variability I would have expected:
    • gender
    • degree of diastolic dysfunction
    • smoking
    • what other changes during the period were measured?: medication, alcohol etc
    • treatment interaction: I find it difficult to work out who got ablated, how many attempts. Are the differences more related to successful ablations or other factors
    • “blinding”: although the operator may have been blinded to patient category patients with smaller BMI are easier to image and may have less “noisy measurements”. Are the real differences, therefore, smaller than suggested
  • I accept that the authors used repeated measures ANOVA to account for paired/correlated nature of the testing.  However, I do not see the details of the model used.
  • I would have liked to see the differences rather than the means and SD as well as some graphical presentation of the data to see the variability as well as modeling of the relationship between weight loss and effect.
I guess I have not seen a paper where everything works out like you want.  I admit that I should have probably suppressed my disbelief (and waited for replication). What’s the down side? “We got the answer we all want”. “It fits with the general results of other work.” I still feel uneasy not at least asking some questions.
I think as a profession, we medical practitioners  have been guilty of “p-hacking” and over-reacting to small studies with large effect sizes. We have spent too much time in “the garden of forking paths” and believe where have got too after picking throw the noise every apparent signal that suits our preconceptions.  We have wonderful large scale randomized clinical trials that seem to answer narrow but important questions and that is great. However, we still publish a lot of lower quality stuff and promulgate “p-hacking” and related methods to our trainees. I found the Smaldino and McElreath paper timely and instructive (I appreciate you have already seen it).
So, I sent you the email because I felt uneasy (perhaps guilty about my “p-hacking” sins of commission of the past and acceptance of such work of others).

 

Air rage rage

Commenter David alerts us that Consumer Reports fell for the notorious air rage story.

Background on air rage here and here. Or, if you want to read something by someone other than me, here. This last piece is particularly devastating as it addresses flaws in the underlying research article, hype in the news reporting, and the participation of the academic researcher in that hype.

From the author’s note by Allen St. John at the bottom of that Consumer Reports story:

For me, there’s no better way to spend a day than talking to a bunch of experts about an important subject and then writing a story that’ll help others be smarter and better informed.

Shoulda talked with just one more expert. Maybe Susan T. Fiske at Princeton University—I’ve heard she knows a lot about social psychology. She’s also super-quotable!

P.S. A commenter notifies us that Wired fell for this one too. Too bad. I guess that air rage study is just too good to check.

Bayesian Posteriors are Calibrated by Definition

Time to get positive. I was asking Andrew whether it’s true that I have the right coverage in Bayesian posterior intervals if I generate the parameters from the prior and the data from the parameters. He replied that yes indeed that is true, and directed me to:

  • Cook, S.R., Gelman, A. and Rubin, D.B. 2006. Validation of software for Bayesian models using posterior quantiles. Journal of Computational and Graphical Statistics 15(3):675–692.

The argument is really simple, but I don’t remember seeing it in BDA.

How to simulate data from a model

Suppose we have a Bayesian model composed of a prior with probability function p(\theta) and sampling distribution with probability function p(y \mid \theta). We then simulate parameters and data as follows.

Step 1. Generate parameters \theta^{(0)} according to the prior p(\theta).

Step 2. Generate data y^{(0)} according to the sampling distribution p(y \mid \theta^{(0)}).

There are three important things to note about the draws:

Consequence 1. The chain rule shows (\theta^{(0)}, y^{(0)}) is generated from the joint p(\theta, y).

Consequence 2. Bayes’s rule shows \theta^{(0)} is a random draw from the posterior p(\theta \mid y^{(0)}).

Consequence 3. The law of total probabiltiy shows that y^{(0)} is a draw from the prior predictive p(y).

Why does this matter?

Because it means the posterior is properly calibrated in the sense that if we repeat this process, the true parameter value \theta^{(0)}_k for any component k of the parameter vector will fall in a 50% posterior interval for that component, p(\theta_k \mid y^{(0)}), exactly 50% of the time under repeated simulations. Same thing for any other interval. And of course it holds up for pairs or more of variables and produces the correct dependencies among variables.

What’s the catch?

The calibration guarantee only holds if the data y^{(0)} actually was geneated from the model, i.e., from the data marginal p(y).

Consequence 3 assures us that this is so. The marginal distribution of the data in the model is also known as the prior predictive distribution; it is defined in terms of the prior and sampling distribution as

p(y) = \int_{\Theta} p(y \mid \theta) \, p(\theta) \, \mathrm{d}\theta.

It’s what we predict about potential data from the model itself. The generating process in Step 1 and Step 2 above follow the integral. Because (\theta^{(0)}, y^{(0)}) is drawn from the joint p(\theta, y), we know that y^{(0)} is drawn from the prior predictive p(y). That is, we know y^{(0)} was generated from the model in the simulations.

Cook et al.’s application to testing

Cook et al. outline the above steps as a means to test MCMC software for Bayesian models. The idea is to generate a bunch of data sets (\theta^{(0)}, y^{(0)}), then for each one make a sequence of draws \theta^{(1)}, \ldots, \theta^{(L)} according to the posterior p(\theta \mid y^{(0)}). If the software is functioning correctly, everything is calibrated and the quantile in which \theta^{(0)}_k falls in \theta^{(1)}_k, \ldots, \theta^{(L)}_k is uniform. The reason it’s uniform is that it’s equally likely to be ranked anywhere because all of the \theta^{(\ell)} including \theta^{(0)} are just random draws from the posterior.

What about overfitting?

What overfitting? Overfitting occurs when the model fits the data too closely and fails to generalize well to new observations. In theory, if we have the right model, we get calibration, and there can’t be overfitting of this type—predictions for new observations are automatically calibrated, because predictions are just another parameter in the model. In other words, we don’t overfit by construction.

In practice, we almost never believe we have the true generative process for our data. We often make do with convenient approximations of some underlying data generating process. For example, we might choose a logistic regression because we’re dealing with binary trial data, not because we believe the log odds are truly a linear function of the predictors.

We even more rarely believe we have the “true” priors. It’s not even clear what “true” would mean when the priors are about our knowledge of the world. The posteriors suffer the same philosophical fate, being more about our knowledge than about the world. But the posteriors have the redeeming quality that we can test them predictively on new data.

In the end, the theory gives us only cold comfort.

But not all is lost!

To give ourselves some peace of mind that our inferences are not going astray, we try to calibrate against real data using cross-validation or actual hold-out testing. For an example, see my Stan case study on repeated binary trial data, which Ben and Jonah conveniently translated into RStanArm.

Basically, we treat Bayesian models like meteorologists—they are making probabilistic predictions after all. To try to assess the competence of a meteorologist, one asks on how many of the N days about which we said it was 20% likely to rain did it actually rain? If the predictions are independent and calibrated, we’d you expect the distribution of rainy days to be \mathrm{Binom}(N, 0.2). To try to assess the competence of a predictive model, we can do exactly the same thing.

Pizzagate update! Response from the Cornell University Media Relations Office

Hey! A few days ago I received an email from the Cornell University Media Relations Office. As I reported in this space, I responded as follows:

Dear Cornell University Media Relations Office:

Thank you for pointing me to these two statements. Unfortunately I fear that you are minimizing the problem.

You write, “while numerous instances of inappropriate data handling and statistical analysis in four published papers were alleged, such errors did not constitute scientific misconduct (https://grants.nih.gov/grants/research_integrity/research_misconduct.htm). However, given the number of errors cited and their repeated nature, we established a process in which Professor Wansink would engage external statistical experts to validate his review and reanalysis of the papers and attendant published errata. . . . Since the original critique of Professor Wansink’s articles, additional instances of self-duplication have come to light. Professor Wansink has acknowledged the repeated use of identical language and in some cases dual publication of materials.”

But there are many, many more problems in Wansink’s published work, beyond those 4 initially-noticed papers and beyond self-duplication.

Your NIH link above defines research misconduct as “fabrication, falsification and plagiarism, and does not include honest error or differences of opinion. . .” and defines falsification as “Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.”

This phrase, “changing or omitting data or results such that the research is not accurately represented in the research record,” is an apt description of much of Wansink’s work, going far beyond those four particular papers that got the ball rolling, and far beyond duplication of materials. For a thorough review, see this recent post by Tim van der Zee, who points to 37 papers by Wansink, many of which have serious data problems: http://www.timvanderzee.com/the-wansink-dossier-an-overview/

And all this doesn’t even get to criticism of Wansink having openly employed a hypotheses-after-results-are-known methodology which leaves his statistics meaningless, even setting aside data errors.

There’s also Wansink’s statement which refers to “the great work of the Food and Brand Lab,” which is an odd phrase to use to describe a group that has published papers with hundreds of errors and major massive data inconsistencies that represent, at worst, fraud, and, at best, some of the sloppiest empirical work—published or unpublished—that I have ever seen. In either case, I consider this pattern of errors to represent research misconduct.

I understand that it’s natural to think that nothing can every be proven, Rashomon and all that. But in this case the evidence for research misconduct is all out in the open, in dozens of published papers.

I have no personal stake in this matter and I have no plans to file any sort of formal complaint. But as a scientist, this bothers me: Wansink’s misconduct, his continuing attempt to minimize it, and this occurring at a major university.

Yours,
Andrew Gelman

Let me emphasize at this point that the Cornell University Media Relations Office has no obligation to respond to me. They’re already pretty busy, what with all the Fox News crews coming on campus, not to mention the various career-capping studies that happen to come through. Just cos the Cornell University Media Relations Office sent me an email, this implies no obligation on their part to reply to my response.

Anyway, that all said, I thought you might be interested in what the Cornell University Media Relations Office had to say.

So, below, here is their response, in its entirety:

 

 

 

Stacking, pseudo-BMA, and AIC type weights for combining Bayesian predictive distributions

This post is by Aki.

We have often been asked in the Stan user forum how to do model combination for Stan models. Bayesian model averaging (BMA) by computing marginal likelihoods is challenging in theory and even more challenging in practice using only the MCMC samples obtained from the full model posteriors.

Some users have suggested using Akaike type weighting by exponentiating WAIC or LOO to compute weights to combine models.

We had doubts on this approach and started investigating this more so that we could give some recommendations.

Investigation led to the paper we (Yuling Yao, Aki Vehtari, Daniel Simpson and Andrew Gelman) have finally finished and which contains our recommendations:

  • Ideally, we prefer to attack the Bayesian model combination problem via continuous model expansion, forming a larger bridging model that includes the separate models as special cases.
  • In practice constructing such an expansion can require conceptual and computational effort, and so it sometimes makes sense to consider simpler tools that work with existing inferences from separately-fit models.
  • Bayesian model averaging based on marginal likelihoods can fail badly in the M-open setting in which the true data-generating process is not one of the candidate models being fit.
  • We propose and recommend a new log-score stacking for combining predictive distributions.
  • Akaike-type weights computed using Bayesian cross-validation are closely related to the pseudo Bayes factor of Geisser and Eddy (1979), and thus we label model combination using such weights as pseudo-BMA.
  • We propose a improved variant, which we call pseudo-BMA+, that is stabilized using the Bayesian bootstrap, to properly take into account the uncertainty of the future data distribution.
  • Based on our theory, simulations, and examples, we recommend stacking (of predictive distributions) for the task of combining separately-fit Bayesian posterior predictive distributions. As an alternative, Pseudo-BMA+ is computationally cheaper and can serve as an initial guess for stacking.

The paper is also available in arXiv:1704.02030 (with minor typo correction appearing there tonight in the regular arXiv update), and the code is part of the loo package in R (currently in github https://github.com/stan-dev/loo/ and later in CRAN).

P.S. from Andrew: I really like this idea. It resolves a problem that’s been bugging me for many years.

Beyond subjective and objective in statistics: my talk with Christian Hennig tomorrow (Wed) 5pm in London

Christian Hennig and I write:

Decisions in statistical data analysis are often justified, criticized, or avoided using concepts of objectivity and subjectivity. We argue that the words “objective” and “subjective” in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes, with objectivity replaced by transparency, consensus, impartiality, and correspondence to observable reality, and subjectivity replaced by awareness of multiple perspectives and context dependence. Together with stability, these make up a collection of virtues that we think is helpful in discussions of statistical foundations and practice. The advantage of these reformulations is that the replacement terms do not oppose each other and that they give more specific guidance about what statistical science strives to achieve. Instead of debating over whether a given statistical method is subjective or objective (or normatively debating the relative merits of subjectivity and objectivity in statistical practice), we can recognize desirable attributes such as transparency and acknowledgment of multiple perspectives as complementary goals. We demonstrate the implications of our proposal with recent applied examples from pharmacology, election polling, and socioeconomic stratification. The aim of this paper is to push users and developers of statistical methods toward more effective use of diverse sources of information and more open acknowledgement of assumptions and goals.

The paper will be discussed tomorrow (Wed 12 Apr) 5pm at the Royal Statistical Society in London. Christian and I will speak for 20 minutes each, then various people will present their discussions. Kinda like a blog comment thread but with nothing on the hot hand, pizzagate, or power pose.

You can also see here for a link to a youtube where Christian and I discuss the paper with each other.

Mmore from Ppnas

Kevin Lewis asks for my take on two new papers:

Study 1:
Honesty plays a key role in social and economic interactions and is crucial for societal functioning. However, breaches of honesty are pervasive and cause significant societal and economic problems that can affect entire nations. Despite its importance, remarkably little is known about the neurobiological mechanisms supporting honest behavior. We demonstrate that honesty can be increased in humans with transcranial direct current stimulation (tDCS) over the right dorsolateral prefrontal cortex. Participants (n = 145) completed a die-rolling task where they could misreport their outcomes to increase their earnings, thereby pitting honest behavior against personal financial gain. Cheating was substantial in a control condition but decreased dramatically when neural excitability was enhanced with tDCS. This increase in honesty could not be explained by changes in material self-interest or moral beliefs and was dissociated from participants’ impulsivity, willingness to take risks, and mood. A follow-up experiment (n = 156) showed that tDCS only reduced cheating when dishonest behavior benefited the participants themselves rather than another person, suggesting that the stimulated neural process specifically resolves conflicts between honesty and material self-interest. Our results demonstrate that honesty can be strengthened by noninvasive interventions and concur with theories proposing that the human brain has evolved mechanisms dedicated to control complex social behaviors.

Study 2:
Academic credentials open up a wealth of opportunities. However, many people drop out of educational programs, such as community college and online courses. Prior research found that a brief self-regulation strategy can improve self-discipline and academic outcomes. Could this strategy support learners at large scale? Mental contrasting with implementation intentions (MCII) involves writing about positive outcomes associated with a goal, the obstacles to achieving it, and concrete if-then plans to overcome them. The strategy was developed in Western countries (United States, Germany) and appeals to individualist tendencies, which may reduce its efficacy in collectivist cultures such as India or China. We tested this hypothesis in two randomized controlled experiments in online courses (n = 17,963). Learners in individualist cultures were 32% (first experiment) and 15% (second experiment) more likely to complete the course following the MCII intervention than a control activity. In contrast, learners in collectivist cultures were unaffected by MCII. Natural language processing of written responses revealed that MCII was effective when a learner’s primary obstacle was predictable and surmountable, such as everyday work or family obligations but not a practical constraint (e.g., Internet access) or a lack of time. By revealing heterogeneity in MCII’s effectiveness, this research advances theory on self-regulation and illuminates how even highly efficacious interventions may be culturally bounded in their effects.

He only sent me the abstract which is kind of a nice thing to do cos then I feel under no obligation to read the papers (which he tells me will appear in PPNAS and are embargoed until this very moment).

Anyway, here was my reply:

#1 looks like a forking-paths disaster but, hey, who knows? I guess it’s a candidate for a preregistered replication study.

#2 looks interesting as a main effect—if a simple trick helps people focus, that’s good—but I’m suspicious of the interaction for the usual reasons of confounders and forking paths.

Molyneux expresses skepticism on hot hand

handonfire-wallpaper

Guy Molyneux writes:

I saw your latest post on the hot hand too late to contribute to the discussion there. While I don’t disagree with your critique of Gilovich and his reluctance to acknowledge past errors, I do think you underestimate the power of the evidence against a meaningful hot hand effect in sports. I believe the balance of evidence should create a strong presumption that the hot hand is at most a small factor in competitive sports, and therefore that people’s belief in the hot hand is reasonably considered a kind of cognitive error. Let me try to explain my thinking in a couple of steps.

I think everyone agrees that evidence of a hot hand (or “momentum”) is extremely hard to find in actual game data. Across a wide range of sports, players’ outcomes are just slightly more “streaky” than we’d expect from random chance, implying that momentum is at most a weak effect (and even some of that streakiness is accounted for by player health, which is not a true “hot hand”). This body of work I think fairly places the burden of proof on the believers in a strong hot hand, or those still open to the idea (like you), to show why this evidence shouldn’t end the debate. Broadly speaking, two serious objections have been raised to accepting the empirical evidence from actual games.

First, you argue that “Whether you made or missed the last couple of shots is itself a very noisy measure of your ‘hotness,’ so estimates of the hot hand based on these correlations are themselves strongly attenuated toward zero.” If most of the allegedly hot players we study were really just lucky, and thus quickly regress to their true mean, the elevated performance by the subset of truly ‘hot’ players will be masked in the data. I take your point. Nonetheless, given the absence of observed momentum in games, one of two things must still be true: A) the hot hand effect is large but rare (your hypothesis), or B) the hot hand effect is small but perhaps frequent. This may be an important distinction for some analytic purpose, but from my perspective the two possibilities are effectively the same thing: the game impact (I won’t say ‘effect’) of the hot hand is quite small. By “small” I mean both that the hot hand likely has a negligible impact on game outcomes, and that teams and athletes should largely ignore the hot hand in making strategic decisions.

And since the actual impact on games is quite small *even if* your hypothesis is correct (because true hotness is rare), it follows that belief in a strong hot hand by players or fans still represents a kind of cognitive failure. The hot hand fallacy held by most fans, at least in my experience, is not that a very few (and unknowable) players sometimes get very hot, but rather that nearly all athletes sometimes get hot, and we can see this from their performance on the field/court.

(An important caveat: IF it proved possible to identify “true” hot hands in real time, or even to identify specific athletes who consistently exhibit true hot hand behavior, then my argument fails and the hot hand might have legitimate strategic implications. But I have not seen evidence that anyone knows how to do this in any sport.)

The second major objection made to the empirical studies is that the hot hand is disguised as a result of player’s very knowledge of it. As Miller and Sanjurjo suggest, “the myriad confounds present in games actually make it impossible to identify or rule out the existence of a hot hand effect with in-game shooting data, despite the rich nature of modern data sets.” Two main confounds are usually cited: hot players will take more difficult shots, and opposing athletes will deploy additional resources to combat the hot player. Some have argued (including Miller and Sanjurjo) that these factors are so strong that we must ignore real game data in favor of experimental data. But I think it is a mistake to dismiss the game data, for three reasons:

  • The theoretical possibility that players’ shot selections and defensive responses could perfectly – and with astonishing consistency – mask the true hot hand effect is only a possibility. Before we dismiss a large body of inconvenient studies, I’d argue that hot hand believers need to demonstrate that these confounds regularly operate at the necessary scale, not just assume it.
  • A sophisticated effort to control for shot selection and defensive attention to hot basketball shooters concludes that the remaining hot hand effect is quite modest. Conversely, as far as I know no one has shown empirically that the enhanced confidence of hot players and/or opponents’ defensive responses can account for the lack of observed momentum in a sport.
  • Efforts to detect a hot hand effect in baseball have invariably failed. And that’s important, because in baseball the players cannot choose to take on more arduous tasks when they feel “hot,” and opposing players have virtually no ability to redistribute defensive resources in a way that disadvantages players perceived to be hot. So even if you reject the Sloan study and think confounds explain the lack of momentum in basketball, they cannot explain what we observe in baseball.

I would also note that this “confounds” objection is in fact a strong argument *in favor* of the notion that the hot hand is a cognitive failure, given your argument that in-game streaks are a very poor marker of true hotness. If the latter is true, then it would still be a cognitive error for a player or his opponents to act on this usually-false indicator of enhanced short-term talent. If players on a streak take more difficult shots, they are wrong to do so, and teams that change defensive alignments in response are also making a mistake.

So, these are the reasons I remain unpersuaded that I should believe in a hot hand in the wild, or even consider it an open question. That leaves us, finally, with the experimental data that some feel should be privileged as evidence. I haven’t read enough of the experimental research to form any view on its quality or validity. But for answering the question of whether belief in the hot hand is a fallacy, I don’t see how the results of these experiments much matter. Fans and athletes believe they see the hot hand in real games. If a pitcher has retired the last nine batters he faced, many fans (and managers!) believe he is more likely than usual to get the next batter out. If a batter has 10 hits in his last 20 at bats, fans believe he is “on a tear” and more likely to be successful in his 21st at bat (and his manager is more likely to keep him in the lineup). But we know these beliefs are wrong.

Even if experiments do demonstrate real momentum for some repetitive athletic tasks in controlled settings, this would not challenge either of my contentions: that the hot hand has a negligible impact on competitive sports outcomes, and fans’ belief in the hot hand (in real games) is a cognitive error. Personally, I find it easy to believe that humans may get into (and out of) a rhythm for some extremely repetitive tasks – like shooting a large number of 3-point baskets. Perhaps this kind of “muscle memory” momentum exists, and is revealed in controlled experiments. But it seems to me that those conducting such studies have ranged far from the original topic of a hot hand in competitive sports — indeed, I’m not sure it is even in sight.

I don’t know that I have anything new to say after a few zillion exchanges in blog comments, but I wanted to put Guy’s reasoning out there, because (a) he expresses it well, and (b) he’s arguing that I’m a bit off in my interpretation of the data, and that’s something I should share with you.

The only thing I will comment on in Guy’s above post is that I do think baseball is different, because a hitter can face different pitches every time he comes to the plate.  So it’s not quite like basketball where the task is the same every time.

P.S. Yeah, yeah, I know, it seems at times that this blog is on an endless loop of power pose, pizzagate, and the hot hand. Really, though, we do talk about other things! See here, for example. Or here. Or here, here, here.

P.P.S. Josh (coauthor with Sanjurjo of those hot hand papers) responds in the comments. Lots of good discussion here.

Combining independent evidence using a Bayesian approach but without standard Bayesian updating?

Nic Lewis writes:

I have made some progress with my work on combining independent evidence using a Bayesian approach but eschewing standard Bayesian updating. I found a neat analytical way of doing this, to a very good approximation, in cases where each estimate of a parameter corresponds to the ratio of two variables each determined with normal error, the fractional uncertainty in the numerator and denominator variables differing between the types of evidence. This seems a not uncommon situation in science, and it is a good approximation to that which exists when estimating climate sensitivity. I have had a manuscript in which I develop and test this method accepted by the Journal of Statistical Planning and Inference (for a special issue on Confidence Distributions edited by Tore Schweder and Nils Hjort). Frequentist coverage is almost exact using my analytical solution, based on combining Jeffreys’ priors in quadrature, whereas Bayesian updating produces far poorer probability matching. I also show that a simple likelihood ratio method gives almost identical inference to my Bayesian combination method. A copy of the manuscript is available here: https://niclewis.files.wordpress.com/2016/12/lewis_combining-independent-bayesian-posteriors-for-climate-sensitivity_jspiaccepted2016_cclicense.pdf .

I’ve since teamed up with Peter Grunwald, a statistics professor in Amsterdam whom you may know – you cite two of his works in your 2013 paper ‘Philosophy and the practice of Bayesian statistics’. It turns out that my proposed method for combining evidence agrees with that implied by the Minimum Description Length principle, which he has been closely involved in developing. We have a joint paper under review by a leading climate science journal, which applies the method developed in my JSPI paper.

I think the reason Bayesian updating can give poor probability matching, even when the original posterior used as the prior gives exact probability matching in relation to the original data, is that conditionality is not always applicable to posterior distributions. Conditional probability was developed rigorously by Kolmogorov in the conext of random variables. Don Fraser has stated that the conditional probability lemma requires two probabilistic inputs and is not satisfied where there is no prior knowledge of a parameter’s value. I extend this argument and suggest that, unless the parameter is generated by a random process rather than being fixed, conditional probability does not apply to updating a posterior corresponding to existing knowledge, used as a prior, since such a prior distribution does not provide the required type of probability. As Tore Schweder has written (in his and Nils Hjort’s 2016 book Confidence, Likelihood, Probability) it is necessary to keep epistemic and aleatory probability apart. Bayes himself, of course, developed his theory in the context of a random parameter generated with a known probability distribution.

I don’t really have time to look at this but I thought it might interest you, so feel free to share your impressions. I assume Nic Lewis will be reading the comments.

This also seemed worth posting, given that Yuling, Aki, Dan, and I will soon be releasing our own paper on combining posterior inferences from different models fit to a single dataset. Not quite the same problem but it’s in the same general class of questions.

Cross Purposes

A correspondent writes:

I thought you might enjoy this…

I’m refereeing a paper which basically looks at whether survey responses on a particular topic vary when the question is asked in two different ways. In the main results table they split the sample along several relevant dimensions (education; marital status; religion; etc). I give them credit for showing all the results, but only one differential is statistically significant at 5%, and of course they focus the interpretation on that one. In my initial report, I asked if they either had tried or would try correcting for multiple hypothesis testing. I just received their response:

“We agree with the referee, but we do not think it is possible given that we really do not have enough power.”

So they left it as is and don’t discuss the issue at all in the revision!

As is often the case, this is an example where I suspect we’d be better off had p-values never been invented.

PPPPPPPPPPNAS!

Jochen Weber writes:

As I follow your blog (albeit loosely), I figured I’d point out an “early release” paper from PNAS I consider to be “garbage” (at least by title, and probably by content).

The short version is, the authors claim to have found the neural correlate of a person being “cognizant of” the outcome of an action (either doing something on purpose or by accident), but frame it as a correlate of “knowledge vs. recklessness” in the context of crime and assessing culpability in the legal domain.

Without having read the whole thing yet, it would seem important that once you tell someone that some negative consequence occurred (or could occur), the neural correlate would likely change.

This may not be so much bad based on the methods, but rather by its completely over-hyped framing (and the fact that it spent less than 3 months in review; received 11/23/2016, accepted 2/9/2017, which is about 10 weeks). Presumably the reviewers were so enthralled by the work, they couldn’t think of anything to improve…

And just reading the methods now, I would say they tested something different altogether; the amount of “knowledge” that was manipulated between groups had nothing to do with to what extent what they were doing was criminal, but rather to what extent they were likely to be caught… It is maddening to think that popular news media will pick this up as “finally the answer to judging culpability based on fMRI”–what utter nonsense…

Weber continues:

Finally, just a couple issues I would point out…

– showing people a bowl of poo in the scanner and asking them whether they would eat it may elicit a lot of brain activation; to claim that that is the neural basis of how the brain looks like when eating poo is a fallacy

– besides the point that this was a game (and neither risk nor knowledge had any real-world consequences), this doesn’t seem to be about crime so much as about representing fear of being caught—a criminal who is certain they won’t be caught may not show this pattern

– the first author didn’t design the research (see contributions section), and the senior author never first-authored an fMRI paper (so he doesn’t know the modality), a toxic combination it seems…

A bowl of poo, huh? I’m liking this experiment already! Seriously, I do find this paper a bit creepy as well as hyped.

Dear Cornell University Public Relations Office

I received the following email, which was not addressed to me personally:

From: ** <**@cornell.edu>
Date: Wednesday, April 5, 2017 at 9:42 AM
To: “gelman@stat.columbia.edu”
Cc: ** <**@cornell.edu>
Subject: Information regarding research by Professor Brian Wansink

I know you have been following this issue, and I thought you might be interested in new information posted today on Cornell University’s Media Relations Office website and the Food and Brand Lab website.

**
**@cornell.edu
@**
office: 607-***-****
mobile: 607-***-****

The message included two links: Media Relations Office website and Food and Brand Lab website.

You can click through. Wansink’s statement is hardly worth reading; it’s the usual mixture of cluelessness, pseudo-contrition, and bluster, referring to “the great work of the Food and Brand Lab” and minimizing the long list of problems with his work. He writes, “all of this data analysis was independently reviewed and verified under contract by the outside firm Mathematica Policy Research, and none of the findings altered the core conclusions of any of the studies in question,” but I clicked through, and the Mathematica Policy Research person specifically writes, “The researcher did not review the quality of the data, hypotheses, research methods, interpretations, or conclusions.” So if it’s really true that “none of the findings altered the core conclusions of any of the studies in question,” then it’s on Wansink to demonstrate this. Which he did not.

And, as someone pointed out to me, in any case this doesn’t address the core criticism of Wansink having openly employed a hypotheses-after-results-are-known methodology which leaves those statistics meaningless, even after correcting all the naive errors.

Wansink also writes:

I, of course, take accuracy and replication of our research results very seriously.

That statement is ludicrous. Wansink has published hundreds of errors in many many papers. Many of his reported numbers make no sense at all. To say that Wansink takes accuracy very seriously is like saying that, umm, Dave Kingman took fielding very seriously. Get real, dude.

And he writes of “possible duplicate use of data or republication of portions of text from my earlier works.” It’s not just “possible,” it actually happened. This is like the coach of the 76ers talking about possibly losing a few games during the 2015-2016 season.

OK, now for the official statement from Cornell Media Relations. Their big problem is that they minimize the problem. Here’s what they write:

Shortly after we were made aware of questions being raised about research conducted by Professor Brian Wansink by fellow researchers at other institutions, Cornell conducted an internal review to determine the extent to which a formal investigation of research integrity was appropriate. That review indicated that, while numerous instances of inappropriate data handling and statistical analysis in four published papers were alleged, such errors did not constitute scientific misconduct (https://grants.nih.gov/grants/research_integrity/research_misconduct.htm). However, given the number of errors cited and their repeated nature, we established a process in which Professor Wansink would engage external statistical experts to validate his review and reanalysis of the papers and attendant published errata. . . . Since the original critique of Professor Wansink’s articles, additional instances of self-duplication have come to light. Professor Wansink has acknowledged the repeated use of identical language and in some cases dual publication of materials. Cornell will evaluate these cases to determine whether or not additional actions are warranted.

Here’s the problem. It’s not just those 4 papers, and it’s not just those 4 papers plus the repeated use of identical language and in some cases dual publication of materials.

There’s more. A lot more. And it looks to me like serious research misconduct: either outright fraud by people in the lab, or such monumental sloppiness that data are entirely disconnected from context, with zero attempts to fix things when problems have been pointed out.

If Wansink did all this on his own and never published anything and never got any government grants, I guess I wouldn’t call it research misconduct; I’d just call it a monumental waste of time. But to repeatedly publish papers where the numbers don’t add up, where the data are not as described: sure, that seems to me like research misconduct.

From the NIH link above:

Research misconduct is defined as fabrication, falsification and plagiarism, and does not include honest error or differences of opinion. . .

Fabrication: Making up data or results and recording or reporting them.

Falsification: Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.

Plagiarism: The appropriation of another person’s ideas, processes, results, or words without giving appropriate credit.

I have no idea about fabrication or plagiarism, but Wansink’s work definitely seems to have a bit of falsification as defined above.

Let’s talk falsification.

From my very first post on Wansink, from 15 Dec 2016, it seems that he published four papers from the same experiment, not citing each other, with different sample sizes and incoherent data-exclusion rules. This falls into the category, “omitting data or results such that the research is not accurately represented in the research record.”

Next, Tim van der Zee​, Jordan Anaya​, and Nicholas Brown found over 150 errors in just four papers. If this is not falsification, it’s massive incompetence.

How incompetent do you have to be to make over 150 errors in four published papers?

And, then, remember this quote from Wansink:

Also, we realized we asked people how much pizza they ate in two different ways – once, by asking them to provide an integer of how many pieces they ate, like 0, 1, 2, 3 and so on. Another time we asked them to put an “X” on a scale that just had a “0” and “12” at either end, with no integer mark in between.

As I wrote at the time, “how do you say ‘we realized we asked . . .’? What’s to realize? If you asked the question that way, wouldn’t you already know this?” To publish results under your name without realizing what’s gone into it . . . that’s research misconduct. OK, if it just happens once or twice, sure, we’ve all been sloppy. But with Wansink it happens over and over again.

And this is not new. Here’s a story from 2011. An outside researcher found problems with a published paper of Wansink. Someone from Wansink’s lab fielded the comments, responded politely . . . and did nothing.

And then there’s the carrots story from Jordan Anaya, who shares this table from a Wansink paper from 2012:

As Anaya points out, these numbers don’t add up. None of them add up! The numbers violate the Law of Conservation of Carrots. I guess they don’t teach physics so well up there at Cornell . . .

Anaya reports that, as part of a ridiculous attempt to defend his substandard research practices, Wansink said the values were, “based on the well-cited quarter plate data collection method referenced in that paper.” However, Anaya continues, “The quarter plate method Wansink refers to was published by his group in 2013. Although half of the references in the 2012 paper were self-citations, the quarter plate method was not referenced as he claims.” In addition, according to that 2012 paper from Wansink, “the weight of any remaining carrots was subtracted from their starting weight to determine the actual amount eaten.”

What’s going on? Assuming it’s not out-and-out fabrication, it’s “Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.” In short: what it says in the published paper is not what happened, nor do Wansink’s later replies add up.

But wait, there’s more. Here’s a post from 21 Mar 2017 from Tim van der Zee, who summarizes:

Starting with our pre-print describing over 150 errors in 4 papers, there is a now ever-increasing list of research articles (co-)authored by BW which have been criticized for containing serious errors, reporting inconsistencies, impossibilities, plagiarism, and data duplications.

To the best of my [van der Zee’s] knowledge, there are currently:

37 publications from Wansink which are alleged to contain minor to very serious issues,
which have been cited over 3300 times,
are published in over 20 different journals, and in 8 books,
spanning over 19 years of research.

Here’s just one example:

Wansink, B., & Seed, S. (2001). Making brand loyalty programs succeed. Brand Management, 8, 211–222.

Citations: 34

Serious issue of self-plagiarism which goes beyond the Method section . . . Furthermore, both papers mention substantially different sample sizes (153 vs. 643) but both have a table with results which are basically entirely identical.

That’s a big deal. OK, sure, self-plagiarism, no need for us to be Freybabies about that. But, what about that other thing?

Furthermore, both papers mention substantially different sample sizes (153 vs. 643) but both have a table with results which are basically entirely identical.

Whoa! That sounds to me like . . . “changing or omitting data or results such that the research is not accurately represented in the research record.”

And here’s another, “Relation of soy consumption to nutritional knowledge,” which Nick Brown discusses in detail. This is one of three papers that reports three different surveys, each with exactly 770 respondents, but one based on surveys sent to 1002 people, one sent to 1600 people, and other sent to 2000 people. All things are possible but it’s hard for me to believe there are three different surveys. It seems more like “changing or omitting data or results such that the research is not accurately represented in the research record.” As van der Zee puts it, “Both studies use (mostly) the same questionnaire, allowing the results to be compared. Surprisingly, while the two papers supposedly reflect two entirely different studies in two different samples, there is a near perfect correlation of 0.97 between the mean values in the two studies.”

Brown then gets granular:

A further lack of randomness can be observed in the last digits of the means and F statistics in the three published tables of results . . . Here is a plot of the number of times each decimal digit appears in the last position in these tables:

These don’t look like so much like real data but they do seem consistent with someone making up numbers and not wanting them to seem too round, and not being careful to include enough 0’s and 5’s in the last digits.

Again, all things are possible. So let me just say, that given all the other misrepresentations of data and data collection in Wansink’s papers, I have no good reason to think the numbers from those tables are real.

Oh, wait, here’s another from van der Zee:

Wansink, B., Cheney, M. M., & Chan, N. (2003). Exploring comfort food preferences across age and gender. Physiology & Behavior, 79(4), 739-747.

Citations: 334

Using the provided summary statistics such as mean, test statistics, and additional given constraints it was calculated that the data set underlying this study is highly suspicious. For example, given the information which is provided in the article the response data for a Likert scale question should look like this:

Furthermore, although this is the most extreme possible version given the constraints described in the article, it is still not consistent with the provided information.

In addition, there are more issues with impossible or highly implausible data.

Looks to me like “changing or omitting data or results such that the research is not accurately represented in the research record.”

Or this:

Wansink, B., Van Ittersum, K., & Painter, J. E. (2006). Ice cream illusions: bowls, spoons, and self-served portion sizes. American journal of preventive medicine, 31(3), 240-243.

Citations: 262

A large amount of inconsistent/impossible means and standard deviations, as well as inconsistent ANOVA results, as can be seen in the picture.

If this is not “changing or omitting data or results such that the research is not accurately represented in the research record,” then what is it? Perhaps “telling your subordinates that they are expected to get publishable results, then never looking at the data or the analysis?” Or maybe something else? If the numbers don’t add up, this means that some data or results have been changed so that the research is not accurately represented in the research record, no?

What is this plausible deniability crap? Do we now need a RICO for research labs??

OK, just one more from van der Zee’s list and then I’ll quit:

Wansink, B., Just, D. R., & Payne, C. R. (2012). Can branding improve school lunches?. Archives of pediatrics & adolescent medicine, 166(10), 967-968.

Citations: 38

There are various things wrong with this article. The authors repeatedly interpret non significant p values as being significant, as well as miscalculating p values. Their choice of statistical analysis is very questionable. The data are visualized in a very questionable manner which is easily misinterpreted (such that the effects are overestimated). In addition, the visualization is radically different from an earlier version of the same paper, which gave a much more modest impression of the effects. Furthermore, the authors seem to be confused about the participants, as they are school students aged 8-11 but are also called “preliterate children”; in later publication Wansink mentions these are “daycare kids”, and further exaggerates and misreports the size of the effects.

This does not sound to me like mere “honest error or differences of opinion.” From my perspective it sounds more like “changing or omitting data or results such that the research is not accurately represented in the research record.”

The only possible defense I can see here is that Wansink didn’t actually collect the data, analyze the data, or write the report. He’s just a figurehead. But, again, RICO. If you’re in charge of a lab which repeatedly, over and over and over and over again, changes or omits data or results such that the research is not accurately represented in the research record, then, yes, from my perspective I think it’s fair to say that you have been changing or omitting data or results such that the research is not accurately represented in the research record, and you have been doing scientific misconduct.

I said I’d stop but I have to share this additional example. Again, here’s van der Zee’s summary:

Sığırcı, Ö, Rockmore, M., & Wansink, B. (2016). How traumatic violence permanently changes shopping behavior. Frontiers in Psychology, 7,

Citations: 0

This study is about World War II veterans. Given the mean age stated in the article, the distribution of age can only look very similar to this:

The article claims that the majority of the respondents were 18 to 18.5 years old at the end of WW2 whilst also have experienced repeated heavy combat. Almost no soldiers could have had any other age than 18.

In addition, the article claims over 20% of the war veterans were women, while women only officially obtained the right to serve in combat very recently.

What does that sound like? Oh yeah, “changing or omitting data or results such that the research is not accurately represented in the research record.”

So . . . I decided to reply.

Dear Cornell University Media Relations Office:

Thank you for pointing me to these two statements. Unfortunately I fear that you are minimizing the problem.

You write, “while numerous instances of inappropriate data handling and statistical analysis in four published papers were alleged, such errors did not constitute scientific misconduct (https://grants.nih.gov/grants/research_integrity/research_misconduct.htm). However, given the number of errors cited and their repeated nature, we established a process in which Professor Wansink would engage external statistical experts to validate his review and reanalysis of the papers and attendant published errata. . . . Since the original critique of Professor Wansink’s articles, additional instances of self-duplication have come to light. Professor Wansink has acknowledged the repeated use of identical language and in some cases dual publication of materials.”

But there are many, many more problems in Wansink’s published work, beyond those 4 initially-noticed papers and beyond self-duplication.

Your NIH link above defines research misconduct as “fabrication, falsification and plagiarism, and does not include honest error or differences of opinion. . .” and defines falsification as “Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.”

This phrase, “changing or omitting data or results such that the research is not accurately represented in the research record,” is an apt description of much of Wansink’s work, going far beyond those four particular papers that got the ball rolling, and far beyond duplication of materials. For a thorough review, see this recent post by Tim van der Zee, who points to 37 papers by Wansink, many of which have serious data problems: http://www.timvanderzee.com/the-wansink-dossier-an-overview/

And all this doesn’t even get to criticism of Wansink having openly employed a hypotheses-after-results-are-known methodology which leaves his statistics meaningless, even setting aside data errors.

There’s also Wansink’s statement which refers to “the great work of the Food and Brand Lab,” which is an odd phrase to use to describe a group that has published papers with hundreds of errors and major massive data inconsistencies that represent, at worst, fraud, and, at best, some of the sloppiest empirical work—published or unpublished—that I have ever seen. In either case, I consider this pattern of errors to represent research misconduct.

I understand that it’s natural to think that nothing can every be proven, Rashomon and all that. But in this case the evidence for research misconduct is all out in the open, in dozens of published papers.

I have no personal stake in this matter and I have no plans to file any sort of formal complaint. But as a scientist, this bothers me: Wansink’s misconduct, his continuing attempt to minimize it, and this occurring at a major university.

Yours,
Andrew Gelman

P.S. I have no interest in Wansink being prosecuted for any misdeeds, I don’t think there’s any realistic chance he’ll be asked to return his government grants, nor am I trying to get Cornell to fire him or sanction him or whatever. Any such efforts would take huge amounts of effort which I’m sure could be better spent elsewhere. And who am I to throw the first stone? I’ve messed up in data analysis myself.

But it irritates me to see Wansink continue to misrepresent what is happening, and it irritates me to see Cornell University minimize the story. If they don’t want to do anything about Wansink, fine; I completely understand. But the evidence is what it is; don’t understate it.

Remember the Ed Wegman story? Weggy repeatedly published articles with his name on it that plagiarized copied material written by others without attribution; people notified his employer, who buried the story. George Mason University didn’t seem to want to know about it.

The Wansink case is a bit more complicated, in that there do seem to be people at Cornell who care, but there also seems to be a desire to minimize the problem and make it go away. Don’t do that. To minimize scientific misconduct is an insult to all of us who work so hard to present our data accurately.

Tech company wants to hire Stan programmers!

Ittai Kan writes:

I started life as an academic mathematician (chaos theory) but have long since moved into industry. I am currently Chief Scientist at Afiniti, a contact center routing technology company that connects agent and callers on the basis of various factors in order to globally optimize the contact center performance. We have 17 data scientists/algorithm engineers in Washington, DC and another 40 (not as well trained) in Pakistan.

The problem mathematically divides into two components: A data science / statistics component where we estimate the likelihood of success for any agent/caller pairing (this is difficult because we have to estimate comparative advantage which involves differences of differences of estimates); and an operations research component where we perform an optimal allocation in a stochastic setting with a variety of real-world constraints.

We are finding Stan to be very useful (and something of a miracle) and are interested in hiring students who are experts with Stan and can help us improve Stan as we try to make it easier to use and efficiently investigate multiple models. We expect to significantly contribute to the development of Stan in the next two years.

It’s not so hard to move away from hypothesis testing and toward a Bayesian approach of “embracing variation and accepting uncertainty.”

There’s been a lot of discussion, here and elsewhere, of the problems with null hypothesis significance testing, p-values, deterministic decisions, type 1 error rates, and all the rest. And I’ve recommended that people switch to a Bayesian approach, “embracing variation and accepting uncertainty,” as demonstrated (I hope) in my published applied work.

But we recently had a blog discussion that made me realize there was some confusion on this point.

Emmanuel Charpentier wrote:

It seems that, if we want, following the conclusion of Andrew’s paper, to abandon binary conclusions, we are bound to give :

* a discussion of possible models of the data at hand (including prior probabilities and priors for their parameters),

* a posterior distribution of parameters of the relevant model(s), and

* a discussion of the posterior probabilities of these models

as the sole logically defensible result of a statistical analysis.

It seems also that there is no way to take a decision (pursue or not a given line of research, embark or not in a given planned action, etc…) short of a real decision analysis.

We have hard times before us selling *that* to our “clients” : after > 60 years hard-selling them the NHST theory, we have to tell them that this particular theory was (more or less) snake-oil aimed at *avoiding* decision analysis…

We also have hard work to do in order to learn how to build the necessary discussions, that can hardly avoid involving specialists of the subject matter : I can easily imagine myself discussing a clinical subject ; possibly a biological one ; I won’t touch an economic or political problem with a ten-foot pole…

Wow—that sounds like a lot of work! It might seem that a Bayesian approach is fine in theory but is too impractical for real work.

But I don’t think so. Here’s my reply to Charpentier:

I think you’re making things sound a bit too hard: I’ve worked on dozens of problems in social science and public health, and the statistical analysis that I’ve done doesn’t look so different from classical analyses. The main difference is that I don’t set up the problem in terms of discrete “hypotheses”; instead, I just model things directly.

And Stephen Martin followed up with more thoughts.

In my experience, a Bayesian approach is typically less effort and easier to explain, compared to a classical approach which involves all these weird hypotheses.

It’s harder to do the wrong thing right than to do the right thing right.

Annals of Spam

I’ve recently been getting a ton of spam—something like 200 messages a day in my inbox! The Columbia people tell me that soon we’ll be switching to a new mail server that will catch most of these, but for now I have to spend a couple minutes every day just going thru and flagging them as spam. Which does basically nothing.

Anyway, most of these are just boring: home improvement ads, quack medicine, search engine optimization, sub-NKVD political fake news, invitations to fake conferences around the world, Wolfram Research employees who claim to have read my papers, etc. But today I got this which had an amusing statistical twist:

Understand how to do a coverage analysis at the Clinical Trial Billing Compliance Boot Camp

Become Clinical Trial Billing Proficient at the Only Hands-On Workshop to Guide You through All of Your Billing Compliance Challenges

Do you want to know the ins and outs of performing a coverage analysis? This year’s Clinical Trial Billing Compliance Boot Camp series will walk you through the process from “qualifying” a trial — to putting actual codes on the billing grid, so translation to the coders can occur. . . . Register today to learn how to do a coverage analysis from soup to nuts! We will help you start a coverage analysis grid that you can take home with you that will help you with your process improvement. . . .

Something about the relentless positivity of their message reminded me of Brian Wansink. Or Amy Cuddy.

I mean, really, why does everyone have to be so negative all the time? So critical? I say, let’s stop trying to check whether the numbers on published papers add up. Let’s just agree that any paper that’s published is always true. Let’s believe everything they tell us on NPR and Ted talks. Let’s just say that everything published in PPNAS is actually science. Let’s accept every submitted paper (as long as it has “p less than .05” somewhere). Let’s tenure everybody! No more party poopers, that’s what I say!