The post “Data sleaze: Uber and beyond” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Scroll through Kaiser’s blog for more:

Dispute over analysis of school quality and home prices shows social science is hard

My pre-existing United boycott, and some musing on randomness and fairness

etc.

The post “Data sleaze: Uber and beyond” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Using prior knowledge in frequentist tests appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I replied:

I’ve not tried to follow the details but this reminds me of our paper on posterior predictive checks. People think of this as very Bayesian but my original idea when doing this research was to construct frequentist tests using Bayesian averaging in order to get p-values. This was motivated by a degrees-of-freedom-correction problem where the model had nonlinear constraints and so one could not simply do a classical correction based on counting parameters.

To which Bartels wrote:

Your work starts from the same point as mine, existing frequentist tests may be inadequate for the problem of interest. Your work ends also, where I would like to end, performing tests via integration over (i.e.,sampling of) paremeters and future observation using likelihood and prior.

In addition, I try to anchor the approach in decision theory (as referenced in my write up). Perhaps this is too ambitious, we’ll see.

Results so far, using the language of your publication:

– The posterior distribution p(theta|y) is a good choice for the deviance D(y,theta). It gives optimal confidence intervals/sets in the sense proposed by Schafer, C.M. and Stark, P.B., 2009. Constructing confidence regions of optimal expected size. Journal of the American Statistical Association, 104(487), pp.1080-1089.

– Using informative priors for the deviance D(y,theta)=p(theta|y) may improve the quality of decisions, e.g., may improve thenpower of tests.

– For the marginalization, I find it difficult to strike the balance between proposing something that can be argued/shown to give optimal tests, and something that can be calculate with availabe computational resources. I hope to end up with something like one of the variants shown in your Figure 1.I noted that you distinguish test statistics from deviances that do depend or do not depend on the parameter. I’m not aware of anything that prevents you from using deviances with a dependence on parameters for frequentist tests – it is just inconvenient, if you are after generic, closed form solutions for tests. I did not make this differentiation, and refer to tests independent on whether they depend on the parameters or not.

I don’t really have anything more to say here, as I have not thought about these issues for awhile. But I thought Bartels’s paper and this discussion might interest some of you.

The post Using prior knowledge in frequentist tests appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I hate R, volume 38942 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>R doesn’t allow block comments. You have to comment out each line, or you can encapsulate the block in if(0){} which is the world’s biggest hack. Grrrrr.

**P.S.** Just to clarify: I want block commenting *not* because I want to add long explanatory blocks of text to annotate my scripts. I want block commenting because I alter my scripts, and sometimes I want to comment out a block of code.

The post I hate R, volume 38942 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The next Lancet retraction? [“Subcortical brain volume differences in participants with attention deficit hyperactivity disorder in children and adults”] appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Someone who prefers to remain anonymous asks for my thoughts on this post by Michael Corrigan and Robert Whitaker, “Lancet Psychiatry Needs to Retract the ADHD-Enigma Study: Authors’ conclusion that individuals with ADHD have smaller brains is belied by their own data,” which begins:

Lancet Psychiatry, a UK-based medical journal, recently published a study titled Subcortical brain volume differences in participants with attention deficit hyperactivity disorder in children and adults: A cross-sectional mega-analysis. According to the paper’s 82 authors, the study provides definitive evidence that individuals with ADHD have altered, smaller brains. But as the following detailed review reveals, the study does not come close to supporting such claims.

Below are tons of detail, so let me lead with my conclusion, which is that the criticisms coming from Corrigan and Whitaker seem reasonable to me. That is, based on my quick read, the 82 authors of that published paper seem to have made a big mistake in what they wrote.

I’d be interested to see if the authors have offered any reply to these criticisms. The article has just recently come out—the journal publication is dated April 2017—and I’d like to see what the authors have to say.

OK, on to the details. Here are Corrigan and Whitaker:

The study is beset by serious methodological shortcomings, missing data issues, and statistical reporting errors and omissions. The conclusion that individuals with ADHD have smaller brains is contradicted by the “effect-size” calculations that show individual brain volumes in the ADHD and control cohorts largely overlapped. . . .

Their results, the authors concluded, contained important messages for clinicians: “The data from our highly powered analysis confirm that patients with ADHD do have altered brains and therefore that ADHD is a disorder of the brain.” . . .

The press releases sent to the media reflected the conclusions in the paper, and the headlines reported by the media, in turn, accurately summed up the press releases. Here is a sampling of headlines:

Given the implications of this study’s claims, it deserves to be closely analyzed. Does the study support the conclusion that children and adults with ADHD have “altered brains,” as evidenced by smaller volumes in different regions of the brain? . . .

Alternative Headline: Large Study Finds Children with ADHD Have Higher IQs!

To discover this finding, you need to spend $31.50 to purchase the article, and then make a special request to Lancet Psychiatry to send you the appendix. Then you will discover, on pages 7 to 9 in the appendix, a “Table 2” that provides IQ scores for both the ADHD cohort and the controls.

Although there were 23 clinical sites in the study, only 20 reported comparative IQ data. In 16 of the 20, the ADHD cohort had higher IQs on average than the control group. In the other four clinics, the ADHD and control groups had the same average IQ (with the mean IQ scores for both groups within two points of each other.) Thus, at all 20 sites, the ADHD group had a mean IQ score that was equal to, or higher than, the mean IQ score for the control group. . . .

And why didn’t the authors discuss the IQ data in their paper, or utilize it in their analyses? . . . Indeed, if the IQ data had been promoted in the study’s abstract and to the media, the public would now be having a new discussion: Is it possible that children diagnosed with ADHD are more intelligent than average? . . .

They Did Not Find That Children Diagnosed with ADHD Have Smaller Brain Volumes . . .

For instance, the authors reported a Cohen’s d effect size of .19 for differences in the mean volume of the accumbens in children under 15. . . in this study, for youth under 15, it was the largest effect size of all the brain volume comparisons that were made. . . . Approximately 58% of the ADHD youth in this convenience sample had an accumbens volume below the average in the control group, while 42% of the ADHD youth had an accumbens volume above the average in the control group. Also, if you knew the accumbens volume of a child picked at random, you would have a 54% chance that you could correctly guess which of the two cohorts—ADHD or healthy control—the child belonged to. . . . The diagnostic value of an MRI brain scan, based on the findings in this study, would be of little more predictive value than the toss of a coin. . . .

The authors reported that the “volumes of the accumbens, amygdala, caudate, hippocampus, putamen, and intracranial volume were smaller in individuals with ADHD compared with controls in the mega-analysis” (p. 1). If this is true, then smaller brain volumes should show up in the data from most, if not all, of the 21 sites that had a control group. But that was not the case. . . . The problem here is obvious. If authors are claiming that smaller brain regions are a defining “abnormality” of ADHD, then such differences should be consistently found in mean volumes of ADHD cohorts at all sites. The fact that there was such variation in mean volume data is one more reason to see the authors’ conclusions—that smaller brain volumes are a defining characteristic of ADHD—as unsupported by the data. . . .

And now here’s what the original paper said:

We aimed to investigate whether there are structural differences in children and adults with ADHD compared with those without this diagnosis. In this cross-sectional mega-analysis [sic; see P.P.S. below], we used the data from the international ENIGMA Working Group collaboration, which in the present analysis was frozen at Feb 8, 2015. Individual sites analysed structural T1-weighted MRI brain scans with harmonised protocols of individuals with ADHD compared with those who do not have this diagnosis. . . .

Our sample comprised 1713 participants with ADHD and 1529 controls from 23 sites . . . The volumes of the accumbens (Cohen’s d=–0·15), amygdala (d=–0·19), caudate (d=–0·11), hippocampus (d=–0·11), putamen (d=–0·14), and intracranial volume (d=–0·10) were smaller in individuals with ADHD compared with controls in the mega-analysis. There was no difference in volume size in the pallidum (p=0·95) and thalamus (p=0·39) between people with ADHD and controls.

The above demonstrates some forking paths, and there are a bunch more in the published paper, for example:

Exploratory lifespan modelling suggested a delay of maturation and a delay of degeneration, as e ect sizes were highest in most subgroups of children (<15 years) versus adults (>21 years): in the accumbens (Cohen’s d=–0·19 vs –0·10), amygdala (d=–0·18 vs –0·14), caudate (d=–0·13 vs –0·07), hippocampus (d=–0·12 vs –0·06), putamen (d=–0·18 vs –0·08), and intracranial volume (d=–0·14 vs 0·01). There was no di erence between children and adults for the pallidum (p=0·79) or thalamus (p=0·89). Case-control differences in adults were non-signi cant (all p>0·03). Psychostimulant medication use (all p>0·15) or symptom scores (all p>0·02) did not in uence results, nor did the presence of comorbid psychiatric disorders (all p>0·5). . . .

Outliers were identified at above and below one and a half times the interquartile range per cohort and group (case and control) and were excluded . . . excluding collinearity of age, sex, and intracranial volume (variance in ation factor <1·2) . . . The model included diagnosis (case=1 and control=0) as a factor of interest, age, sex, and intracranial volume as fixed factors, and site as a random factor. In the analysis of intracranial volume, this variable was omitted as a covariate from the model. Handedness was added to the model to correct for possible effects of lateralisation, but was excluded from the model when there was no significant contribution of this factor. . . . stratified by age: in children aged 14 years or younger, adolescents aged 15–21 years, and adults aged 22 years and older. We removed samples that were left with ten patients or fewer because of the stratification. . . .

Forking paths are fine; I have forking paths in every analysis I’ve ever done. But forking paths render published p-values close to meaningless; in particular I have no reason to take seriously a statement such as, “p values were significant at the false discovery rate corrected threshold of p=0·0156,” from the summary of the paper.

So let’s forget about p-values and just look at the data graphs, which appear in the published paper:

Unfortunately these are not raw data or even raw averages for each age; instead they are “moving averages, corrected for age, sex, intracranial volume, and site for the subcortical volumes.” But we’ll take what we’ve got.

From the above graphs, it doesn’t seem like much of anything is going on: the blue and red lines cross all over the place! So now I don’t understand this summary graph from the paper:

I mean, sure, I see it for Accumbens, I guess, if you ignore the older people. But, for the others, the lines in the displayed age curves cross all over the place.

The article in question has the following list of authors: Martine Hoogman, Janita Bralten, Derrek P Hibar, Maarten Mennes, Marcel P Zwiers, Lizanne S J Schweren, Kimm J E van Hulzen, Sarah E Medland, Elena Shumskaya, Neda Jahanshad, Patrick de Zeeuw, Eszter Szekely, Gustavo Sudre, Thomas Wolfers, Alberdingk M H Onnink, Janneke T Dammers, Jeanette C Mostert, Yolanda Vives-Gilabert, Gregor Kohls, Eileen Oberwelland, Jochen Seitz, Martin Schulte-Rüther, Sara Ambrosino, Alysa E Doyle, Marie F Høvik, Margaretha Dramsdahl, Leanne Tamm, Theo G M van Erp, Anders Dale, Andrew Schork, Annette Conzelmann, Kathrin Zierhut, Ramona Baur, Hazel McCarthy, Yuliya N Yoncheva, Ana Cubillo, Kaylita Chantiluke, Mitul A Mehta, Yannis Paloyelis, Sarah Hohmann, Sarah Baumeister, Ivanei Bramati, Paulo Mattos, Fernanda Tovar-Moll, Pamela Douglas, Tobias Banaschewski, Daniel Brandeis, Jonna Kuntsi, Philip Asherson, Katya Rubia, Clare Kelly, Adriana Di Martino, Michael P Milham, Francisco X Castellanos, Thomas Frodl, Mariam Zentis, Klaus-Peter Lesch, Andreas Reif, Paul Pauli, Terry L Jernigan, Jan Haavik, Kerstin J Plessen, Astri J Lundervold, Kenneth Hugdahl, Larry J Seidman, Joseph Biederman, Nanda Rommelse, Dirk J Heslenfeld, Catharina A Hartman, Pieter J Hoekstra, Jaap Oosterlaan, Georg von Polier, Kerstin Konrad, Oscar Vilarroya, Josep Antoni Ramos-Quiroga, Joan Carles Soliva, Sarah Durston, Jan K Buitelaar, Stephen V Faraone, Philip Shaw, Paul M Thompson, Barbara Franke.

I also found a webpage for their research group, featuring this wonderful map:

The number of sites looks particularly impressive when you include each continent twice like that. But they should really do some studies in Antarctica, given how huge it appears to be!

**P.S.** Following the links, I see that Corrigan and Whitaker come into this with a particular view:

Mad in America’s mission is to serve as a catalyst for rethinking psychiatric care in the United States (and abroad). We believe that the current drug-based paradigm of care has failed our society, and that scientific research, as well as the lived experience of those who have been diagnosed with a psychiatric disorder, calls for profound change.

This does not mean that the critics are wrong—presumably the authors of the original paper came into their research with their own strong views—; it can just be helpful to know where they’re coming from.

**P.P.S.** The paper discussed above uses the term “mega-analysis.” At first I thought this might be some sort of typo, but apparently the expression does exist and has been around for awhile. From my quick search, it appears that the term was first used by James Dillon in a 1982 article, “Superanalysis,” in Evaluation News, where he defined mega-analysis as “a method for synthesizing the results of a series of meta-analyses.”

But in the current literature, “mega-analysis” seems to simply refer to a meta-anlaysis that uses the raw data from the original studies.

If so, I’m unhappy with the term “mega-analysis” because: (a) The “mega” seems a bit hypey, (b) What if the original studies are small? Then even all the data combined might not be so “mega”?, and (c) I don’t like the implication that plain old “meta-analysis” *doesn’t* use the raw data. I’m pretty sure that the vast majority of meta-analyses use only published summaries, but I’ve always thought of it as the preferred version of meta-anlaysis to use the original data.

I bring up this mega-analysis thing not as a criticism of the Hoogman et al. paper—they’re just using what appears to be a standard term in their field—but just as an interesting side-note.

**P.P.P.S.** The above post represents my current impression. As I wrote, I’d be interested to see the original authors’ reply to the criticism. Lancet does have a pretty bad reputation—it’s known for publishing flawed, sensationalist work—but I’m sure they run the occasional good article too. So I wouldn’t want to make any strong judgments in this case before hearing more.

**P.P.P.P.S.** Regarding the title of this post: No, I don’t think Lancet would ever retract this paper, even if all the above criticisms are correct. It seems that retraction is used only in response to scientific misconduct, not in response to mere error. So when I say “retraction,” I mean what one might call “conceptual retraction.” The real question is: Will this new paper join the list of past Lancet papers which we would not want to take seriously, and which we regret were ever published?

The post The next Lancet retraction? [“Subcortical brain volume differences in participants with attention deficit hyperactivity disorder in children and adults”] appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan in St. Louis this Friday appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Title:** Stan: A Software Ecosystem for Modern Bayesian Inference

Jonah Sol Gabry, Columbia University

**Neuroimaging Informatics and Analysis Center (NIAC) Seminar Series**

Friday April 28, 2017, 1:30-2:30pm

NIL Large Conference Room

#2311, 2nd Floor, East Imaging Bldg.

4525 Scott Avenue, St. Louis, MO

The post Stan in St. Louis this Friday appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan without frontiers, Bayes without tears appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This recent comment thread reminds me of a question that comes up from time to time, which is how to teach Bayesian statistics to students who aren’t comfortable with calculus. For continuous models, probabilities are integrals. And in just about every example except the one at 47:16 of this video, there are multiple parameters, so probabilities are multiple integrals.

So how to teach this to the vast majority of statistics users who can’t easily do multivariate calculus?

I dunno, but I don’t know that this has anything in particular to do with Bayes. Think about classical statistics, at least the sort that gets used in political science. Linear regression requires multivariate calculus too (or some pretty slick algebra or geometry) to get that least-squares solution. Not to mention the interpretation of the standard error. And then there’s logistic regression. Going further we move to popular machine learning methods which are *really* gonna seem like nothing more than black boxes. Kidz today all wanna do deep learning or random forests or whatever. And that’s fine. But no way are most of them learning the math behind it.

Teach people to drive. Then later, if they want or need, they can learn how the internal combustion engine works.

So, in keeping with this attitude, teach Stan. Students set up the model, they push the button, they get the answers. No integrals required. Yes, you have to work with posterior simulations so there is integration implicitly—the conceptual load is not zero—but I think (hope?) that this approach of using simulations to manage uncertainty is easier and more direct than expressing everything in terms of integrals.

But it’s not just model fitting, it’s also model building and model checking. Cross validation, graphics, etc. You need less mathematical sophistication to *evaluate* a method than to construct it.

About ten years ago I wrote an article, “Teaching Bayesian applied statistics to graduate students in political science, sociology, public health, education, economics, . . .” After briefly talking about a course that uses the BDA book and assumes that students know calculus, I continued:

My applied regression and multilevel modeling class has no derivatives and no integrals—it actually has less math than a standard regression class, since I also avoid matrix algebra as much as possible! What it does have is programming, and this is an area where many of the students need lots of practice. The course is Bayesian in that all inference is implicitly about the posterior distribution. There are no null hypotheses and alternative hypotheses, no Type 1 and Type 2 errors, no rejection regions and confidence coverage.

It’s my impression that most applied statistics classes don’t get into confidence coverage etc., but they can still mislead students by giving the impression that those classical principles are somehow fundamental. My class is different because I don’t pretend in that way. Instead I consider a Bayesian approach as foundational, and I teach students how to work with simulations.

My article continues:

Instead, the course is all about models, understanding the models, estimating parameters in the models, and making predictions. . . . Beyond programming and simulation, probably the Number 1 message I send in my applied statistics class is to focus on the deterministic part of the model rather than the error term. . . .

Even a simple model such as y = a + b*x + error is not so simple if x is not centered near zero. And then there are interaction models—these are incredibly important and so hard to understand until you’ve drawn some lines on paper. We draw lots of these lines, by hand and on the computer. I think of this as Bayesian as well: Bayesian inference is conditional on the model, so you have to understand what the model is saying.

The post Stan without frontiers, Bayes without tears appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The meta-hype algorithm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Kevin Lewis pointed me to this article:

There are several methods for building hype. The wealth of currently available public relations techniques usually forces the promoter to judge, a priori, what will likely be the best method.

Meta-hypeis a methodology that facilitates this decision by combining all identified hype algorithms pertinent for a particular promotion problem. Meta-hype generates a final press release that is at least as good as any of the other models considered for hyping the claim. The overarching aim of this work is to introduce meta-hype to analysts and practitioners. This work compares the performance of journal publication, preprints, blogs, twitter, Ted talks, NPR, and meta-hype to predict successful promotion. A nationwide database including 89,013 articles, tweets, and news stories. All algorithms were evaluated using the total publicity value (TPV) in a test sample that was not included in the training sample used to fit the prediction models. TPV for the models ranged between 0.693 and 0.720. Meta-hype was superior to all but one of the algorithms compared. An explanation of meta-hype steps is provided. Meta-hype is the first step in targeted hype, an analytic framework that yields double hyped promotion with fewer assumptions than the usual publicity methods. Different aspects of meta-hype depending on the context, its function within the targeted promotion framework, and the benefits of this methodology in the addiction to extreme claims are discussed.

I can’t seem to find the link right now, but you get the idea.

**P.S.** The above is a parody of the abstract of a recent article on “super learning” by Acion et al. I did not include a link because the parody was not supposed to be a criticism of the content of the paper in any way; I just thought some of the wording in the abstract was kinda funny. Indeed, I thought I’d disguised the abstract enough that no one would make the connection but I guess Google is more powerful than I’d realized.

But this discussion by Erin in the comment thread revealed that some people were taking this post in a way I had not intended. So I added this comment.

tl;dr: I’m not criticizing the content of the Acion et al. paper in any way, and the above post was not intended to be seen as such a criticism.

The post The meta-hype algorithm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Would you prefer three N=300 studies or one N=900 study? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’ve been thinking about this thought experiment:

—

Imagine you’re given two papers.

Both papers explore the same topic and use the same methodology. Both were preregistered.

Paper A has a novel study (n1=300) with confirmed hypotheses, followed by two successful direct replications (n2=300, n3=300).

Paper B has a novel study with confirmed hypotheses (n=900).

*Intuitively*, which paper would you think has the most evidence? (Be honest, what is your gut reaction?)—

I’m reasonably certain the answer is that both papers provide the same amount of evidence, by essentially the likelihood principle, and if anything, one should trust the estimates of paper B more (unless you meta-analyzed paper A, which should give you the same answer as paper B, more or less).

However, my intuition was correct that most people in this group would choose paper A (See https://www.facebook.com/groups/853552931365745/permalink/1343285629059137/ for poll results).

My reasoning is that if you are observing data from the same DGP, then where you cut the data off is arbitrary; why would flipping a coin 10x, 10x, 10x, 10x, 10x provide more evidence than flipping the coin 50x? The method in paper A essentially just collected 300, drew a line, collected 300, drew a line, then collected 300 more, and called them three studies; this has no more information in sum (in a fisherian sense, the information would just add together) than if you didn’t arbitrarily cut the data into sections.

If you read in the comments of this group (which has researchers predominantly of the NHST world), one sees this fallacy that merely by passing a threshold more times means you have more evidence. They use p*p*p to justify it (even though that doesn’t make sense, because one could partition the data into 10 n=90 sets and get ‘more evidence’ by this logic; in fact, you could have 90 p-values of ~.967262 and get a p-value of .05). They use fisher’s method to say the p-value could be low (~.006), even though when combined, the p-value would actually be even lower (~.0007). One employs only Neyman-Pearson logic, and this results in a t1 error probability of .05^3.

I replied:

What do you mean by “confirmed hypotheses,” and what do you mean by a replication being “successful”? And are you assuming that the data are identical in the two scenarios?

To which Martin answered:

I [Martin], in a sense, left it ambiguous because I suspected that knowing nothing else, people would put paper A, even though asymptotically it should provide the same information as paper B.

I also left ‘confirmed hypothesis’ vague, because I didn’t want to say one must use one given framework. Basically, the hypotheses were supported by whatever method one uses to judge support (whether it be p-values, posteriors, bayes factors, whatever).

Successful replication as in, the hypotheses were supported again in the replication studies.

Finally, my motivating intuition was that paper A could basically be considered paper B if you sliced the data into thirds, or paper B could be written had you just combined the three n=300 samples.

That said, if you are experimenter A gaining three n=300 samples, your data should asymptotically (or, over infinite datasets) equal that of experimenter B gaining one n=900 sample (over infinite datasets), in the sense that the expected total information is equal, and the accumulated evidence should be equal. Therefore, even if any given two papers have different datasets, asymptotically they should provide equal information, and there’s not a good reason to prefer three smaller studies over 1 larger one.

Yet, knowing nothing else, people assumed paper A, I think, because three studies is more intuitively appealing than one large study, even if the two could be interchangeable had you divided the larger sample into three, or combined the smaller samples into 1.

From my perspective, Martin’s question can’t really be answered because I don’t know what’s in papers A and B, and I don’t know what is meant by a replication being “successful.” I think the answer depends a lot on these pieces of information, and I’m still not quite sure what Martin’s getting at here. But maybe some of you have thoughts on this one.

The post Would you prefer three N=300 studies or one N=900 study? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Drug-funded profs push drugs appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I just read a long ProPublica article that I think your blog commenters might be interested in. It’s from February, but was linked to by the Mad Biologist today (https://mikethemadbiologist.com/). Here is a link to the article: https://www.propublica.org/article/big-pharma-quietly-enlists-leading-professors-to-justify-1000-per-day-drugs

In short, it’s about a group of professors (mainly economists) who founded a consulting firm that works for many big pharma companies. They publish many peer-reviewed articles, op-eds, blogs, etc on the debate about high pharmaceutical prices, always coming to the conclusion that high prices are a net benefit (high prices -> more innovation -> better treatments in the future vs poor people having no access to existing treatment today). They also are at best very inconsistent about disclosing their affiliations and funding.

One minor thing that struck me is the following passage, about their response to a statistical criticism of one of their articles:

The founders of Precision Health Economics defended their use of survival rates in a published response to the Dartmouth study, writing that they “welcome robust scientific debate that moves forward our understanding of the world” but that the research by their critics had “moved the debate backward.”

The debate here appears to be about lead-time bias – increased screening leads to earlier detection which can increase survival rates without actually improving outcomes. So on the face it doesn’t seem like an outrageous criticism. If they have controlled it appropriately, they should have a “robust debate” so they can convince their critics and have more support for increasing drug prices! Of course I doubt they have any interest in actually having this debate. It seems similar to the responses you get from Wansink, Cuddy (or the countless other researchers promoting flawed studies who have been featured on your blog) when they are confronted with valid criticism: sound reasonable, do nothing, and get let off the hook.

This interests me because I consult for pharmaceutical companies. I don’t really have anything to add, but this sort of conflict of interest does seem like something to worry about.

We talk a lot on this blog about bad science that’s driven by some combination of careerism and naivite. We shouldn’t forget about the possibility of flat-out corruption.

The post Drug-funded profs push drugs appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Journals for insignificant results appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I know you’re not a fan of hypothesis testing, but the journals in this blog post are an interesting approach to the file drawer problem. I’ve never heard of them or their like. An alternative take (given academia standard practice) is “Journal for XYZ Discipline papers that p-hacking and forking paths could not save.”

Psychology: Journal of Articles in Support of the Null Hypothesis

Biomedicine: Journal of Negative Results in Biomedicine

Ecology and Evolutionary Biology: Journal of Negative Results

In psychology, this sort of journal isn’t really needed because we already have PPNAS, where they publish articles in support of the null hypothesis all the time, they just don’t realize it!

OK, ok, all jokes aside, the above post recommends:

Is it time for Economics to catch up? . . . a number of prominent Economists have endorsed this idea (even if they are not ready to pioneer the initiative). So, imagine… a call for papers along the following lines:

Series of Unsurprising Results in Economics (SURE)

Is the topic of your paper interesting, your analysis carefully done, but your results are not “sexy”? If so, please consider submitting your paper to SURE. An e-journal of high-quality research with “unsurprising” findings.

How does it work:

— We accept papers from all fields of Economics…

— Which have been rejected at a journal indexed in EconLit…

— With the ONLY important reason being that their results are statistically insignificant or otherwise “unsurprising”.

I can’t imagine this working. Why not just publish everything on SSRN or whatever, and then this SURE can just link to the articles in question (along with the offending referee reports)?

Also, I’m reminded of the magazine McSweeney’s, which someone once told me had been founded based on the principle of publishing stories that had been rejected elsewhere.

The post Journals for insignificant results appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Teaching Statistics: A Bag of Tricks (second edition) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Hey! Deb Nolan and I finished the second edition of our book, Teaching Statistics: A Bag of Tricks. You can pre-order it here.

I love love love this book. As William Goldman would say, it’s the “good parts version”: all the fun stuff without the standard boring examples (counting colors of M&M’s, etc.). Great stuff for teaching, also I’ve been told that’s a fun read for students of statistics.

Here’s the table of contents. If this doesn’t look like fun to you, don’t buy the book.

The post Teaching Statistics: A Bag of Tricks (second edition) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Representists versus Propertyists: RabbitDucks – being good for what? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>It is not that unusual in statistics to get the same statistical output (uncertainty interval, estimate, tail probability,etc.) for every sample, or some samples or the same distribution of outputs or the same expectations of outputs or just close enough expectations of outputs. Then, I would argue one has a variation on a DuckRabbit. In the DuckRabbit, the same sign represents different objects with different interpretations (what to make of it) whereas here we have differing signs (models) representing the same object (an inference of interest) with different interpretations (what to make of them). I will imaginatively call this a RabbitDuck ;-)

Does one always choose a Rabbit or a Duck, or sometimes one or another or always both? I would argue the higher road is both – that is to use differing models to collect and consider the different interpretations. Multiple perspectives can always be more informative (if properly processed), increasing our hopes to find out how things actually are by increasing the chances and rate of getting less wrong. Though this getting less wrong is in expectation only – it really is an uncertain world.

Of course, in statistics a good guess for the Rabbit interpretation would be Bayesian and for the Duck, Frequentest (Canadian spelling). Though, as one of Andrew’s colleagues once argued it is really modellers versus non modellers rather than Bayesians versus Frequentests and that makes a lot of sense to me. Representists are Rabbits “conjecturing, assessing, and adopting idealized representations of reality, predominantly using probability generating models for both parameters and data” while Propertyists are Ducks “primarily being about discerning procedures with good properties that are uniform over a wide range of possible underlying realities and restricting use, especially in science, to just those procedures” from here. Given that “idealized representations of reality” can only be indirectly checked (i.e. always remain possibly wrong) and “good properties” always beg the question “good for what?” (as well as only hold over a range of possible but largely unrepresented realities) – it should be a no brainer? that would it be more profitable than not to thoroughly think through both perspectives (and more actually).

An alternative view might be Leo Breiman’s “two cultures” paper.

This issue of multiple perspectives also came up in Bob’s recent post where the possibility arose that some might think it taboo to mix Bayes and Frequentist perspectives.

Some case studies would be:

Case study 1: The Bayesian inference completely solves the multiple comparisons problem post.

In this blog post, Andrew implements and contrasts both the Rabbit route and the Duck route to get uncertainty intervals (using simulation for ease of wide understanding). It turns out that the intervals will not be different under a flat prior, while increasingly different under increasingly informative priors. Now the Duck route guarantees a property that is considered to be important and good by many – “uniform confidence coverage” and by some, even mandatory (e.g. see here). The Rabbit route with a flat prior will also happens to have this property (as it gives the same intervals). Perhaps to inform the good for what property, Andrew evaluates another property of making “claims with confidence” (type S and M error rates) and additionally evaluates that property.

With respect to this property “claims with confidence”, the Duck route (and the Rabbit route with flat prior) does not do so well – horribly actually. Now, informed with these two perspectives, it seems almost obvious that if a prior centred at zero and not too wide (implying large and very large effects are unlikely) is a reasonable “idealized representations of reality” for the area one is working in, the Rabbit route’s will have good properties while the Duck route’s guaranteed “good property” ain’t so good for you. On the other hand if effects of any size are all just as likely (which would be a strange universe to live in, perhaps not even possible) and you always keep in mind all the intervals you encounter, the Duck route will be fine.

Case study 2: The Bayesian Bootstrap

In the paper, Rubin outlines a Bayesian bootstrap that provides close enough expectations of outputs to the simple or vanilla bootstrap and argues that the implicit prior involved is _silly_ for some or many empirical research applications and hence shows the vanilla bootstrap is not an “analytic panacea that allows us to pull ourselves up by the bootstraps”. The bootstrap simply cannot avoid sensitivity to model assumptions. And in this post I am emphasising that _any_ model assumptions that give rise to a procedure with similar enough properties whether considered, used or even believed relevant? should be thought through. Not sure where this “case study” sits today – at one point Brad Efron was advancing ideas based on the bootstrap “as an automatic device for constructing Welch and Peers’ (1963) “probability matching priors” .

An aside, I find interesting in this paper of Rubin is the italicized phrase “with replacement”. It might be common knowledge today that the vanilla bootstrap simply samples from all possible sample paths of length n with replacement, but certainly in 1981 few seemed to realise that. I know because when Peter McCullagh presented work that was later published in Re-sampling and exchangeable arrays* *2000* *at the University of Toronto, I pointed this out to him and his response indicated he was not aware of this.

Case study 3: Bayarri et al Rejection Odds and Rejection Ratios .

This is a suggested Bayesian/Frequentest compromise for replacing the dreaded p_value/NHST. It is not being put forward as the best method for a replacement but rather one that can be easily adopted widely – Bayes with training wheels or a Frequentest approach with better balanced errors. Essentially a Bayesian inference that matches a frequentest expectation with the argument that “Any curve that has the right frequentist expectation is a valid frequentist report.”

I am not expected most readers will read even one of these case studies, but rather readers who do or have already read them, might share their views in comments.

The post Representists versus Propertyists: RabbitDucks – being good for what? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My proposal for JASA: “Journal” = review reports + editors’ recommendations + links to the original paper and updates + post-publication comments appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Whenever they’ve asked me to edit a statistics journal, I say no thank you because I think I can make more of a contribution through this blog. I’ve said no enough times that they’ve stopped asking me. But I’ve had an idea for awhile and now I want to do it.

I think that journals should get out of the publication business and recognize that their goal is curation. My preferred model is that everything gets published on some sort of super-Arxiv, and then the role of an organization such as the Journal of the American Statistical Association is to pick papers to review and to recommend. The “journal” is then the review reports plus the editors’ recommendations plus links to the original paper and any updates plus post-publication comments.

If JASA is interested in going this route, I’m in.

The post My proposal for JASA: “Journal” = review reports + editors’ recommendations + links to the original paper and updates + post-publication comments appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My talk this Friday in the Machine Learning in Finance workshop appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This is kinda weird because I don’t know anything about machine learning in finance. I guess the assumption is that statistical ideas are not domain specific. Anyway, here it is:

What can we learn from data?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

The standard framework for statistical inference leads to estimates that are horribly biased and noisy for many important examples. And these problems all even worse as we study subtle and interesting new questions. Methods such as significance testing are intended to protect us from hasty conclusions, but they have backfired: over and over again, people think they have learned from data but they have not. How can we have any confidence in what we think we’ve learned from data? One appealing strategy is replication and external validation but this can be difficult in the real world of social science. We discuss statistical methods for actually learning from data without getting fooled.

The post My talk this Friday in the Machine Learning in Finance workshop appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Reputational incentives and post-publication review: two (partial) solutions to the misinformation problem appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We’ve seen lots of examples. Here are just a few:

– Political scientist Larry Bartels is committed to a model of politics in which voters make decisions based on irrelevant information. He’s published claims about shark attacks deciding elections and subliminal smiley faces determining attitudes about immigration. In both cases, second looks by others showed that the evidence wasn’t really there. I think Bartels was sincere; he just did sloppy analyses—statistics is hard!—and jumped to conclusions that supported his existing views.

– New York Times columnist David Brooks has a habit of citing statistics that fall apart under closer inspection. I think Brooks believes these things when he writes them—OK, I guess he never really believed that Red Lobster thing, he must really have been ~~lying~~ exercising poetic license on that one—but what’s important is that these stories work to make his political points, and he doesn’t care when they’re proved wrong.

– David’s namesake and fellow NYT op-ed columnist Arthur Brooks stepped in it one or twice when reporting survey data. He wrote that Tea Party supporters were happier than other voters, but a careful look at the data suggested the opposite. A. Brooks’s conclusions were counterintuitive and supported his political views; they just didn’t happen to line up with reality.

– The familiar menagerie from the published literature in social and behavioral sciences: himmicanes, air rage, ESP, ages ending in 9, power pose, pizzagate, ovulation and voting, ovulation and clothing, beauty and sex ratio, fat arms and voting, etc etc.

– Gregg Easterbrook writing about politics.

And . . . we have a new one. A colleague emailed me expressing annoyance at a recent NYT op-ed by historian Stephanie Coontz entitled, “Do Millennial Men Want Stay-at-Home Wives?”

Emily Beam does the garbage collection. The short answer is that, no, there’s no evidence that millennial men want stay-at-home wives. Here’s Beam:

You can’t say a lot about millennials based on talking to 66 men.

The GSS surveys are pretty small – about 2,000-3,000 per wave – so once you split by sample, and then split by age, and then exclude the older millennials (age 26-34) who don’t show any negative trend in gender equality, you’re left with cells of about 60-100 men ages 18-25 per wave. . . .

Suppose you want to know whether there is a downward trend in young male disagreement with the women-in-the-kitchen statement. Using all available GSS data, there is a positive, not statistically significant trend in men’s attitudes (more disagreement). Starting in 1988 only, there is very, very small negative, not statistically significant effect.

Only if we pick 1994 as a starting point, as Coontz does, ignoring the dip just a few years prior, do we see a negative less-than half-percentage point drop in disagreement per year, significant at the 10-percent level.

To Coontz’s (or the NYT’s) credit, they followed up with a correction, but it’s half-assed:

The trend still confirms a rise in traditionalism among high school seniors and 18-to-25-year-olds, but the new data shows that this rise is no longer driven mainly by young men, as it was in the General Social Survey results from 1994 through 2014.

And at this point I have no reason to believe anything that Coontz says on this topic, any more than I’d trust what David Brooks has to say about high school test scores or the price of dinner at Red Lobster, or Arthur Brooks on happiness measurements, or Susan Fiske on himmicanes, power pose, and air rage. All these people made natural mistakes but then were overcommitted, in part I suspect because the mistaken analyses what they’d like to think is true.

But it’s good enough for the New York Times, or PPNAS, right?

The question is, what to do about it. Peer review can’t be the solution: for scientific journals, the problem with peer review is the peers, and when it comes to articles in the newspaper, there’s no way to do systematic review. The NYT can’t very well send all their demography op-eds to Emily Beam and Jay Livingston, after all. Actually, maybe they could—it’s not like they publish so many op-eds on the topic—but I don’t think this is going to happen.

So here are two solutions:

**1. Reputational incentives.** Make people own their errors. It’s sometimes considered rude to do this, to remind people that Satoshi Kanazawa Satoshi Kanazawa Satoshi Kanazawa published a series of papers that were dead on arrival because the random variation in his data was so much larger than any possible signal. Or to remind people that Amy Cuddy Amy Cuddy Amy Cuddy still goes around promoting power pose even thought the first author on that paper had disowned the entire thing. Or that John Bargh John Bargh John Bargh made a career out of a mistake and now refuses to admit his findings didn’t replicate. Or that David Brooks David Brooks David Brooks reports false numbers and then refused to correct them. Or that Stephanie Coontz Stephanie Coontz Stephanie Coontz jumped to conclusions based on a sloppy reading of trends from a survey.

But . . . maybe we need these negative incentives. If there’s a positive incentive for getting your name out there, there should be a negative incentive for getting it wrong. I’m not saying the positive and negative incentives should be equal, just that there more of a motivation for people to check what they’re doing.

And, yes, don’t keep it a secret that I published a false theorem once, and, another time, had to retract the entire empirical section of a published paper because we’d reverse-coded a key variable in our analysis.

**2. Post-publication review.**

I’ve talked about this one before. Do it for real, in scientific journals and also the newspapers. Correct your errors. And, when you do so, link to the people who did the better analyses.

Incentives and post-publication review go together. To the extent that David Brooks is known as the guy who reports made-up statistics and then doesn’t correct them—if this is his reputation—this gives the incentives for future Brookses (if not David himself) to prominently correct his mistakes. If Stephanie Coontz and the New York Times don’t want to be mocked on twitter, they’re motivated to follow up with serious corrections, not minimalist damage control.

**Some perspective here**

Look, I’m not talking about tarring and feathering here. The point is that incentives are real; they already exist. You really do (I assume) get a career bump from publishing in Psychological Science and PPNAS, and your work gets more noticed if you publish an op-ed in the NYT or if you’re featured on NPR or Ted or wherever. If all incentives are positive, that creates problems. It creates a motivation for sloppy work. It’s not that anyone is *trying* to do sloppy work.

**Econ gets it (pretty much) right**

Say what you want about economists, but they’ve got this down. First off, they understand the importance of incentives. Second, they’re harsh, harsh critics of each other. There’s not much of an econ equivalent to quickie papers in Psychological Science or PPNAS. Serious econ papers go through tons of review. Duds still get through, of course (even some duds in PPNAS). But, overall, it seems to me that economists avoid what might be called the “happy talk” problem. When an economist publishes something, he or she tries to get it right (politically-motivated work aside), in awareness that lots of people are on the lookout for errors, and this will rebound back to the author’s reputation.

The post Reputational incentives and post-publication review: two (partial) solutions to the misinformation problem appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Donald Trump’s nomination as an unintended consequence of Citizens United appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A key part of the story is that so many of the non-Trump candidates stayed in the race so long because everyone thought Trump was doomed, so they were all trying to grab Trump’s support when he crashed. Instead, Trump didn’t crash, and he benefited from the anti-Trump forces not coordinating on an alternative.

David Banks shares a theory of how it was that these candidates all stayed in so long:

I [Banks] see it as an unintended consequence of Citizens United. Before that [Supreme Court] decision, the $2000 cap on what individuals/corporations could contribute largely meant that if a candidate did not do well in one of the first three primaries, they pretty much had to drop out and their supporters would shift to their next favorite choice. But after Citizens United, as long as a candidate has one billionaire friend, they can stay in the race through the 30th primary if they want. And this is largely what happened. Trump regularly got the 20% of the straight-up crazy Republican vote, and the other 15 candidates fragmented the rest of the Republicans for whom Trump was the least liked candidate. So instead of Rubio dropping out after South Carolina and his votes shifting over to Bush, and Fiorino dropping out and her votes shifting to Bush, so that Bush would jump from 5% to 10% to 15% to 20% to 25%, etc., we wound up very late in the primaries with Trump looking like the most dominant candidate to field.

Of course, things are much more complex than this facile theory suggests, and lots of other things were going on in parallel. But it still seems to me that this partly explains how Trump threaded the needle to get the Republican nomination.

Interesting. I’d not seen this explanation before so I thought I’d share it with you.

The post Donald Trump’s nomination as an unintended consequence of Citizens United appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Fitting hierarchical GLMs in package *X* is like driving car *Y* appeared first on Statistical Modeling, Causal Inference, and Social Science.

Given that Andrew started the Gremlin theme (the car in the image at the right), I thought it would only be fitting to link to the following amusing blog post:

- Chris Brown: Choosing R packages for mixed effects modelling based on the car you drive (on the seascape models blog)

It’s exactly what it says on the tin. I won’t spoil the punchline, but will tell you the packages considered are: lme4, JAGS, RStan(Arm), and INLA.

**What do you think?**

Anyway, don’t take my word for it—read the original post. I’m curious about others’ take on systems for fitting GLMs and how they compare (to cars or otherwise).

**You might also like…**

My favorite automative analogy was made in the following essay, from way back in the first dot-com boom:

- Neal Stephenson. 1999. In the beginning was the command line.

Although it’s about operating systems, the satirical take on closed- vs. open-source is universal.

**(Some of) what I thought**

Chris Brown reports in the post,

I simulated a simple hierarchical data-set to test each of the models. The script is available here. The data-set has 100 binary measurements. There is one fixed covariate (continuous) and one random effect with five groups. The linear predictor was transformed to binomial probabilities using the logit function. For the Bayesian approaches, slightly different priors were used for each package, depending on what was available. See the script for more details on priors.

*Apples and oranges*. This doesn’t make a whole lot of sense, given that lme4 is giving you max marginal likelihood, whereas JAGS and Stan give you full Bayes. And if you use different priors in Stan and JAGS, you’re not even fitting the same posterior. I’m afraid I’ve never understood INLA (lucky for me Dan Simpson’s visiting us this week, so there’s no time like the present to learn it). You’ll also find that relative performance of Stan and JAGS will vary dramatically based on the shape of the posterior and scale of the data.

*It’s all about effective sample size*. The author doesn’ tmention the subtlety of choosing a way to estimate effective sample size (RStan’s is more conservative than the Coda package, using a variance approach like that of the split R-hat we use to detect convergence problems in RStan).

*Random processes are hard to compare*. You’ll find a lot of variation across runs with different random inits. You really want to start JAGS and Stan at the same initial points and run to the same effective sample size over multiple runs and compare averages and variation.

*RStanArm, not RStan*. I looked at the script, and it turns out the post is comparing RStanArm, not coding a model in Stan itself and running it in RStan. Here’s the code.

library(rstanarm) t_prior <- student_t(df = 4, location = 0, scale = 2.5) mb.stanarm <- microbenchmark(mod.stanarm <- stan_glmer(y ~ x + (1|grps), data = dat, family = binomial(link = 'logit'), prior = t_prior, prior_intercept = t_prior, chains = 3, cores = 1, seed = 10), times = 1L)

*Parallelization reduces wall time.* This script runs RStanArm three Markov chains on a single core, meaning they have to run one after the other. This can obviously be sped up by the up to the number of cores you have and letting them all run at the same time. Presuambly JAGS could be sped up the same way. The multiple chains are embarassingly parallelizable, after all.

*It's hard to be fair!* There's a reason we don't do a lot of these comparisons ourselves!

The post Fitting hierarchical GLMs in package *X* is like driving car *Y* appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post “Do you think the research is sound or is it gimmicky pop science?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>David Nguyen writes:

I wanted to get your opinion on http://www.scienceofpeople.com/. Do you think the research is sound or is it gimmicky pop science?

My reply: I have no idea. But since I see no evidence on the website, I’ll assume it’s pseudoscience until I hear otherwise. I won’t believe it until it has the endorsement of Susan T. Fiske.

**P.S.** Oooh, that Fiske slam was so unnecessary, you say. But she still hasn’t apologized for falling asleep on the job and greenlighting himmicanes, air rage, and ages ending in 9.

The post “Do you think the research is sound or is it gimmicky pop science?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Organizations that defend junk science are pitiful suckers get conned and conned again appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>So. Cornell stands behind Wansink, and Ohio State stands behind Croce. George Mason University bestows honors on Weggy. Penn State trustee disses “so-called victims.” Local religious leaders aggressively defend child abusers in their communities. And we all remember how long it took for Duke University to close the door on Dr. Anil Potti.

OK, I understand all these situations. It’s the sunk cost fallacy: you’ve invested a lot of your reputation in somebody; you don’t want to admit that, all this time, they’ve been using you.

Still, it makes me sad.

These organizations—Cornell, Ohio State, etc.—are victims as much as perpetrators. Wansink, Croce, etc., couldn’t have done it on their own: in their quest to illegitimately extract millions of corporate and government dollars, they made use of their prestigious university affiliations. A press release from a “Cornell professor” sounds so much more credible than a press release from some fast-talking guy with a P.O. box.

Cornell, Ohio State, etc., they’ve been played, and they still don’t realize it.

Remember, a key part of the long con is misdirection: make the mark think you’re his friend.

The post Organizations that defend junk science are pitiful suckers get conned and conned again appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Causal inference conference in North Carolina appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Michael Hudgens announces:

Registration for the 2017 Atlantic Causal Inference Conference is now open. The registration site is here. More information about the conference, including the poster session and the Second Annual Causal Inference Data Analysis Challenge can be found on the conference website here.

We held the very first Atlantic Causal Inference Conference here at Columbia twelve years ago, and it’s great to see that it has been continuing so successfully.

The post Causal inference conference in North Carolina appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The Efron transition? And the wit and wisdom of our statistical elders appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Stephen Martin writes:

Brad Efron seems to have transitioned from “Bayes just isn’t as practical” to “Bayes can be useful, but EB is easier” to “Yes, Bayes should be used in the modern day” pretty continuously across three decades.

http://www2.stat.duke.edu/courses/Spring10/sta122/Handouts/EfronWhyEveryone.pdf

http://projecteuclid.org/download/pdf_1/euclid.ss/1028905930

http://statweb.stanford.edu/~ckirby/brad/other/2009Future.pdfAlso, Lindley’s comment in the first article is just GOLD:

“The last example with [lambda = theta_1theta_2] is typical of a sampling theorist’s impractical discussions. It is full of Greek letters, as if this unusual alphabet was a repository of truth.” To which Efron responded “Finally, I must warn Professor Lindley that his brutal, and largely unprovoked, attack has been reported to FOGA (Friends of the Greek Alphabet). He will be in for a very nasty time indeed if he wishes to use as much as an epsilon or an iota in any future manuscript.”“Perhaps the author has been falling over all those bootstraps lying around.”

“What most statisticians have is a parody of the Bayesian argument, a simplistic view that just adds a woolly prior to the sampling-theory paraphernalia. They look at the parody, see how absurd it is, and thus dismiss the coherent approach as well.”

I pointed Stephen to this post and this article (in particular the bottom of page 295). Also this, I suppose.

The post The Efron transition? And the wit and wisdom of our statistical elders appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Causal inference conference at Columbia University on Sat 6 May: Varying Treatment Effects appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Hey! We’re throwing a conference:

Varying Treatment Effects

The literature on causal inference focuses on estimating average effects, but the very notion of an “average effect” acknowledges variation. Relevant buzzwords are treatment interactions, situational effects, and personalized medicine. In this one-day conference we shall focus on varying effects in social science and policy research, with particular emphasis on Bayesian modeling and computation.

The focus will be on applied problems in social science.

The organizers are Jim Savage, Jennifer Hill, Beth Tipton, Rachael Meager, Andrew Gelman, Michael Sobel, and Jose Zubizarreta.

And here’s the schedule:

9:30 AM

1. Heterogeneity across studies in meta-analyses of impact evaluations.

– Michael Kremer, Harvard

– Greg Fischer, LSE

– Rachael Meager, MIT

– Beth Tipton, Columbia

10-45 – 11 coffee break11:00

2. Heterogeneity across sites in multi-site trials.

– David Yeager, UT Austin

– Avi Feller, Berkeley

– Luke Miratrix, Harvard

– Ben Goodrich, Columbia

– Michael Weiss, MDRC12:30-1:30 Lunch

1:30

3. Heterogeneity in experiments versus quasi-experiments.

– Vivian Wong, University of Virginia

– Michael Gechter, Penn State

– Peter Steiner, U Wisconsin

– Bryan Keller, Columbia3:00 – 3:30 afternoon break

3:30

4. Heterogeneous effects at the structural/atomic level.

– Jennifer Hill, NYU

– Peter Rossi, UCLA

– Shoshana Vasserman, Harvard

– Jim Savage, Lendable Inc.

– Uri Shalit, NYU5pm

Closing remarks: Andrew Gelman

Please register for the conference here. Admission is free but we would prefer if you register so we have a sense of how many people will show up.

We’re expecting lots of lively discussion.

**P.S.** Signup for outsiders seems to have filled up. Columbia University affiliates who are interested in attending should contact me directly.

The post Causal inference conference at Columbia University on Sat 6 May: Varying Treatment Effects appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I wanna be ablated appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Mark Dooris writes:

I am senior staff cardiologist from Australia. I attach a paper that was presented at our journal club some time ago. It concerned me at the time. I send it as I suspect you collect similar papers. You may indeed already be aware of this paper. I raised my concerns about the “too good to be true” and plethora of “p-values” all in support of desired hypothesis. I was decried as a naysayer and some individuals wanted to set their own clinics on the basis of the study (which may have been ok if it was structured as a replication prospective randomized clinical trial).

I would value your views on the statistical methods and the results…it is somewhat pleasing: fat bad…lose fat good and may even be true in some specific sense but please look at the number of comparisons, which exceed the number of patients and how they are almost perfectly consistent with an amazing dose response esp structural changes.

I am not at all asserting there is fraud I am just pointing out how anomalous this is. Perhaps it is most likely that many of these tests were inevitably unable to be blinded…losing 20 kg would be an obvious finding in imaging. Many of the claimed detected differences in echocardiography seem to exceed the precision of the test (a test which has greater uncertainty in measurements in the obese patients). Certainly the blood parameters may be real but there has been accounting for multiple comparisons.

PS: I do not know, work with or have any relationship with the authors. I am an interventional cardiologist (please don’t hold that against me) and not an electrophysiologist.

The paper that he sent is called “Long-Term Effect of Goal-Directed Weight Management in an Atrial Fibrillation Cohort: A Long-Term Follow-Up Study (LEGACY),” it’s by Rajeev K. Pathak, Melissa E. Middeldorp, Megan Meredith, Abhinav B. Mehta, Rajiv Mahajan, Christopher X. Wong, Darragh Twomey, Adrian D. Elliott, Jonathan M. Kalman, Walter P. Abhayaratna, Dennis H. Lau, and Prashanthan Sanders, and it appeared in 2015 in the Journal of the American College of Cardiology.

The topic of atrial fibrillation concerns me personally! But my body mass index is less than 27 so I don’t seem to be in the target population for this study.

Anyway, I did take a look. The study in question was observational: they divided the patients into three groups, not based on treatments that had been applied, but based on weight loss (>=10%, 3-9%, <3%; all patients had been counseled to try to lose weight). As Dooris writes, the results seem almost too good to be true: For all five of their outcomes (atrial fibrillation frequency, duration, episode severity, symptom subscale, and global well-being), there is a clean monotonic stepping down from group 1 to group 2 to group 3. I guess maybe the symptom subscale and the global well-being measure are combinations of the first three outcomes? So maybe it’s just three measures, not five, that are showing such clean trends. All the measures show huge improvements from baseline to follow-up in all groups, which I guess just demonstrates that the patients were improving in any case. Anyway, I don’t really know what to make of all this but I thought I’d share it with you.

**P.S.** Dooris adds:

I must admit to feeling embarrassed for my, perhaps, premature and excessive skepticism. I read the comments with interest.

I am sorry to read that you have some personal connection to atrial fibrillation but hope that you have made (a no doubt informed) choice with respect to management. It is an “exciting” time with respect to management options. I am not giving unsolicited advice (and as I have expressed I am just a “plumber” not an “electrician”).I remain skeptical about the effect size and thecompleteuniformity of the findings consistent with the hypothesis that weight loss is associated with reduced symptoms of AF, reduced burden of AF, detectable structural changes on echocardiography and uniformly positive effects on lipid profile.I want to be clear:

- I find the hypothesis plausible
- I find the implications consistent with my pre-conceptions and my current advice (this does not mean they are true or based on compelling evidence)
- The plausibility (for me) arises from

- there are relatively small studies and meta-analyses that suggest weight loss is associated with “beneficial” effects on blood pressure and lipids. However, the effects are variable. There seems to be differences between genders and differences between methods of weight loss. The effect size is generally smaller than in the LEGACY trial
- there is evidence of cardiac structural changes: increase chamber size, wall thickness and abnormal diastolic function and some studies suggest that the changes are reversible, perhaps the most change in patients with diastolic dysfunction. I note perhaps the largest change detected with weight loss is reduction in epicardial fat. Some cardiac MRI studies (which have better resolution) have supported this
- there is electrophysiological data in suggesting differences in electrophysiological properties in patients with atrial fibrillation related to obesity
- What concerned me about the paper was the apparent homogeneity of this particular population that seemed to allow the detection of such a strong and consistent relationship. This seemed “too good to be true”. I think it does not show the variability I would have expected:

- gender
- degree of diastolic dysfunction
- smoking
- what other changes during the period were measured?: medication, alcohol etc
- treatment interaction: I find it difficult to work out who got ablated, how many attempts. Are the differences more related to successful ablations or other factors
- “blinding”: although the operator may have been blinded to patient category patients with smaller BMI are easier to image and may have less “noisy measurements”. Are the real differences, therefore, smaller than suggested
- I accept that the authors used repeated measures ANOVA to account for paired/correlated nature of the testing. However, I do not see the details of the model used.
- I would have liked to see the differences rather than the means and SD as well as some graphical presentation of the data to see the variability as well as modeling of the relationship between weight loss and effect.
I guess I have not seen a paper whereeverythingworks out like you want. I admit that I should have probably suppressed my disbelief (and waited for replication). What’s the down side? “We got the answer we all want”. “It fits with the general results of other work.” I still feel uneasy not at least asking some questions.I think as a profession, we medical practitioners have been guilty of “p-hacking” and over-reacting to small studies with large effect sizes. We have spent too much time in “the garden of forking paths” and believe where have got too after picking throw the noise every apparent signal that suits our preconceptions. We have wonderful large scale randomized clinical trials that seem to answer narrow but important questions and that is great. However, we still publish a lot of lower quality stuff and promulgate “p-hacking” and related methods to our trainees. I found the Smaldino and McElreath paper timely and instructive (I appreciate you have already seen it).So, I sent you the email because I felt uneasy (perhaps guilty about my “p-hacking” sins of commission of the past and acceptance of such work of others).

The post I wanna be ablated appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Air rage rage appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Commenter David alerts us that Consumer Reports fell for the notorious air rage story.

Background on air rage here and here. Or, if you want to read something by someone other than me, here. This last piece is particularly devastating as it addresses flaws in the underlying research article, hype in the news reporting, and the participation of the academic researcher in that hype.

From the author’s note by Allen St. John at the bottom of that Consumer Reports story:

For me, there’s no better way to spend a day than talking to a bunch of experts about an important subject and then writing a story that’ll help others be smarter and better informed.

Shoulda talked with just one more expert. Maybe Susan T. Fiske at Princeton University—I’ve heard she knows a lot about social psychology. She’s also super-quotable!

**P.S.** A commenter notifies us that Wired fell for this one too. Too bad. I guess that air rage study is just too good to check.

The post Air rage rage appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bayesian Posteriors are Calibrated by Definition appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>- Cook, S.R., Gelman, A. and Rubin, D.B. 2006. Validation of software for Bayesian models using posterior quantiles.
*Journal of Computational and Graphical Statistics*15(3):675–692.

The argument is really simple, but I don’t remember seeing it in *BDA*.

**How to simulate data from a model**

Suppose we have a Bayesian model composed of a prior with probability function and sampling distribution with probability function . We then simulate parameters and data as follows.

*Step 1*. Generate parameters according to the prior .

*Step 2*. Generate data according to the sampling distribution .

There are three important things to note about the draws:

*Consequence 1*. The chain rule shows is generated from the joint .

*Consequence 2*. Bayes’s rule shows is a random draw from the posterior .

*Consequence 3*. The law of total probabiltiy shows that is a draw from the prior predictive .

**Why does this matter?**

Because it means the posterior is properly calibrated in the sense that if we repeat this process, the true parameter value for any component of the parameter vector will fall in a 50% posterior interval for that component, , exactly 50% of the time under repeated simulations. Same thing for any other interval. And of course it holds up for pairs or more of variables and produces the correct dependencies among variables.

**What’s the catch?**

The calibration guarantee only holds if the data actually was geneated from the model, i.e., from the data marginal .

Consequence 3 assures us that this is so. The marginal distribution of the data in the model is also known as the prior predictive distribution; it is defined in terms of the prior and sampling distribution as

.

It’s what we predict about potential data from the model itself. The generating process in Step 1 and Step 2 above follow the integral. Because is drawn from the joint , we know that is drawn from the prior predictive . That is, we know was generated from the model in the simulations.

**Cook et al.’s application to testing**

Cook et al. outline the above steps as a means to test MCMC software for Bayesian models. The idea is to generate a bunch of data sets , then for each one make a sequence of draws according to the posterior . If the software is functioning correctly, everything is calibrated and the quantile in which falls in is uniform. The reason it’s uniform is that it’s equally likely to be ranked anywhere because all of the including are just random draws from the posterior.

**What about overfitting?**

What overfitting? Overfitting occurs when the model fits the data too closely and fails to generalize well to new observations. In theory, if we have the right model, we get calibration, and there can’t be overfitting of this type—predictions for new observations are automatically calibrated, because predictions are just another parameter in the model. In other words, we don’t overfit by construction.

In practice, we almost never believe we have the true generative process for our data. We often make do with convenient approximations of some underlying data generating process. For example, we might choose a logistic regression because we’re dealing with binary trial data, not because we believe the log odds are truly a linear function of the predictors.

We even more rarely believe we have the “true” priors. It’s not even clear what “true” would mean when the priors are about our knowledge of the world. The posteriors suffer the same philosophical fate, being more about our knowledge than about the world. But the posteriors have the redeeming quality that we can test them predictively on new data.

In the end, the theory gives us only cold comfort.

**But not all is lost!**

To give ourselves some peace of mind that our inferences are not going astray, we try to calibrate against real data using cross-validation or actual hold-out testing. For an example, see my Stan case study on repeated binary trial data, which Ben and Jonah conveniently translated into RStanArm.

Basically, we treat Bayesian models like meteorologists—they are making probabilistic predictions after all. To try to assess the competence of a meteorologist, one asks on how many of the days about which we said it was 20% likely to rain did it actually rain? If the predictions are independent and calibrated, we’d you expect the distribution of rainy days to be . To try to assess the competence of a predictive model, we can do exactly the same thing.

The post Bayesian Posteriors are Calibrated by Definition appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Pizzagate update! Response from the Cornell University Media Relations Office appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Hey! A few days ago I received an email from the Cornell University Media Relations Office. As I reported in this space, I responded as follows:

Dear Cornell University Media Relations Office:

Thank you for pointing me to these two statements. Unfortunately I fear that you are minimizing the problem.

You write, “while numerous instances of inappropriate data handling and statistical analysis in four published papers were alleged, such errors did not constitute scientific misconduct (https://grants.nih.gov/grants/research_integrity/research_misconduct.htm). However, given the number of errors cited and their repeated nature, we established a process in which Professor Wansink would engage external statistical experts to validate his review and reanalysis of the papers and attendant published errata. . . . Since the original critique of Professor Wansink’s articles, additional instances of self-duplication have come to light. Professor Wansink has acknowledged the repeated use of identical language and in some cases dual publication of materials.”

But there are many, many more problems in Wansink’s published work, beyond those 4 initially-noticed papers and beyond self-duplication.

Your NIH link above defines research misconduct as “fabrication, falsification and plagiarism, and does not include honest error or differences of opinion. . .” and defines falsification as “Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.”

This phrase, “changing or omitting data or results such that the research is not accurately represented in the research record,” is an apt description of much of Wansink’s work, going far beyond those four particular papers that got the ball rolling, and far beyond duplication of materials. For a thorough review, see this recent post by Tim van der Zee, who points to 37 papers by Wansink, many of which have serious data problems: http://www.timvanderzee.com/the-wansink-dossier-an-overview/

And all this doesn’t even get to criticism of Wansink having openly employed a hypotheses-after-results-are-known methodology which leaves his statistics meaningless, even setting aside data errors.

There’s also Wansink’s statement which refers to “the great work of the Food and Brand Lab,” which is an odd phrase to use to describe a group that has published papers with hundreds of errors and major massive data inconsistencies that represent, at worst, fraud, and, at best, some of the sloppiest empirical work—published or unpublished—that I have ever seen. In either case, I consider this pattern of errors to represent research misconduct.

I understand that it’s natural to think that nothing can every be proven, Rashomon and all that. But in this case the evidence for research misconduct is all out in the open, in dozens of published papers.

I have no personal stake in this matter and I have no plans to file any sort of formal complaint. But as a scientist, this bothers me: Wansink’s misconduct, his continuing attempt to minimize it, and this occurring at a major university.

Yours,

Andrew Gelman

Let me emphasize at this point that the Cornell University Media Relations Office has no obligation to respond to me. They’re already pretty busy, what with all the Fox News crews coming on campus, not to mention the various career-capping studies that happen to come through. Just cos the Cornell University Media Relations Office sent me an email, this implies no obligation on their part to reply to my response.

Anyway, that all said, I thought you might be interested in what the Cornell University Media Relations Office had to say.

So, below, here is their response, in its entirety:

The post Pizzagate update! Response from the Cornell University Media Relations Office appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stacking, pseudo-BMA, and AIC type weights for combining Bayesian predictive distributions appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>*This post is by Aki.*

We have often been asked in the Stan user forum how to do model combination for Stan models. Bayesian model averaging (BMA) by computing marginal likelihoods is challenging in theory and even more challenging in practice using only the MCMC samples obtained from the full model posteriors.

Some users have suggested using Akaike type weighting by exponentiating WAIC or LOO to compute weights to combine models.

We had doubts on this approach and started investigating this more so that we could give some recommendations.

Investigation led to the paper we (Yuling Yao, Aki Vehtari, Daniel Simpson and Andrew Gelman) have finally finished and which contains our recommendations:

- Ideally, we prefer to attack the Bayesian model combination problem via continuous model expansion, forming a larger bridging model that includes the separate models as special cases.
- In practice constructing such an expansion can require conceptual and computational effort, and so it sometimes makes sense to consider simpler tools that work with existing inferences from separately-fit models.
- Bayesian model averaging based on marginal likelihoods can fail badly in the M-open setting in which the true data-generating process is not one of the candidate models being fit.
- We propose and recommend a new log-score stacking for combining predictive distributions.
- Akaike-type weights computed using Bayesian cross-validation are closely related to the pseudo Bayes factor of Geisser and Eddy (1979), and thus we label model combination using such weights as pseudo-BMA.
- We propose a improved variant, which we call pseudo-BMA+, that is stabilized using the Bayesian bootstrap, to properly take into account the uncertainty of the future data distribution.
- Based on our theory, simulations, and examples, we recommend stacking (of predictive distributions) for the task of combining separately-fit Bayesian posterior predictive distributions. As an alternative, Pseudo-BMA+ is computationally cheaper and can serve as an initial guess for stacking.

The paper is also available in arXiv:1704.02030 (with minor typo correction appearing there tonight in the regular arXiv update), and the code is part of the loo package in R (currently in github https://github.com/stan-dev/loo/ and later in CRAN).

**P.S. from Andrew:** I really like this idea. It resolves a problem that’s been bugging me for many years.

The post Stacking, pseudo-BMA, and AIC type weights for combining Bayesian predictive distributions appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Beyond subjective and objective in statistics: my talk with Christian Hennig tomorrow (Wed) 5pm in London appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Decisions in statistical data analysis are often justified, criticized, or avoided using concepts of objectivity and subjectivity. We argue that the words “objective” and “subjective” in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes, with objectivity replaced by transparency, consensus, impartiality, and correspondence to observable reality, and subjectivity replaced by awareness of multiple perspectives and context dependence. Together with stability, these make up a collection of virtues that we think is helpful in discussions of statistical foundations and practice. The advantage of these reformulations is that the replacement terms do not oppose each other and that they give more specific guidance about what statistical science strives to achieve. Instead of debating over whether a given statistical method is subjective or objective (or normatively debating the relative merits of subjectivity and objectivity in statistical practice), we can recognize desirable attributes such as transparency and acknowledgment of multiple perspectives as complementary goals. We demonstrate the implications of our proposal with recent applied examples from pharmacology, election polling, and socioeconomic stratification. The aim of this paper is to push users and developers of statistical methods toward more effective use of diverse sources of information and more open acknowledgement of assumptions and goals.

The paper will be discussed tomorrow (Wed 12 Apr) 5pm at the Royal Statistical Society in London. Christian and I will speak for 20 minutes each, then various people will present their discussions. Kinda like a blog comment thread but with nothing on the hot hand, pizzagate, or power pose.

You can also see here for a link to a youtube where Christian and I discuss the paper with each other.

The post Beyond subjective and objective in statistics: my talk with Christian Hennig tomorrow (Wed) 5pm in London appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Mmore from Ppnas appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Study 1:

Honesty plays a key role in social and economic interactions and is crucial for societal functioning. However, breaches of honesty are pervasive and cause significant societal and economic problems that can affect entire nations. Despite its importance, remarkably little is known about the neurobiological mechanisms supporting honest behavior. We demonstrate that honesty can be increased in humans with transcranial direct current stimulation (tDCS) over the right dorsolateral prefrontal cortex. Participants (n = 145) completed a die-rolling task where they could misreport their outcomes to increase their earnings, thereby pitting honest behavior against personal financial gain. Cheating was substantial in a control condition but decreased dramatically when neural excitability was enhanced with tDCS. This increase in honesty could not be explained by changes in material self-interest or moral beliefs and was dissociated from participants’ impulsivity, willingness to take risks, and mood. A follow-up experiment (n = 156) showed that tDCS only reduced cheating when dishonest behavior benefited the participants themselves rather than another person, suggesting that the stimulated neural process specifically resolves conflicts between honesty and material self-interest. Our results demonstrate that honesty can be strengthened by noninvasive interventions and concur with theories proposing that the human brain has evolved mechanisms dedicated to control complex social behaviors.Study 2:

Academic credentials open up a wealth of opportunities. However, many people drop out of educational programs, such as community college and online courses. Prior research found that a brief self-regulation strategy can improve self-discipline and academic outcomes. Could this strategy support learners at large scale? Mental contrasting with implementation intentions (MCII) involves writing about positive outcomes associated with a goal, the obstacles to achieving it, and concrete if-then plans to overcome them. The strategy was developed in Western countries (United States, Germany) and appeals to individualist tendencies, which may reduce its efficacy in collectivist cultures such as India or China. We tested this hypothesis in two randomized controlled experiments in online courses (n = 17,963). Learners in individualist cultures were 32% (first experiment) and 15% (second experiment) more likely to complete the course following the MCII intervention than a control activity. In contrast, learners in collectivist cultures were unaffected by MCII. Natural language processing of written responses revealed that MCII was effective when a learner’s primary obstacle was predictable and surmountable, such as everyday work or family obligations but not a practical constraint (e.g., Internet access) or a lack of time. By revealing heterogeneity in MCII’s effectiveness, this research advances theory on self-regulation and illuminates how even highly efficacious interventions may be culturally bounded in their effects.

He only sent me the abstract which is kind of a nice thing to do cos then I feel under no obligation to read the papers (which he tells me will appear in PPNAS and are embargoed until this very moment).

Anyway, here was my reply:

#1 looks like a forking-paths disaster but, hey, who knows? I guess it’s a candidate for a preregistered replication study.

#2 looks interesting as a main effect—if a simple trick helps people focus, that’s good—but I’m suspicious of the interaction for the usual reasons of confounders and forking paths.

The post Mmore from Ppnas appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Molyneux expresses skepticism on hot hand appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Guy Molyneux writes:

I saw your latest post on the hot hand too late to contribute to the discussion there. While I don’t disagree with your critique of Gilovich and his reluctance to acknowledge past errors, I do think you underestimate the power of the evidence against a meaningful hot hand effect in sports. I believe the balance of evidence should create a strong presumption that the hot hand is at most a small factor in competitive sports, and therefore that people’s belief in the hot hand is reasonably considered a kind of cognitive error. Let me try to explain my thinking in a couple of steps.

I think everyone agrees that evidence of a hot hand (or “momentum”) is extremely hard to find in actual game data. Across a wide range of sports, players’ outcomes are just slightly more “streaky” than we’d expect from random chance, implying that momentum is at most a weak effect (and even some of that streakiness is accounted for by player health, which is not a true “hot hand”). This body of work I think fairly places the burden of proof on the believers in a strong hot hand, or those still open to the idea (like you), to show why this evidence shouldn’t end the debate. Broadly speaking, two serious objections have been raised to accepting the empirical evidence from actual games.

First, you argue that “Whether you made or missed the last couple of shots is itself a very noisy measure of your ‘hotness,’ so estimates of the hot hand based on these correlations are themselves strongly attenuated toward zero.” If most of the allegedly hot players we study were really just lucky, and thus quickly regress to their true mean, the elevated performance by the subset of truly ‘hot’ players will be masked in the data. I take your point. Nonetheless, given the absence of observed momentum in games, one of two things must still be true: A) the hot hand effect is large but rare (your hypothesis), or B) the hot hand effect is small but perhaps frequent. This may be an important distinction for some analytic purpose, but from my perspective the two possibilities are effectively the same thing: the game impact (I won’t say ‘effect’) of the hot hand is quite small. By “small” I mean both that the hot hand likely has a negligible impact on game outcomes, and that teams and athletes should largely ignore the hot hand in making strategic decisions.

And since the actual impact on games is quite small *even if* your hypothesis is correct (because true hotness is rare), it follows that belief in a strong hot hand by players or fans still represents a kind of cognitive failure. The hot hand fallacy held by most fans, at least in my experience, is not that a very few (and unknowable) players sometimes get very hot, but rather that nearly all athletes sometimes get hot, and we can see this from their performance on the field/court.

(An important caveat: IF it proved possible to identify “true” hot hands in real time, or even to identify specific athletes who consistently exhibit true hot hand behavior, then my argument fails and the hot hand might have legitimate strategic implications. But I have not seen evidence that anyone knows how to do this in any sport.)

The second major objection made to the empirical studies is that the hot hand is disguised as a result of player’s very knowledge of it. As Miller and Sanjurjo suggest, “the myriad confounds present in games actually make it impossible to identify or rule out the existence of a hot hand effect with in-game shooting data, despite the rich nature of modern data sets.” Two main confounds are usually cited: hot players will take more difficult shots, and opposing athletes will deploy additional resources to combat the hot player. Some have argued (including Miller and Sanjurjo) that these factors are so strong that we must ignore real game data in favor of experimental data. But I think it is a mistake to dismiss the game data, for three reasons:

- The theoretical possibility that players’ shot selections and defensive responses could perfectly – and with astonishing consistency – mask the true hot hand effect is only a possibility. Before we dismiss a large body of inconvenient studies, I’d argue that hot hand believers need to demonstrate that these confounds regularly operate at the necessary scale, not just assume it.
- A sophisticated effort to control for shot selection and defensive attention to hot basketball shooters concludes that the remaining hot hand effect is quite modest. Conversely, as far as I know no one has shown empirically that the enhanced confidence of hot players and/or opponents’ defensive responses can account for the lack of observed momentum in a sport.
- Efforts to detect a hot hand effect in baseball have invariably failed. And that’s important, because in baseball the players cannot choose to take on more arduous tasks when they feel “hot,” and opposing players have virtually no ability to redistribute defensive resources in a way that disadvantages players perceived to be hot. So even if you reject the Sloan study and think confounds explain the lack of momentum in basketball, they cannot explain what we observe in baseball.
I would also note that this “confounds” objection is in fact a strong argument *in favor* of the notion that the hot hand is a cognitive failure, given your argument that in-game streaks are a very poor marker of true hotness. If the latter is true, then it would still be a cognitive error for a player or his opponents to act on this usually-false indicator of enhanced short-term talent. If players on a streak take more difficult shots, they are wrong to do so, and teams that change defensive alignments in response are also making a mistake.

So, these are the reasons I remain unpersuaded that I should believe in a hot hand in the wild, or even consider it an open question. That leaves us, finally, with the experimental data that some feel should be privileged as evidence. I haven’t read enough of the experimental research to form any view on its quality or validity. But for answering the question of whether belief in the hot hand is a fallacy, I don’t see how the results of these experiments much matter. Fans and athletes believe they see the hot hand in

real games. If a pitcher has retired the last nine batters he faced, many fans (and managers!) believe he is more likely than usual to get the next batter out. If a batter has 10 hits in his last 20 at bats, fans believe he is “on a tear” and more likely to be successful in his 21^{st}at bat (and his manager is more likely to keep him in the lineup). But we know these beliefs are wrong.Even if experiments do demonstrate real momentum for some repetitive athletic tasks in controlled settings, this would not challenge either of my contentions: that the hot hand has a negligible impact on competitive sports outcomes, and fans’ belief in the hot hand (in real games) is a cognitive error. Personally, I find it easy to believe that humans may get into (and out of) a rhythm for some extremely repetitive tasks – like shooting a large number of 3-point baskets. Perhaps this kind of “muscle memory” momentum exists, and is revealed in controlled experiments. But it seems to me that those conducting such studies have ranged far from the original topic of a hot hand in competitive sports — indeed, I’m not sure it is even in sight.

I don’t know that I have anything new to say after a few zillion exchanges in blog comments, but I wanted to put Guy’s reasoning out there, because (a) he expresses it well, and (b) he’s arguing that I’m a bit off in my interpretation of the data, and that’s something I should share with you.

The only thing I will comment on in Guy’s above post is that I do think baseball is different, because a hitter can face different pitches every time he comes to the plate. So it’s not quite like basketball where the task is the same every time.

**P.S.** Yeah, yeah, I know, it seems at times that this blog is on an endless loop of power pose, pizzagate, and the hot hand. Really, though, we do talk about other things! See here, for example. Or here. Or here, here, here.

**P.P.S.** Josh (coauthor with Sanjurjo of those hot hand papers) responds in the comments. Lots of good discussion here.

The post Molyneux expresses skepticism on hot hand appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Combining independent evidence using a Bayesian approach but without standard Bayesian updating? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I have made some progress with my work on combining independent evidence using a Bayesian approach but eschewing standard Bayesian updating. I found a neat analytical way of doing this, to a very good approximation, in cases where each estimate of a parameter corresponds to the ratio of two variables each determined with normal error, the fractional uncertainty in the numerator and denominator variables differing between the types of evidence. This seems a not uncommon situation in science, and it is a good approximation to that which exists when estimating climate sensitivity. I have had a manuscript in which I develop and test this method accepted by the Journal of Statistical Planning and Inference (for a special issue on Confidence Distributions edited by Tore Schweder and Nils Hjort). Frequentist coverage is almost exact using my analytical solution, based on combining Jeffreys’ priors in quadrature, whereas Bayesian updating produces far poorer probability matching. I also show that a simple likelihood ratio method gives almost identical inference to my Bayesian combination method. A copy of the manuscript is available here: https://niclewis.files.wordpress.com/2016/12/lewis_combining-independent-bayesian-posteriors-for-climate-sensitivity_jspiaccepted2016_cclicense.pdf .

I’ve since teamed up with Peter Grunwald, a statistics professor in Amsterdam whom you may know – you cite two of his works in your 2013 paper ‘Philosophy and the practice of Bayesian statistics’. It turns out that my proposed method for combining evidence agrees with that implied by the Minimum Description Length principle, which he has been closely involved in developing. We have a joint paper under review by a leading climate science journal, which applies the method developed in my JSPI paper.

I think the reason Bayesian updating can give poor probability matching, even when the original posterior used as the prior gives exact probability matching in relation to the original data, is that conditionality is not always applicable to posterior distributions. Conditional probability was developed rigorously by Kolmogorov in the conext of random variables. Don Fraser has stated that the conditional probability lemma requires two probabilistic inputs and is not satisfied where there is no prior knowledge of a parameter’s value. I extend this argument and suggest that, unless the parameter is generated by a random process rather than being fixed, conditional probability does not apply to updating a posterior corresponding to existing knowledge, used as a prior, since such a prior distribution does not provide the required type of probability. As Tore Schweder has written (in his and Nils Hjort’s 2016 book Confidence, Likelihood, Probability) it is necessary to keep epistemic and aleatory probability apart. Bayes himself, of course, developed his theory in the context of a random parameter generated with a known probability distribution.

I don’t really have time to look at this but I thought it might interest you, so feel free to share your impressions. I assume Nic Lewis will be reading the comments.

This also seemed worth posting, given that Yuling, Aki, Dan, and I will soon be releasing our own paper on combining posterior inferences from different models fit to a single dataset. Not quite the same problem but it’s in the same general class of questions.

The post Combining independent evidence using a Bayesian approach but without standard Bayesian updating? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Cross Purposes appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I thought you might enjoy this…

I’m refereeing a paper which basically looks at whether survey responses on a particular topic vary when the question is asked in two different ways. In the main results table they split the sample along several relevant dimensions (education; marital status; religion; etc). I give them credit for showing all the results, but only one differential is statistically significant at 5%, and of course they focus the interpretation on that one. In my initial report, I asked if they either had tried or would try correcting for multiple hypothesis testing. I just received their response:

“We agree with the referee, but we do not think it is possible given that we really do not have enough power.”

So they left it as is and don’t discuss the issue at all in the revision!

As is often the case, this is an example where I suspect we’d be better off had p-values never been invented.

The post Cross Purposes appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post PPPPPPPPPPNAS! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>As I follow your blog (albeit loosely), I figured I’d point out an “early release” paper from PNAS I consider to be “garbage” (at least by title, and probably by content).

The short version is, the authors claim to have found the neural correlate of a person being “cognizant of” the outcome of an action (either doing something on purpose or by accident), but frame it as a correlate of “knowledge vs. recklessness” in the context of crime and assessing culpability in the legal domain.

Without having read the whole thing yet, it would seem important that once you tell someone that some negative consequence occurred (or could occur), the neural correlate would likely change.

This may not be so much bad based on the methods, but rather by its completely over-hyped framing (and the fact that it spent less than 3 months in review; received 11/23/2016, accepted 2/9/2017, which is about 10 weeks). Presumably the reviewers were so enthralled by the work, they couldn’t think of anything to improve…

And just reading the methods now, I would say they tested something different altogether; the amount of “knowledge” that was manipulated between groups had nothing to do with to what extent what they were doing was criminal, but rather to what extent they were likely to be caught… It is maddening to think that popular news media will pick this up as “finally the answer to judging culpability based on fMRI”–what utter nonsense…

Weber continues:

Finally, just a couple issues I would point out…

– showing people a bowl of poo in the scanner and asking them whether they would eat it may elicit a lot of brain activation; to claim that that is the neural basis of how the brain looks like when eating poo is a fallacy

– besides the point that this was a game (and neither risk nor knowledge had any real-world consequences), this doesn’t seem to be about crime so much as about representing fear of being caught—a criminal who is certain they won’t be caught may not show this pattern

– the first author didn’t design the research (see contributions section), and the senior author never first-authored an fMRI paper (so he doesn’t know the modality), a toxic combination it seems…

A bowl of poo, huh? I’m liking this experiment already! Seriously, I do find this paper a bit creepy as well as hyped.

The post PPPPPPPPPPNAS! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Dear Cornell University Public Relations Office appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I received the following email, which was not addressed to me personally:

From: ** <**@cornell.edu>

Date: Wednesday, April 5, 2017 at 9:42 AM

To: “gelman@stat.columbia.edu”

Cc: ** <**@cornell.edu>

Subject: Information regarding research by Professor Brian WansinkI know you have been following this issue, and I thought you might be interested in new information posted today on Cornell University’s Media Relations Office website and the Food and Brand Lab website.

**

**@cornell.edu

@**

office: 607-***-****

mobile: 607-***-****

The message included two links: Media Relations Office website and Food and Brand Lab website.

You can click through. Wansink’s statement is hardly worth reading; it’s the usual mixture of cluelessness, pseudo-contrition, and bluster, referring to “the great work of the Food and Brand Lab” and minimizing the long list of problems with his work. He writes, “all of this data analysis was independently reviewed and verified under contract by the outside firm Mathematica Policy Research, and none of the findings altered the core conclusions of any of the studies in question,” but I clicked through, and the Mathematica Policy Research person specifically writes, “The researcher did not review the quality of the data, hypotheses, research methods, interpretations, or conclusions.” So if it’s really true that “none of the findings altered the core conclusions of any of the studies in question,” then it’s on Wansink to demonstrate this. Which he did not.

And, as someone pointed out to me, in any case this doesn’t address the core criticism of Wansink having openly employed a hypotheses-after-results-are-known methodology which leaves those statistics meaningless, even after correcting all the naive errors.

Wansink also writes:

I, of course, take accuracy and replication of our research results very seriously.

That statement is ludicrous. Wansink has published hundreds of errors in many many papers. Many of his reported numbers make no sense at all. To say that Wansink takes accuracy very seriously is like saying that, umm, Dave Kingman took fielding very seriously. Get real, dude.

And he writes of “*possible* duplicate use of data or republication of portions of text from my earlier works.” It’s not just “possible,” it actually happened. This is like the coach of the 76ers talking about possibly losing a few games during the 2015-2016 season.

OK, now for the official statement from Cornell Media Relations. Their big problem is that they minimize the problem. Here’s what they write:

Shortly after we were made aware of questions being raised about research conducted by Professor Brian Wansink by fellow researchers at other institutions, Cornell conducted an internal review to determine the extent to which a formal investigation of research integrity was appropriate. That review indicated that, while numerous instances of inappropriate data handling and statistical analysis in four published papers were alleged, such errors did not constitute scientific misconduct (https://grants.nih.gov/grants/research_integrity/research_misconduct.htm). However, given the number of errors cited and their repeated nature, we established a process in which Professor Wansink would engage external statistical experts to validate his review and reanalysis of the papers and attendant published errata. . . . Since the original critique of Professor Wansink’s articles, additional instances of self-duplication have come to light. Professor Wansink has acknowledged the repeated use of identical language and in some cases dual publication of materials. Cornell will evaluate these cases to determine whether or not additional actions are warranted.

Here’s the problem. It’s not just those 4 papers, and it’s not just those 4 papers plus the repeated use of identical language and in some cases dual publication of materials.

There’s more. A lot more. And it looks to me like serious research misconduct: either outright fraud by people in the lab, or such monumental sloppiness that data are entirely disconnected from context, with zero attempts to fix things when problems have been pointed out.

If Wansink did all this on his own and never published anything and never got any government grants, I guess I wouldn’t call it research misconduct; I’d just call it a monumental waste of time. But to repeatedly publish papers where the numbers don’t add up, where the data are not as described: sure, that seems to me like research misconduct.

From the NIH link above:

Research misconduct is defined as fabrication, falsification and plagiarism, and does not include honest error or differences of opinion. . .

Fabrication: Making up data or results and recording or reporting them.

Falsification: Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.

Plagiarism: The appropriation of another person’s ideas, processes, results, or words without giving appropriate credit.

I have no idea about fabrication or plagiarism, but Wansink’s work definitely seems to have a bit of falsification as defined above.

Let’s talk falsification.

From my very first post on Wansink, from 15 Dec 2016, it seems that he published four papers from the same experiment, not citing each other, with different sample sizes and incoherent data-exclusion rules. This falls into the category, “omitting data or results such that the research is not accurately represented in the research record.”

Next, Tim van der Zee, Jordan Anaya, and Nicholas Brown found over 150 errors in just four papers. If this is not falsification, it’s massive incompetence.

How incompetent do you have to be to make over 150 errors in four published papers?

And, then, remember this quote from Wansink:

Also, we realized we asked people how much pizza they ate in two different ways – once, by asking them to provide an integer of how many pieces they ate, like 0, 1, 2, 3 and so on. Another time we asked them to put an “X” on a scale that just had a “0” and “12” at either end, with no integer mark in between.

As I wrote at the time, “how do you say ‘we realized we asked . . .’? What’s to realize? If you asked the question that way, wouldn’t you already know this?” To publish results under your name without realizing what’s gone into it . . . that’s research misconduct. OK, if it just happens once or twice, sure, we’ve all been sloppy. But with Wansink it happens over and over again.

And this is not new. Here’s a story from 2011. An outside researcher found problems with a published paper of Wansink. Someone from Wansink’s lab fielded the comments, responded politely . . . and did nothing.

And then there’s the carrots story from Jordan Anaya, who shares this table from a Wansink paper from 2012:

As Anaya points out, these numbers don’t add up. None of them add up! The numbers violate the Law of Conservation of Carrots. I guess they don’t teach physics so well up there at Cornell . . .

Anaya reports that, as part of a ridiculous attempt to defend his substandard research practices, Wansink said the values were, “based on the well-cited quarter plate data collection method referenced in that paper.” However, Anaya continues, “The quarter plate method Wansink refers to was published by his group in 2013. Although half of the references in the 2012 paper were self-citations, the quarter plate method was not referenced as he claims.” In addition, according to that 2012 paper from Wansink, “the weight of any remaining carrots was subtracted from their starting weight to determine the actual amount eaten.”

What’s going on? Assuming it’s not out-and-out fabrication, it’s “Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.” In short: what it says in the published paper is not what happened, nor do Wansink’s later replies add up.

But wait, there’s more. Here’s a post from 21 Mar 2017 from Tim van der Zee, who summarizes:

Starting with our pre-print describing over 150 errors in 4 papers, there is a now ever-increasing list of research articles (co-)authored by BW which have been criticized for containing serious errors, reporting inconsistencies, impossibilities, plagiarism, and data duplications.

To the best of my [van der Zee’s] knowledge, there are currently:

37 publications from Wansink which are alleged to contain minor to very serious issues,

which have been cited over 3300 times,

are published in over 20 different journals, and in 8 books,

spanning over 19 years of research.

Here’s just one example:

Wansink, B., & Seed, S. (2001). Making brand loyalty programs succeed. Brand Management, 8, 211–222.

Citations: 34

Serious issue of self-plagiarism which goes beyond the Method section . . . Furthermore, both papers mention substantially different sample sizes (153 vs. 643) but both have a table with results which are basically entirely identical.

That’s a big deal. OK, sure, self-plagiarism, no need for us to be Freybabies about that. But, what about that other thing?

Furthermore, both papers mention substantially different sample sizes (153 vs. 643) but both have a table with results which are basically entirely identical.

Whoa! That sounds to me like . . . “changing or omitting data or results such that the research is not accurately represented in the research record.”

And here’s another, “Relation of soy consumption to nutritional knowledge,” which Nick Brown discusses in detail. This is one of three papers that reports three different surveys, each with exactly 770 respondents, but one based on surveys sent to 1002 people, one sent to 1600 people, and other sent to 2000 people. All things are possible but it’s hard for me to believe there are three different surveys. It seems more like “changing or omitting data or results such that the research is not accurately represented in the research record.” As van der Zee puts it, “Both studies use (mostly) the same questionnaire, allowing the results to be compared. Surprisingly, while the two papers supposedly reflect two entirely different studies in two different samples, there is a near perfect correlation of 0.97 between the mean values in the two studies.”

Brown then gets granular:

A further lack of randomness can be observed in the last digits of the means and F statistics in the three published tables of results . . . Here is a plot of the number of times each decimal digit appears in the last position in these tables:

These don’t look like so much like real data but they *do* seem consistent with someone making up numbers and not wanting them to seem too round, and not being careful to include enough 0’s and 5’s in the last digits.

Again, all things are possible. So let me just say, that given all the other misrepresentations of data and data collection in Wansink’s papers, I have no good reason to think the numbers from those tables are real.

Oh, wait, here’s another from van der Zee:

Wansink, B., Cheney, M. M., & Chan, N. (2003). Exploring comfort food preferences across age and gender. Physiology & Behavior, 79(4), 739-747.

Citations: 334

Using the provided summary statistics such as mean, test statistics, and additional given constraints it was calculated that the data set underlying this study is highly suspicious. For example, given the information which is provided in the article the response data for a Likert scale question should look like this:

Furthermore, although this is the most extreme possible version given the constraints described in the article, it is still not consistent with the provided information.

In addition, there are more issues with impossible or highly implausible data.

Looks to me like “changing or omitting data or results such that the research is not accurately represented in the research record.”

Or this:

Wansink, B., Van Ittersum, K., & Painter, J. E. (2006). Ice cream illusions: bowls, spoons, and self-served portion sizes. American journal of preventive medicine, 31(3), 240-243.

Citations: 262

A large amount of inconsistent/impossible means and standard deviations, as well as inconsistent ANOVA results, as can be seen in the picture.

If this is not “changing or omitting data or results such that the research is not accurately represented in the research record,” then what is it? Perhaps “telling your subordinates that they are expected to get publishable results, then never looking at the data or the analysis?” Or maybe something else? If the numbers don’t add up, this means that some data or results have been changed so that the research is not accurately represented in the research record, no?

What is this plausible deniability crap? Do we now need a RICO for research labs??

OK, just one more from van der Zee’s list and then I’ll quit:

Wansink, B., Just, D. R., & Payne, C. R. (2012). Can branding improve school lunches?. Archives of pediatrics & adolescent medicine, 166(10), 967-968.

Citations: 38

There are various things wrong with this article. The authors repeatedly interpret non significant p values as being significant, as well as miscalculating p values. Their choice of statistical analysis is very questionable. The data are visualized in a very questionable manner which is easily misinterpreted (such that the effects are overestimated). In addition, the visualization is radically different from an earlier version of the same paper, which gave a much more modest impression of the effects. Furthermore, the authors seem to be confused about the participants, as they are school students aged 8-11 but are also called “preliterate children”; in later publication Wansink mentions these are “daycare kids”, and further exaggerates and misreports the size of the effects.

This does not sound to me like mere “honest error or differences of opinion.” From my perspective it sounds more like “changing or omitting data or results such that the research is not accurately represented in the research record.”

The only *possible* defense I can see here is that Wansink didn’t actually collect the data, analyze the data, or write the report. He’s just a figurehead. But, again, RICO. If you’re in charge of a lab which repeatedly, over and over and over and over again, changes or omits data or results such that the research is not accurately represented in the research record, then, yes, from my perspective I think it’s fair to say that *you* have been changing or omitting data or results such that the research is not accurately represented in the research record, and *you* have been doing scientific misconduct.

I said I’d stop but I have to share this additional example. Again, here’s van der Zee’s summary:

Sığırcı, Ö, Rockmore, M., & Wansink, B. (2016). How traumatic violence permanently changes shopping behavior. Frontiers in Psychology, 7,

Citations: 0

This study is about World War II veterans. Given the mean age stated in the article, the distribution of age can only look very similar to this:

The article claims that the majority of the respondents were 18 to 18.5 years old at the end of WW2 whilst also have experienced repeated heavy combat. Almost no soldiers could have had any other age than 18.

In addition, the article claims over 20% of the war veterans were women, while women only officially obtained the right to serve in combat very recently.

What does that sound like? Oh yeah, “changing or omitting data or results such that the research is not accurately represented in the research record.”

So . . . I decided to reply.

Dear Cornell University Media Relations Office:

Thank you for pointing me to these two statements. Unfortunately I fear that you are minimizing the problem.

You write, “while numerous instances of inappropriate data handling and statistical analysis in four published papers were alleged, such errors did not constitute scientific misconduct (https://grants.nih.gov/grants/research_integrity/research_misconduct.htm). However, given the number of errors cited and their repeated nature, we established a process in which Professor Wansink would engage external statistical experts to validate his review and reanalysis of the papers and attendant published errata. . . . Since the original critique of Professor Wansink’s articles, additional instances of self-duplication have come to light. Professor Wansink has acknowledged the repeated use of identical language and in some cases dual publication of materials.”

But there are many, many more problems in Wansink’s published work, beyond those 4 initially-noticed papers and beyond self-duplication.

Your NIH link above defines research misconduct as “fabrication, falsification and plagiarism, and does not include honest error or differences of opinion. . .” and defines falsification as “Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.”

This phrase, “changing or omitting data or results such that the research is not accurately represented in the research record,” is an apt description of much of Wansink’s work, going far beyond those four particular papers that got the ball rolling, and far beyond duplication of materials. For a thorough review, see this recent post by Tim van der Zee, who points to 37 papers by Wansink, many of which have serious data problems: http://www.timvanderzee.com/the-wansink-dossier-an-overview/

And all this doesn’t even get to criticism of Wansink having openly employed a hypotheses-after-results-are-known methodology which leaves his statistics meaningless, even setting aside data errors.

There’s also Wansink’s statement which refers to “the great work of the Food and Brand Lab,” which is an odd phrase to use to describe a group that has published papers with hundreds of errors and major massive data inconsistencies that represent, at worst, fraud, and, at best, some of the sloppiest empirical work—published or unpublished—that I have ever seen. In either case, I consider this pattern of errors to represent research misconduct.

I understand that it’s natural to think that nothing can every be proven, Rashomon and all that. But in this case the evidence for research misconduct is all out in the open, in dozens of published papers.

I have no personal stake in this matter and I have no plans to file any sort of formal complaint. But as a scientist, this bothers me: Wansink’s misconduct, his continuing attempt to minimize it, and this occurring at a major university.

Yours,

Andrew Gelman

**P.S.** I have no interest in Wansink being prosecuted for any misdeeds, I don’t think there’s any realistic chance he’ll be asked to return his government grants, nor am I trying to get Cornell to fire him or sanction him or whatever. Any such efforts would take huge amounts of effort which I’m sure could be better spent elsewhere. And who am I to throw the first stone? I’ve messed up in data analysis myself.

But it irritates me to see Wansink continue to misrepresent what is happening, and it irritates me to see Cornell University minimize the story. If they don’t want to do anything about Wansink, fine; I completely understand. But the evidence is what it is; don’t understate it.

Remember the Ed Wegman story? Weggy repeatedly published articles with his name on it that ~~plagiarized~~ copied material written by others without attribution; people notified his employer, who buried the story. George Mason University didn’t seem to want to know about it.

The Wansink case is a bit more complicated, in that there do seem to be people at Cornell who care, but there also seems to be a desire to minimize the problem and make it go away. Don’t do that. To minimize scientific misconduct is an insult to all of us who work so hard to present our data accurately.

The post Dear Cornell University Public Relations Office appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Tech company wants to hire Stan programmers! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I started life as an academic mathematician (chaos theory) but have long since moved into industry. I am currently Chief Scientist at Afiniti, a contact center routing technology company that connects agent and callers on the basis of various factors in order to globally optimize the contact center performance. We have 17 data scientists/algorithm engineers in Washington, DC and another 40 (not as well trained) in Pakistan.

The problem mathematically divides into two components: A data science / statistics component where we estimate the likelihood of success for any agent/caller pairing (this is difficult because we have to estimate comparative advantage which involves differences of differences of estimates); and an operations research component where we perform an optimal allocation in a stochastic setting with a variety of real-world constraints.

We are finding Stan to be very useful (and something of a miracle) and are interested in hiring students who are experts with Stan and can help us improve Stan as we try to make it easier to use and efficiently investigate multiple models. We expect to significantly contribute to the development of Stan in the next two years.

The post Tech company wants to hire Stan programmers! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post It’s not so hard to move away from hypothesis testing and toward a Bayesian approach of “embracing variation and accepting uncertainty.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>But we recently had a blog discussion that made me realize there was some confusion on this point.

Emmanuel Charpentier wrote:

It seems that, if we want, following the conclusion of Andrew’s paper, to abandon binary conclusions, we are bound to give :

* a discussion of possible models of the data at hand (including prior probabilities and priors for their parameters),

* a posterior distribution of parameters of the relevant model(s), and

* a discussion of the posterior probabilities of these models

as the sole logically defensible result of a statistical analysis.

It seems also that there is no way to take a decision (pursue or not a given line of research, embark or not in a given planned action, etc…) short of a real decision analysis.

We have hard times before us selling *that* to our “clients” : after > 60 years hard-selling them the NHST theory, we have to tell them that this particular theory was (more or less) snake-oil aimed at *avoiding* decision analysis…

We also have hard work to do in order to learn how to build the necessary discussions, that can hardly avoid involving specialists of the subject matter : I can easily imagine myself discussing a clinical subject ; possibly a biological one ; I won’t touch an economic or political problem with a ten-foot pole…

Wow—that sounds like a lot of work! It might seem that a Bayesian approach is fine in theory but is too impractical for real work.

But I don’t think so. Here’s my reply to Charpentier:

I think you’re making things sound a bit too hard: I’ve worked on dozens of problems in social science and public health, and the statistical analysis that I’ve done doesn’t look so different from classical analyses. The main difference is that I don’t set up the problem in terms of discrete “hypotheses”; instead, I just model things directly.

And Stephen Martin followed up with more thoughts.

In my experience, a Bayesian approach is typically *less* effort and *easier* to explain, compared to a classical approach which involves all these weird hypotheses.

It’s harder to do the wrong thing right than to do the right thing right.

The post It’s not so hard to move away from hypothesis testing and toward a Bayesian approach of “embracing variation and accepting uncertainty.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Annals of Spam appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’ve recently been getting a ton of spam—something like 200 messages a day in my inbox! The Columbia people tell me that soon we’ll be switching to a new mail server that will catch most of these, but for now I have to spend a couple minutes every day just going thru and flagging them as spam. Which does basically nothing.

Anyway, most of these are just boring: home improvement ads, quack medicine, search engine optimization, sub-NKVD political fake news, invitations to fake conferences around the world, Wolfram Research employees who claim to have read my papers, etc. But today I got this which had an amusing statistical twist:

Understand how to do a coverage analysis at the Clinical Trial Billing Compliance Boot Camp

Become Clinical Trial Billing Proficient at the Only Hands-On Workshop to Guide You through All of Your Billing Compliance Challenges

Do you want to know the ins and outs of performing a coverage analysis? This year’s Clinical Trial Billing Compliance Boot Camp series will walk you through the process from “qualifying” a trial — to putting actual codes on the billing grid, so translation to the coders can occur. . . . Register today to learn how to do a coverage analysis from soup to nuts! We will help you start a coverage analysis grid that you can take home with you that will help you with your process improvement. . . .

Something about the relentless positivity of their message reminded me of Brian Wansink. Or Amy Cuddy.

I mean, really, why does everyone have to be so negative all the time? So critical? I say, let’s stop trying to check whether the numbers on published papers add up. Let’s just agree that any paper that’s published is always true. Let’s believe everything they tell us on NPR and Ted talks. Let’s just say that everything published in PPNAS is actually science. Let’s accept every submitted paper (as long as it has “p less than .05” somewhere). Let’s tenure everybody! No more party poopers, that’s what I say!

The post Annals of Spam appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Scalable Bayesian Inference with Hamiltonian Monte Carlo” (Michael Betancourt’s talk this Thurs at Columbia) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Despite the promise of big data, inferences are often limited not by sample size but rather by systematic effects. Only by carefully modeling these effects can we take full advantage of the data—big data must be complemented with big models and the algorithms that can fit them. One such algorithm is Hamiltonian Monte Carlo, which exploits the inherent geometry of the posterior distribution to admit full Bayesian inference that scales to the complex models of practical interest. In this talk I will discuss the theoretical foundations of Hamiltonian Monte Carlo, elucidating the geometric nature of its scalable performance and stressing the properties critical to a robust implementation.

The talk is this Thurs, 6 Apr, 1:10-2:20pm in 303 Mudd Building at Columbia.

You shouldn’t miss this one. These ideas are fundamental to Stan present and future.

The post “Scalable Bayesian Inference with Hamiltonian Monte Carlo” (Michael Betancourt’s talk this Thurs at Columbia) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Study showing that humans have some psychic powers caps Daryl Bem’s career” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>On the webpage of Russ Roberts’s interview with me, I happened to come across this article from the Cornell Chronicle, Dec. 6, 2010:

Study showing that humans have some psychic powers caps Daryl Bem’s career

By George Lowery

It took eight years and nine experiments with more 1,000 participants, but the results offer evidence that humans have some ability to anticipate the future.

“Of the various forms of ESP or psi, as we call it, precognition has always most intrigued me because it’s the most magical,” said Daryl Bem, professor of psychology emeritus, whose study will be published in the American Psychological Association’s Journal of Personality and Social Psychology sometime next year.

“It most violates our notion of how the physical world works. The phenomena of modern quantum physics are just as mind-boggling, but they are so technical that most non-physicists don’t know about them,” said Bem, who studied physics before becoming a psychologist.

Really?

Lowery continues:

Publishing on this topic has gladdened the hearts of psi researchers but stumped doubting social psychologists, who cannot fault Bem’s mainstream and widely accepted methodology.

Whoops! That hasn’t aged so well.

**P.S.** I was curious so I searched Cornell Chronicle for articles about Brian “Pizzagate” Wansink. They had lots of articles on this guy, and many of them were written by Katie Baildon, “a communications specialist for the Cornell Food and Brand Lab.”

There’s nothing wrong with hiring a public relations writer. But it’s interesting to see the diffusion of responsibility:

– Cornell is a well-respected university.

– Cornell, like other such organizations, hires people whose sole job is to write positive things about the institution.

– But sometimes the Cornell public relations department delegates its job and runs articles written by public relations writers hired by individual Cornell professors.

If I was the president of Cornell, I’d be kind of annoyed at all the stupid things being written by people working at my institution.

But I guess none of these items are quite as stupid as the claim that “The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” That one came from two Harvard professors who might have benefited by checking with the statistics department before mouthing off like they did.

Also from the Cornell Chronicle, this beauty:

Are your attitudes toward certain foods shaped by peer pressure rather than science? Recent research conducted by Cornell suggests that’s the case.

While some ingredient food fears are justified by objective evidence, others have demonized ingredients and damaged industries. . . .

“High fructose corn syrup avoiders expressed a stronger belief that the ingredient gives you headaches, is dangerous for children, cannot be digested, is bad for skin, makes one sluggish and changes one’s palate,” the researchers reported. . . .

Educating consumers might also do the trick. The researchers noted that participants’ views toward ingredients “became more positive when they were either informed about the history and functions of the ingredient, or informed of the wide range of familiar products that currently contain the ingredient – all factors that contribute to familiarity with the product.” . . .

“To overcome food ingredient fears, learn the science, history and the process of how the ingredient is made, and you’ll be a smarter, savvier consumer,” said Food and Brand Lab Director Brian Wansink, lead author on the report.

Oh, and one other thing:

The study was funded in part by the Corn Refiners Association and the Dyson School of Applied Economics and Management.

All right, then.

The post “Study showing that humans have some psychic powers caps Daryl Bem’s career” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Life imitates art appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I hate to interrupt our discussion of traffic deaths, but this is important. . . .

Someone pointed me to this news article, “A Retiree Discovers an Elusive Math Proof,” and I noticed this sentence:

Not knowing LaTeX, the word processer of choice in mathematics, he typed up his calculations in Microsoft Word . . . Richards notified a few colleagues and even helped Royen retype his paper in LaTeX to make it appear more professional.

Which reminded me of this article, in particular the Technical Note on page 4:

We originally wrote this article in Word, but then we converted it to Latex to make it look more like science.

The post Life imitates art appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post No evidence that providing driver’s licenses to unauthorized immigrants in California decreases traffic safety appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>So. A reporter asked me what I thought of this article, “Providing driver’s licenses to unauthorized immigrants in California improves traffic safety,” by Hans Lueders, Jens Hainmueller, and Duncan Lawrence. It’s embargoed! so I’m not supposed to post anything on it until now.

From the abstract:

We examine the short-term effects of . . . California’s Assembly Bill 60 (AB60), under which more than 600,000 [driver’s] licenses were issued in the first year of implementation in 2015 . . . We find that, contrary to concerns voiced by opponents of the law, AB60 has had no discernible short-term effect on the number of accidents. The law primarily allowed existing unlicensed drivers to legalize their driving. We also find that, although AB60 had no effect on the rate of fatal accidents, it did decrease the rate of hit and run accidents, suggesting that the policy reduced fears of deportation and vehicle impoundment. . . .

The paper seems reasonable to me.

But I’d like to see what would happen if they would perform the same analysis but using 2014 as their break point, or 2013, or 2011, or 2010, etc. They’re looking at changes in 2015 that are correlated with %AB60 licences, but (a) other things can be happening in 2015, and (b) there are lots of things correlated with this county predictor. Also, the %AB60 licenses is low (always less than 6%, so it seems from the graphs). It would seem to me that lots of things can be driving the rate of collisions and the rate of hit and runs. So the causal story does not seem so clear to me. There are too many alternative explanations.

So, I’m with them when they say that the data show no evidence of negative consequence from the law allowing driver’s licenses for unauthorized immigrants. But I’d need more evidence to be convinced of their causal claims, in particular on the hit and runs. That seems like more of a reach.

To put it another way, the paper is called, “Providing driver’s licenses to unauthorized immigrants in California improves traffic safety,” but I’d be happier with a title such as “No evidence that providing driver’s licenses to unauthorized immigrants in California decreases traffic safety.”

**P.S.** The paper is in PPNAS but the editor is not Susan Fiske so I guess I should really just call it PNAS this time.

The post No evidence that providing driver’s licenses to unauthorized immigrants in California decreases traffic safety appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My interview on EconTalk, and some other podcasts and videos appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Russ Roberts recently interviewed me for his EconTalk podcast. We talked about social science and the garden of forking paths. Roberts was also going to talk with me about Case and Deaton, but we ran out of time.

Whenever I announce a talk, people ask in comments if it will be streamed or recorded. Most of my talks are not streamed or recorded, but some are, also sometimes I get interviewed in podcasts. I can’t vouch for the podcasts because I hate the sound of my own voice, but here are some things you can check out:

Podcast with Chauncey DeVega on the 2016 Election.

Podcast with Julia Galef on Why do Americans vote the way they do.

My talk at New York R conference 2016 on the political impact of social penumbras. Also starts off with some discussion of Bayes, church, the folk theorem, and some Stan.

My talk at New York R conference 2015, But When You Call Me Bayesian, I Know I’m Not the Only One.

My talk in 2016 at a Harvard conference on big data.

My talk at Bath on crimes against data.

My talk at Oxford on teaching quantitative methods to social science students.

Bloggingheads with Eliezer Yudkowsky on probability and statistics.

Bloggingheads with Will Wilkinson on Red State, Blue State, Rich State, Poor State.

You can find more on Google. The talks come out ok on video but you don’t get a sense of the audience participation. In real life people are actually laughing at the jokes.

The post My interview on EconTalk, and some other podcasts and videos appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Gilovich doubles down on hot hand denial appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A correspondent pointed me to this Freaknomics radio interview with Thomas Gilovich, one of the authors of that famous “hot hand” paper from 1985, “Misperception of Chance Processes in Basketball.” Here’s the key bit from the Freakonomics interview:

DUBNER: Right. The “hot-hand notion” or maybe the “hot-hand fallacy.”

GILOVICH: Well, everyone who’s ever played the game of basketball knows you get this feeling where the game seems to slow down. It becomes easier, or you almost don’t even have to aim that carefully. The ball’s going to go in. It’s one of the most compelling feelings that you can have. And it turns out if you statistically analyze people’s shots — whether it’s professional games, college basketball players shooting in a gym, although the feeling exists when you make several shots in a row — you will feel hot. That feeling very surprisingly doesn’t predict how you’re going to do in the next shot or the next several shots — the distribution of hits and misses in the game of basketball looks just like the distribution of heads and tails when you’re flipping a coin. although of course, not every player shoots 50%. Very few of them do.

That’s wrong. The distribution of hits and misses in the game of basketball does *not* look just like the distribution of heads and tails when you’re flipping a coin.

In their 1985 paper, Gilovich et al. *thought* they found that the distribution of hits and misses in the game of basketball looked just like the distribution of heads and tails when you’re flipping a coin. But they’d made a couple mistakes. Subtle mistakes, but mistakes nonetheless. Miller and Sanjurjo explained the problems:

1. The simple estimate of correlation, computing for each player the empirical frequency of hits right after a sequence of hits, minus the frequency of hits right after a sequence of misses, is biased. And the bias is large enough to make a difference in data sequences of realistic length. (Hence Gilovich was making a mathematical error when he said, “Because our samples were fairly large, I don’t believe this changes the original conclusions about the hot hand.”)

2. Whether you made or missed the last couple of shots is itself a very noisy measure of your “hotness,” so estimates of the hot hand based on these correlations are themselves strongly attenuated toward zero.

Combining 1 and 2, even large hot-hand effects will show up as zero in Gilovich-like studies.

None of this is new, and Gilovich is aware of these criticisms. Yet in this Freakonomics interview, he chose to do a full Cuddy, as it were, and not even acknowledge the demonstrated problems with his work.

That’s too bad, especially given that his 1985 paper has an important place in the history of our understanding of cognitive illusions, and its errors are subtle. There’s no shame in making a mistake.

What’s with the refusal to acknowledge error? Is it something in the water at Cornell University? Is there some bar in Ithaca where Daryl Bem, Brian Wansink, and Thomas Gilovich go to complain about the fickle nature of the science press?

This might be unfair to Gilovich, though. All I have to work with is the transcript of the interview, and for all I know he went on like this:

GILOVICH: Actually, though, we were wrong! Miller and Sanjurjo showed that our data were consistent with a hot hand all along. The problem was that our correlation-based estimate, which seemed so reasonable, was actually a biased and noisy estimate of the hot hand. My bad for not noticing this back in 1985. But, hey, both problems—the bias and the variance—were subtle. In my defense, the brilliant Amos Tversky didn’t catch these problems either.

DUBNER: Yup, almost everyone missed it. As late as 2014, Gelman was pooh-poohing the idea that the hot hand could amount to much. Anyway, it’s great to have the opportunity to correct the record now, here on Freakonomics radio!

And then maybe that last exchange got cut, for lack of space.

**What’s goin on here?**

Seriously, though, I see two problems here.

First, Gilovich has seen that serious scholars have criticized his hot-hand claims. The criticisms are real, and they’re spectacular, and it’s not like Gilovich has any refutation, he’s just bobbing and weaving. Then a reporter comes to him for a feature story and he presents it completely straight, as if it’s 1985 again, Run DMC on the beatbox, the Cosby Show on TV, Oliver North sending weapons to the Ayatollah, and New Coke in every 7-11 in the country. What kind of attitude is that?

Second, the Freaknomics team didn’t think to even check.

OK, this thing jumped out at me because I’m a statistician and I’ve spent a lot of time thinking about the hot hand. But you don’t need to be an expert to know about this.

Suppose you’re running a radio show and you’re going to interview someone about a particular topic. (And this one didn’t come as a surprise: if you check the transcript, you’ll see that it’s Dubner, not Gilovich, who first brings up the “hot hand.”)

What do you do? You quickly research the topic? How? First step is Google, right?

Here’s what happens when you google *hot hand basketball*:

(I ran the search in anonymous mode so I don’t think it’s using my own search history.)

Whaddya get?

– First item is the Wikipedia entry which right away describes the hot hand as an “allegedly fallacious belief” and includes an entire section on recent research in support of the hot hand (i.e., disagreeing with Gilovich’s claims).

– The second item is a news article saying that the hot hand is real, and featuring the work of Miller and Sanjurjo.

– The third item is the Gilovich et al. paper from 1985.

– The fourth item is another news article saying the hot hand is real.

So, it would be hard to google the topic and *not* come to the conclusion that Gilovich’s claims are, at best, controversial.

Yet this did not come up in the interview. The Freakonomics team was not doing their job. Either they showed up for an interview without doing the simplest Google search, or they did know about the controversy but let their interviewee get away with misrepresenting the state of knowledge.

Too bad. It would’ve been easy for the interviewer to follow up with something like,

DUBNER: That all sounds good, but in preparation for this interview, I looked up the hot hand and I saw that your claims are controversial. Nowadays a lot of people are saying that the hot hand is real, and there seems to be general agreement that the conclusions from your 1985 paper were a mistake, turning on a subtle probability error.

Then Gilovich could reply, maybe something like this:

GILOVICH: Yeah, I’ve heard about this. I’m not a stats expert myself so, sure, I could believe that there were some subtleties in the estimation and the power analysis that we missed. Still, when you look at our data, players and spectators seem sooo convinced that the hot hand is huge, and even if it’s real, it’s hard for me to believe that it’s big as people think. So I’ll hold to my larger point that people overinterpret random events. That’s the real message of our project.

DUBNER: OK, now let’s talk for just a minute about your work in happiness or hedonic studies. . . .

OK, having written this, I can see how Dubner might not have wanted to include such an exchange in the interview, as it’s a bit of a distraction from the main point. But, really, I think you have to do it. The exchange doesn’t make Gilovich look so good, as it forces him to backpedal, but ultimately *that’s Gilovich’s fault* as he was the one to overstate his conclusions in the interview.

I guess this happens in political and celebrity interviews all the time: the interviewee says something false, and then the interviewer has to choose between letting a false statement go by unchallenged, or else blowing the whistle and losing the trust of the interviewee.

But when you’re interviewing a scientist, it should be different.

Look. When I talk about my own research, I try to be complete and open but I’m sure that I do some hype, I don’t dwell on my failures, and there must be times that I don’t get around to mentioning some serious criticisms. I should do better. If I’m being interviewed, I’d appreciate the interviewer calling me on these things. Not that I want to be hassled—I’m not planning to go on Troll TV anytime soon—but if I say something false or incomplete, that’s on me. I’d like to have a chance to explain, in the way that the (hypothetical) Gilovich did above.

The post Gilovich doubles down on hot hand denial appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Move along, nothing to see here appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I don’t really want to go into details on this one as our paper is under review at a journal, but the short story is that my colleagues and I have conducted replications of a high-profile psychology study. Not all our replications had results that made sense, and so from a perfectly reasonable Bayesian perspective we chose the more reasonable outcomes to include in our meta-analyses.

Anyway, this is no big deal and our paper will be peer reviewed in any case. But I felt the need to bring it up here because some methodological terrorists have been going on social media bugging me to post all my data. I’m sorry but it doesn’t work that way. We have IRB rules, subject confidentiality to consider, and also I don’t like the precedent. What would happen if every reputable lab had to share data with bloggers, the rabble on twitter, etc? It would end up like that PACE trial, where all sorts of unqualified outsiders started second-guessing serious researchers. This is just bad news. We have serious science to conduct here. I wish the snipers would spend a little more time doing science themselves and a little less time criticizing. As the great Satoshi Kanazawa put it—or was it Stewart Lee?—those who can, do research. Those who can’t, blog.

We’ll now return to our usually scheduled programming.

The post Move along, nothing to see here appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Running Stan with external C++ code appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Starting with the 2.13 release, it is much easier to use external C++ code in a Stan program. This vignette briefly illustrates how to do so.

He continues:

Suppose that you have (part of) a Stan program that involves Fibonacci numbers, such as

`functions { int fib(int n); int fib(int n) { if (n <= 0) reject("n must be positive"); return n <= 2 ? 1 : fib(n - 1) + fib(n - 2); } } model {} // use the fib() function somehow`

On the second line, we have

declaredthe`fib`

function before it isdefinedin order to call it recursively.For functions that are not recursive, it is not necessary to declare them before defining them but it may be advantageous. For example, I often like to hide the definitions of complicated utility functions that are just a distraction using the

`#include "file"`

mechanism`functions { real complicated(real a, real b, real c, real d, real e, real f, real g); #include "complicated.stan" // defines the above function } model {} // use the complicated() function somehow`

This Stan program would have to be parsed using the

`stanc_builder`

function in therstanpackage rather than the default`stanc`

function (which is called by`sampling`

and`stan`

internally).Returning to the Fibonacci example, it is not necessary to define the

`fib`

function using the Stan language because Stan programs with functions that aredeclaredbut notdefinedcan use the standard capabilities of the C++ toolchain to provide the function definitions in C++. For example, this program produces a parser error by default`mc <- ' functions { int fib(int n); } model {} // use the fib() function somehow ' try(stan_model(model_code = mc, model_name = "parser_error"), silent = TRUE)`

`## SYNTAX ERROR, MESSAGE(S) FROM PARSER:`

`##`

`## Function declared, but not defined. Function name=fib`

`##`

`## ERROR at line 2`

`##`

`## 1:`

`## 2: functions { int fib(int n); }`

`## ^`

`## 3: model {} // use the fib() function somehow`

`##`

However, if we specify the

`allow_undefined`

and`includes`

arguments to the`stan_model`

function, and define a`fib`

function in the named C++ header file, then it will parse and compile`stan_model(model_code = mc, model_name = "external", allow_undefined = TRUE, includes = paste0('\n#include "', file.path(getwd(), 'fib.hpp'), '"\n'))`

Specifying the

`includes`

argument is a bit awkward because the C++ representation of a Stan program is written and compiled in a temporary directory. Thus, the`includes`

argument must specify afullpath to the fib.hpp file, which in this case is in the working directory. Also, the path must be enclosed in double-quotes, which is why single quotes are used in the separate arguments to the`paste0`

function so that double-quotes are interpreted literally. Finally, the`includes`

argument should include newline characters (`"\n"`

) at the start and end. It is possible to specify multiple paths using additional newline characters or include a “meta-header” file that contains`#include`

directives to other C++ header files.The result of the

`includes`

argument is inserted into the C++ file directly after the following lines (as opposed to CmdStan where it is inserted directlybeforethe following lines)`#include <stan/model/model_header.hpp> namespace some_namespace { using std::istream; using std::string; using std::stringstream; using std::vector; using stan::io::dump; using stan::math::lgamma; using stan::model::prob_grad; using namespace stan::math; typedef Eigen::Matrix<double,Eigen::Dynamic,1> vector_d; typedef Eigen::Matrix<double,1,Eigen::Dynamic> row_vector_d; typedef Eigen::Matrix<double,Eigen::Dynamic,Eigen::Dynamic> matrix_d; static int current_statement_begin__; // various function declarations and / or definitions`

Thus, the definition of the

`fib`

function in the fib.hpp file need not be enclosed in any particular namespace (which is a random string by default. The “meta-include” stan/model/model_header.hpp file reads as`#ifndef STAN_MODEL_MODEL_HEADER_HPP #define STAN_MODEL_MODEL_HEADER_HPP #include <stan/math.hpp> #include <stan/io/cmd_line.hpp> #include <stan/io/dump.hpp> #include <stan/io/reader.hpp> #include <stan/io/writer.hpp> #include <stan/lang/rethrow_located.hpp> #include <stan/model/prob_grad.hpp> #include <stan/model/indexing.hpp> #include <boost/exception/all.hpp> #include <boost/random/additive_combine.hpp> #include <boost/random/linear_congruential.hpp> #include <cmath> #include <cstddef> #include <fstream> #include <iostream> #include <sstream> #include <stdexcept> #include <utility> #include <vector> #endif`

so the definition of the

`fib`

function in the fib.hpp file could utilize any function in the Stan Math Library (without having to prefix function calls with`stan::math::`

), some typedefs to classes in the Eigen matrix algebra library, plus streams, exceptions, etc. without having to worry about the corresponding header files. Nevertheless, an external C++ filemaycontain additional include directives that bring in class definitions, for example.Now let’s examine the fib.hpp file, which contains the C++ lines

`int fib(const int&n, std::ostream* pstream__) { if (n <= 0) { stringstream errmsg; errmsg << "n must be positive"; throw std::domain_error(errmsg.str()); } return n <= 1 ? 1 : fib(n - 1, 0) + fib(n - 2, 0); }`

This C++ function is essentially what the preceding user-defined function in the Stan language

`int fib(int n) { if (n <= 0) reject("n must be positive"); return n <= 2 ? 1 : fib(n - 1) + fib(n - 2); }`

parses to. Thus, there is no

speedadvantage to defining the`fib`

function in the external fib.hpp file. However, it is possible to use an external C++ file to handle the gradient of a function analytically as opposed to using Stan’s autodifferentiation capabilities, which are slower and more memory intensive but fully general. In this case, the`fib`

function only deals with integers so there is nothing to take the derivative of. The primary advantage of using an external C++ file is flexibility to do things that cannot be done directly in the Stan language. It is also useful for R packages likerstanarmthat may want to define some C++ functions in the package’s src directory and rely on the linker to make them available in its Stan programs, which are compiled at (or before) installation time.In the C++ version, we check if

`n`

is non-positive, in which case we throw an exception. It is unnecessary to prefix`stringstream`

with`std::`

because of the`using std::stringstream;`

line in thegeneratedC++ file. However, there is no corresponding`using std::domain_error;`

line, so it has to be qualified appropriately when the exception is thrown.The only confusing part of the C++ version of the

`fib`

function is that it has an additional argument (with no default value) named`pstream__`

that is added to thedeclarationof the`fib`

function by Stan. Thus, yourdefinitionof the`fib`

function needs to match with this signature. This additional argument is a pointer to a`std::ostream`

and is only used if your function prints something to the screen, which is rare. Thus, when we call the`fib`

function recursively in the last line, we specify`fib(n - 1, 0) + fib(n - 2, 0);`

so that the output (if any, and in this case there is none) is directed to the null pointer.This vignette has employed a toy example with the Fibonacci function, which has little apparent use in a Stan program and if it were useful, would more easily be implemented as a user-defined function in the

`functions`

block as illustrated at the outset. The ability to use external C++ code only becomes useful with more complicated C++ functions. It goes without saying that this mechanism ordinarily cannot call functions in C, Fortran, R, or other languages because Stan needs the derivatives with respect to unknown parameters in order to perform estimation. These derivatives are handled with custom C++ types that cannot be processed by functions in other languages that only handle primitive types such as`double`

,`float`

, etc.That said, it is possible to accomplish a great deal in C++, particularly when utilizing the Stan Math Library. For more details, see The Stan Math Library: Reverse-Mode Automatic Differentiation in C++ and its GitHub repository. The functions that you

declarein the`functions`

block of a Stan program will typically involve templating and type promotion in their signatures when parsed to C++ (the only exceptions are functions whose only arguments are integers, as in the`fib`

function above). Suppose you wanted to define a function whose arguments are real numbers (or at least one of the arguments is). For example,`mc <- ' functions { real besselK(real v, real z); } model {} // use the besselK() function somehow '`

`stan_model(model_code = mc, model_name = "external", allow_undefined = TRUE, includes = paste0('\n#include "', file.path(getwd(), 'besselK.hpp'), '"\n'))`

Although the Stan Math Library (via Boost) has an implementation of the Modified Bessel Function of the Second Kind, it only supports the case where the order (

`v`

) is an integer. The besselK.hpp file reads as`template <typename T0__, typename T1__> typename boost::math::tools::promote_args<T0__, T1__>::type besselK(const T0__& v, const T1__& z, std::ostream* pstream__) { return boost::math::cyl_bessel_k(v, z); }`

because, in general, its first two arguments could either be integers, known real numbers, or unknown but real parameters. In the case of unknown real numbers, Stan will need to rely on its autodifferentiation mechanism to keep track of the derivative with respect to those arguments during estimation. But if either of the first two arguments is an integer or a known real number, then Stan avoids taking derivatives with respect to them. Thus, it is useful to utilize C++ templates that can generate all (four in this case) versions of this Bessel function with only a few lines of C++ source code. The first line of besselK.hpp states that the first two arguments are going to be templated with typenames

`TO__`

and`T1__`

respectively. The second line is convoluted but merely states that the return type of the`besselK`

function depends on`TO__`

and`T1__`

. In short, if either is an unknown real parameter, then the result will also be an unknown real parameter. The third line contains the generated signature of the function in C++, which involves the typenames`TO__`

and`T1__`

, as well as the pointer to a`std::ostream`

(which is again not used in the body of the function). The body of the`besselK`

function is simply a call to the coresponding function in the Boost Math Library, whose headers are pulled in by the Stan Math Library but the`boost::math::`

prefix is necessary due to the absence of a`using boost::math;`

statement.An easy way to see what the generated function signature will be is to call the

`stanc`

function in therstanpackage with`allow_undefined = TRUE`

and inspect the resuling C++ code. In this case, I first did`try(readLines(stanc(model_code = mc, allow_undefined = TRUE)$cppcode))`

`## Warning in file(con, "r"): cannot open file '// Code generated by Stan version 2.14 ## ## #include <stan/model/model_header.hpp> ## ## namespace model79d3b505e97_mc_namespace { ## ## using std::istream; ## using std::string; ## using std::stringstream; ## using std::vector; ## using stan::io::dump; ## using stan::math::lgamma; ## using stan::model::prob_grad; ## using namespace stan::math; ## ## typedef Eigen::Matrix<double,Eigen::Dynamic,1> vector_d; ## typedef Eigen::Matrix<double,1,Eigen::Dynamic> row_vector_d; ## typedef Eigen::Matrix<double,Eigen::Dynamic,Eigen::Dynamic> matrix_d; ## ## static int current_statement_begin__; ## ## template <typename T0__, typename T1__> ## typename boost::math::tools::promote_args<T0__, T1__>::type ## besselK(const T0__& v, ## const T1__& z, std::ostream* pstream__); ## ## class model79d3b505e97_mc : public prob_grad { ## private: ## public: ## model79d3b505e97_mc(stan::io::var_context& context__, ## std::ostream* pstream__ = 0) ## : prob_grad(0) { ## typedef boost::ecuyer1988 rng_t; ## rng_t base_rng(0); // 0 seed d [... truncated]`

to see what function signature needed to be written for besselK.hpp.

When using external C++ code, care must be taken to ensure that the function is numerically stable over a wide range of floating point numbers. Indeed, in the case of the

`besselK`

function, the derivative with respect to the order argument (`v`

) may not be sufficiently stable numerically. In general, it is best to strip the underlying double-precision numbers out of Stan’s custom scalar types, evaluate the desired function, and then calculate the necessary derivatives analytically in an object whose class that inherits from the`vari`

class in Stan. The details of doing so are beyond the scope of this vignette but are discussed in the links above. Once you go to the trouble of writing such a C++ function, we would welcome a pull request on GitHub to include your C++ function in the Stan Math Library for everyone to benefit from, provided that it can be licensed under the 3-clause BSD license and its use is not overly-specific to your particular Stan program.The Stan Math Library is compliant with the C++11 standard but currently does not utilize any features that were introduced by the C++11 standard nor does it require a compiler that is compliant with the C++11 standard. However, almost any modern C++ compiler is compliant, so you can use (at least some subset of the) features that were introduced by the C++11 standard in external C++ code and your Stan program should compile (perhaps with some warnings). In particular, you may want to use the

`auto`

keyword to avoid having to learn a lot of the messy type-promotion syntax used in the Stan Math Library and the rules for what kind of object is returned by various mathematical operations. For example, under the C++11 standard, the`besselK.hpp`

file could be written as`template <typename T0__, typename T1__> auto besselK(const T0__& v, const T1__& z, std::ostream* pstream__) -> decltype(v + z) { return boost::math::cyl_bessel_k(v, z); }`

where the

`auto`

keyword combined with`-> decltype (v + z)`

results in the same code as the Boost metaprogram`typename boost::math::tools::promote_args<T0__, T1__>::type`

.

The post Running Stan with external C++ code appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Prediction model for fleet management appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I am working on a fleet management system these days: basically, I am trying to predict the usage ‘y’ of our fleet in a zip code in the future. We have some factors ‘X’, such as number of active users, number of active merchants etc.

If I can fix the time horizon, the problem will become relatively easy:

y = beta * X

However, there is no ‘golden’ time horizon. At any moment, I need to make a call whether I need to send more (or less) cars to that zip code.Event worse, not all my factors are same. Some factors are strong in short term prediction, some factors are strong in long term prediction. If I force to combine them into one single regression model, I am afraid that I will be hurting the overall regression performance.

I was thinking about multivariate regression, but it does not really solve the problem. Multivariate regression might just give me an ‘average (not in a statistical sense)’ model that tries to predict multiple time-horizon, but due to each factor’s unique predicting power, it might not predict well in any time horizon.

I’d recommend fitting a multilevel model in Stan—hey, I’d even do it myself if you paid me enough! Maybe commenters have some more specific ideas.

The post Prediction model for fleet management appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Aggregate age-adjusted trends in death rates for non-Hispanic whites and minorities in the U.S. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Earlier we’d graphed the trends within each state but there was so much going on there, it was hard to see the big picture.

All our graphs are age adjusted.

Also, each graph is on a different scale. Graphs are scaled just to include the data. So look at the y-axes carefully if you want to compare different plots. As always, there’s a tradeoff between the unambiguousness of a common scale or the higher resolution of different scales. In these graphs we went for resolution. Soon we’ll post all our data and R code and then you can easily play with the code and make your own graph. Or you can go to CDC Wonder, download the data, and make your own graphs right now.

All trends are from 1999-2014.

**Summary**

Minorities used to have higher death rates than whites among almost all age groups. But now, death rates for whites and minorities are close to equal in most age categories.

During the past decade and a half, death rates for in the different minority groups have steadily declined in most age categories, while death rates for whites have declined more slowly and actually increased in a few age categories (25-55-year-old women and 25-34-year-old men).

As demonstrated in our other document, these patterns vary *a lot* by state and region of the country.

The graphs below compare non-Hispanic whites to others for the U.S. as a whole.

**Non-Hispanic whites vs. all minorities**

**Non-Hispanic whites vs. blacks vs. Hispanics vs. others**

**The controversy**

Economists Anne Case and Angus Deaton wrote two papers on mortality rate trends. These two papers did the service of getting these trends discussed in the news media and in the scholarly community more broadly. Unfortunately most of the discussion of this in the news media, back in 2015 and again now, began and ended with Case and Deaton, not giving other perspectives.

I think the best recent news article on the topic is this piece by Malcolm Harris, who went to the trouble to read the literature and interview some demographers who work in this area. Harris goes into detail on the problems with Case and Deaton’s comparisons of trends in different education categories. It’s not that such comparisons shouldn’t be done, but you have to be careful about the interpretation, because of selection bias.

The rest of the news media (NYT, NPR, etc) pretty much punted on this one and just did straight Case-Deaton with no alternative perspectives. Kinda frustrating, but this is the tradition in science reporting, I guess. As I wrote earlier, I’m not saying these journalists had to talk with me; the appropriate people to interview are various actuaries, demographers, and public health researchers who understand these numbers inside and out.

**Obligatory criticism of the graphical display**

I’d prefer a different color scheme—it’s kinda weird that the color for white people changes from the first to the second set of graphs. Also, I’d take advantage of the scrolling feature of html and display the graphs in a 10 x 4 grid: that’s 10 age categories x 4 batches of graphs (women whites vs. minorities, men whites vs. minorities, women whites vs. blacks vs. hispanics vs. others, men whites vs. blacks vs. hispanics vs. others).

Or maybe, how’s this for an idea: 10 x 5, each graph has one line for men and one for women, and the 5 columns are whites, blacks, hispanics, others, and all minorities. That could work! At least, it’s worth a try. I don’t like the current graphs with 4 different colored lines; it’s too hard for me to keep these clear in my head. I keep finding myself going back and forth between the plots and the color code, and that’s not good.

The post Aggregate age-adjusted trends in death rates for non-Hispanic whites and minorities in the U.S. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Hideout appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I got this email from a journalist:

This seems . . . irresponsible to me.

Particularly:

For the first 100 years that meteorologists kept weather records at Central Park, from 1869 through 1996, they recorded just two snowstorms that dumped 20 inches or more. But since 1996, counting this week’s storm, there have been six. (You’ll find similar stats for other major East Coast cities.)

Basically, we’ve become accustomed to something that used to be very rare.

The link points to a post by Eric Holthaus on grist.org, and I agree with the person who sent this to me, that the argument is pretty bad, at least as presented.

First, let’s compute the simple statistical comparison. The previous rate was 2 out of 128, and the new rate is 6 out of 22. So:

y <- c(2, 6) n <- c(128, 22) p_hat <- y/n diff <- p_hat[2] - p_hat[1] se_diff <- sqrt(sum(p_hat*(1-p_hat)/n))

The difference between the two probabilities is 0.26 and the standard error is 0.10. So, sure, it's more than 2 standard errors from zero: good enough for grist.com, PPNAS, NPR, and your friendly neighborhood Ted talk.

But not good enough for the rest of us. The researcher degrees of freedom are obvious: the choice of 1996 as a cutpoint, the choice of 20 inches as a cutpoint, and the decision to pick out snowstorms as the outcome of interest. Also there can be changes at various time scales, so it's not quite right to treat each year as an independent data point. In summary, just by chance alone, we'd expect to be able to see lots of apparently statistically significant patterns by sifting through weather data in this way.

The story here is the usual: To the extent that this evidence is presented in support of a clear theory, it could be meaningful. By itself, it's noise.

The post Hideout appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Easier-to-download graphs of age-adjusted mortality trends by sex, ethnicity, and age group appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Jonathan Auerbach and I recently created graphs of smoothed age-adjusted mortality trends from 1999-2014 for:

– 50 states

– men and women

– non-hispanic whites, blacks, and hispanics

– age categories 0-1, 1-4, 5-14, 15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84.

We posted about this on the blog and also wrote an article for Slate expressing our frustration with scientific papers and news reports that oversimplified the trends.

Anyway, when Jonathan and I put all our graphs into a file it was really really huge, and I fear this could’ve dissuaded people from downloading it.

Fortunately, Yair found a way to compress the pdf file. It’s now still pretty readable and only takes up 9 megabytes. So go download it and enjoy all the stunning plots. (See above for an example.) Thanks, Yair!

**P.S.** If you want the images in full resolution, the original document remains here

The post Easier-to-download graphs of age-adjusted mortality trends by sex, ethnicity, and age group appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Crack Shot appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Raghu Parthasarathy writes:

You might find this interesting, an article (and related essay) on the steadily declining percentage of NIH awards going to mid-career scientists and the steadily increasing percentage going to older researchers. The key figure is below. The part that may be of particular interest to you, since you’ve written about age-adjustment in demographic work: does an analysis like this have to account for the changing demographics of the US population (which wasn’t done), or it that irrelevant since there’s no necessarily link between the age distribution of scientists and that of the society they’re drawn from? I have no idea, but I figured you might.

Most of the article is about the National Heart Lung and Blood Institute, one of the NIH institutes, but I would bet that its findings are quite general.

Jeez, what an ugly pair of graphs! Actually, I’ve seen a lot worse. These graphs are actually pretty functional. But uuuuuugly. And what’s with those R-squareds? Anyway, the news seems good to me—the crossover point seems to be happening just about when I turn 55. And I’d sure like some of that National Heart Lung and Blood Institute for Stan.

The post Crack Shot appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>