Skip to content

“Dear Major Textbook Publisher”: A Rant


Dear Major Academic Publisher,

You just sent me, unsolicited, an introductory statistics textbook that is 800 pages and weighs about 5 pounds. It’s the 3rd edition of a book by someone I’ve never heard of. That’s fine—a newcomer can write a good book. The real problem is that the book is crap. It’s just the usual conventional intro stat stuff. The book even has a table of the normal distribution on the inside cover! How retro is that?

The book is bad in so many many ways, I don’t really feel like going into it. There’s nothing interesting here at all, the examples are uniformly fake, and I really can’t imagine this is a good way to teach this material to anybody. None of it makes sense, and a lot of the advice is out-and-out bad (for example, a table saying that a p-value between 0.05 and 0.10 is “moderate evidence” and that a p-value between 0.10 and 0.15 is “slight evidence”). This is not at all the worst thing I saw; I’m just mentioning it here to give a sense of the book’s horrible mixture of ignorance and sloppiness.

I could go on and on. But, again, I don’t want to do so.

I can’t blame the author, who, I’m sure, has no idea what he is doing in any case. It would be as if someone hired me to write a book about, ummm, I dunno, football. Or maybe rugby would be an even better analogy, since I don’t even know the rules to that one.

Who do I blame, then? I blame you, the publisher.

You bastards.

Out of some goal of making a buck, you inflict this pile of crap on students, charging them $200—that’s right, the list price is just about two hundred dollars—for the privilege of ingesting some material that is both boring and false.

And, the worst thing is, this isn’t even your only introductory statistics book! You publish others that are better than this one. I guess you figure there’s a market for anything. It’s free money, right?

And then you go the extra environment-destroying step of printing a copy just for me and mailing it over here, just so that I can throw it out.

Please do me a favor. Shut your business down and go into something more productive to the world. For example, you could run a three-card monte game on the street somewhere. Three-card monte, that’s still a thing, right?

Hey, I forgot to include a cat picture in my previous post!

Josh Miller fixes it for me:


Hot hand 1, WSJ 0


In a generally good book review on “uncertainty and the limits of human reason,” William Easterly writes:

Failing to process uncertainty correctly, we attach too much importance to too small a number of observations. Basketball teams believe that players suddenly have a “hot hand” after they have made a string of baskets, so you should pass them the ball. Tversky showed that the hot hand was a myth—among many small samples of shooting attempts, there will randomly be some streaks. Instead of a hot hand, there was “regression to the mean”—players fall back down to their average shooting prowess after a streak. Likewise a “cold” player will move back up to his own average.

No no no. The funny thing is:

1. As Miller and Sanjurjo explain, the mistaken belief that there is no hot hand, is itself a result of people “attaching too much importance to too small a number of observations.”

2. This is not news to the Wall Street Journal! Ben Cohen reported on the hot hand over a year ago!

On the plus side, Easterly’s review did not mention himmicanes, power pose, the gay gene, the contagion of obesity, or the well-known non-finding of an increase in the death rate among middle-aged white men.

In all seriousness, the article is fine; it’s just interesting how misconceptions such the hot hand fallacy fallacy can persist and persist and persist.

Data 1, NPR 0


Jay “should replace the Brooks brothers on the NYT op-ed page” Livingston writes:

There it was again, the panic about the narcissism of millennialas as evidenced by selfies. This time it was NPR’s podcast Hidden Brain. The show’s host Shankar Vedantam chose to speak with only one researcher on the topic – psychologist Jean Twenge, whose even-handed and calm approach is clear from the titles of her books, Generation Me and The Narcissism Epidemic. . . .

What’s the evidence that so impressed National Public Radio? Livingston explains:

There are serious problems with the narcissism trope. One is that people use the word in many different ways. For the most part, we are not talking about what the DSM-IV calls Narcissistic Personality Disorder. That diagnosis fits only a relatively few (a lifetime prevalence of about 6% ). For the rest, the hand-wringers use a variety of terms. Twenge, in the Hidden Brain episode, uses individualism and narcissism as though they were interchangeable. She refers to her data on the increase in “individualistic” pronouns and language, even though linguists have shown this idea to be wrong (see Mark Liberman at Language log here and here). . . .

Then there’s the generational question. Are millennials more narcissistic than were their parents or grandparents? . . . if you’re old enough, when you read the title The Narcissism Epidemic, you heard a faint echo of a book by Christopher Lasch published thirty years earlier.

And now on to the data:

We have better evidence than book titles. Since 1975, Monitoring the Future (here) has surveyed large samples of US youth. It wasn’t designed to measure narcissism, but it does include two relevant questions:
Compared with others your age around the country, how do you rate yourself on school ability?
How intelligent do you think you are compared with others your age?
It also has self-esteem items including
I take a positive attitude towards myself
On the whole, I am satisfied with myself
I feel I do not have much to be proud of (reverse scored)
A 2008 study compared 5-year age groupings and found absolutely no increase in “egotism” (those two “compared with others” questions). The millennials surveyed in 2001-2006 were almost identical to those surveyed twenty-five years earlier. The self-esteem questions too showed little change.

Another study by Brent Roberts, et al., tracked two sources for narcissism: data from Twenge’s own studies; and data from a meta-analysis that included other research, often with larger samples. The test of narcissism in all cases was the Narcissism Personality Inventory – 40 questions designed to tap narcissistic ideas.

Their results look like this:

Narc graph 2

Twenge’s sources justify her conclusion that narcissism is on the rise. But include the other data and you wonder if all the fuss about kids today is a bit overblown. You might not like participation trophies or selfie sticks or Instagram, but it does not seem likely that these have created an epidemic of narcissism.

Oooh—ugly ugly ugly Excel graph. Still, Livingston has a point.

Ahhhh, NPR!

best algorithm EVER !!!!!!!!


Someone writes:

On the website you find a lot of material for Optimal (or “optimizing”) Data Analysis (ODA) which is described as:

In the Optimal (or “optimizing”) Data Analysis (ODA) statistical paradigm, an optimization algorithm is first utilized to identify the model that explicitly maximizes predictive accuracy for the sample, and then the resulting optimal performance is evaluated in the context of an application-specific exact statistical architec­ture. Discovered in 1990, the first and most basic ODA model was a distribution-free machine learning algorithm used to make maximum accuracy classifications of observations into one of two categories (pass or fail) on the basis of their score on an ordered attribute (test score). When the first book on ODA was writ­ten in 2004 a cornucopia of in­disputable evidence had already amassed demonstrating that statistical models identified by ODA were more flexible, transpar­ent, intuitive, accurate, par­simonious, and generalizable than competing models instead identified using an unin­tegrated menagerie of legacy statistical meth­ods. Understanding of ODA methodology skyrocketed over the next decade, and 2014 produced the development of novometric theory – the conceptual analogue of quan­tum mechanics for the statistical analysis of classical data. Maximizing Predictive Accu­racy was written as a means of organizing and making sense of all that has so-far been learned about ODA, through November of 2015.

I found a paper in which a comparison of several machine learning algorithms reveals that a classification tree analysis based on ODA approach delivers best classification results (compared to binary regression, random forest, SVM, etc.)

So far, based on given information, it sounds pretty appealing – do you see any pitfalls? – would you recommend it for using in data analysis when I want to achieve accurate predictions?

My reply: I have no idea. It seems like a lot of hype to me: “discovered . . . conucopia . . . menagerie . . . skyrocketed . . . novometric theory . . . conceptual analogue of quan­tum mechanics.”

But, hey, something can be hyped and still be useful, so who knows? I’ll leave it for others to make their judgments on this one.

Using Stan in an agent-based model: Simulation suggests that a market could be useful for building public consensus on climate change


Jonathan Gilligan writes:

I’m writing to let you know about a preprint that uses Stan in what I think is a novel manner: Two graduate students and I developed an agent-based simulation of a prediction market for climate, in which traders buy and sell securities that are essentially bets on what the global average temperature will be at some future time. We use Stan as part of the model: at every time step, simulated traders acquire new information and use this information to update their statistical models of climate processes and generate predictions about the future.

J.J. Nay, M. Van der Linden, and J.M. Gilligan, Betting and Belief: Prediction Markets and Attribution of Climate Change, (code here).

ABSTRACT: Despite much scientific evidence, a large fraction of the American public doubts that greenhouse gases are causing global warming. We present a simulation model as a computational test-bed for climate prediction markets. Traders adapt their beliefs about future temperatures based on the profits of other traders in their social network. We simulate two alternative climate futures, in which global temperatures are primarily driven either by carbon dioxide or by solar irradiance. These represent, respectively, the scientific consensus and a hypothesis advanced by prominent skeptics. We conduct sensitivity analyses to determine how a variety of factors describing both the market and the physical climate may affect traders’ beliefs about the cause of global climate change. Market participation causes most traders to converge quickly toward believing the “true” climate model, suggesting that a climate market could be useful for building public consensus.

Our simulated traders treat the global temperature as linear function of a forcing term (either the logarithm of the atmospheric carbon dioxide concentration or the total solar irradiance) plus an auto-correlated noise process. Each trader has an individual belief about the cause of climate change, and uses the corresponding forcing term. At each time step, the simulated traders use past temperatures to fit parameters for their time-series models, use these models to extrapolate probability distributions for future temperatures, and use these probability distribution to place bets (buy and sell securities).

Gilligan continues:

We developed our agent-based model in R. At first, we used the well-known nlme package to fit generalized least-squares models of global temperature with ARMA noise, but this was both very slow and unstable: many model runs failed with cryptic and poorly documented error messages from nlme.

Then we tried coding the time series model in Stan. The excellent manual and helpful advice from the Stan users mailing list allowed us to quickly write and debug a time-series model. To our great surprise, the full Bayesian analysis with Stan was much faster than nlme. Moreover, the generated quantities block in a Stan program makes it easy for our agents to generate predicted probability distributions for future temperatures by sampling model parameters from the joint posterior distribution and then simulating a stochastic ARMA noise process.

Fitting the time-series models at each time step is the big bottleneck in our simulation, so the speedup we achieved in moving to Stan helped a lot. This made it much easier to debug and test the model and also to perform a sensitivity analysis that required 5000 simulation runs, each of which called Stan more than 160 times, sampling 4 chains for 800 iterations each. Stan’s design—one slow compilation step that produces a very fast sampler, which can be called over and over—is ideally suited to this project.


Gilligan concludes:

We would like to thank you and the Stan team, not just for writing such a powerful tool, but also for supporting it so well with superb documentation, examples, and the Stan-users email list.

You’re welcome!

Mighty oaks from little acorns grow


Eric Loken writes:

Do by any chance remember the bogus survey that Augusta National carried out in 2002 to deflect criticism about not having any female members? I even remember this survey being ridiculed by ESPN who said their polls showed much more support for a boycott and sympathy with Martha Burke.

Anyway, sure that’s a long time ago. But I’ve often mentioned this survey in my measurement classes over the years. Guess who was the architect of that survey?

Boy oh boy . . . I didn’t know how long she’d been at it.

I’ve been searching everywhere for the text of the survey. In one news story she said, “If I thought the survey was slanted why would I have insisted that the sponsor release the entire list of questions?” At one point I had it somewhere . . . but maybe not electronic. After all it was 2002!

There was a piece in the Guardian that listed even more of the questions and some very severe British criticism. I had found that article this afternoon but now I can’t find it again.

Anyway, the Tribune piece gives the general idea.

This somehow reminds me of President Bloomberg‘s pollster, Doug Schoen.

Frustration with published results that can’t be reproduced, and journals that don’t seem to care


Thomas Heister writes:

Your recent post about Per Pettersson-Lidbom frustrations in reproducing study results reminded me of our own recent experience that we had in replicating a paper in PLOSone. We found numerous substantial errors but eventually gave up as, frustratingly, the time and effort didn’t seem to change anything and the journal’s editors quite obviously regarded our concerns as a mere annoyance.

We initially stumbled across this study by Collignon et al (2015) that explains antibiotic resistance rates by country level corruption levels as it raised red flags for an omitted variable bias (it’s at least not immediately intuitive to us how corruption causes resistance in bacteria). It wasn’t exactly a high-impact sort of study which a whole lot of people will read/cite but we thought we look at it anyways as it seemend relevant for our field. As the authors provided their data we tried to reproduce their findings and actually found a whole lot of simple but substantial errors in their statistical analysis and data coding that lead to false findings. We wrote a detailled analyis of the errors and informed the editorial office, as PLOSone only has an online comment tool but doesn’t accept letters. The apparent neglect of the concerns raised (see email correspondence below) led us to finally publish our letter as an online comment at PLOSone. The authors’ responses are quite lenghty but do in essence only touch on some of the things we criticize and entirely neglect some of our most important points. Frustratingly, we finally got an answer from PLOSone (see below) that the editors were happy with the authors’ reply and didn’t consider further action. This is remarkable considering that the main explanatory variable is completely useless as can be very easily seen in our re-analysis of the dataset (see table 1 ).

Maybe our experience is just an example of the issues with Open-Access journals, maybe of the problem of journals generally not accepting letters, or maybe just that a lot of journals still see replications and criticism of published studies as an attack on the journal’s scientific standing. Sure, this paper will probably not have a huge impact, but false findings like these might easily slip into the “what has been shown on this topic” citation loop in the introduction parts.

I would be very interested to hear your opinion on this topic with respect to PLOS journals, its “we’re not looking at the contribution of a paper, only whether its methodologically sound” policy and open access.

My reply: We have to think of the responsibility as being the authors’, not the journals’. Journals just don’t have the resources to adjudicate this sort of dispute.

So little information to evaluate effects of dietary choices


Paul Alper points to this excellent news article by Aaron Carroll, who tells us how little information is available in studies of diet and public health. Here’s Carroll:

Just a few weeks ago, a study was published in the Journal of Nutrition that many reports in the news media said proved that honey was no better than sugar as a sweetener, and that high-fructose corn syrup was no worse. . . .

Not so fast. A more careful reading of this research would note its methods. The study involved only 55 people, and they were followed for only two weeks on each of the three sweeteners. . . . The truth is that research like this is the norm, not the exception. . . .

Readers often ask me how myths about nutrition get perpetuated and why it’s not possible to do conclusive studies to answer questions about the benefits and harms of what we eat and drink.

Good question. Why is it that supposedly evidence-based health recommendations keep changing?

Carroll continues:

Almost everything we “know” is based on small, flawed studies. . . . This is true not only of the newer work that we see, but also the older research that forms the basis for much of what we already believe to be true. . . .

The honey study is a good example of how research can become misinterpreted. . . . A 2011 systematic review of studies looking at the effects of artificial sweeteners on clinical outcomes identified 53 randomized controlled trials. That sounds like a lot. Unfortunately, only 13 of them lasted for more than a week and involved at least 10 participants. Ten of those 13 trials had a Jadad score — which is a scale from 0 (minimum) to 5 (maximum) to rate the quality of randomized control trials — of 1. This means they were of rather low quality. None of the trials adequately concealed which sweetener participants were receiving. The longest trial was 10 weeks in length.

According to Carroll, that’s it:

This is the sum total of evidence available to us. These are the trials that allow articles, books, television programs and magazines to declare that “honey is healthy” or that “high fructose corn syrup is harmful.” This review didn’t even find the latter to be the case. . . .

My point is not to criticize research on sweeteners. This is the state of nutrition research in general. . . .

I just have one criticism. Carroll writes:

The outcomes people care about most — death and major disease — are actually pretty rare.

Death isn’t so rare. Everyone dies! Something like 1/80 of the population dies every year. The challenge is connecting the death to a possible cause such as diet.

Carroll also talks about the expense and difficulty of doing large controlled studies. Which suggests to me that we should be able to do better in our observational research. I don’t know exactly how to do it, but there should be some useful bridge between available data, on one hand, and experiments with N=55, on the other.

P.S. I followed a link to another post by Carroll which includes this crisp graph:

Screen Shot 2016-04-06 at 10.44.03 AM

Some U.S. demographic data at zipcode level conveniently in R

Ari Lamstein writes:

I chuckled when I read your recent “R Sucks” post. Some of the comments were a bit … heated … so I thought to send you an email instead.

I agree with your point that some of the datasets in R are not particularly relevant. The way that I’ve addressed that is by adding more interesting datasets to my packages. For an example of this you can see my blog post choroplethr v3.1.0: Better Summary Demographic Data. By typing just a few characters you can now view eight demographic statistics (race, income, etc.) of each state, county and zip code in the US. Additionally, mapping the data is trivial.

I haven’t tried this myself, but assuming it works . . . that’s great to be able to make maps of American Community Survey data at the zipcode level!

Survey weighting and that 2% swing


Nate Silver agrees with me that much of that shocking 2% swing can be explained by systematic differences between sample and population: survey respondents included too many Clinton supporters, even after corrections from existing survey adjustments.

In Nate’s words, “Pollsters Probably Didn’t Talk To Enough White Voters Without College Degrees.” Last time we looked carefully at this, my colleagues and I found that pollsters weighted for sex x ethnicity and age x education, but not by ethnicity x education.

I could see that this could be an issue. It goes like this: Surveys typically undersample less-educated people, I think even relative to their proportion of voters. So you need to upweight the less-educated respondents. But less-educated respondents are more likely to be African Americans and Latinos, so this will cause you to upweight these minority groups. Once you’re through with the weighting (whether you do it via Mister P or classical raking or Bayesian Mister P), you’ll end up matching your target population on ethnicity and education, but not on their interaction, so you could end up with too few low-income white voters.

There’s also the gender gap: you want the right number of low-income white male and female voters in each category. In particular, we found that in 2016 the gender gap increased with education, so if your sample gets some of these interactions wrong, you could be biased.

Also a minor thing: Back in the 1990s the ethnicity categories were just white / other and there were 4 education categories: no HS / HS / some college / college grad. Now we use 4 ethnicity categories (white / black / hisp / other) and 5 education categories (splitting college grad into college grad / postgraduate degree). Still just 2 sexes though. For age, I think the standard is 18-29, 30-44, 45-64, and 65+. But given how strongly nonresponse rates vary by age, it could make sense to use more age categories in your adjustment.

Anyway, Nate’s headline makes sense to me. One thing surprises me, though. He writes, “most pollsters apply demographic weighting by race, age and gender to try to compensate for this problem. It’s less common (although by no means unheard of) to weight by education, however.” Back when we looked at this, a bit over 20 years ago, we found that some pollsters didn’t weight at all, some weighted only on sex, and some weighted on sex x ethnicity and age x education. The surveys that did very little weighting relied on the design to get a more representative sample, either using quota sampling or using tricks such as asking for the youngest male adult in the household.

Also, Nate writes, “the polls may not have reached enough non-college voters. It’s a bit less clear whether this is a longstanding problem or something particular to the 2016 campaign.” All the surveys I’ve seen (except for our Xbox poll!) have massively underrepresented young people, and this has gone back for decades. So no way it’s just 2016! That’s why survey organizations adjust for age. There’s always a challenge, though, in knowing what distribution to adjust to, as we don’t know turnout until after the election—and not even then, given all the problems with exit polls.

P.S. The funny thing is, back in September, Sam Corbett-Davies, David Rothschild, and I analyzed some data from a Florida poll and came up with the estimate that Trump was up by 1 in that state. This was a poll where the other groups analyzing the data estimated Clinton up by 1, 3, or 4 points. So, back then, our estimate was that a proper adjustment (in this case, using party registration, which we were able to do because this poll sampled from voter registration lists) would shift the polls by something like 2% (that is, 4% in the differential between the two candidates). But we didn’t really do anything with this. I can’t speak for Sam or David, but I just figured this was just one poll and I didn’t take it so seriously.

In retrospect maybe I should’ve thought more about the idea that mainstream pollsters weren’t adjusting their numbers enough. And in retrospect Nate should’ve thought of that too! Our analysis was no secret; it appeared in the New York Times. So Nate and I were both guilty of taking the easy way out and looking at poll aggregates and not doing the work to get inside the polls. We’re doing that now, in December, but I we should’ve been doing it in October. Instead of obsessing about details of poll aggregation, we should’ve been working more closely with the raw data.

P.P.S. Could someone please forward this email to Nate? I don’t think he’s getting my emails any more!

How can you evaluate a research paper?


Shea Levy writes:

You ended a post from last month [i.e., Feb.] with the injunction to not take the fact of a paper’s publication or citation status as meaning anything, and instead that we should “read each paper on its own.” Unfortunately, while I can usually follow e.g. the criticisms of a paper you might post, I’m not confident in my ability to independently assess arbitrary new papers I find. Assuming, say, a semester of a biological sciences-focused undergrad stats course and a general willingness and ability to pick up any additional stats theory or practice, what should someone in the relevant fields do to get to the point where they can meaningfully evaluate each paper they come across?

My reply: That’s a tough one. My own view of research papers has become much more skeptical over the years. For example, I devoted several posts to the Dennis-the-Dentist paper without expressing any skepticism at all—and then Uri Simonsohn comes along and shoots it down. So it’s hard to know what to say. I mean, even as of 2007, I think I had a pretty good understanding of statistics and social science. And look at all the savvy people who got sucked into that Bem ESP thing—not that they thought Bem had proved ESP, but many people didn’t realize how bad that paper was, just on statistical grounds.

So what to do to independently assess new papers?

I think you have to go Bayesian. And by that I don’t mean you should be assessing your prior probability that the null hypothesis is true. I mean that you have to think about effect sizes, on one side, and about measurement, on the other.

It’s not always easy. For example, I found the claimed effect sizes for the Dennis/Dentist paper to be reasonable (indeed, I posted specifically on the topic). For that paper, the problem was in the measurement, or one might say the likelihood: the mapping from underlying quantity of interest to data.

Other times we get external information, such as the failed replications in ovulation-and-clothing, or power pose, or embodied cognition. But we should be able to do better, as all these papers had major problems which were apparent, even before the failed reps.

One cue which we’ve discussed a lot: if a paper’s claim relies on p-values, and they have lots of forking paths, you might just have to set the whole paper aside.

Medical research: I’ve heard there’s lots of cheating, lots of excluding patients who are doing well under the control condition, lots of ways to get people out of the study, lots of playing around with endpoints.

The trouble is, this is all just a guide to skepticism. But I’m not skeptical about everything.

And the solution can’t be to ask Gelman. There’s only one of me to go around! (Or two, if you count my sister.) And I make mistakes too!

So I’m not sure. I’ll throw the question to the commentariat. What do you say?

An exciting new entry in the “clueless graphs from clueless rich guys” competition


Jeff Lax points to this post from Matt Novak linking to a post by Matt Taibbi that shares the above graph from newspaper columnist / rich guy Thomas Friedman.

I’m not one to spend precious blog space mocking bad graphs, so I’ll refer you to Novak and Taibbi for the details.

One thing I do want to point out, though, is that this is not necessarily the worst graph promulgated recently by a zillionaire. Let’s never forget this beauty which was being spread on social media by wealthy human Peter Diamandis:


Interesting epi paper using Stan

Jon Zelner writes:

Just thought I’d send along this paper by Justin Lessler et al. Thought it was both clever & useful and a nice ad for using Stan for epidemiological work.

Basically, what this paper is about is estimating the true prevalence and case fatality ratio of MERS-CoV [Middle East Respiratory Syndrome Coronavirus Infection] using data collected via a mix of passive and active surveillance, which if treated naively will result in an overestimate of case fataility and underestimate of burden b/c only the most severe cases are caught via passive surveillance. All of the interesting modeling details are in the supplementary information.

“A bug in fMRI software could invalidate 15 years of brain research”


About 50 people pointed me to this press release or the underlying PPNAS research article, “Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates,” by Anders Eklund, Thomas Nichols, and Hans Knutsson, who write:

Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.

This is all fine (I got various emails with lines such as, “Finally, a PPNAS paper you’ll appreciate”), and I’m guessing it won’t surprise Vul, Harris, Winkielman, and Pashler one bit.

I continue to think that the false-positive, false-negative thing is a horrible way to look at something like brain activity, which is happening all over the place all the time. The paper discussed above looks like a valuable contribution and I hope people follow up by studying the consequences of these FMRI issues using continuous models.

OK, sometimes the concept of “false positive” makes sense.


Paul Alper writes:

I know by searching your blog that you hold the position, “I’m negative on the expression ‘false positives.'”

Nevertheless, I came across this. In the medical/police/judicial world, false positive is a very serious issue:


Cost of a typical roadside drug test kit used by police departments. Namely, is that white powder you’re packing baking soda or blow? Well, it turns out that these cheap drug tests have some pretty significant problems with false positives. One study found 33 percent of cocaine field tests in Las Vegas between 2010 and 2013 were false positives. According to Florida Department of Law Enforcement data, 21 percent of substances identified by the police as methamphetamine were not methamphetamine. [ProPublica]

The ProPublica article is lengthy:

Tens of thousands of people every year are sent to jail based on the results of a $2 roadside drug test. Widespread evidence shows that these tests routinely produce false positives. Why are police departments and prosecutors still using them? . . .

The Harris County district attorney’s office is responsible for half of all exonerations by conviction-integrity units nationwide in the past three years — not because law enforcement is different there but because the Houston lab committed to testing evidence after defendants had already pleaded guilty, a position that is increasingly unpopular in forensic science. . . .

The Texas Criminal Court of Appeals overturned Albritton’s conviction in late June, but before her record can be cleared, that reversal must be finalized by the trial court in Houston. Felony records are digitally disseminated far and wide, and can haunt the wrongly convicted for years after they are exonerated. Until the court makes its final move, Amy Albritton — for the purposes of employment, for the purposes of housing, for the purposes of her own peace of mind — remains a felon, one among unknown tens of thousands of Americans whose lives have been torn apart by a very flawed test.

Yes, I agree. There are cases where “false positive” and “false negative” make sense. Just not in general for scientific hypotheses. I think the statistical framework of hypothesis testing (Bayesian or otherwise) is generally a mistake. But in settings in which individuals are in one of some number of discrete states, it can make a lot of sense to think about false positives and negatives.

The funny thing is, someone once told me that he had success teaching the concepts of type 1 and 2 errors by framing the problem in terms of criminal defendants. My reaction was that he was leading the students exactly in the wrong direction!

I haven’t commented on the politics of the above story but of course I agree that it’s horrible. Imagine being sent to prison based on some crappy low-quality lab test. There’s a real moral hazard here: The people who do these tests and who promote them based on bad data, they aren’t at risk of going to prison themselves here, even though they’re putting others in jeopardy.

An election just happened and I can’t stop talking about it


Some things I’ve posted elsewhere:

The Electoral College magnifies the power of white voters (with Pierre-Antoine Kremp)

I’m not impressed by this claim of vote rigging

And, in case you missed it:

Explanations for that shocking 2% shift

Coming soon:

What theories in political science got supported or shot down by the 2016 election? (with Bob Erikson)

A bunch of maps and graphs (with Rob Trangucci, Imad Ali, and Doug Rivers)

Reminder: Instead of “confidence interval,” let’s say “uncertainty interval”


We had a vigorous discussion the other day on confusions involving the term “confidence interval,” what does it mean to have “95% confidence,” etc. This is as good a time as any for me to remind you that I prefer the term “uncertainty interval”. The uncertainty interval tells you how much uncertainty you have. That works pretty well, I think. Also, I prefer 50% intervals. More generally, I think confidence intervals are overrated for reasons discussed here and here.

Happiness formulas


Jazi Zilber writes:

Have you heard of “the happiness formula”?

Lyubomirsky at al. 2005. Happiness = 0.5 genetic, 0.1 circumstances, 0.4 “intentional activity”

They took the 0.4 unexplained variance and argued it is “intentional activity”

Cited hundreds of times by everybody.

The absurd is, to you even explaining it is unneeded. For others, I do not know how to explain it.

No, I hadn’t heard of it. So I googled *happiness formula*. And what turned up was a silly-looking formula (but not the formula of Lyubomirsky at al. 2005), and some reasonable advice. For example, this from Alexandra Sifferlin in Time magazine in 2014:

Researchers at University College London were able to create an equation that could accurately predict the happiness of over 18,000 people, according to a new study.

First, the researchers had 26 participants complete decisionmaking tasks in which their choices either led to monetary gains or losses. The researchers used fMRI imaging to measure their brain activity, and asked them repeatedly, “How happy are you now?” Based on the data the researchers gathered from the first experiment, they created a model that linked self-reported happiness to recent rewards and expectations.

Here’s what the equation looks like:


Yeah, yeah, I know what you’re thinking . . . it looks like B.S., right? But, as I said, the ultimate advice seemed innocuous enough:

The researchers were not surprised by how much rewards influenced happiness, but they were surprised by how much expectations could. The researchers say their findings do support the theory that if you have low expectations, you can never be disappointed, but they also found that the positive expectations you have for something—like going to your favorite restaurant with a friend—is a large part of what develops your happiness.

Nothing as ridiculous as that formula quoted by Zilber above.

So I next googled *Lyubomirsky at al. 2005* and I found the paper Zilber was talking about, and . . . yeah, it has it all! An exploding pie chart, a couple of 3-d bar charts that would make Ed Tufte spin in his, ummm, Tufte’s still alive so I guess it would make him spin in his chair, they’re so bad. Oh, yeah, also a “longitudinal path model” with asterisks indicating low p-values. What more could you possibly desire? The whole paper made me happy, in a perverse way. By which I mean, it made me sad.

The good news, though, is that this 2005 paper does not seem so influential anymore. At least, when you google *happiness formula* it does not come up on the first page of listings. So that’s one thing we can be happy about.

Discussion on overfitting in cluster analysis


Ben Bolker wrote:

It would be fantastic if you could suggest one or two starting points for the idea that/explanation why BIC should naturally fail to identify the number of clusters correctly in the cluster-analysis context.

Bob Carpenter elaborated:

Ben is finding that using BIC to select number of mixture components is selecting too many components given the biological knowledge of what’s going on. These seem to be reasonable mixture models like HMMs for bison movement with states corresponding to transiting and foraging and resting, with the data (distance moved and turning angle) being clearly multimodal.

First (this is more to Christian): Is this to be expected if the models are misspecified and the data’s relatively small?

Second (this one more to Andrew): What do you recommend doing in terms of modeling? The ecologists are already on your page w.r.t. adding predictors (climate, time of day or year) and general hierarchical models over individuals in a population.

Number of components isn’t something we can put a prior on in Stan other than by having something like many mixture components with asymmetric priors or by faking up a Dirichlet process a la some of the BUGS examples. I’ve seen some work on mixtures of mixtures which looks cool, and gets to Andrew’s model expansion inclinations, but it’d be highly compute intensive.

X replied:

Gilles Celeux has been working for many years on the comparison between AIC, BIC and other-ICs for mixtures and other latent class models. Here is one talk he gave on the topic. With the message that BIC works reasonably well for density estimation but not for estimating the number of clusters. Here is also his most popular paper on such information criteria, including ICL.

I am rather agnostic on the use of such information criteria as they faiL to account for prior information or prior opinion on what’s making two components distinct rather than identical. In that sense I feel like the problem is non-identifiable. If components are not distinguishable in some respect. And as a density estimation problem, the main drawback in having many components is an increased variability. This is not a Bayesian/frequentist debate, unless prior inputs can make components make sense. And prior modelling fights against over-fitting by picking priors on the weight near zero (in the Rousseau-Mengersen 2012 sense).

And then I wrote:

I think BIC is fundamentally different from AIC, WAIC, LOO, etc, in that all those other things are estimates of out-of-sample prediction error, while BIC is some weird thing that under certain ridiculous special cases corresponds to an approximation to the log marginal probability.

Just to continue along these lines: I think it makes more sense to speak of “choosing” the number of clusters or “setting” the number of clusters, not “estimating” the number of clusters, because the number of clusters is not in general a Platonic parameter that it would make sense to speak of estimating. I think this comment is similar to what X is saying, just in slightly different language (although both in English, pas en français).

To put it another way, what does it mean to say “too many components given the biological knowledge of what’s going on”? This depends on how “component” is defined. I don’t mean this as a picky comment: I think this is fundamental to the question. To move the discussion to an area I know more about: suppose we want to characterize voters. We could have 4 categories: Dem, Rep, Ind, Other. We could break this down more, place voters on multiple dimensions, maybe identify 12 or 15 different sorts of voters. Ultimately, though, we’re each individuals, so we could define 300 million clusters, one for each American. It seems to me that the statement “too many components” has to be defined with respect to what you will be doing with the components. To put it another way: what’s the cost to including “too many” components? Is the cost that estimates will be too noisy? If so, there is some interaction between #components and the prior being used on the parameters: one might have a prior that works well for 4 or 5 components but not so well when there are 20 or 25 components.

Actually, I can see some merit to the argument that there can just about never be more than 4 or 5 clusters, ever. My argument goes like this: if you’re talking “clusters” you’re talking about a fundamentally discrete process. But once you have more than 4 or 5, you can’t really have real discreteness; instead things slide into a continuous model.

OK, back to the practical question. Here I like the idea of using LOO (or WAIC) in that I understand what it’s doing: it’s an estimate of out-of-sample prediction error, and I can take that for what it is.

To get to the modeling question: if Ben is comfortable with a model with, say, between 3 and 6 clusters, then I think he should just fit a model with 6 clusters. Just include all 6 and let some of them be superfluous if that’s what the model and data want. One way to keep the fitting under control is to regularize a bit by putting strong priors on the weights on the mixtures, so that mixture components 1, 2, etc, are large in expectation large, and later components are smaller. You can do this with an informative Dirichlet prior on the vector of lambda parameters. I’ve never tried this but it seems to me like it could work.

Also–and I assume this is being already but I’ll mention just in case–don’t forget to put informative priors for the parameters in each mixture component. I don’t know the details of this particular model, but, just for example, if we are fitting a mixture of normals, it’s important to constrain the variances of the normals because the model will blow up with infinite likleihood at points where any variance equals zero. The constraint can be “soft,” for example lognormal priors on scale parameters, or a hierarchical prior on the sale parameters with a proper prior on how much they vary. The same principle applies to other sorts of mixture models.

And Aki added:

If the whole distribution is multimodal it is easier to identify the number of modes and say that these correspond to clusters. Even if we have “true” clusters, but they are overlapping so that there are no separate modes, the number of clusters is not well identified unless we have lot of information about the shape of each cluster. Example: using mixture of Gaussians to fit Student t data -> when n->infty, the number of components (clusters) goes to infty. Depennding on the amount of model misspecification and separability of clusters we may not be able to identify the number of clusters no matter which criteria we use. In simulated examples with true small number of clusters, use of criteria which favors small number of clusters is likely to perform well (LOO (or WAIC) is likely to favor more clusters than marginal likelihood, BIC or WBIC). In Andrew’s voters example, and in many medical examples I’ve seen, there are no clear clusters as the variation between individuals is mostly continuous or discrete in high dimensions.