Skip to content

fMRI clusterf******


Several pointed me to this paper by Anders Eklund, Thomas Nichols, and Hans Knutsson, which begins:

Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.

I’m not a big fan of the whole false-positive, false-negative thing. In this particular case it makes sense because they’re actually working with null data, but ultimately what you’ll want to know is what’s happening to the estimates in the more realistic case that there are nonzero differences amidst the noise. The general message is clear, though: don’t trust FMRI p-values. And let me also point out that this is yet another case of a classical (non-Bayesian) method that is fatally assumption-based.

Perhaps what’s the most disturbing thing about this study is how unsurprising it all is. In one sense, it’s big big news: FMRI is a big part of science nowadays, and if it’s all being done wrong, that’s a problem. But, from another perspective, it’s no surprise at all: we’ve been hearing about “voodoo correlations” in FMRI for nearly a decade now, and I didn’t get much sense that the practitioners of this sort of study were doing much of anything to clean up their act. I pretty much don’t believe FMRI studies on the first try, any more than I believe “gay gene” studies or various other headline-of-the-week auto-science results.

What to do? Short-term, one can handle the problem of bad statistics by insisting on preregistered replication, thus treating traditional p-value-based studies as screening exercises. But that’s a seriously inefficient way to go: if you don’t watch out, your screening exercises are mostly noise, and then you’re wasting your effort with the first study, then again with the replication.

On the other hand, if preregistered replication becomes a requirement for a FMRI study to be taken seriously (I’m looking at you, PPNAS; I’m looking at you, Science and Nature and Cell; I’m looking at you, TED and NIH and NPR), then it won’t take long before researchers themselves realize they’re wasting their time.

The next step, once researchers learn to stop bashing their heads against the wall, will be better data collection and statistical analysis. When the motivation for spurious statistical significance goes away, there will be more motivation for serious science.

Something needs to be done, though. Right now the incentives are all wrong. Why not do a big-budget FMRI study? In many fields, this is necessary for you to be taken seriously. And it’s not like you’re spending your own money. Actually, it’s the opposite: at least within the university, when you raise money for a big-budget experiment, you’re loved, because the university makes money on the overhead. And as long as you close your eyes to the statistical problems and move so fast that you never have to see the failed replications, you can feel like a successful scientist.

The other thing that’s interesting is how this paper reflects divisions within PPNAS. On one hand you have editors such as Susan Fiske or Richard Nisbett who are deeply invested in the science-as-routine-discovery-through-p-values paradigm; on the other, you have editors such as Emery Brown (editor of this particular paper; full disclosure, I know Emery from grad school) who as a statistician has a more skeptical take and who has nothing to lose by pulling the house down.

Those guys at Harvard (but not in the statistics department!) will say, “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” But they’re innumerate, and they’re wrong. Time for us to move on, time for the scientists to do more science and for the careerists to find new ways to play the game.

5 more things I learned from the 2016 election

After posting the 19 Things We Learned from the 2016 Election, I received a bunch of helpful feedback in comments and email. Here are some of the key points that I missed or presented unclearly:

Non-presidential elections

Nadia Hassan points out that my article is “so focused on the Presidential race than it misses some key pertinent downballot stuff. Straight ticket voting soared in this election in the Senate races, though not the governor’s races,” which supports explanations based on fundamentals and polarization rather than candidate-specific stories.

The Latino vote

In the “Demography is not destiny” category, I cited exit polls that showed the Latino vote dividing 66%-28% in favor of Clinton. But exit polls have a lot of problems, as Justin Gross noted in comments and which others pointed out to me by email. Gary Segura and Matt Barreto suggest that “the national exit polls interviewed few if any Latino voters in areas where many Latinos actually live.” Trump winning based on the white vote is consistent with what Yair and I found earlier this year about the electorate being whiter than observers had thought based on exit polls, as reported in a news article, “There Are More White Voters Than People Think. That’s Good News for Trump.”

Siloed news

Andy Guess writes, conventional wisdom says news is “siloed.” But the best evidence (from passive metering data) doesn’t support the idea, and on social media, see this. We have more discussion of fake news in comments here.

Shark attacks

I ragged on Chris Achen and Larry Bartels’s claim that shark attacks swing elections. But as commenter WB points out, we shouldn’t let that distract us from Achen and Bartels’s larger point that that many voters are massively uninformed about politics, policy, and governing, which is relevant even if it’s not true, as they claimed, that voters are easily swung by irrelevant stimuli.

The Clinton campaign’s “ground game”

Someone who had led Obama’s ground game in a rural area of a midwestern state sent me this note:

I [my correspondent] returned there to informally assist Senator Clinton after it became apparent that she was having difficulty in that state (September 2016). It is from this background that I respectfully think you’re wrong about ground games being overrated (point 10). That is the wrong lesson.

You are correct that Democrats were supposed to have an amazing ground game. More hires. More offices. A field guy as campaign manager experienced in tight field wins (DCCC 2012; McAuliffe 2013). The problem is that Clinton never ran a ground game.

When I arrived in September/October, I was astounded to discover that the field staff had spent all their time on volunteer recruitment. This meant that they were only calling people who were already friendly to Clinton and asking those same people to come into the office to call more people friendly to Clinton. At no point during the campaign did the field staff ever ID voters or do persuasion (e.g. talk to a potentially non-friendly voter). That is a call center, it is not a ground game.

Part of the reason for this is that Brooklyn read an academic piece suggesting that voter contact more than 10 days out is worthless—a direct repudiation of the organizing model used by Obama in 2008 and 2012 when field contacted each voter 4 times between July and November. The result is that the Clinton campaign started asking people to turn out for Clinton only in the final week of the election when they began GOTV work. There was no preexisting relationship. Those calls for turning out might as well have come from a Hyderabad call center for all the good they did.

I hate to see people taking the wrong lesson from this campaign. Ground games are critical for Democrats to win. But non organizing-based ground games are worse than useless as they artificially inflate your expectations, demoralize volunteers (they want to talk to voters, not recruit more volunteers), and fail to turn out your base.

Thanks to everyone for your comments. One excellent thing about blogging is that we can revise what we write, in contrast to the David Brookses of the world who can never admit error.

“The Fundamental Incompatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsampling”

Here’s Michael Betancourt writing in 2015:

Leveraging the coherent exploration of Hamiltonian flow, Hamiltonian Monte Carlo produces computationally efficient Monte Carlo estimators, even with respect to complex and high-dimensional target distributions. When confronted with data-intensive applications, however, the algorithm may be too expensive to implement, leaving us to consider the utility of approximations such as data subsampling. In this paper I demonstrate how data subsampling fundamentally compromises the scalability of Hamiltonian Monte Carlo.

But then here’s Jost Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter in 2016:

Despite its successes, the prototypical Bayesian optimization approach – using Gaussian process models – does not scale well to either many hyperparameters or many function evaluations. Attacking this lack of scalability and flexibility is thus one of the key challenges of the field. . . . We obtain scalability through stochastic gradient Hamiltonian Monte Carlo, whose robustness we improve via a scale adaptation. Experiments including multi-task Bayesian optimization with 21 tasks, parallel optimization of deep neural networks and deep reinforcement learning show the power and flexibility of this approach.

So now I’m not sure what to think! I guess a method can be useful even if it doesn’t quite optimize the function it’s supposed to optimize? Another twist here is that these deep network models are multimodal so you can’t really do full Bayes for them even in problems of moderate size, even before worrying about scalability. Which suggests that we should think of algorithms such as that of Springenberg et al. as approximations, and we should be doing more work on evaluating these approximations. To put it another way, when they run stochastic gradient Hamiltonian Monte Carlo, we should perhaps think of this not as a way of tracing through the posterior distribution but as a way of exploring the distribution, or some parts of it.

Temple Grandin

She also belongs in the “objects of class Pauline Kael” category. Most autistic people are male, but Temple Grandin is the most famous and accomplished autistic person ever.

19 Things We Learned from the 2016 Election

OK, we can all agree that the November election result was a shocker. According to news reports, even the Trump campaign team was stunned to come up a winner.

So now seemed like a good time to go over various theories floating around in political science and political reporting and see where they stand, now that this turbulent political year is drawing to a close. By the time I was done writing it up for Slate, I came up with 19 lessons learned. I thank my colleague Bob Erikson for help on some of these.

1. The party doesn’t decide.

We can start with the primaries, which destroyed the Party Decides theory of Marty Cohen, David Karol, Hans Noel, and John Zaller, who wrote in 2008 that “unelected insiders in both major parties have effectively selected candidates long before citizens reached the ballot box.” You can’t blame authors of a book on political history–its subtitle is “Presidential Nominations Before and After Reform”–for failing to predict the future. But it does seem that the prestige of the Party Decides model was one reason that Nate Silver, Nate Cohn, Jonathan Chait, and a bunch of other pundits not named Nate or Jonathan were so quick to dismiss Donald Trump’s chances of winning in the Republican primaries.

Indeed, I myself was tempted to dismiss Trump’s chances during primary season, but then I read that article I’d written in 2011 ( explaining why primary elections are so difficult to predict (multiple candidates, no party cues or major ideological distinctions between them, unequal resources, unique contests, and rapidly-changing circumstances), and I decided to be careful with any predictions.

2. That trick of forecasting elections using voter predictions rather than voter intentions? Doesn’t work.

Economists David Rothschild and Justin Wolfers have argued that the best way to predict the election is not to ask people whom they’ll vote for, but rather ask whom they’ll think will win ( Their claim was that when you ask people whom they think will win, survey respondents will be informally tallying their social networks, hence their responses will contain valuable information for forecasting. When this idea was hyped back in 2012, I was skeptical (, taking the position that respondents will be doing little more than processing what they’d seen in the news media, and I remain skeptical, following a 2016 election that was a surprise to most.

3. Survey nonresponse is a thing.

It’s harder and harder to reach a representative sample of voters, and it’s been argued that much of the swing in the polls is attributable not to people changing their vote intention, but to changes in who responds or doesn’t respond. In short, when there is good news about a candidate, his or her supporters are more likely to respond to polls. Doug Rivers, David Rothschild, Sharad Goel, and I floated this theory following some analysis of opinion polls from 2012 (, and it seems to have held up well during the recent campaign season (

The only hitch here is that the differential nonresponse story explains variation in the polls but not the level or average shift. The final polls were off by about 2 percentage points (, suggesting that, even at the end, Trump supporters were responding at a lower rate than Clinton supporters.

4. The election outcome was consistent with “the fundamentals.”

Various models predict the election outcome not using the polls, instead using the national economy (as measured, for example, in inflation-adjusted personal income growth during the year or two preceding the election) and various political factors. In 2016 the election was growing slowly but not booming (a mixed signal for the voters), the incumbent party was going for a third term in office (traditionally a minus, as voters tend to support alternation), and the Republicans controlled both houses of Congress (a slight benefit for the Democrats in presidential voting, for that minority of voters who prefer party balancing), and, on the left-right scale, both candidates were political centrists relative to other candidates from their parties. This information can be combined in different ways: Running a version of the model constructed by the political scientist Doug Hibbs, I gave Hillary Clinton a forecast of 52 percent of the two-party vote ( Fitting a similar model but with slightly different parameters, political scientist Drew Linzer gave Clinton 49 percent ( In October the political science journal PS published several articles on forecasting the election, including one from Bob Erikson and Chris Wlezien who concluded, “the possibility of greater campaign effects than we typically observe should constrain our confidence in the predictions presented here.” (

All these fundamentals-based models have uncertainties on the order of 3 percentage points, so what they really predicted is that the election would not be a landslide. The actual outcome was consistent with these predictions. That said, a wide range of outcomes–anything from 55-45 to 45-55–would’ve jibed with some of these forecasts. And the non-blowout can also be explained by countervailing factors: Perhaps Trump was so unpopular that anyone but Clinton would’ve destroyed him in the general election, and vice versa. That seems doubtful. But who knows.

5. Polarization is real.

Democrats vote for Democrats, Republicans vote for Republicans. It’s always been thus—what would the party labels mean, otherwise—but cross-party voting keeps declining, and members of the out-party hold the president in lower and lower esteem. Consider, for example, Donald Trump’s criticism of Barack Obama during the presidential debates. Obama is popular so this might seem to have been a mistake to stand against him—but Obama is deeply unpopular among Republicans, especially those Republicans who are likely to vote.

A corollary of polarization is that, if there aren’t many people in the middle to be persuaded, it makes sense for candidates to focus on firing up their base, and this is a key part of the story of the success of the Trump campaign. You can bet that activists of both parties will have learned this lesson when 2020 comes along.

6. Demography is not destiny.

We’d been hearing a lot about how the Republican party, tied to a declining base of elderly white supporters, needs to reassess. For example, here’s Jamelle Bouie in Slate, under the heading, “It Lost Black Voters. Now It’s Losing Latinos. What’s Left Is a Broken, White GOP” ( “The latest tracking poll from Latino Decisions shows Republican nominee Donald Trump with 16 percent support, versus 74 percent for Hillary Clinton. Looking ahead to November, the group expects that electorate to cast the vast majority of its votes for Clinton, 82 percent to 15 percent for Trump, which would be the most lopsided total in history.” According to exit polls, the Latino vote ended up dividing 66%-28%, a clear Clinton lead but nothing like the forecast from Latino Decisions—a forecast that should’ve been suspect, given that it contradicted the organization’s own polls! Longer term, it may well be that the Republican party needs to change with the times, but destiny hasn’t happened yet.

7. Public opinion does not follow elite opinion.

Perhaps the most disturbing theoretical failure of political science is the general idea that voters simply follow elite opinion. This worked in 1964 to destroy Goldwater, for instance. Or so the story goes. The implication is that voters had to be told Goldwater was scary. They could not figure it out for themselves.

In 2016, Trump was opposed vigorously as dangerous, incompetent, xenophobic, tyrannical, and unhinged, by almost everybody in elite circles: most of his Republican primary opponents at one time or another, a large number of conservative intellectuals, former Republican candidates Romney and McCain, the various Bushes, the media, almost all newspaper editorialists including those that were reliable Republican supporters, all Democrats, about 10 Republican senators, and even some pundits on Fox News. Further, Trump’s breaking of all the standard niceties of politics was there for all to see for themselves. But half the voters said, we go with this guy anyway. The falcon no longer hears the falconer,” as W. B. Yeats put it.

8. There is an authoritarian dimension of politics.

Political scientists used to worry about authoritarianism within the electorate. Mainstream politicians, ranging from Republicans on the far right to lefties such as Sanders, tend not to go there. Trump did. In doing so he broke the rules of politics with extreme comments about his opponents, etc., that are hard to forget. But a significant segment of the electorate, maybe 20 percent, have always been waiting for its authoritarian champion on what we now call the alt-right dimension. There had not been one in the modern era. Trump’s absolute dominance of the political news for over a year signifies this uniqueness. There had been others with this sort of appeal—Joe McCarthy (, George Wallace—but they never came close to becoming our national leader.

9. Swings are national.

When you look at changes from one election to the next, the country moves together. If you plot vote swings by county, or by state, you see much more uniformity in the swing in recent years than in previous decades ( The swing from 2012 to 2016 was also close to uniform. There’s been lots of talk of Pennsylvania, Michigan, and Wisconsin, and these three states did make the difference in the electoral college, but similar swings happened all over the country. To put it another way, nonuniform swings were essential to Trump’s win, but looking at public opinion more broadly, the departures from a national swing were small, and consistent with the increasing nationalization of elections in recent decades.

10. The ground game was overrated.

The Democrats were supposed to be able to win a close election using their ability to target individual voters and get them out to the polls. But it didn’t happen this way. The consensus after 2016, which should’ve been the consensus earlier: Some ground game is necessary, but it’s hard to get people to turn out and vote, if they weren’t already planning to.

11. News is siloed.

For years we’ve been hearing that liberals hear one set of news, conservatives hear another, and moderates are being exposed to an incoherent mix, so that it’s difficult for anyone to make sense of what everyone else is hearing. There have always been dramatic differences of opinion (consider, for example, attitudes toward civil rights in the 1950s and the Vietnam war in the 1970s) but research on public opinion has shown an increase in partisan polarization in recent decades. The 2016 election, with its sharp divide between traditional news organizations on one side and fake news spread by Twitter and Facebook on the other, seems like the next step in this polarization.

It’s the political version of Moore’s Law, which says that every time the semiconductor manufacturers have run out of ways to squeeze more computing power on a chip, they come up with something new. Whenever it starts to seem like there’s no more room for Americans to polarize, something new comes up—in this case saturation of social media by fake news, along with a decline of the traditional TV networks and continuing distrust of the press.

12. The election wasn’t decided by shark attacks.

Political scientists Chris Achen and Larry Bartels have argued that voters are emotional and that elections can be swayed by events such as shark attacks that should logically be irrelevant to voting decisions. Others have analyzed data and claimed to find that close elections can be decided by the outcomes of college football games (with happy voters being more likely to pull the lever for the incumbent party’s candidate). Others have reanalyzed these data and found no such effect ( What does 2016 say about all this? Not much.

You can’t prove a negative so it’s possible that irrelevant stimuli could have made all the difference. But the big stories about this election were that (a) lots of bad information about Donald Trump did not sway much of the electorate, and (b) Clinton’s narrow Electoral College loss may well be attributed to FBI leaks, which were relevant to the voting decision in reminding voters (perhaps inappropriately) of concerns about her governing style. The 2016 election was not about shark attacks or football games but rather about big stories that didn’t matter much, or canceled each other out.

13. Overconfident pundits get attention.

From one direction, neuroscientist Sam Wang gave Hillary Clinton a 99 percent chance of winning the election; from the other, cartoonist and jar opener ( Scott Adams gave 98 percent odds in favor of Trump. Looking at it one way, both Wang and Adams were correct: Clinton indisputably won the popular vote while Trump was the uncontested electoral vote winner. After the election, Wang blamed the polls, which was wrong. The polls were off by 2 percent, which from a statistical standpoint wasn’t bad. Indeed this magnitude of error was expected from a historical perspective (, even if it did happen to be consequential this time. The mistake was not in the polls but in Wang’s naive interpretation of the polls which did not account for the possibility of systematic nonsampling errors shared by the mass of pollsters, even though evidence for such errors was in the historical record. Meanwhile, Adams explains Trump’s victory as being the result of powers of persuasion, which might be so but doesn’t explain why Trump received less than half the vote, rather than the landslide that Adams had predicted.

I continue to think that polling uncertainty could best be expressed not by speculative win probabilities but rather by using the traditional estimate and margin of error. Much confusion could’ve been avoided during the campaign had Clinton’s share in the polls simply been reported as 52 percent of the two-party vote, plus or minus 2 percentage points.

There’s a theory that academics such as myself are petrified of making a mistake, hence we are overcautious in our predictions; in contrast, the media (traditional news media and modern social media) reward boldness and are forgiving of failure. This theory is supported by the experiences of Sam Wang (who showed up in the New York Times explaining the polls after the election he’d so completely biffed, and Scott Adams (who triumphantly reported that his Twitter following had reached 100,000).

14. Red state blue state is over.

Republicans have done better among rich voters than among poor voters in every election since the dawn of polling, with the only exceptions being 1952, 1956, and 1960, which featured moderate Republican Dwight Eisenhower and then moderate Democrat John Kennedy. Typically the upper third of income votes 10 to 20 percentage points more Republican than the lower third. This was such a big deal that my colleagues and I wrote a book about it! ( But 2016 was different. For example, here are the exit polls ( Clinton won 53 percent of the under-$30,000 vote and 47 percent of those making over $100,000, a difference of only 6 percentage points, much less than the usual income gap. And we found similar minimal income-voting gradients when looking at other surveys. Will the partisan income divide return in future years? Will it disappear? It depends on where the two parties go. Next move is yours, Paul Ryan.

15. Third parties are still treading water.

The conventional wisdom is that minor parties are doomed in the U.S. electoral system. The paradox is that the only way for a minor party to have real success is to start local, but all the press comes from presidential runs. Anyway, 2016 seems to have confirmed conventional wisdom. Both major parties were highly unpopular, but all the minor parties combined got only 5.6 percent of the vote. On the other hand, 5.6 percent is a lot better than 1.7 percent (2012), 1.4 percent (2008), 1.0 percent (2004), or 3.7 percent (2000).

Glass half full is that minor parties are starting to get serious; glass half empty is that not much bloomed even in such fertile soil.

16. A working-class pundit is something to be.

Filmmaker and political activist Michael Moore gets lots of credit for writing, over a month before the election (, an article entitled “5 Reasons Why Trump Will Win,” specifically pointing to the Rust Belt, angry white men, voter turnout, and other factors that everybody else was writing about after the election was over. Moore even mentioned the Electoral College. And unlike the overconfident pundits mentioned above, Moore clearly stated this as a scenario (“As of today, as things stand now, I believe this is going to happen …”) without slapping a 98 or 99 percent on to it.

What if Hillary Clinton had won 52 percent of the two-party vote and a solid Electoral College victory? Would we now be hearing from pundits with a special insight into white suburban moms? Maybe so. Or maybe we’d still be hearing about the angry white male, since 48 percent of the two-party vote would still be a lot more Trump support than most were expecting when the campaign began.

17. Beware of stories that explain too much.

After the election, which shocked the news media, the pollsters, and even the Clinton and Trump campaigns, my colleague Thomas Basboll wrote that “social science and democracy are incompatible. The social sciences conduct an undemocratic inquiry into society. Democracy is an unscientific way of governing it.” (

Maybe so. But Basboll could’ve written this a few days before the election. Had the election gone as predicted, with Clinton getting the expected 52 percent of the two-party vote rather than the awkwardly distributed 51% that was not enough for her to win in the Electoral College, it still would’ve been true that half of American voters had refused to vote for her. So there’s something off about these sweeping election reviews: even when you agree with the sentiments, it’s not clear why it makes sense to tie it to any particular election outcome.

The Republicans have done well in political strategy, tactics, timing, and have had a bit of luck too. One party right now controls the presidency, both houses of Congress, most of the governorships, and soon the Supreme Court. But when it comes to opinions and votes, we’re a 50/50 nation. So we have to be wary of explanations of Trump’s tactical victory that explain too much.

18. Goldman Sachs rules the world.

This theory appears to still hold up. Goldman Sachs candidate Hillary Clinton managed to lose the electoral vote, but Goldman Sachs Senator Chuck Schumer may now be the most powerful Democrat in Washington, while former Goldman Sachs executive Steve Bannon will be deciding strategy inside the White House. So it looks like the banksters are doing just fine. They had things wired, no matter which way the election went.

19. The Electoral College was a ticking time bomb.


P.S. More here.

“So such markets were, and perhaps are, subject to bias from deep pocketed people who may be expressing preference more than actual expectation”


Geoff Buchan writes in with another theory about how prediction markets can go wrong:

I did want to mention one fascinating datum on Brexit: one UK bookmaker said they received about twice as many bets on leave as on remain, but the average bet on remain was *five* times what was bet on leave, meaning more than twice as much money was bet on remain.

Clearly weathier people, most likely pro remain, would be able to bet more, and I strongly suspect a similar bias exists in prediction markets, which, the last time I dabbled in them, had quite small open interest (think total money at stake). So such markets were, and perhaps are, subject to bias from deep pocketed people who may be expressing preference more than actual expectation.

Here we are in December (I wrote this in July but, y’know, bloglag) so youall have probably forgotten what Brexit even is. The general point still holds, though.

“Dear Major Textbook Publisher”: A Rant


Dear Major Academic Publisher,

You just sent me, unsolicited, an introductory statistics textbook that is 800 pages and weighs about 5 pounds. It’s the 3rd edition of a book by someone I’ve never heard of. That’s fine—a newcomer can write a good book. The real problem is that the book is crap. It’s just the usual conventional intro stat stuff. The book even has a table of the normal distribution on the inside cover! How retro is that?

The book is bad in so many many ways, I don’t really feel like going into it. There’s nothing interesting here at all, the examples are uniformly fake, and I really can’t imagine this is a good way to teach this material to anybody. None of it makes sense, and a lot of the advice is out-and-out bad (for example, a table saying that a p-value between 0.05 and 0.10 is “moderate evidence” and that a p-value between 0.10 and 0.15 is “slight evidence”). This is not at all the worst thing I saw; I’m just mentioning it here to give a sense of the book’s horrible mixture of ignorance and sloppiness.

I could go on and on. But, again, I don’t want to do so.

I can’t blame the author, who, I’m sure, has no idea what he is doing in any case. It would be as if someone hired me to write a book about, ummm, I dunno, football. Or maybe rugby would be an even better analogy, since I don’t even know the rules to that one.

Who do I blame, then? I blame you, the publisher.

You bastards.

Out of some goal of making a buck, you inflict this pile of crap on students, charging them $200—that’s right, the list price is just about two hundred dollars—for the privilege of ingesting some material that is both boring and false.

And, the worst thing is, this isn’t even your only introductory statistics book! You publish others that are better than this one. I guess you figure there’s a market for anything. It’s free money, right?

And then you go the extra environment-destroying step of printing a copy just for me and mailing it over here, just so that I can throw it out.

Please do me a favor. Shut your business down and go into something more productive to the world. For example, you could run a three-card monte game on the street somewhere. Three-card monte, that’s still a thing, right?

Hey, I forgot to include a cat picture in my previous post!

Josh Miller fixes it for me:


Hot hand 1, WSJ 0


In a generally good book review on “uncertainty and the limits of human reason,” William Easterly writes:

Failing to process uncertainty correctly, we attach too much importance to too small a number of observations. Basketball teams believe that players suddenly have a “hot hand” after they have made a string of baskets, so you should pass them the ball. Tversky showed that the hot hand was a myth—among many small samples of shooting attempts, there will randomly be some streaks. Instead of a hot hand, there was “regression to the mean”—players fall back down to their average shooting prowess after a streak. Likewise a “cold” player will move back up to his own average.

No no no. The funny thing is:

1. As Miller and Sanjurjo explain, the mistaken belief that there is no hot hand, is itself a result of people “attaching too much importance to too small a number of observations.”

2. This is not news to the Wall Street Journal! Ben Cohen reported on the hot hand over a year ago!

On the plus side, Easterly’s review did not mention himmicanes, power pose, the gay gene, the contagion of obesity, or the well-known non-finding of an increase in the death rate among middle-aged white men.

In all seriousness, the article is fine; it’s just interesting how misconceptions such the hot hand fallacy fallacy can persist and persist and persist.

Data 1, NPR 0


Jay “should replace the Brooks brothers on the NYT op-ed page” Livingston writes:

There it was again, the panic about the narcissism of millennialas as evidenced by selfies. This time it was NPR’s podcast Hidden Brain. The show’s host Shankar Vedantam chose to speak with only one researcher on the topic – psychologist Jean Twenge, whose even-handed and calm approach is clear from the titles of her books, Generation Me and The Narcissism Epidemic. . . .

What’s the evidence that so impressed National Public Radio? Livingston explains:

There are serious problems with the narcissism trope. One is that people use the word in many different ways. For the most part, we are not talking about what the DSM-IV calls Narcissistic Personality Disorder. That diagnosis fits only a relatively few (a lifetime prevalence of about 6% ). For the rest, the hand-wringers use a variety of terms. Twenge, in the Hidden Brain episode, uses individualism and narcissism as though they were interchangeable. She refers to her data on the increase in “individualistic” pronouns and language, even though linguists have shown this idea to be wrong (see Mark Liberman at Language log here and here). . . .

Then there’s the generational question. Are millennials more narcissistic than were their parents or grandparents? . . . if you’re old enough, when you read the title The Narcissism Epidemic, you heard a faint echo of a book by Christopher Lasch published thirty years earlier.

And now on to the data:

We have better evidence than book titles. Since 1975, Monitoring the Future (here) has surveyed large samples of US youth. It wasn’t designed to measure narcissism, but it does include two relevant questions:
Compared with others your age around the country, how do you rate yourself on school ability?
How intelligent do you think you are compared with others your age?
It also has self-esteem items including
I take a positive attitude towards myself
On the whole, I am satisfied with myself
I feel I do not have much to be proud of (reverse scored)
A 2008 study compared 5-year age groupings and found absolutely no increase in “egotism” (those two “compared with others” questions). The millennials surveyed in 2001-2006 were almost identical to those surveyed twenty-five years earlier. The self-esteem questions too showed little change.

Another study by Brent Roberts, et al., tracked two sources for narcissism: data from Twenge’s own studies; and data from a meta-analysis that included other research, often with larger samples. The test of narcissism in all cases was the Narcissism Personality Inventory – 40 questions designed to tap narcissistic ideas.

Their results look like this:

Narc graph 2

Twenge’s sources justify her conclusion that narcissism is on the rise. But include the other data and you wonder if all the fuss about kids today is a bit overblown. You might not like participation trophies or selfie sticks or Instagram, but it does not seem likely that these have created an epidemic of narcissism.

Oooh—ugly ugly ugly Excel graph. Still, Livingston has a point.

Ahhhh, NPR!

best algorithm EVER !!!!!!!!


Someone writes:

On the website you find a lot of material for Optimal (or “optimizing”) Data Analysis (ODA) which is described as:

In the Optimal (or “optimizing”) Data Analysis (ODA) statistical paradigm, an optimization algorithm is first utilized to identify the model that explicitly maximizes predictive accuracy for the sample, and then the resulting optimal performance is evaluated in the context of an application-specific exact statistical architec­ture. Discovered in 1990, the first and most basic ODA model was a distribution-free machine learning algorithm used to make maximum accuracy classifications of observations into one of two categories (pass or fail) on the basis of their score on an ordered attribute (test score). When the first book on ODA was writ­ten in 2004 a cornucopia of in­disputable evidence had already amassed demonstrating that statistical models identified by ODA were more flexible, transpar­ent, intuitive, accurate, par­simonious, and generalizable than competing models instead identified using an unin­tegrated menagerie of legacy statistical meth­ods. Understanding of ODA methodology skyrocketed over the next decade, and 2014 produced the development of novometric theory – the conceptual analogue of quan­tum mechanics for the statistical analysis of classical data. Maximizing Predictive Accu­racy was written as a means of organizing and making sense of all that has so-far been learned about ODA, through November of 2015.

I found a paper in which a comparison of several machine learning algorithms reveals that a classification tree analysis based on ODA approach delivers best classification results (compared to binary regression, random forest, SVM, etc.)

So far, based on given information, it sounds pretty appealing – do you see any pitfalls? – would you recommend it for using in data analysis when I want to achieve accurate predictions?

My reply: I have no idea. It seems like a lot of hype to me: “discovered . . . conucopia . . . menagerie . . . skyrocketed . . . novometric theory . . . conceptual analogue of quan­tum mechanics.”

But, hey, something can be hyped and still be useful, so who knows? I’ll leave it for others to make their judgments on this one.

Using Stan in an agent-based model: Simulation suggests that a market could be useful for building public consensus on climate change


Jonathan Gilligan writes:

I’m writing to let you know about a preprint that uses Stan in what I think is a novel manner: Two graduate students and I developed an agent-based simulation of a prediction market for climate, in which traders buy and sell securities that are essentially bets on what the global average temperature will be at some future time. We use Stan as part of the model: at every time step, simulated traders acquire new information and use this information to update their statistical models of climate processes and generate predictions about the future.

J.J. Nay, M. Van der Linden, and J.M. Gilligan, Betting and Belief: Prediction Markets and Attribution of Climate Change, (code here).

ABSTRACT: Despite much scientific evidence, a large fraction of the American public doubts that greenhouse gases are causing global warming. We present a simulation model as a computational test-bed for climate prediction markets. Traders adapt their beliefs about future temperatures based on the profits of other traders in their social network. We simulate two alternative climate futures, in which global temperatures are primarily driven either by carbon dioxide or by solar irradiance. These represent, respectively, the scientific consensus and a hypothesis advanced by prominent skeptics. We conduct sensitivity analyses to determine how a variety of factors describing both the market and the physical climate may affect traders’ beliefs about the cause of global climate change. Market participation causes most traders to converge quickly toward believing the “true” climate model, suggesting that a climate market could be useful for building public consensus.

Our simulated traders treat the global temperature as linear function of a forcing term (either the logarithm of the atmospheric carbon dioxide concentration or the total solar irradiance) plus an auto-correlated noise process. Each trader has an individual belief about the cause of climate change, and uses the corresponding forcing term. At each time step, the simulated traders use past temperatures to fit parameters for their time-series models, use these models to extrapolate probability distributions for future temperatures, and use these probability distribution to place bets (buy and sell securities).

Gilligan continues:

We developed our agent-based model in R. At first, we used the well-known nlme package to fit generalized least-squares models of global temperature with ARMA noise, but this was both very slow and unstable: many model runs failed with cryptic and poorly documented error messages from nlme.

Then we tried coding the time series model in Stan. The excellent manual and helpful advice from the Stan users mailing list allowed us to quickly write and debug a time-series model. To our great surprise, the full Bayesian analysis with Stan was much faster than nlme. Moreover, the generated quantities block in a Stan program makes it easy for our agents to generate predicted probability distributions for future temperatures by sampling model parameters from the joint posterior distribution and then simulating a stochastic ARMA noise process.

Fitting the time-series models at each time step is the big bottleneck in our simulation, so the speedup we achieved in moving to Stan helped a lot. This made it much easier to debug and test the model and also to perform a sensitivity analysis that required 5000 simulation runs, each of which called Stan more than 160 times, sampling 4 chains for 800 iterations each. Stan’s design—one slow compilation step that produces a very fast sampler, which can be called over and over—is ideally suited to this project.


Gilligan concludes:

We would like to thank you and the Stan team, not just for writing such a powerful tool, but also for supporting it so well with superb documentation, examples, and the Stan-users email list.

You’re welcome!

Mighty oaks from little acorns grow


Eric Loken writes:

Do by any chance remember the bogus survey that Augusta National carried out in 2002 to deflect criticism about not having any female members? I even remember this survey being ridiculed by ESPN who said their polls showed much more support for a boycott and sympathy with Martha Burke.

Anyway, sure that’s a long time ago. But I’ve often mentioned this survey in my measurement classes over the years. Guess who was the architect of that survey?

Boy oh boy . . . I didn’t know how long she’d been at it.

I’ve been searching everywhere for the text of the survey. In one news story she said, “If I thought the survey was slanted why would I have insisted that the sponsor release the entire list of questions?” At one point I had it somewhere . . . but maybe not electronic. After all it was 2002!

There was a piece in the Guardian that listed even more of the questions and some very severe British criticism. I had found that article this afternoon but now I can’t find it again.

Anyway, the Tribune piece gives the general idea.

This somehow reminds me of President Bloomberg‘s pollster, Doug Schoen.

Frustration with published results that can’t be reproduced, and journals that don’t seem to care


Thomas Heister writes:

Your recent post about Per Pettersson-Lidbom frustrations in reproducing study results reminded me of our own recent experience that we had in replicating a paper in PLOSone. We found numerous substantial errors but eventually gave up as, frustratingly, the time and effort didn’t seem to change anything and the journal’s editors quite obviously regarded our concerns as a mere annoyance.

We initially stumbled across this study by Collignon et al (2015) that explains antibiotic resistance rates by country level corruption levels as it raised red flags for an omitted variable bias (it’s at least not immediately intuitive to us how corruption causes resistance in bacteria). It wasn’t exactly a high-impact sort of study which a whole lot of people will read/cite but we thought we look at it anyways as it seemend relevant for our field. As the authors provided their data we tried to reproduce their findings and actually found a whole lot of simple but substantial errors in their statistical analysis and data coding that lead to false findings. We wrote a detailled analyis of the errors and informed the editorial office, as PLOSone only has an online comment tool but doesn’t accept letters. The apparent neglect of the concerns raised (see email correspondence below) led us to finally publish our letter as an online comment at PLOSone. The authors’ responses are quite lenghty but do in essence only touch on some of the things we criticize and entirely neglect some of our most important points. Frustratingly, we finally got an answer from PLOSone (see below) that the editors were happy with the authors’ reply and didn’t consider further action. This is remarkable considering that the main explanatory variable is completely useless as can be very easily seen in our re-analysis of the dataset (see table 1 ).

Maybe our experience is just an example of the issues with Open-Access journals, maybe of the problem of journals generally not accepting letters, or maybe just that a lot of journals still see replications and criticism of published studies as an attack on the journal’s scientific standing. Sure, this paper will probably not have a huge impact, but false findings like these might easily slip into the “what has been shown on this topic” citation loop in the introduction parts.

I would be very interested to hear your opinion on this topic with respect to PLOS journals, its “we’re not looking at the contribution of a paper, only whether its methodologically sound” policy and open access.

My reply: We have to think of the responsibility as being the authors’, not the journals’. Journals just don’t have the resources to adjudicate this sort of dispute.

So little information to evaluate effects of dietary choices


Paul Alper points to this excellent news article by Aaron Carroll, who tells us how little information is available in studies of diet and public health. Here’s Carroll:

Just a few weeks ago, a study was published in the Journal of Nutrition that many reports in the news media said proved that honey was no better than sugar as a sweetener, and that high-fructose corn syrup was no worse. . . .

Not so fast. A more careful reading of this research would note its methods. The study involved only 55 people, and they were followed for only two weeks on each of the three sweeteners. . . . The truth is that research like this is the norm, not the exception. . . .

Readers often ask me how myths about nutrition get perpetuated and why it’s not possible to do conclusive studies to answer questions about the benefits and harms of what we eat and drink.

Good question. Why is it that supposedly evidence-based health recommendations keep changing?

Carroll continues:

Almost everything we “know” is based on small, flawed studies. . . . This is true not only of the newer work that we see, but also the older research that forms the basis for much of what we already believe to be true. . . .

The honey study is a good example of how research can become misinterpreted. . . . A 2011 systematic review of studies looking at the effects of artificial sweeteners on clinical outcomes identified 53 randomized controlled trials. That sounds like a lot. Unfortunately, only 13 of them lasted for more than a week and involved at least 10 participants. Ten of those 13 trials had a Jadad score — which is a scale from 0 (minimum) to 5 (maximum) to rate the quality of randomized control trials — of 1. This means they were of rather low quality. None of the trials adequately concealed which sweetener participants were receiving. The longest trial was 10 weeks in length.

According to Carroll, that’s it:

This is the sum total of evidence available to us. These are the trials that allow articles, books, television programs and magazines to declare that “honey is healthy” or that “high fructose corn syrup is harmful.” This review didn’t even find the latter to be the case. . . .

My point is not to criticize research on sweeteners. This is the state of nutrition research in general. . . .

I just have one criticism. Carroll writes:

The outcomes people care about most — death and major disease — are actually pretty rare.

Death isn’t so rare. Everyone dies! Something like 1/80 of the population dies every year. The challenge is connecting the death to a possible cause such as diet.

Carroll also talks about the expense and difficulty of doing large controlled studies. Which suggests to me that we should be able to do better in our observational research. I don’t know exactly how to do it, but there should be some useful bridge between available data, on one hand, and experiments with N=55, on the other.

P.S. I followed a link to another post by Carroll which includes this crisp graph:

Screen Shot 2016-04-06 at 10.44.03 AM

Some U.S. demographic data at zipcode level conveniently in R

Ari Lamstein writes:

I chuckled when I read your recent “R Sucks” post. Some of the comments were a bit … heated … so I thought to send you an email instead.

I agree with your point that some of the datasets in R are not particularly relevant. The way that I’ve addressed that is by adding more interesting datasets to my packages. For an example of this you can see my blog post choroplethr v3.1.0: Better Summary Demographic Data. By typing just a few characters you can now view eight demographic statistics (race, income, etc.) of each state, county and zip code in the US. Additionally, mapping the data is trivial.

I haven’t tried this myself, but assuming it works . . . that’s great to be able to make maps of American Community Survey data at the zipcode level!

Survey weighting and that 2% swing


Nate Silver agrees with me that much of that shocking 2% swing can be explained by systematic differences between sample and population: survey respondents included too many Clinton supporters, even after corrections from existing survey adjustments.

In Nate’s words, “Pollsters Probably Didn’t Talk To Enough White Voters Without College Degrees.” Last time we looked carefully at this, my colleagues and I found that pollsters weighted for sex x ethnicity and age x education, but not by ethnicity x education.

I could see that this could be an issue. It goes like this: Surveys typically undersample less-educated people, I think even relative to their proportion of voters. So you need to upweight the less-educated respondents. But less-educated respondents are more likely to be African Americans and Latinos, so this will cause you to upweight these minority groups. Once you’re through with the weighting (whether you do it via Mister P or classical raking or Bayesian Mister P), you’ll end up matching your target population on ethnicity and education, but not on their interaction, so you could end up with too few low-income white voters.

There’s also the gender gap: you want the right number of low-income white male and female voters in each category. In particular, we found that in 2016 the gender gap increased with education, so if your sample gets some of these interactions wrong, you could be biased.

Also a minor thing: Back in the 1990s the ethnicity categories were just white / other and there were 4 education categories: no HS / HS / some college / college grad. Now we use 4 ethnicity categories (white / black / hisp / other) and 5 education categories (splitting college grad into college grad / postgraduate degree). Still just 2 sexes though. For age, I think the standard is 18-29, 30-44, 45-64, and 65+. But given how strongly nonresponse rates vary by age, it could make sense to use more age categories in your adjustment.

Anyway, Nate’s headline makes sense to me. One thing surprises me, though. He writes, “most pollsters apply demographic weighting by race, age and gender to try to compensate for this problem. It’s less common (although by no means unheard of) to weight by education, however.” Back when we looked at this, a bit over 20 years ago, we found that some pollsters didn’t weight at all, some weighted only on sex, and some weighted on sex x ethnicity and age x education. The surveys that did very little weighting relied on the design to get a more representative sample, either using quota sampling or using tricks such as asking for the youngest male adult in the household.

Also, Nate writes, “the polls may not have reached enough non-college voters. It’s a bit less clear whether this is a longstanding problem or something particular to the 2016 campaign.” All the surveys I’ve seen (except for our Xbox poll!) have massively underrepresented young people, and this has gone back for decades. So no way it’s just 2016! That’s why survey organizations adjust for age. There’s always a challenge, though, in knowing what distribution to adjust to, as we don’t know turnout until after the election—and not even then, given all the problems with exit polls.

P.S. The funny thing is, back in September, Sam Corbett-Davies, David Rothschild, and I analyzed some data from a Florida poll and came up with the estimate that Trump was up by 1 in that state. This was a poll where the other groups analyzing the data estimated Clinton up by 1, 3, or 4 points. So, back then, our estimate was that a proper adjustment (in this case, using party registration, which we were able to do because this poll sampled from voter registration lists) would shift the polls by something like 2% (that is, 4% in the differential between the two candidates). But we didn’t really do anything with this. I can’t speak for Sam or David, but I just figured this was just one poll and I didn’t take it so seriously.

In retrospect maybe I should’ve thought more about the idea that mainstream pollsters weren’t adjusting their numbers enough. And in retrospect Nate should’ve thought of that too! Our analysis was no secret; it appeared in the New York Times. So Nate and I were both guilty of taking the easy way out and looking at poll aggregates and not doing the work to get inside the polls. We’re doing that now, in December, but I we should’ve been doing it in October. Instead of obsessing about details of poll aggregation, we should’ve been working more closely with the raw data.

P.P.S. Could someone please forward this email to Nate? I don’t think he’s getting my emails any more!

How can you evaluate a research paper?


Shea Levy writes:

You ended a post from last month [i.e., Feb.] with the injunction to not take the fact of a paper’s publication or citation status as meaning anything, and instead that we should “read each paper on its own.” Unfortunately, while I can usually follow e.g. the criticisms of a paper you might post, I’m not confident in my ability to independently assess arbitrary new papers I find. Assuming, say, a semester of a biological sciences-focused undergrad stats course and a general willingness and ability to pick up any additional stats theory or practice, what should someone in the relevant fields do to get to the point where they can meaningfully evaluate each paper they come across?

My reply: That’s a tough one. My own view of research papers has become much more skeptical over the years. For example, I devoted several posts to the Dennis-the-Dentist paper without expressing any skepticism at all—and then Uri Simonsohn comes along and shoots it down. So it’s hard to know what to say. I mean, even as of 2007, I think I had a pretty good understanding of statistics and social science. And look at all the savvy people who got sucked into that Bem ESP thing—not that they thought Bem had proved ESP, but many people didn’t realize how bad that paper was, just on statistical grounds.

So what to do to independently assess new papers?

I think you have to go Bayesian. And by that I don’t mean you should be assessing your prior probability that the null hypothesis is true. I mean that you have to think about effect sizes, on one side, and about measurement, on the other.

It’s not always easy. For example, I found the claimed effect sizes for the Dennis/Dentist paper to be reasonable (indeed, I posted specifically on the topic). For that paper, the problem was in the measurement, or one might say the likelihood: the mapping from underlying quantity of interest to data.

Other times we get external information, such as the failed replications in ovulation-and-clothing, or power pose, or embodied cognition. But we should be able to do better, as all these papers had major problems which were apparent, even before the failed reps.

One cue which we’ve discussed a lot: if a paper’s claim relies on p-values, and they have lots of forking paths, you might just have to set the whole paper aside.

Medical research: I’ve heard there’s lots of cheating, lots of excluding patients who are doing well under the control condition, lots of ways to get people out of the study, lots of playing around with endpoints.

The trouble is, this is all just a guide to skepticism. But I’m not skeptical about everything.

And the solution can’t be to ask Gelman. There’s only one of me to go around! (Or two, if you count my sister.) And I make mistakes too!

So I’m not sure. I’ll throw the question to the commentariat. What do you say?

An exciting new entry in the “clueless graphs from clueless rich guys” competition


Jeff Lax points to this post from Matt Novak linking to a post by Matt Taibbi that shares the above graph from newspaper columnist / rich guy Thomas Friedman.

I’m not one to spend precious blog space mocking bad graphs, so I’ll refer you to Novak and Taibbi for the details.

One thing I do want to point out, though, is that this is not necessarily the worst graph promulgated recently by a zillionaire. Let’s never forget this beauty which was being spread on social media by wealthy human Peter Diamandis:


Interesting epi paper using Stan

Jon Zelner writes:

Just thought I’d send along this paper by Justin Lessler et al. Thought it was both clever & useful and a nice ad for using Stan for epidemiological work.

Basically, what this paper is about is estimating the true prevalence and case fatality ratio of MERS-CoV [Middle East Respiratory Syndrome Coronavirus Infection] using data collected via a mix of passive and active surveillance, which if treated naively will result in an overestimate of case fataility and underestimate of burden b/c only the most severe cases are caught via passive surveillance. All of the interesting modeling details are in the supplementary information.