Skip to content

Euro 2016 update

Big news out of Europe, everyone’s talking about soccer.

Leo Egidi updated his model and now has predictions for the Round of 16:

Screen Shot 2016-06-25 at 9.47.00 PM

Here’s Leo’s report, and here’s his zipfile with data and Stan code.

The report contains some ugly histograms showing the predictive distributions of goals to be scored in each game. The R histogram function FAILS with discrete data because it puts the bin boundaries at 0, 1, 2, etc. Or, in this case, 0, .5, 1, 1.5, etc., which is even worse because now the y-axis is hard to interpret as the frequencies all got multiplied by 2. When data are integers, you want the boundaries at -.5, .5, 1.5, 2.5, etc. Or use barplot(). Really, though, you want scatterplots because the teams are playing against each other. You’ll want heatmaps, actually: scatterplots don’t work so well with discrete data.

What they’re saying about “blended learning”: “Perhaps the most reasonable explanation is that no one watched the video or did the textbook reading . . .”


Someone writes in:

I was wondering if you had a chance to see the commentary by the Stockwells on blended learning strategies that was recently published in Cell and which also received quite a nice write up by Columbia. It’s also currently featured on Columbia’s webpage.

In fact, I was a student in Prof. Stockwell’s Biochemistry class last year, and a participant in this study, which was why I was so surprised that it ended up in Cell and received the attention that it did.

I was part of the textbook group, for which he assigned over 30 pages of dense textbook reading (which would probably have taken multiple hours to fully digest, and was more than 2-3 times more than what he’d assign for a typical class), so I’m sure the video was much more tailored to the material he covered in class and ultimately quizzed everyone on. Moreover, in his interview Stockwell claims that he’ll “use video lectures and assign them in advance,” rather than relying exclusively on a textbook, yet it was surprising that in their commentary they write:

We also compared the exam scores of students in the textbook versus video preparation groups but found no statistically significant difference in this relatively modest sample size, despite the trend toward higher scores in the group that received the video assignment.

Perhaps the most reasonable explanation is that no one watched the video or did the textbook reading for a class that wasn’t going to be covered on any of the exams? What’s even more confusing to me is that they admit the sample size of the textbook/video groups were “modest,” but are readily able to draw conclusions about which of the 4 arms provides the most effect model for learning, when each arm had half as many participants as these two larger groups! I’m not sure if that’s just confirmation bias, or if the results are truly significant? I’m also not sure if the figure in the paper is mislabeled since Group 2 and 3 and panel A are different than what’s used in panel D (see above).

Do you have any thoughts on the statistical power of such a study?

I know the above seems a little bit like I have an axe to grind, but it seemed to me like the conclusions of this experiment were quite reaching, especially for such a short study with so few participants, and I was wondering what someone else with more expertise on experimental design than I have thought.

I had not heard about this study and don’t really have the time to look at it, but I’m posting it here in case any of you have any comments.

As to why Cell chose to publish it: This seems clear enough. Everybody knows that teaching is important and it’s hard to get students to learn, we try lots of teaching strategies but there are not a lot of controlled trials of teaching methods, so when there is such a study, and when it gives positive results with that magic “p less than .05,” then, yeah, I’m not surprised it gets published in a top journal.

Brexit polling: What went wrong?

Commenter numeric writes:

Since you were shilling for yougov the other day you might want to talk about their big miss on Brexit (off by 6% from their eve-of-election poll—remain up 2 on their last poll and leave up by 4 as of this posting).

Fair enough: Had Yougov done well, I could use them as an example of the success of MRP, and political polling more generally, so I should take the hit when they fail. It looks like Yougov was off by about 4 percentage points (or 8 percentage points if you want to measure things by vote differential). It will be interesting to how much this difference was nonuniform across demographic groups.

The difference between survey and election outcome can be broken down into five terms:

1. Survey respondents not being a representative sample of potential voters (for whatever reason, Remain voters being more reachable or more likely to respond to the poll, compared to Leave voters);

2. Survey responses being a poor measure of voting intentions (people saying Remain or Undecided even though it was likely they’d vote to leave);

3. Shift in attitudes during the last day;

4. Unpredicted patterns of voter turnout, with more voting than expected in areas and groups that were supporting Leave, and lower-than-expected turnout among Remain supporters.

5. And, of course, sampling variability. Here’s Yougov’s rolling average estimate from a couple days before the election:


Added in response to comments: And here’s their final result, “YouGov on the day poll: Remain 52%, Leave 48%”:

Final poll

We’ll take this final 52-48 poll as Yougov’s estimate.

Each one of the above five explanations seems to be reasonable to consider as part of the story. Remember, we’re not trying to determine which of 1, 2, 3, 4, or 5 is “the” explanation; rather, we’re assuming that all five of these are happening. (Indeed, some of these could be happening but in the opposite direction; for example it’s possible that the polls oversampled Remain voters (a minus sign on item 1 above) but that this non-representativeness was more than overbalanced by a big shift in attitudes during the last day (a big plus sign on item 3).

The other thing is that item 5, sampling variability, does not stand on its own. Given the amount of polling on this issue (even within Yougov itself, as indicated by the graph above), sampling variability is an issue to the extent that items 1-4 above are problems. If there were no problems with representativeness, measurement, changes in attitudes, and turnout predictions, then the total sample size of all these polls would be enough that they’d predict the election outcome almost perfectly. But given all these other sources of uncertainty and variation, you need to worry about sampling variability too, to the extent that you’re using the latest poll to estimate the latest trends.

OK, with that as background, what does Yougov say? I went to their website and found this article posted a few hours ago:

Unexpectedly high turnout in Leave areas pushed the campaign to victory

Unfortunately YouGov was four points out in its final poll last night, but we should not be surprised that the referendum was close – we have shown it close all along. Over half our polls since the start of the year we showed Brexit in the lead or tied. . . .

As we wrote in the Times newspaper three days ago: “This campaign is not a “done deal”. The way the financial and betting markets have reacted you would think Remain had already won – yesterday’s one day rally in the pound was the biggest for seven years, and the odds of Brexit on Betfair hit 5-1. But it’s hard to justify those odds using the actual data…. The evidence suggests that we are in the final stages of a genuinely close and dynamic race.”

Just to check, what did Yougov say about this all before the election? Here’s their post from the other day, which I got by following the links from my post linked above:

Our current headline estimate of the result of the referendum is that Leave will win 51 per cent of the vote. This is close enough that we cannot be very confident of the election result: the model puts a 95% chance of a result between 48 and 53, although this only captures some forms of uncertainty.

The following three paragraphs are new, in response to comments, and replace one paragraph I had before:

OK, let’s do a quick calculation. Take their final estimate that Remain will win with 52% of the vote and give it a 95% interval with width 6 percentage points (a bit wider than the 5-percentage-point width reported above, but given that big swing, presumably we should increase the uncertainty a bit). So the interval is [49%, 55%], and if we want to call this a normal distribution with mean 52% and standard deviation 1.5%, then the probability of Remain under this model would be pnorm(52, 50, 1.5) = .91, that is, 10-1 odds in favor. So, when Yougov said the other day that “it’s hard to justify those [Betfair] odds” of 5-1, it appears that they (Yougov) would’ve been happy to give 10-1 odds.

But these odds are very sensitive to the point estimate (for example, pnorm(51.5, 50, 1.5) = .84, which gives you those 5-1 odds), to the forecast uncertainty (for example, pnorm(52, 50, 2.5) = .79), and to any smoothing you might do (for example, take a moving average of the final few days and you get something not far from 50/50).

In short, betting odds in this setting are highly sensitive to small changes in the model, and when the betting odds stay stable (as I think they were during the final period of Brexit), this suggests they contain a large element of convention or arbitrary mutual agreement.

The “out” here seems to be that last part of Yougov’s statement from the other day: “although this only captures some forms of uncertainty.”

It’s hard to know how to think about other forms of uncertainty, and I think that one way that people handle this in practice is to present 95% intervals and treat them as something more like 50% intervals.

Think about it. If you want to take the 95% interval as a Bayesian predictive interval—and Yougov does use Bayesian inference—then you’d be concluding that the odds are 40-1 that Remain would get more than 48% of the vote the outcome would fall below the lower endpoint of the interval. That’s pretty strong. But that would not be an appropriate conclusion to draw, not if you remember that this interval “only captures some forms of uncertainty.” So you can mentally adjust the interval, either by making it wider to account for these other sources of uncertainty, or by mentally lowering its probability coverage. I argue that in practice people do the latter, that they take 95% intervals as statements of uncertainty, without really believing the 95% part.

OK, fine, but if that’s right, then did the betting markets appear to be taking Yougov’s uncertainties literally with those 5-1 odds? There I’m guessing the problem was . . . other polls. Yougov was saying 51% for Leave, or maybe 52% for Remain, but other polls were showing large leads for Remain. If all the polls had looked like Yougov, and had betters been rational about accounting for nonsampling error, we might have seen something like 3-1 or 2-1 odds in favor, which would’ve been more reasonable (from a prospective sense, given Yougov’s pre-election polling results and our general knowledge that nonsampling error can be a big deal).

Houshmand Shirani-Mehr, David Rothschild, Sharad Goel, and I recently wrote a paper estimating the level of nonsampling error in U.S. election polls, and here’s what we found:

It is well known among both researchers and practitioners that election polls suffer from a variety of sampling and non-sampling errors, often collectively referred to as total survey error. However, reported margins of error typically only capture sampling variability, and in particular, generally ignore errors in defining the target population (e.g., errors due to uncertainty in who will vote). Here we empirically analyze 4,221 polls for 608 state-level presidential, senatorial, and gubernatorial elections between 1998 and 2014, all of which were conducted during the final three weeks of the campaigns. Comparing to the actual election outcomes, we find that average survey error as measured by root mean squared error (RMSE) is approximately 3.5%, corresponding to a 95% confidence interval of ±7%—twice the width of most reported intervals.

Got it? Take that Yougov pre-election 95% interval of [.48,.53] and double its width and you get something like [.46,.56] which more appropriately captures your uncertainty.

That all sounds just fine. But . . . I didn’t say this before the vote? So now the question is not, “Yougov: what went wrong?” or “UK bettors: what went wrong?” but, rather, “Gelman: what went wrong?”

That’s a question I should be able to answer! I think the most accurate response is that, like everyone else, I was focusing on the point estimate rather than the uncertainty. And, to the extent I was focusing on the uncertainty I was implicitly taking reported 95% intervals and treating them like 50% intervals. And, finally, I was probably showing too much deference to the betting line.

But I didn’t put this all together and note the inconsistency between the wide uncertainty intervals from the polls (after doing the right thing and widening the intervals to account for nonsampling errors) and the betting odds. In writing about the pre-election polls, I focused on the point estimate and didn’t focus in on the anomaly.

I should get some credit for attempting to untangle these threads now, but not as much as I’d deserve if I’d written this all two days ago. Credit to Yougov, then, for publicly questioning the 5-1 betting odds, before the voting began.

OK, now back to Yougov’s retrospective:

YouGov, like most other online pollsters, has said consistently it was a closer race than many others believed and so it has proved. While the betting markets assumed that Remain would prevail, throughout the campaign our research showed significantly larger levels of Euroscepticism than many other polling organisations. . . .

Early in the campaign, an analysis of the “true” state of public opinion claimed support for Leave was somewhere between phone and online methodologies but a little closer to phone. We disputed this at the time as we were sure our online samples were getting a much more representative sample of public opinion.

Fair enough. They’re gonna take the hit for being wrong, so they might as well grab what credit they can for being less wrong than many other pollsters. Remember, there still are people out there saying that you can’t trust online polls.

And now Yougov gets to the meat of the question:

We do not hide from the fact that YouGov’s final poll miscalculated the result by four points. This seems in a large part due to turnout – something that we have said all along would be crucial to the outcome of such a finely balanced race. Our turnout model was based, in part, on whether respondents had voted at the last general election and a turnout level above that of general elections upset the model, particularly in the North.

So they go with explanation 4 above: unexpected patterns of turnout.

They frame this as a North/South divide—which I guess is what you can learn from the data—but I’m wondering if it’s more of a simple Leave/Remain divide, with Leave voters being, on balance, more enthusiastic, hence turning out to vote at a higher-than-expected rate.

Related to this is explanation 3, changes in opinion. After all, that Yougov report also says, “three of YouGov’s final six polls of the campaign showing ‘Leave’ with the edge ranging from a 4% Remain lead to an 8% Leave lead.” And if you look at the graph reproduced above, and take a simple average, you’ll see a win for Leave. So the only way to call the polls as a lead for Remain (as Yougov did, in advance of the election) was to weight the more recent polls higher, that is to account for trends in opinion. It makes sense to account for trends, but once you do that, you have to accept the possibility of additional changes after the polling is done.

And, just to be clear: Yougov’s estimates using MRP were not bad at all. But this did not stop Yougov from reporting, as a final result, that mistaken 52-48 pro-Remain poll on the eve of the vote.

To get another perspective on what went wrong with the polling, I went to the webpage of Nikos Askitas, whose work I’d “shilled” on the sister blog the other day. Askitas had used a tally based on Google search queries—a method that he reported had worked for recent referenda in Ireland and Greece—and reported just before the election a slight lead for Remain, very close to the Yougov poll, as a matter of fact. Really kind of amazing it was so close, but I don’t know what adjustments he did to the data to get there; it might well be that he was to some extent anchoring his estimates to the polls. (He did not preregister his data-processing rules before the campaign began.)

Anyway, Askitas was another pundit to get things wrong. Here’s what he wrote in the aftermath:

Two ways ago observing the rate at which the brexit side was recovering from the murder of Jo Cox I was writing that “as of 16:15 hrs on Tuesday afternoon the leave searches caught up by half a percentage point going from 47% to 47.5%. If trend continues they will be at 53% or Thursday morning”. This was simply regressing the leave searches on each hours passed. When I then saw the first slow down I had thought that it might become 51% or 52% but recovering most of the pre-murder momentum was still possible with only one obstacle in its way: time. When the rate of recovery of the leave searches slowed down in the evening of the 22nd of June and did not move upwards in the early morning of the 23rd I had to call the presumed trend as complete: if your instrument does not pick up measurement variation then you declare the process you are observing for finished. Leave was at 48%.

What explains the difference? Maybe the trend I was seeing early on was indeed still mostly there and there was simply no time to be recorded in search? Maybe the rain damaged the remaineers as it is widely believed? Maybe the pour turnout in Wales? Maybe our tool does not have the resolution it needs for such a close call? or maybe as I was saying elsewhere “I am confident to mostly have identified the referendum relevant searches and I can see that many -but not all- of the top searches are indeed related to voting intent”.

Askitas seems to be focusing more on items 2 and 3 (measurement issues and opinion changes) and not so much on item 1 (non-representativeness of searchers) and item 4 (turnout). Again, let me emphasize the that all four items interact.

Askitas also gives his take on the political outcome:

The principle of parliamentary sovereignty implies that referendum results are not legally binding and that action occurs at the discretion of the parliament alone. Consequently a leave vote is not identical with leaving. As I was writing elsewhere voting leave is hence cheap talk and hence the rational thing to do: you can air any and all grievances with the status quo and it is your vote if you have any kind of ax to grind (and most people do). Why wouldn’t you want to do so? The politicians can still sort it out afterwards. These politicians are now going to have to change their and our ways. Pro European forces in the UK, in Brussels and other European capitals must realize that scaremongering is not enough to stir people towards Europe. We saw that more than half of the Britons prefer a highly uncertain path than the certainty of staying, a sad evaluation of the European path. Pro Europeans need to paint a positive picture of staying instead of ugly pictures of leaving and most importantly they need to sculpt it in 3D reality one European citizen at a time.

P.S. I could’ve just as well titled this, “Brexit prediction markets: What went wrong?” But it seems pretty clear that the prediction markets were following the polls.

P.P.S. Full disclosure: YouGov gives some financial support to the Stan project. (I’d put this in my previous post on Yougov but I suppose the commenter is right that I should add this disclaimer to every post that mentions the pollster. But does this mean I also need to disclose our Google support every time I mention googling something? And must I disclose my consulting for Microsoft ever time I mention Clippy? I think I’ll put together a single page listing outside support and then I can use a generic disclaimer for all my posts.

P.P.P.S. Ben Lauderdale sent me a note arguing that Yougov didn’t do so bad at all:

I worked with Doug Rivers on the MRP estimates you discussed in your post today. I want to make an important point of clarification: none of the YouGov UK polling releases *except* the one you linked to a few days back used the MRP model. All the others were 1 or 2 day samples adjusted with raking and techniques like that. The MRP estimates never showed Remain ahead, although they got down to Leave 50.1 the day before the referendum (which I tweeted). The last run I did the morning of the referendum with the final overnight data had Leave at 50.6, versus a result of Leave 51.9.

Doug and I are going to post a more detailed post-mortem on the estimates when we recover from being up all night, but fundamentally they were a success: both in terms of getting close to the right result in a very close vote, and also in predicting the local authority level results very well. Whether our communications were successful is another matter, but it was a very busy week in the run up to the referendum, and we did try very hard to be clear about the ways we could be wrong in that article!

P.P.P.P.S. And Yair writes:

I like the discussion about turnout and Leave voters being more enthusiastic. My experience has been that it’s very difficult to separate turnout from support changes. I bet if you look at nearly any stable subgroup (defined by geography and/or demographics), you’ll tend to see the two moving together.

Another piece here, which I might have missed in the discussion, is differential non-response due to the Jo Cox murder. Admittedly I didn’t follow too closely, but it seems like all the news covereage in recent days was about that. Certainly plausible that this led to some level of Leave non-response, contributing to the polling trend line dipping towards Remain in recent days. I don’t think the original post mentioned fitting the MRP with Party ID (or is it called something else in the UK?), but I’m remembering the main graphs from the swing voter Xbox paper being pretty compelling on this point.

Last — even if the topline is off, in my view there’s still a lot of value in getting the subgroups right. I know I was informally looking at the YouGov map compared to the results map last night. Maybe would be good to see a scatterplot or something. I know everyone cares about the topline more than anything, but to me (and others, I hope) the subgroups are important, both for understanding the election and for understanding where the polls were off.

My talk tomorrow (Thurs) 10:30am at ICML in NYC

I’ll be speaking at the workshop on Data-Efficient Machine Learning. And here’s the schedule.

I’ll be speaking on the following topic:

Toward Routine Use of Informative Priors

Bayesian statistics is typically performed using noninformative priors but the resulting inferences commonly make no sense and also can lead to computational problems as algorithms have to waste time in irrelevant regions of parameter space. Certain informative priors that have been suggested don’t make much sense either. We consider some aspects of the open problem of using informative priors for routine data analysis.

I’ll also be speaking on one other, related, thing. And I’ll be part of the panel discussion at 3:30pm.

The workshop will be at the Marriott Marquis (Astor Room), New York.

P.S. The slides are here.

It comes down to reality and it’s fine with me cause I’ve let it slide

E. J. Wagenmakers pointed me to this recent article by Roy Baumeister, who writes:

Patience and diligence may be rewarded, but competence may matter less than in the past. Getting a significant result with n = 10 often required having an intuitive flair for how to set up the most conducive situation and produce a highly impactful procedure. Flair, intuition, and related skills matter much less with n = 50.

In fact, one effect of the replication crisis can even be seen as rewarding incompetence. These days, many journals make a point of publishing replication studies, especially failures to replicate. The intent is no doubt a valuable corrective, so as to expose conclusions that were published but have not held up.

But in that process, we have created a career niche for bad experimenters. This is an underappreciated fact about the current push for publishing failed replications. I submit that some experimenters are incompetent. In the past their careers would have stalled and failed. But today, a broadly incompetent experimenter can amass a series of impressive publications simply by failing to replicate other work and thereby publishing a series of papers that will achieve little beyond undermining our field’s ability to claim that it has accomplished anything.

I [Baumeister] mentioned the rise in rigor corresponding to the decline in interest value and influence of personality psychology. Crudely put, shifting the dominant conceptual paradigm from Freudian psychoanalytic theory to Big Five research has reduced the chances of being wrong but palpably increased the fact of being boring. In making that transition, personality psychology became more accurate but less broadly interesting.

Poe’s Law, as I’m sure you’re aware, “is an Internet adage which states that, without a clear indicator of the author’s intent, parodies of extreme views will be mistaken by some readers or viewers for sincere expressions of the parodied views.”

Baumeister’s article is what might be called a reverse-Poe, in that it’s evidently sincere, yet its contents are parodic.

Just to explain briefly:

1. The goal of science is not to reward “flair, intuition, and related skills”; it is to learn about reality.

2. I’m skeptical of the claim that “today, a broadly incompetent experimenter can amass a series of impressive publications simply by failing to replicate other work.” I’d be interested in who this experimenter is who had this impressive career.

In fact, the incentives go in the other direction. Let’s take an example. Carney et al. do a little experiment on power pose and, with the help of some sloppy data analysis, get “p less than .05,” statistical significance, publication, NPR, and a Ted Talk. Ranehill et al. do a larger, careful replication study and find the claims of Carney et al. to be unsupported by the data. A look back at the original paper of Carney et al. reveals serious problems with the original study, to the extent that, as Uri Simonsohn put it, that study never had a chance.

So, who’s the “broadly incompetent experimenter”? The people who did things wrong and claimed success by finding patterns in noise? Or the people who did things carefully and found nothing? I say the former. And they’re the ones who were “amassing a series of impressive publications.”

Baumeister’s problem, I think, is the same one as the problem with the “statistical power” literature, which is that he sees “p less than .05,” statistical significance, publication, NPR, Ted Talk, Gladwell, Freakonomics, etc., as a win. Whereas, to me, all of that is a win if there’s really a discovery there, but it’s a loss if it’s just a tale spun out of noise.

Here’s another example: When Weakliem and I showed why Kanazawa’s conclusions regarding beauty and sex ratio were essentially unsupported by data, this indeed “undermined psychology’s ability to claim that it has accomplished anything” in that particular area—but it was a scientific plus to undermine this, just as it was a scientific plus when chemists abandoned alchemy, when geographers abandoned the search for Atlantis, when biologists abandoned the search for the Loch Ness monster, and when mathematicians abandoned the search for solutions to the equation x^n + y^n = z^n for positive integers x, y, z and integers n greater than 2.

3. And then there’s that delicious phrase, “more accurate but less broadly interesting.”

I guess the question is, interesting to whom? Daryl Bem claimed that Cornell students had ESP abilities. If true, this would indeed be interesting, given that it would cause us to overturn so much of what we thought we understood about the world. On the other hand, if false, it’s pretty damn boring, just one more case of a foolish person believing something he wants to believe.

Same with himmicanes, power pose, ovulation and voting, alchemy, Atlantis, and all the rest.


The unimaginative hack might find it “less broadly interesting” to have to abandon beliefs in ghosts, unicorns, ESP, and the correlation between beauty and sex ratio. For the scientists among us, on the other hand, reality is what’s interesting and the bullshit breakthroughs-of-the-week are what’s boring.

Anyway, I read through that article when E. J. sent it to me, and I started to blog it, but then I thought, why give any attention to the ignorant ramblings of some obscure professor in some journal I’d never heard of.

But then someone else pointed me to this post by John Sakaluk who described Baumeister as “a HUGE Somebody.” It’s funny how someone can be HUGE in one field and unheard-of outside of it. Anyway, now I’ve heard of him!

P.S. In comments, Ulrich Schimmack points to this discussion. One thing I find particularly irritating about Baumeister, as well as with some other people in the anti-replication camp, is their superficially humanistic stance, the idea that they care about creativity! and discovery!, not like those heartless bean-counting statisticians.

As I wrote above, to the extent these phenomena such as power pose, embodied cognition, ego depletion, ESP, ovulation and clothing, beauty and sex ratio, Bigfoot, Atlantis, unicorns, etc., are real, then sure, they’re exciting discoveries! A horse-like creature with a big horn coming out of its head—cool, right? But, to the extent that these are errors, nothing more than the spurious discovery of patterns from random noise . . . then they’re just stories that are really “boring” (in the words of Baumeister) stories, low-grade fiction. The true humanist, I think, would want to learn truths about humanity. That’s a lot more interesting than playing games with random numbers.

Time-reversal heuristic as randomization, and p < .05 as conflict of interest declaration

Alex Gamma writes:

Reading your blog recently has inspired two ideas which have in common that they analogize statistical concepts with non-statistical ones related to science:

The time-reversal heuristic as randomization: Pushing your idea further leads to the notion of randomization of the sequence of study “reporting”. Studies are produced sequentially, but consumers of science could be exposed to them in any permutation of ordering to study the effects of temporal priority on belief formation. Maybe there could be a useful application of this, but I don’t see it (yet).

Declarations of conflicts of interest as p < .05: Conflicts of interests are largely treated like significant p-values: they indicate the absolute presence or absence of an effect (or, in this case, a bias), without nuance or context. The truth is certainly dimensional and not categorical here, so one could argue this practice should be refined. But then the utility function of such decisions might be different from those related to accepting scientific findings based on p-values, so it may well be better to err on the side of many more false positives. But at least the question should be raised.

I guess he’s referring to this post and this one.

YouGov uses Mister P for Brexit poll


Ben Lauderdale and Doug Rivers give the story:

There has been a lot of noise in polling on the upcoming EU referendum. Unlike the polls before the 2015 General Election, which were in almost perfect agreement (though, of course, not particularly close to the actual outcome), this time the polls are in serious disagreement. Telephone polls have generally shown more support for remain than online polls. Polls from different polling organisations (and sometimes even from the same organisation) have given widely varying estimates of support for Brexit. The polls do not even seem to agree on whether support for Brexit is increasing or decreasing.

This lack of agreement partly reflects the fact that polling a referendum is more difficult than a general election, because general elections occur regularly and many patterns can be expected to stay the same from election to election, enabling us to learn from past mistakes. Referendums address unique questions that may turn out different voters, and may scramble the political loyalties that tend to persist from general election to general election. Still, as the Scottish independence referendum showed, it is possible to get pretty close to the right answer with careful polling and analysis. In this article, we describe a strategy we are using to synthesise the evidence from the many polls that YouGov runs, the resulting estimates of the overall referendum results and how it breaks down by geography and some other variables – as well as what might still go wrong with our approach. . . .


More discussion and graphs at the link.

Full disclosure: YouGov gives some financial support to the Stan project.

Reduced-dimensionality parameterizations for linear models with interactions

After seeing this post by Matthew Wilson on a class of regression models called “factorization machines,” Aki writes:

In a typical machine learning way, this is called “machine”, but it would be also a useful mode structure in Stan to make linear models with interactions, but with a reduced number of parameters. With a fixed k, the rest of the parameters are continuous and it would be a first step toward the models Dunson talked in his presentation. To get this fast it may require a built-in function (to get a small expression tree), but for a moderately big datasets it’s not going to need all these algorithm mentioned in this posting.

Interesting, perhaps something for someone to work on. It requires some programming but not within core Stan code.

Why I don’t believe the claim that Fox News can get Trump elected

Full story in the sister blog. Short story is that some economists did some out-of-control extrapolations.

More of my recent sister blog entries here.

Clarke’s Law: Any sufficiently crappy research is indistinguishable from fraud


The originals:

Clarke’s first law: When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.

Clarke’s second law: The only way of discovering the limits of the possible is to venture a little way past them into the impossible.

Clarke’s third law: Any sufficiently advanced technology is indistinguishable from magic.

My updates:

1. When a distinguished but elderly scientist states that “You have no choice but to accept that the major conclusions of these studies are true,” don’t believe him.

2. The only way of discovering the limits of the reasonable is to venture a little way past them into the unreasonable.

3. Any sufficiently crappy research is indistinguishable from fraud.

It’s the third law that particularly interests me.

On this blog and elsewhere, we sometimes get into disputes about untrustworthy research, things like power pose or embodied cognition which didn’t successfully replicate, or things like ovulation and voting or beauty and sex ratio which never had a chance to be uncovered with the experiments in question (that kangaroo thing), or things like the Daryl Bem ESP studied that just about nobody believed in the first place, and then when people looked carefully it turned out the analysis was a ganglion of forking paths. Things like himmicanes and hurricanes, where we’re like, Who knows? Anything could happen? But the evidence presented to us is pretty much empty. Or that air pollution in China where everyone’s like, Sure, we believe it, but, again, if you believe it, it can’t be from the non-evidence in that paper.

What all these papers have in common is that they make serious statistical errors. That’s ok, statistics is hard. As I recently wrote, there’s no reason we should trust a statistical analysis, just because it appears in a peer-reviewed journal. Remember, the problem with peer review is with the “peers”: lots of them don’t know statistics either, and lots of them are motivated to believe certain sorts of claims (such as “embodied cognition” or whatever) and don’t care so much about the quality of the evidence.

And now to Clarke’s updated third law. Some of the work in these sorts of papers is so bad that I’m starting to think the authors are making certain mistakes on purpose. That is, they’re not just ignorantly walking down forking paths, picking up shiny statistically significant comparisons, and running with them. No, they’re actively torturing the data, going through hypothesis after hypothesis until they find the magic “p less than .05,” they’re strategically keeping quiet about alternative analyses that didn’t work, they’re selecting out inconvenient data points on purpose, knowingly categorizing variables to keep good cases on one side of the line and bad ones on the other, screwing around with rounding in order to get p-values from just over .05 to just under . . . all these things. In short, they’re cheating.

Even when they’re cheating, I have no doubt that they are doing so for what they perceive to be a higher cause.

Are they cheating or are they just really really incompetent? When Satoshi Kanazawa publishes yet another paper on sex ratios with sample sizes too small to possibly learn anything, even after Weakliem and I published our long article explaining how his analytical strategy could never work, was he cheating—knowingly using a bad method that would allow him to get statistical significance, and thus another publication, from noise? Or was he softly cheating, by purposely not looking into the possibility that his method might be wrong, just looking away so he could continue to use the method? Or was he just incompetent, try to do the scientific equivalent of repairing a watch using garden shears? Same with Daryl Bem and all the rest. I’m not accusing any of them of fraud! Who knows what was going through their mind when they were doing what they were doing.

Anyway, my point is . . . it doesn’t matter. Clarke’s Law! Any sufficiently crappy research is indistinguishable from fraud.

On deck this week

Mon: Clarke’s Law: Any sufficiently crappy research is indistinguishable from fraud

Tues: Reduced-dimensionality parameterizations for linear models with interactions

Wed: Time-reversal heuristic as randomization, and p < .05 as conflict of interest declaration

Thurs: It comes down to reality and it’s fine with me cause I’ve let it slide

Fri: Can a census-tract-level regression analysis untangle correlation between lead and crime?

Sat: What they’re saying about “blended learning”: “Perhaps the most reasonable explanation is that no one watched the video or did the textbook reading . . .”

Sun: When are people gonna realize their studies are dead on arrival?

How an academic urban legend can spread because of the difficulty of clear citation


Allan Dafoe writes:

I just came across this article about academic urban legends spreading because of sloppy citation practices. I found it fascinating and relevant to the conversations on your blog.

The article is by Ole Bjørn Rekdal and it is indeed fascinating. It begins as follows:

Many of the messages presented in respectable scientific publications are, in fact, based on various forms of rumors. Some of these rumors appear so frequently, and in such complex, colorful, and entertaining ways that we can think of them as academic urban legends. . . .

To illustrate this phenomenon, I draw upon a remarkable case in which a decimal point error appears to have misled millions into believing that spinach is a good nutritional source of iron. Through this example, I demonstrate how an academic urban legend can be conceived and born, and can continue to grow and reproduce within academia and beyond.

The story begins:

The following quote, including the reference, is taken from an article published by K. Sune Larsson in the Journal of Internal Medicine:

The myth from the 1930s that spinach is a rich source of iron was due to misleading information in the original publication: a malpositioned decimal point gave a 10-fold overestimate of iron content [Hamblin, 1981]. (Larsson, 1995: 448–449)

The quote caught my [Rekdal’s] attention for two reasons. First, it falsified an idea that I had carried with me since I was a child, that spinach is an excellent source of iron. The most striking thing, however, was that a single decimal point, misplaced 80 years ago, had affected not just myself and my now deceased parents, but also a large number of others in what we place on our table.

But you have to read Rekdal’s article for the full story. There are no bad guys here—no Weggy or Weick-style plagiarism. Rather, it’s a twisty tale indicating the challenges of conveying information that we hear about second- or third-hand.

Difficulty of communication in our supersaturated media environment

Gregory Gelembiuk writes:

I was wondering if you might take a look at this and, if so inclined, do some public shredding.

Claims of electoral fraud have become increasingly popular among political progressives in the last several years and, unfortunately, appear to be gaining critical mass (especially with Sanders’ loss). The “study” above, now being widely circulated in social media, is one example. Even though I normally wouldn’t waste your time with a junk item like this, I thought it might warrant some attention, given the apparent ongoing erosion in faith in democratic institutions.

Sure, no prob . . . It’s a bad, bad paper. By comparison, it makes that himmicanes paper look like Stephen Hawking, it makes power pose look like Jean Piaget, it makes that ovulation-and-clothing paper look like, ummmm, I dunno, the Stroop effect?

But I just posted on it 3 days ago. That should be enough, no?

Kinda scary that someone’s emailing me without noticing such a recent post. Maybe the problem is that it was not easily found by a search. So maybe this will help: the paper in question is called, “Are we witnessing a dishonest election? A between state comparison based on the used voting procedures of the 2016 Democratic Party Primary for the Presidency of the United States of America,” and it’s by Axel Geijsel and Rodolfo Cortes Barragan.

The NYT inadvertently demonstrates how not to make a graph

Andrew Hacker writes:

I have the class prepare a report on how many households in the United States have telephones, land and cell. After studying census data, they focus on two: Connecticut and Arkansas, with respective ownerships of 98.9 percent and 94.6 percent. They are told they have to choose one of the following charts to represent the numbers, and defend their choice.


The first chart suggests a much bigger difference, but is misleading because the bars are arbitrarily scaled to exaggerate that difference.

I hate to see this sort of thing in the New York Times. Millions of people read the Times, it’s an authoritative news source, and this is not the graphics advice they should be given.

Let me break this down. The first thing is that it’s a bit ridiculous to make this big graph for just 2 data points. Why not map all 50 states, why just graph two of them?

The second thing is . . . hey, be numerate here! 98.9% and 94.6% look really close. Let’s ask what percentage of households in each state don’t have phone ownership. When X is close to 1, look at 1 – X. Then you get 2.1% and 5.4%, which indeed are very different.

P.S. Hacker also writes this:

In the real world, we constantly settle for estimates, whereas mathematics — see the SAT — demands that you get the answer precisely right.

Ummm, no. The SAT is a multiple-choice test so of course you have to get the answer precisely right. That’s true of the reading questions on the SAT too, but nobody would say that reading demands that you get the answer precisely right. He’s confusing the underlying subject with the measuring instrument!

Speaking more generally of mathematics: of course there are lots of mathematical results on estimation and approximation. I mean, sure, yeah, I think I see what Hacker’s getting at, but I’d prefer if he were to say that there is a mathematics of estimation and approximation, and that this is an important part of studying the real world.

Hey—here’s a tip from the biology literature: If your correlation is .02, try binning your data to get a correlation of .8 or .9!

Josh Cherry writes:

This isn’t in the social sciences, but it’s an egregious example of statistical malpractice:

Below the abstract you can find my [Cherry’s] comment on the problem, which was submitted as a letter to the journal, but rejected on the grounds that the issue does not affect the main conclusions of the article (sheesh!). These folks had variables with Spearman correlations ranging from 0.02 to 0.07, but they report “strong correlations” (0.74-0.96) that they obtain by binning and averaging, essentially averaging away unexplained variance. This sort of thing has been done in other fields as well.

The paper in question, by A. Diament, R. Y. Pinter, and T. Tuller, is called, “Three-dimensional eukaryotic genomic organization is strongly correlated with codon usage expression and function.” I don’t know from eukaryotic genomic organization, nor have I ever heard of “codon”—maybe I’m using the stuff all the time without realizing it!—but I have heard of “strongly correlated.” Actually, in the abstract of the paper it gets upgraded to “very strongly correlated.”

In the months since Cherry sent this to me, more comments have appeared at the above Pubmed commons link, including this by Joshua Plotkin which shares his original referee report with Premal Shah from 2014 recommending rejection of the paper. Key comment:

Every single correlation reported in the paper is based on binned data. Although it is sometimes appropriate to bin the data for visualization purposes, it is entirely without merit to report correlation coefficients (and associated p-values) on binned data . . . Based on their own figures 3D and S2A, it seems clear that their results either have very small effect or do not at hold at all when analyzing the actual raw data.

And this:

Moreover, the correlation coefficients reported in most of their plots make no sense whatsoever. For instance, in Fig1B, the best-fit regression line of CUBS vs PPI barely passes through the bulk of the data, and yet the authors report a perfect correlation of R=1.

A follow-up comment by Plotkin has some numbers:

In the paper by Diament et al 2014, the authors never reported the actual correlation (r = 0.022) between two genomic measurements; instead they reported correlations on binned data (r = 0.86).

I think we can all agree that .022 is a low correlation and .86 is a high correlation.

But then there’s this from Tuller:

In Shah P, 2013 Plotkin & Shah report in the abstract a correlation which is in fact very weak (according to their definitions here), r = 0.12, without controlling for relevant additional fundamental variables, and include a figure of binned values related to this correlation. This correlation (0.12) is reported in their study as “a strong positive correlation”.

So now I’m thinking that everyone in this field should just stop calling correlations high or low or strong or weak. Better just to report the damn number.

Tuller also writes:

If the number of points in a typical systems biology study is ~300, the number of points analyzed in our study is 1,230,000-fold higher (!); a priori, a researcher with some minimal experience in the field should not expect to see similar levels of correlations in the two cases. Everyone also knows that increasing the number of points, specifically when dealing with non trivial NGS data, also tends to very significantly decrease the correlation.

Huh? I have no idea what they’re talking about here.

But, in all seriousness, it sounds to me like all these researchers should stop talking about correlation. If you have a measure that gets weaker and weaker as your sample size increases, that doesn’t seem like good science to me! I’m glad that Cherry put in the effort to fight this one.

Comment on network analysis of online ISIS activity

Two different journalists asked me about this paper, “New online ecology of adversarial aggregates: ISIS and beyond,” N. F. Johnson, M. Zheng, Y. Vorobyeva, A. Gabriel, H. Qi, N. Velasquez, P. Manrique, D. Johnson, E. Restrepo, C. Song, S. Wuchty, a paper that begins:

Support for an extremist entity such as Islamic State (ISIS) somehow manages to survive globally online despite considerable external pressure and may ultimately inspire acts by individuals having no history of extremism, membership in a terrorist faction, or direct links to leadership. Examining longitudinal records of online activity, we uncovered an ecology evolving on a daily time scale that drives online support, and we provide a mathematical theory that describes it. The ecology features self-organized aggregates (ad hoc groups formed via linkage to a Facebook page or analog) that proliferate preceding the onset of recent real-world campaigns and adopt novel adaptive mechanisms to enhance their survival. One of the predictions is that development of large, potentially potent pro-ISIS aggregates can be thwarted by targeting smaller ones.

I sent my response to the journalists, but then I thought that some of you might be interested too, so here it is:

The paper seems kinda weird. Figure 1 has 10 groups and but you have to contact the authors to find out the names of the groups? They say, “Because the focus in this paper is on the ecosystem rather than the behavior of any individual aggregate, the names are not being released.” But (a) there’s room on the graph for the names, and (b) it would be easy to post the names online. It creeps me out: maybe the FBI is tracking who emails them for the names? I have no idea but it seems strange to withhold data and make readers ask them for it. If the data were actually secret for national security reasons, that I’d understand. But if you’re going to release the data to anyone who asks, why not just post online?

Anyway, that all put me in a bad mood, also the little image inserted in figure 1 adds zero information except to show that the authors had access to a computer program that makes these umbrella-like plots.

Beyond this, they talk about a model for the shark-fin shape, but this just seems a natural consequence of the networks being shut down as they get larger and more noticeable.

On the plus side, the topic is obviously important, the idea of looking at aggregates seems like a good one, and I’m sure much can be learned from these data. I think it would be more useful for them to have produced a longer, open-ended report full of findings. The problem with the “Science magazine” style of publishing is that it encourages researchers to write these very short papers that are essentially self-advertisements. I guess in that sense this paper might be useful in that it could attract media attention and maybe the authors have a longer report with more data explorations. Or it might be that there’s useful stuff in this Science paper and I’m just missing it because I’m getting lost in their model. My guess is that the most valuable things here are the descriptive statistics. If so, that would be fine. There’s a bit of a conflict here in that for Science magazine you’re supposed to have discoveries but for fighting Isis there’s more of a goal of understanding what is happening out there. In theory there is some benefit from modeling (as the authors note, one can do simulation of various potential anti-ISIS strategies) but I don’t think they’re really there yet.

I’m guessing Cosma Shalizi and Peter Dodds would have something to say here, as they are more expert than I am in this sort of physics-inspired network analysis.

P.S. Here’s one of the news articles. It’s by Catherine Caruso and entitled, “Can a Social-Media Algorithm Predict a Terror Attack?”, which confused me because I didn’t notice anything about prediction in that research article. I mean, sure, they said, “Our theoretical model generates various mathematically rigorous yet operationally relevant predictions.” But I don’t think they were actually predicting terror attacks. But maybe I’m missing something.

P.P.S. The New York Times also ran a story on this one. They didn’t interview me, instead they interviewed a couple of actual experts, both of whom expressed skepticism. Even so, though, the writer of the article, Pam Belluck, managed to spin the research project in a positive way. I think that’s just the way things go with science reporting. If it’s not a scandal of some sort, the press likes the idea as scientist as hero.



Paul Alper pointed me to this news article with the delightful url, “superstar-doctor-fired-from-swedish-institute-over-research-lies-allegations-windpipe-surgery.” Also here. It reminded me of this discussion from last year.

Damn, those windpipe surgeons are the worst. Never should trust them. The pope should never have agreed to officiate at this guy’s wedding.

You’ll never guess what I’ll say about this paper claiming election fraud! (OK, actually probably you can guess.)

Glenn Chisholm writes:

As a frequent visitor of your blog (a bit of a long time listener first time caller comment I know) I saw this particular controversy:


Very superficial analysis:

and was interested if I could get you to blog on its actual statistic foundations, this particular paper has at least the appearance of respectability due to the institutions involved. As a permanent resident not a citizen I tend to be a little more abstracted from American politics, although I do enjoy the vigor in which it is played in the US. Based on some of the vitriol I have seen from all sides you may not want to touch this with a ten foot pole, but I thought it would be interesting to get a political scientist with credibility analysis out there publicly. There does seem to be this undercurrent in this cycle where a group genuinely feel that somehow they were disenfranchised, my take is Occam’s Razor applies here and no manipulation occurred, but my opinion is worthless.

My reply:

I don’t find this paper at all convincing. There can be all sorts of differences between different states, and that pie chart is a joke in that it hides the between-state variation within each group of states. You never know, fraud could always happen, but this is essentially zero evidence. Not that you’d need an explanation as to why a 74-year-old socialist fails to win a major-party nomination in the United States.

Also, regarding your comment about the institutions involved, I wouldn’t take the credentials so seriously; this Stanford guy is not a political scientist. Not that this means it’s necessarily wrong, but he’s not coming from a position of expertise. He’s just a guy, the Stanford affiliation gives no special credibility.

Objects of the class “Pauline Kael”

A woman who’s arguably the top person ever in a male-dominated field.

Steve Sailer introduced the category and entered Pauline Kael (top film critic) as its inaugural member. I followed up with Alice Waters (top chef/restaurateur), Mata Hari (top spy), Agatha Christie (top mystery writer), and Helen Keller (top person who overcame a disability; sorry, Stevie Wonder, you didn’t make the cut).

We can distinguish this from objects of the class “Amelia Earhart”: a woman who’s famous for being a woman in a male-dominated field. There are lots of examples of this category, for example Danica Patrick.

Objects of the class “Objects of the class”

P.S. Some other good ones from that thread:

Queen Elizabeth 1 (tops in the “English monarchs” category)

Margaret Mead (anthropologists)

We also discussed some other candidates, including Marie Curie, Margaret Thatcher, Mary Baker Eddy, Ellen Willis, and my personal favorite on this list, Caitlin Jenner (best decathlete of all time).

“Smaller Share of Women Ages 65 and Older Are Living Alone,” before and after age adjusment

After noticing this from a recent Pew Research report:


Ben Hanowell wrote:

This made me [Hanowell] think of your critique of Case and Deaton’s finding about non-Hispanic mortality.

I wonder how much these results are driven by the fact that the population of adults aged 65 and older has gotten older with increasing lifespans, etc etc.

My collaborator Jonathan Auerbach looked into this and found that, in this case, age adjustment doesn’t make a qualitative difference. Here are some graphs:

Percent of adults over age 65 who live alone, over time, for women and men, with solid lines showing raw data and dotted lines showing populations adjusted to the age x sex composition from 2014:


The adjustment doesn’t change much. To get more understanding, let’s break things up by individual ages. Here are the raw data, with each subgraph showing the curves for the 5 years in its age category:


Some interesting patterns here. At the beginning of the last century, a bit less than 10% of elders were living alone, with that percentage not varying by sex or age. Then big changes in recent years.

We learn a lot from these individual-age curves than we did from the simple aggregate.

In this case, age adjustment did not do much, but age disaggregation was useful.

Jonathan then broke down the data by age and ethnicity (non-Hispanic Black, Hispanic, and non-Hispanic White; I guess there weren’t enough data on Other, going back to 1900):


To see people blogging about it in real time — that’s not the way science really gets done. . . .

Seriously, though, it’s cool to see how much can be learned from very basic techniques of statistical adjustment and graphics.

And maybe we made some mistakes. If so, putting our results out there in clear graphical form should make it easier for others to find problems with what we’ve done And that’s what it’s all about. We only spent a few hours on this—we didn’t spend a year, sweating out every number, sweating out over what we were doing—but we’d still welcome criticism and feedback, from any source. That’s a way to move science forward, to move the research forward.

P.S. I sent this to demographer Philip Cohen who wrote:

There are two issues with 65+: 1 is they are getting older, 2 is they are getting younger since the baby boomers suddenly hit 65.

Here’s a post by Cohen on this from a couple months ago.