Skip to content

Confirmation bias

Shravan Vasishth is unimpressed by this evidence that was given to support the claim that being bilingual postpones symptoms of dementia:

Screen Shot 2016-07-26 at 1.23.27 PM

My reaction: Seems like there could be some selection issues, no?

Shravan: Also, low sample size, and confirming what she already believes. I would be more impressed if she found evidence against the bilingual advantage.

Me: Hmmm, that last bit is tricky, as there’s also a motivation for people to find surprising, stunning results.

Shravan: Yes, but you will never find that this surprising, stunning result is something that goes against the author’s own previously published work. It always goes against someone *else*’s. I find this issue to be the most surprising and worrying of all, even more than p-hacking, that we only ever find evidence consistent with our beliefs and theories, never against.

Indeed, Shravan’s example confirms what I already thought about scientists.

The Lure of Luxury


From the sister blog, a response to an article by psychologist Paul Bloom on why people own things they don’t really need:

Paul Bloom argues that humans dig deep, look beyond the surface, and attend to the nonobvious in ways that add to our pleasure and appreciation of the world of objects. I [Susan] wholly agree with that analysis. My objection, however, is that he does not go far enough. There is a dark side to our infatuation by and obsession with the past. Our focus on historical persistence reveals not just appreciation and pleasure, but also bigotry and cruelty. Bloom’s story is incomplete without bringing these cases to light.

All the examples that Bloom discusses involve what we might call positive contagion—an object gains value because of its link to a beloved individual, history, or brand. This positive glow rescues a seemingly offensive behavior: contrary to what we might at first think, spending exorbitant amounts on a watch is not selfish or self-absorbed but rather can be understood as benign and even virtuous. Those who spend on luxuries are not “irrational, wasteful, . . . evil”—rather, they appropriately take pleasure by rationally considering the joy that we all find in a cherished object’s history.

Yet attention to an object’s history does not merely provide joy. History can also be a taint leading to suspicion, segregation, and discrimination. The psychologist Paul Rozin notes that people seem to operate according to a principle of “magical contagion,” where one can be harmed by contact with an object involved with evil or death, leading people to reject wearing Hitler’s sweater, a suit that someone died in, or a house in which a murder was committed. Fair enough. But the troubling point is that this same impulse arises when people come into contact with objects linked to those who are not evil but just different—not part of one’s in-group. In fact, simply thinking about such contact can be disturbing.

Segregation and institutionalized discrimination reflect this impulse to avoid contact across social groups. In parts of India, elaborate behavioral codes ensure that individuals will not come into contact with objects that have been touched by those of a lower caste. Thus, some teashops use a “double-tumbler” system, such that Dalits (“untouchables”) are required to use different cups, plates, or utensils than caste Hindus. Whites-only drinking fountains in the pre–civil rights southern United States can be understood as a means of avoiding negative history—contact with an object that has been touched by members of a marginalized group. In the 1980s, many responded similarly to individuals with AIDS, who were sometimes banned from swimming pools and other public places. Indeed, in one national survey, many respondents reported that they would be less likely to wear a sweater that had been worn once by a person with AIDS, or would feel uncomfortable drinking out of a sterilized glass that had been used a few days earlier by a person with AIDS.

In our own research, Meredith Meyer, Sarah-Jane Leslie, Sarah Stilwell, and I [Susan] found similar negative responses to a homeless person, someone with low IQ, someone with schizophrenia, or someone who has committed a crime. Adults typically report feeling “creeped out” by the idea of receiving an organ transplant or blood transfusion from such individuals for fear they will be contaminated or even become more like the donor. These beliefs hold even when people are assured that the organ or blood is healthy. In this case, a heart’s history is thought to carry with it negative characteristics of a group subject to discrimination.

Attention to object history may indeed be a biological adaptation. It can serve us well and enrich our appreciation of the objects around us, from Rolex watches to discarded baby shoes to a poet’s unused typewriter paper. But it is important that we recognize the terrible costs of this way of thinking.

We fiddle while Rome burns: p-value edition


Raghu Parthasarathy presents a wonderfully clear example of disastrous p-value-based reasoning that he saw in a conference presentation. Here’s Raghu:

Consider, for example, some tumorous cells that we can treat with drugs 1 and 2, either alone or in combination. We can make measurements of growth under our various drug treatment conditions. Suppose our measurements give us the following graph:


. . . from which we tell the following story: When administered on their own, drugs 1 and 2 are ineffective — tumor growth isn’t statistically different than the control cells (p > 0.05, 2 sample t-test). However, when the drugs are administered together, they clearly affect the cancer (p < 0.05); in fact, the p-value is very small (0.002!). This indicates a clear synergy between the two drugs: together they have a much stronger effect than each alone does. (And that, of course, is what the speaker claimed.)

I [Raghu] will pause while you ponder why this is nonsense.

He continues:

Another interpretation of this graph is that the “treatments 1 and 2” data are exactly what we’d expect for drugs that don’t interact at all. Treatment 1 and Treatment 2 alone each increase growth by some factor relative to the control, and there’s noise in the measurements. The two drugs together give a larger, simply multiplicative effect, and the signal relative to the noise is higher (and the p-value is lower) simply because the product of 1’s and 2’s effects is larger than each of their effects alone.

And now the background:

I [Raghu] made up the graph above, but it looks just like the “important” graphs in the talk. How did I make it up? The control dataset is random numbers drawn from a normal distribution with mean 1.0 and standard deviation 0.75, with N=10 measurements. Drug 1 and drug 2’s “data” are also from normal distributions with the same N and the same standard deviation, but with a mean of 2.0. (In other words, each drug enhances the growth by a factor of 2.0.) The combined treatement is drawn from a distribution of mean 4.0 (= 2 x 2), again with the same number of measurements and the same noise. In other words, the simplest model of a simple effect. One can simulate this ad nauseum to get a sense of how the measurements might be expected to look.

Did I pick a particular outcome of this simulation to make a dramatic graph? Of course, but it’s not un-representative. In fact, of the cases in which Treatment 1 and Treatment 2 each have p>0.05, over 70% have p<0.05 for Treatment 1 x Treatment 2 ! Put differently, conditional on looking for each drug having an “insignificant” effect alone, there’s a 70% chance of the two together having a “significant” effect not because they’re acting together, but just because multiplying two numbers greater than one gives a larger number, and a larger number is more easily distinguished from 1!

As we’ve discussed many times, the problem here is partly with p-values themselves and partly with the null hypothesis significance testing framework:

1. The problem with p-values: the p-value is a strongly nonlinear transformation of data that is interpretable only under the null hypothesis, yet the usual purpose of the p-value in practice is to reject the null. My criticism here is not merely semantic or a clever tongue-twister or a “howler” (as Deborah Mayo would say); it’s real. In settings where the null hypothesis is not a live option, the p-value does not map to anything relevant.

To put it another way: Relative to the null hypothesis, the difference between a p-value of .13 (corresponding to a z-score of 1.5), and a p-value of .003 (corresponding to a z-score of 3), is huge; it’s the difference between a data pattern that could easily have arisen by chance alone, and a data pattern that it is highly unlikely to have arisen by chance. But, once you allow nonzero effects (as is appropriate in the sorts of studies that people are interested in doing in the first place), the difference between p-values of 1.5 and 3 is no big deal at all, it’s easily attributable to random variation. I don’t mind z-scores so much, but the p-value transformation does bad things to them.

2. The problem with null hypothesis significance testing: As Raghu discusses near the end of his post, this sort of binary thinking makes everything worse in that people inappropriately combine probabilistic statements with Boolean rules. And switching from p-values to confidence intervals doesn’t do much good here, for two reasons: (a) if all you do is check whether the conf intervals excludes 0, you haven’t gone forward at all, and (b) even if you do use them as uncertainty statements, classical intervals have all the biases that arise from not including prior information: classical conf intervals overestimate magnitudes of effect sizes.

Anyway, we know all this, but recognizing the ubiquity of fatally flawed significance-testing reasoning puts a bit more pressure on us to come up with and promote better alternatives that are just as easy to use. I do think this is possible; indeed I’m working on it when not spending my time blogging. . . .

“Which curve fitting model should I use?”


Oswaldo Melo writes:

I have learned many of curve fitting models in the past, including their technical and mathematical details. Now I have been working on real-world problems and I face a great shortcoming: which method to use.

As an example, I have to predict the demand of a product. I have a time series collected over the last 8 years. A simple set of (x,y) data about the relationship between the demand of a product on a certain week. I have this for 9 products. And to continue the study, I must predict the demand of each product for the next years.

Looks easy enough, right? Since I do not have the probability distribution of the data, just use a non-parametric curve fitting algorithm. But which one? Kernel smoothing? B-splines? Wavelets? Symbolic regression? What about Fourier analysis? Neural networks? Random forests?

There are dozens of methods that I could use. But which one has better performance remains a mystery. I tried to read many articles in which the authors make predictions based on a time- eries and in most, it
looks like the choice was completely arbitrarily. They would say: “now we will fit a curve to the data using multivariate adaptive regression splines.” But nowhere it’s explained why he used such a method instead of, let’s say, kernel regression or Fourier analysis or a neural network.

I am aware of cross-validation. But am I supposed to try all the dozen methods out there, cross-validate all of them, and see which one performs better? Can cross-validation even be used for all methods – I am not sure. I have mostly seen cross-validation being used within a single method, never between a lot of methods.

I could not find anything on the literature that answers such a simple question. “Which curve fitting model should I use?”

These are good questions. Here are my responses, in no particular order:

1. What is most important about a statistical model is not what it does with the data but, rather, what data it uses. You want to use a model that can take advantage of all the data you have.

2. In your setting with structured time series data, I’d use a multilevel model with coefficients that vary by product and by time. You may well have other structure in your data that you haven’t even mentioned yet, for example demand as broken down by geography or demographic sectors of your consumers; also the time dimension has structure, with different things happening at different times of year. If you want a nonparametric curve fit, you could try a Gaussian process, which plays well with Bayesian multilevel models.

3. Cross-validation is fine but it’s just one more statistical method. To put it another way, if you estimate a parameter or pick a method using cross-validation, it’s still just an estimate. Just cos something performs well in cross-validation, it doesn’t mean it’s the right answer. It doesn’t even mean it will predict well for new data.

4. There are lots of ways to solve a problem. The choice of method to use will depend on what information you want to include in your model, and also what sorts of extrapolations you’ll want to use it for.

Nooooooo, just make it stop, please!


Dan Kahan wrote:

You should do a blog on this.

I replied: I don’t like this article but I don’t really see the point in blogging on it. Why bother?


BECAUSE YOU REALLY NEVER HAVE EXPLAINED WHY. Gelman-Rubin criticque of BIC is *not* responsive; you have something in mind—tell us what, pls! Inquiring minds what to know.

Me: Wait, are you saying it’s not clear to you why I should hate that paper??



Certainly what say about “model selection” aspects of BIC in Gelman-Rubin don’t apply.

Me: OK, OK. . . . The paper is called, Bayesian Benefits for the Pragmatic Researcher, and it’s by some authors whom I like and respect, but I don’t like what they’re doing. Here’s their abstract:

The practical advantages of Bayesian inference are demonstrated here through two concrete examples. In the first example, we wish to learn about a criminal’s IQ: a problem of parameter estimation. In the second example, we wish to quantify and track support in favor of the null hypothesis that Adam Sandler movies are profitable regardless of their quality: a problem of hypothesis testing. The Bayesian approach unifies both problems within a coherent predictive framework, in which parameters and models that predict the data successfully receive a boost in plausibility, whereas parameters and models that predict poorly suffer a decline. Our examples demonstrate how Bayesian analyses can be more informative, more elegant, and more flexible than the orthodox methodology that remains dominant within the field of psychology.

And here’s what I don’t like:

Their first example is fine, it’s straightforward Bayesian inference with a linear model, it’s almost ok except that they include a bizarre uniform distribution as part of their prior. But here’s the part I really don’t like. After listing seven properties of the Bayesian posterior distribution, they write, “none of the statements above—not a single one—can be arrived at within the framework of orthodox methods.” That’s just wrong. In classical statistics, this sort of Bayesian inference falls into the category of “prediction.” We discuss this briefly in a footnote somewhere in BDA. Classical “predictive inference” is Bayesian inference conditional on hyperparameters, which is what’s being done in that example. A classical predictive interval is not the same thing as a classical confidence interval, and a classical unbiased prediction is not the same thing as a classical unbiased estimate. The key difference: when a classical statistician talks about “prediction,” this means that the true value of the unknown quantity (the “prediction”) is not being conditioned on. Don’t get me wrong, I think Bayesian inference is great; I just think it’s silly to say that these methods don’t exist with orthodox methods.

Their second example, I hate. It’s that horrible hypothesis testing thing. They write, “The South Park hypothesis (H0) posits that there is no correlation (ρ) between box-office success and “fresh” ratings—H0: ρ = 0.” OK, it’s a joke, I get that. But, within the context of the example, no. No. Nononononono. It makes no sense. The correlation is not zero. None of this makes any sense. It’s a misguided attempt to cram a problem into an inappropriate hypothesis testing framework.

I have a lot of respect for the authors of this paper. They’re smart and thoughtful people. In this case, though, I think they’re in a hopeless position.

I do agree with Kahan that the problem of adjudicating between scientific hypotheses is important. I just don’t think this is the right way to do it. If you want to adjudicate between scientific hypotheses, I prefer the approach of continuous model expansion: building a larger model that includes the separate models as separate cases. Forget Wald, Savage, etc., and start from scratch.

When you add a predictor the model changes so it makes sense that the coefficients change too.

Shane Littrell writes:

I’ve recently graduated with my Masters in Science in Research Psych but I’m currently trying to get better at my stats knowledge (in psychology, we tend to learn a dumbed down, “Stats for Dummies” version of things). I’ve been reading about “suppressor effects” in regression recently and it got me curious about some curious results from my thesis data.

I ran a multiple regression analysis on several predictors of academic procrastination and I noticed that two of my predictors showed some odd behavior (to me). One of them (“entitlement”) was very nonsignificant (β = -.05, p = .339) until I added “boredom” as a predictor, and it changed to (β = – .10, p = .04).

The boredom predictor also had an effect on another variable, but in the opposite way. Before boredom was added, Mastery Approach Orientation (MAP) was significant (β = -.17, p = .003) but after boredom was added it changed to (β = -.05, p = .335).

It’s almost as if Entitlement and MAP switched Beta values and significance levels once Boredom was added.

What is the explanation for this? Is this a type of suppressor effect or something else I haven’t learned about yet?

My reply: Yes, this sort of thing can happen. It is discussed in some textbooks on regression but we don’t really go into it in our book. Except we do have examples where we run a regression and then throw in another predictor and the original coefficients change. When you add a predictor the model changes so it makes sense that the coefficients change too.

Field Experiments and Their Critics

Seven years ago I was contacted by Dawn Teele, who was then a graduate student and is now a professor of political science, and asked for my comments on an edited book she was preparing on social science experiments and their critics.

I responded as follows:

This is a great idea for a project. My impression was that Angus Deaton is in favor of observational rather than experimental analysis; is this not so? If you want someone technical, you could ask Ed Vytlacil; he’s at Yale, isn’t he? I think the strongest arguments in favor of observational rather than experimental data are:

(a) Realism in causal inference. Experiments–even natural experiments–are necessarily artificial, and there are problems in generalizing beyond them to the real world. This is a point that James Heckman has made.

(b) Realism in research practice. Experimental data are relatively rare, and in the meantime we have to learn with what data we have, which are typically observational. This is the point made by Paul Rosenbaum, Don Rubin, and others who love experiments, see experiments as the gold standard, but want to make the most of their observational data. You could perhaps get Paul Rosenbaum or Rajeev Dehejia to write a good paper making this point–not saying that obs data are better than experimental data, but saying that much that is useful can be learned from obs data.

(c) The “our brains can do causal inference, so why can’t social scientists?” argument. Sort of an analogy to the argument that the traveling salesman problem can’t be so hard as all that, given that thousands of traveling salesmen solve the problem every day. The idea is that humans do (model-based) everyday causal inference all the time (every day, as it were), and we rarely use experimental data, certainly not the double-blind stuff we do all the time. I have some sympathy but some skepticism with this argument (see attached article), but if you wanted someone who could make that argument, you could ask Niall Bolger or David Kenny or some other social psychologist or sociologist who is familiar with path analysis. Again, I doubt they’d say that observational data are better than the equivalent experiment, but they might point out that, realistically, “the equivalent experiment” isn’t always out there, and the observational data are.

(d) This issue also arises in evidence-based medicine. As far as I can tell, there are three main strands of evidence-based medicine: (i) using randomized controlled trials to compare treatments, (ii) data-based cost-benefit analyses (Qalys and the like), (iii) systematic collection and analysis of what’s actually done (i.e., observational data), thus moving medicine into a total quality control environment. You could perhaps get someone like Chris Schmid (a statistician at New England Medical Center who’s a big name in this field) to write an article about this (giving him my sentence above to give you a sense of what you’re looking for).

(e) An argument from a completely different direction is that _experimentation_ is great, but formal _randomized trials_ are overrated. The idea is that these formal experiments (in the style of NIH or, more recently, the MIT poverty lab) would be fine in and of themselves except that they (i) suck up resources and, even more importantly, (ii) dissuade people from doing everyday experimentation that they might learn from. The #1 proponent of this view is Seth Roberts, an experimental psychologist who’s written on self-experimentation.

I’d be happy to write something expanding (briefly) on the above points. I don’t feel so competent in the area to actually take any strong positions but I’d be glad to lay out what I consider are some important issues that often get lost in the debate.

A few months later I sent in my chapter, which begins:

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.”

At the same time, in my capacity as a social scientist, I’ve published many applied research papers, almost none of which have used experimental data.

In the present article, I’ll address the following questions:

1. Why do I agree with the consensus characterization of randomized experimentation as a gold standard?

2. Given point 1 above, why does almost all my research use observational data?

In confronting these issues, we must consider some general issues in the strategy of social science research. We also take from the psychology methods literature a more nuanced perspective that considers several different aspects of research design and goes beyond the simple division into randomized experiments, observational studies, and formal theory.

A few years later the book came out.

I’ve blogged on this all before, but just recently the journal Perspectives on Politics published a symposium with several reviews of the book (from Henry Brady, Yanna Krupnikov, Jessica Robinson Preece, Peregrine Schwartz-Shea, and Betsy Sinclair), and I thought it might interest some of you.

In her review, Sinclair writes, “The arguments in the book are slightly dated . . . Seven years later, there is more consensus within the experimental community about the role experiments play in addressing a research question.” I don’t quite agree with that; I think the issues under discussion remain highly relevant. I hope that soon we shall reach a point of consensus, but we’re not there yet.

I certainly would not want to join in any consensus that includes some of the more controversial Ted-talk-style experimental claims involving all the supposedly irrational influences on voting, for example. The key role of experimentation in such work is, I think, not scientific so much as meta-scientific: when a study is encased in an experimental or quasi-experimental framework, it can seem more like science, and then the people at PPNAS, NPR, etc., can take it more seriously. My recommendation is for experimentation, quasi-experimentation, and identification strategies more generally to be subsumed within larger issues of statistical measurement.

About that claim in the Monkey Cage that North Korea had “moderate” electoral integrity . . .

Yesterday I wrote about problems with the Electoral Integrity Project, a set of expert surveys that are intended to “evaluate the state of the world’s elections” but have some problems, notably rating more than half of the U.S. states in 2016 as having lower integrity than Cuba (!) and North Korea (!!!) in 2014.

I was distressed to learn that these shaky claims regarding electoral integrity have been promoted multiple times on the Monkey Cage, a blog with which I am associated. Here, for example, is that notorious map showing North Korea as having “moderate” electoral integrity in 2014.

The post featuring North Korea has the following note:
Continue reading ‘About that claim in the Monkey Cage that North Korea had “moderate” electoral integrity . . .’ »

Fragility index is too fragile

Simon Gates writes:

Where is an issue that has had a lot of publicity and Twittering in the clinical trials world recently. Many people are promoting the use of the “fragility index” (paper attached) to help interpretation of “significant” results from clinical trials. The idea is that it gives a measure of how robust the results are – how many patients would have to have had a different outcome to render the result “non-significant”.

Lots of well-known people seem to be recommending this at the moment; there’s a website too ( , which calculates p-values to 15 decimal places!). I’m less enthusiastic. It’s good that problems of “statistical significance” are being more widely appreciated, but the fragility index is still all about “significance”, and we really need to be getting away from p-values and “significance” entirely, not trying to find better ways to use them (or shore them up).

Though you might be interested/have some thoughts as it’s relevant to many of the issues frequently discussed on your blog.

My response: I agree, it seems like a clever idea but built on a foundation of sand.

“Constructing expert indices measuring electoral integrity” — reply from Pippa Norris

This morning I posted a criticism of the Electoral Integrity Project, a survey organized by Pippa Norris and others to assess elections around the world.

Norris sent me a long response which I am posting below as is. I also invited Andrew Reynolds, the author of the controversial op-ed, to contribute to the discussion.

Here’s Norris:
Continue reading ‘“Constructing expert indices measuring electoral integrity” — reply from Pippa Norris’ »

About that bogus claim that North Carolina is no longer a democracy . . .

Nick Stevenson directed me to a recent op-ed in the Raleigh News & Observer, where political science professor Andrew Reynolds wrote:

In 2005, in the midst of a career of traveling around the world to help set up elections in some of the most challenging places on earth . . . my Danish colleague, Jorgen Elklit, and I designed the first comprehensive method for evaluating the quality of elections around the world. . . . In 2012 Elklit and I worked with Pippa Norris of Harvard University, who used the system as the cornerstone of the Electoral Integrity Project. Since then the EIP has measured 213 elections in 153 countries and is widely agreed to be the most accurate method for evaluating how free and fair and democratic elections are across time and place. . . .

So far so good. But then comes the punchline:

In the just released EIP report, North Carolina’s overall electoral integrity score of 58/100 for the 2016 election places us alongside authoritarian states and pseudo-democracies like Cuba, Indonesia and Sierra Leone. If it were a nation state, North Carolina would rank right in the middle of the global league table – a deeply flawed, partly free democracy that is only slightly ahead of the failed democracies that constitute much of the developing world.

I searched on the web and could not find a copy of the just released EIP report but I did come across this page which lists all 50 states plus DC.

North Carolina is not even the lowest-ranked state! Alabama, Michigan, Ohio, Georgia, Rhode Island, Pennsylvania, South Carolina, Mississippi, Oklahoma, Tennessee, Wisconsin, and Arizona are lower.

Hmmm. Whassup with that?

Here’s the international map from The Year in Elections, 2014, by Pippa Norris, Ferran Martinez i Coma, and Max Grömping:

There’s North Korea in yellow, one of the countries with “moderate” electoral integrity. Indeed, go to the chart and they list North Korea as #65 out of 127 countries. The poor saps in Bulgaria and Romania are ranked #90 and 92, respectively. Clearly what they need is a dose of Kim Jong-il.

Let’s see what this measure actually is. From the report:

The survey asks experts to evaluate elections using 49 indicators, grouped into eleven categories reflecting the whole electoral cycle. Using a comprehensive instrument, listed at the end of the report, experts assess whether each national parliamentary and presidential contest meets international standards during the pre-election period, the campaign, polling day and its aftermath. The overall PEI index is constructed by summing the 49 separate indicators for each election and for each country. . . .

Around forty domestic and international experts were consulted about each election, with requests to participate sent to a total of 4,970 experts, producing an overall mean response rate of 29%. The rolling survey results presented in this report are drawn from the views of 1,429 election experts.

OK, let’s check what the experts said about North Korea; it’s on page 9 of the report:
Electoral laws 53
Electoral procedures 73
District boundaries 73
Voter registration 83
Party and candidate registration 54
Media coverage 78
Campaign finance 84
Voting process 53
Vote count 74
Results 80
Electoral authorities 60

Each of these is on a 0-100 scale with 100 being good. So, you got it, North Korea is above 50 in every category on the scale.

Who did they get to fill out this survey? Walter Duranty?

OK, let’s look more carefully. In this table, the response rate for North Korea is given as 6%. And the report said they consulted about 40 “domestic and international experts” for each election. Hmmm . . . 6% of 40 is 2.4, so maybe they got 3 respondents for North Korea, 2 of whom were Stalinists.

That 2014 report mentioned above gave North Korea a rating of 65.3 out of 100 and Cuba a rating of 65.6. Both these numbers are higher than at least 27 of the 50 U.S. states in 2016, according to the savants at the Electoral Integrity Project.

Political science, indeed.

How’s North Korea been doing lately? Stevenson writes:

North Korea is in The Year in Elections 2014 but was quietly removed from The Year in Elections 2015. It’s not a matter of the 2014 elections not being in the 2015 timeframe either – diagram 5 of The Year in Elections 2015 says ‘PEI Index 2012-2015’ and North Korea was in Diagram 1 of The Year in Elections 2014, PEI Index 2012-2014. They have North Korea in gray in the later world map as ‘Not yet covered’. On p. 73 of The Year in Elections 2015 they list their criteria for inclusion in the survey (no microstates, no Taiwan, etc) but don’t explain why PRK_09032014_L1 has suddenly gone missing.

Perhaps North Korea was too embarrassing for them?

In his email to me, Stevenson wrote:

This is terrible research that I [Stevenson] think has the potential to do real damage in the real world with their absurdly high scores for fake elections in places like Oman, Kuwait, Rwanda, and Cuba. Suppose Oman’s government arrests an opposition politician or cracks down on a peaceful demonstration and the EU and US ambassadors protest. What if the Omani government argues that according to Harvard University’s measure which is “widely agreed to be the most accurate method for evaluating how free and fair and democratic elections are across time and place”, Oman is in much better shape than many EU countries and US states and that they should get their own houses in order before criticizing others? The EIP is just as likely to serve as a freebie to repressive governments that somehow fluke a high score as it is to spur the repeal of Wisconsin’s ID law.

If Reynolds, Norris, etc., don’t like what the North Carolina legislature has been doing, fine. It could even be unconstitutional, I have no sense of such things. And I agree with the general point that there are degrees of electoral integrity or democracy or whatever. Vote suppression is not the same thing as an one-party state and any number-juggling that suggests that is just silly, but, sure, put together enough restrictions and gerrymandering and ex post facto laws and so on, and that can add up.

Electoral integrity is an important issue, and it’s worth studying. In a sensible way.

What went wrong here? It all seems like an unstable combination of political ideology, academic self-promotion, credulous journalism, and plain old incompetence. Kinda like this similar thing from a few years ago with the so-called Human Development Index.

P.S. I googled *reynolds north carolina democracy* to see how much exposure this story got, and I found links to Democracy Now, Vox, Slate, Daily Caller, Common Dreams, American Thinker,, MSNBC, Huffington Post, Think Progress, The Week, . . . basically a lot of obscure outlets. I write for Slate and Vox, so I was sorry to see them pick this one up.

But the good news is that the usual suspects such as ABC, NBC, CBS, CNN, NPR, BBC, NYT didn’t fall for it. I give these core media outlets such a hard time when they screw up, and they deserve our respect when they don’t take the bait on this sort of juicy, but bogus, story.

P.P.S. See here for more from Pippa Norris.

Migration explaining observed changes in mortality rate in different geographic areas?

We know that the much-discussed increase in mortality among middle-aged U.S. whites is mostly happening among women in the south.

In response to some of that discussion, Tim Worstall wrote:

I [Worstall] have a speculative answer. It is absolutely speculative: but it is also checkable to some extent.

Really, I’m channelling my usual critique of Michael Marmot’s work on health inequality in the UK. Death stats don’t measure lifespans of people from places, they measure life spans of people who die in places. So, if there’s migration, and selectivity in who migrates where, then it’s not the inequality between places that might explain differential lifespans but that selection in migration.

Similarly, here in the American case. We know that Appalachia, the Ozarks and the smaller towns of the mid west are emptying out. But it’s those who graduate high school, or who go off to college, who are leaving.

It’s possible, but obviously not certain, that the rising death *rates* are simply a reflection of this selectivity in migration.

I replied: This could be true, I’m not sure. I haven’t tried to crunch the numbers to see if mobility is enough to cause these changes, but on first glance it seems possible. One thing also to remember is that when comparing a particular age category over several years, we’re not comparing the same people. Today’s 50-yr-olds are not the same as next year’s 50-yr-olds. So the usual challenge is separating age, period, and cohort effects. But I agree with you that mobility is an issue too. On a related point, I questioned Case and Deaton’s comparisons by education category, because the proportion of people with college degrees etc. in different age groups has been changing over time too.

Worstall replied, “Not sure when the switch took place in the US but in my age cohort in the UK some 12% or so went to university, now it’s near 50%.” And then he followed up:

Further to the point that migration might be explaining something about these changes in average lifespans. Interesting new research from Glasgow. Seems that it at least part of the story there.

I guess the point is that death rates below age 65 are low enough that it doesn’t take much migration of at-risk people to move the numbers around.

P.S. As an aside, it’s kind of amazing that the big discussion of mortality trends was over a year ago. It seems so recent! There’s so much going on in statistics and social science, room for 400 or so posts a year, sometimes it’s hard to see how we can possibly keep it all in our heads at once.

Stan 2.14 released for R and Python; fixes bug with sampler

Stan 2.14 is out and it fixes the sampler bug in Stan versions 2.10 through 2.13.

Critical update

It’s critical to update to Stan 2.14. See:

The other interfaces will update when you udpate CmdStan.

The process

After Michael Betancourt diagnosed the bug, it didn’t take long for him to generate a test statistic so we can test this going forward, then submit a pull request for the patch and new test. I code reviewed that and made sure a clean check out did the right thing and then we merged. We had a few other fixes in, including one from Mitzi Morris that completed the compound declare define feature. Then Mitzi and Daniel built the releases for the Stan math library, the core Stan C++ library, and then Daniel built the release for CmdStan. After that, Ben Goodrich and Jiqiang Guo worked on updating RStan and Allen Riddell worked through pile of issues for PyStan, and both were released.

Stan Con coming soon!

Over 100 people have registered for the first

It’s at Columbia University in New York on

  • 21 January 2017

Andrew Gelman and Michael Betancourt will be speaking, along with nine submitted talks and a closing Q&A panel. Most of the rest of us from Columbia will be there and I believe other dev team members are coming in for the event. There will be courses the two days before.

Hope to see you in New York!

Comment of the year

In our discussion of research on the possible health benefits of a low-oxygen environment, Raghu wrote:

This whole idea (low oxygen -> lower cancer risk) seems like a very straightforward thing to test in animals, which one can move to high and low oxygen environments . . .

And then Llewelyn came in for the kill:

Why do the animals always get first pick at new treatments? Seems unfair.

Transformative treatments

Screen Shot 2016-07-25 at 4.01.30 PM

Kieran Healy and Laurie Paul wrote a new article, “Transformative Treatments,” (see also here) which reminds me a bit of my article with Guido, “Why ask why? Forward causal inference and reverse causal questions.” Healy and Paul’s article begins:

Contemporary social-scientific research seeks to identify specific causal mechanisms for outcomes of theoretical interest. Experiments that randomize populations to treatment and control conditions are the “gold standard” for causal inference. We identify, describe, and analyze the problem posed by transformative treatments. Such treatments radically change treated individuals in a way that creates a mismatch in populations, but this mismatch is not empirically detectable at the level of counterfactual dependence. In such cases, the identification of causal pathways is underdetermined in a previously unrecognized way. Moreover, if the treatment is indeed transformative it breaks the inferential structure of the experimental design. . . .

I’m not sure exactly where my paper with Guido fits in here, except that the idea of the “treatment” is so central to much of causal inference, that sometimes researchers seem to act as if randomization (or, more generally, “identification”) automatically gives validity to a study, as if randomization plus statistical significance equals scientific discovery. The notion of a transformative treatment is interesting because it points to a fundamental contradiction in how we typically think of causality, in that on one hand “the treatment” is supposed to be transformative and have some clearly-defined “effect,” while on the other hand the “treatment” and “control” are typically considered symmetrically in statistical models. I pick at this a bit in this 2004 article on general models for varying treatment effects.

P.S. Hey, I just remembered—I discussed this a couple of other times on this blog:

– 2013: Yes, the decision to try (or not) to have a child can be made rationally

– 2015: Transformative experiences: a discussion with L. A. Paul and Paul Bloom

“Kevin Lewis and Paul Alper send me so much material, I think they need their own blogs.”

In my previous post, I wrote:

Kevin Lewis and Paul Alper send me so much material, I think they need their own blogs.

It turns out that Lewis does have his own blog. His latest entry contains a bunch of links, starting with this one:

Populism and the Return of the “Paranoid Style”: Some Evidence and a Simple Model of Demand for Incompetence as Insurance against Elite Betrayal

Rafael Di Tella & Julio Rotemberg

NBER Working Paper, December 2016

We present a simple model of populism as the rejection of “disloyal” leaders. We show that adding the assumption that people are worse off when they experience low income as a result of leader betrayal (than when it is the result of bad luck) to a simple voter choice model yields a preference for incompetent leaders. These deliver worse material outcomes in general, but they reduce the feelings of betrayal during bad times. We find some evidence consistent with our model in a survey carried out on the eve of the recent U.S. presidential election. Priming survey participants with questions about the importance of competence in policymaking usually reduced their support for the candidate who was perceived as less competent; this effect was reversed for rural, and less educated white, survey participants.

I clicked through, and, ugh! What a forking-paths disaster! It already looks iffy from the abstract, but when you get into the details . . . ummm, let’s just say that these guys could teach Daryl Bem a thing or two.

Not Kevin Lewis’s fault; he’s just linking . . .

On the plus side, he also links to this:

Turnout and weather disruptions: Survey evidence from the 2012 presidential elections in the aftermath of Hurricane Sandy

Narayani Lasala-Blanco, Robert Shapiro & Viviana Rivera-Burgos

Electoral Studies, forthcoming

This paper examines the rational choice reasoning that is used to explain the correlation between low voter turnout and the disruptions caused by weather related phenomena in the United States. Using in-person as well as phone survey data collected in New York City where the damage and disruption caused by Hurricane Sandy varied by district and even by city blocks, we explore, more directly than one can with aggregate data, whether individuals who were more affected by the disruptions caused by Hurricane Sandy were more or less likely to vote in the 2012 Presidential Election that took place while voters still struggled with the devastation of the hurricane and unusually low temperatures. Contrary to the findings of other scholars who use aggregate data to examine similar questions, we find that there is no difference in the likelihood to vote between citizens who experienced greater discomfort and those who experienced no discomfort even in non-competitive districts. We theorize that this is in part due to the resilience to costs and higher levels of political engagement that vulnerable groups develop under certain institutional conditions.

I like this paper, but then again I know Narayani and Bob personally, so you can make of this what you will.

P.S. Although I think the “Populism and the Return of the Paranoid Style” paper is really bad, I recognize the importance of the topic, and I assume the researchers on this project were doing their best. It is worth another post or article explaining how better to address such questions and analyze this sort of data. My quick suggestion is that each causal question deserves its own study, and I don’t think it’s going to work so well to sift through a pile of data pulling out statistically significant comparisons, dismissing results that don’t fit your story, and labeling results that you like as “significant at the 7% level.” It’s not that there’s anything magic about a 5% significance level, it’s that you want to look at all of your comparisons, and you’re asking for trouble if you keep coming up with reasons to count or discard patterns.
Continue reading ‘“Kevin Lewis and Paul Alper send me so much material, I think they need their own blogs.”’ »

Two unrelated topics in one post: (1) Teaching useful algebra classes, and (2) doing more careful psychological measurements

Kevin Lewis and Paul Alper send me so much material, I think they need their own blogs. In the meantime, I keep posting the stuff they send me, as part of my desperate effort to empty my inbox.

1. From Lewis:

“Should Students Assessed as Needing Remedial Mathematics Take College-Level Quantitative Courses Instead? A Randomized Controlled Trial,” by A. W. Logue, Mari Watanabe-Rose, and Daniel Douglas, which begins:

Many college students never take, or do not pass, required remedial mathematics courses theorized to increase college-level performance. Some colleges and states are therefore instituting policies allowing students to take college-level courses without first taking remedial courses. However, no experiments have compared the effectiveness of these approaches, and other data are mixed. We randomly assigned 907 students to (a) remedial elementary algebra, (b) that course with workshops, or (c) college-level statistics with workshops (corequisite remediation). Students assigned to statistics passed at a rate 16 percentage points higher than those assigned to algebra (p < .001), and subsequently accumulated more credits. A majority of enrolled statistics students passed. Policies allowing students to take college-level instead of remedial quantitative courses can increase student success.

I like the idea of teaching statistics instead of boring algebra. That said, I think if algebra were taught well, it would be as useful as statistics. I think the most important parts of statistics are not the probabilistic parts so much as the quantitative reasoning. You can use algebra to solve lots of problems. For example, this age adjustment story is just a bunch of algebra. Algebra + data. But there’s no reason algebra has to be data-free, right?

Meanwhile, intro stat can be all about p-values, and then I hate it.

So what I’d really like to see is good intro quantitative classes. Call it algebra or call it real-world math or call it statistics or call it data science, I don’t really care.

2. Also from Lewis:

“Less Is More: Psychologists Can Learn More by Studying Fewer People,” by Matthew Normand, who writes:

Psychology has been embroiled in a professional crisis as of late. . . . one problem has received little or no attention: the reliance on between-subjects research designs. The reliance on group comparisons is arguably the most fundamental problem at hand . . .

But there is an alternative. Single-case designs involve the intensive study of individual subjects using repeated measures of performance, with each subject exposed to the independent variable(s) and each subject serving as their own control. . . .

Normand talks about “single-case designs,” which we also call “within-subject designs.” (Here we’re using experimental jargon in which the people participating in a study are called “subjects.”) Whatever terminology is being used, I agree with Normand. This is something Eric Loken and I have talked about a lot, that many of the horrible Psychological Science-style papers we’ve discussed use between-subject designs to study within-subject phenomena.

A notorious example was that study of ovulation and clothing, which posited hormonally-correlated sartorial changes within each woman during the month, but estimated this using a purely between-person design, with only a single observation for each woman in their survey.

Why use between-subject designs for studying within-subject phenomena? I see a bunch of reasons. In no particular order:

1. The between-subject design is easier, both for the experimenter and for any participant in the study. You just perform one measurement per person. No need to ask people a question twice, or follow them up, or ask them to keep a diary.

2. Analysis is simpler for the between-subject design. No need to worry about longitudinal data analysis or within-subject correlation or anything like that.

3. Concerns about poisoning the well. Ask the same question twice and you might be concerned that people are remembering their earlier responses. This can be an issue, and it’s worth testing for such possibilities and doing your measurements in a way to limit these concerns. But it should not be the deciding factor. Better a within-subject study with some measurement issues than a between-subject study that’s basically pure noise.

4. The confirmation fallacy. Lots of researchers think that if they’ve rejected a null hypothesis at a 5% level with some data, that they’ve proved the truth of their preferred alternative hypothesis. Statistically significant, so case closed, is the thinking. Then all concerns about measurements get swept aside: After all, who cares if the measurements are noisy, if you got significance? Such reasoning is wrong wrong wrong but lots of people don’t understand.

Also relevant to this reduce-N-and-instead-learn-more-from-each-individual-person’s-trajectory perspective is this conversation I had with Seth about ten years ago.

“The Pitfall of Experimenting on the Web: How Unattended Selective Attrition Leads to Surprising (Yet False) Research Conclusions”


Kevin Lewis points us to this paper by Haotian Zhou and Ayelet Fishbach, which begins:

The authors find that experimental studies using online samples (e.g., MTurk) often violate the assumption of random assignment, because participant attrition—quitting a study before completing it and getting paid—is not only prevalent, but also varies systemically across experimental conditions. Using standard social psychology paradigms (e.g., ego-depletion, construal level), they observed attrition rates ranging from 30% to 50% (Study 1). The authors show that failing to attend to attrition rates in online panels has grave consequences. By introducing experimental confounds, unattended attrition misled them to draw mind-boggling yet false conclusions: that recalling a few happy events is considerably more effortful than recalling many happy events, and that imagining applying eyeliner leads to weight loss (Study 2). In addition, attrition rate misled them to draw a logical yet false conclusion: that explaining one’s view on gun rights decreases progun sentiment (Study 3). The authors offer a partial remedy (Study 4) and call for minimizing and reporting experimental attrition in studies conducted on the Web.

I started to read this but my attention wandered before I got to the end; I was on the internet at the time and got distracted by a bunch of cat pictures, lol.

“I thought it would be most unfortunate if a lab . . . wasted time and effort trying to replicate our results.”

Mark Palko points us to this news article by George Dvorsky:

A Harvard research team led by biologist Douglas Melton has retracted a promising research paper following multiple failed attempts to reproduce the original findings. . . .

In June 2016, the authors published an article in the open access journal PLOS One stating that the original study had deficiencies. Yet this peer-reviewed admission was not accompanied by a retraction. Until now.

Melton told Retraction Watch that he finally decided to issue the retraction to ensure zero confusion about the status of the paper, saying, “I thought it would be most unfortunate if a lab missed the PLOS ONE paper, then wasted time and effort trying to replicate our results.”

He said the experience was a valuable one, telling Retraction Watch, “It’s an example of how scientists can work together when they disagree, and come together to move the field forward . . . The history of science shows it is not a linear path.”

True enough. Each experiment, successful or not, takes us a step closer to an actual cure.

Are you listening, John Bargh? Roy Baumeister?? Andy Yap??? Editors of the Lancet???? Ted talk people????? NPR??????

I guess the above could never happen in a field like psychology, where the experts assure us that the replication rate is “statistically indistinguishable from 100%.”

In all seriousness, I’m glad that Melton and their colleagues recognize that there’s a cost to presenting shaky work as solid and thus sending other research teams down blind alleys for years or even decades. I don’t recall any apologies on those grounds ever coming from the usual never-admit-error crowd.

Sorry, but no, you can’t learn causality by looking at the third moment of regression residuals

Under the subject line “Legit?”, Kevin Lewis pointed me to this press release, “New statistical approach will help researchers better determine cause-effect.” I responded, “No link to any of the research papers, so cannot evaluate.”

In writing this post I thought I’d go further. The press release mentions 6 published articles so I googled the first one, from the British Journal of Mathematical and Statistical Psychology (hey, I’ve published there!) and found this paper, “Significance tests to determine the direction of effects in linear regression models.”

Uh oh, significance tests. It’s almost like they’re trying to piss me off!

I’m traveling so I can’t get access to the full article. From the abstract:

Previous studies have discussed asymmetric interpretations of the Pearson correlation coefficient and have shown that higher moments can be used to decide on the direction of dependence in the bivariate linear regression setting. The current study extends this approach by illustrating that the third moment of regression residuals may also be used to derive conclusions concerning the direction of effects. Assuming non-normally distributed variables, it is shown that the distribution of residuals of the correctly specified regression model (e.g., Y is regressed on X) is more symmetric than the distribution of residuals of the competing model (i.e., X is regressed on Y). Based on this result, 4 one-sample tests are discussed which can be used to decide which variable is more likely to be the response and which one is more likely to be the explanatory variable. A fifth significance test is proposed based on the differences of skewness estimates, which leads to a more direct test of a hypothesis that is compatible with direction of dependence. . . .

The third moment of regression residuals??? This is nuts!

OK, I can see the basic idea. You have a model in which x causes y; the model looks like y = x + error. The central limit theorem tells you, roughly, that y should be more normal-looking than x, hence all those statistical tests.

Really, though, this is going to depend so much on how things are measured. I can’t imagine it will be much help in understanding causation. Actually, I think it will hurt in that if anyone takes it seriously, they’ll just muddy the waters with various poorly-supported claims. Nothing wrong with doing some research in this area, but all that hype . . . jeez!