## Objects of the class “George Orwell”

George Orwell is an exemplar in so many ways: a famed truth-teller who made things up, a left-winger who mocked left-wingers, an author of a much-misunderstood novel (see “Objects of the class ‘Sherlock Holmes,’”) probably a few dozen more.

But here I’m talking about Orwell’s name being used as an adjective. More specifically, “Orwellian” being used to refer specifically to the sort of doublespeak that Orwell deplored. When someone says something is Orwellian, they mean it’s something that Orwell would’ve hated.

Another example: Kafkaesque. A Kafkaesque world is not something Kafka would’ve wanted.

Just to be clear: I’m not saying there’s anything wrong with referring to doublespeak as Orwellian—the man did write a lot about it! It’s just interesting to think of things named after people who hated them.

## Emails I never bothered to answer

So, this came in the email one day:

Dear Professor Gelman,

I would like to shortly introduce myself: I am editor in the ** Department at the publishing house ** (based in ** and **).

As you may know, ** has taken over all journals of ** Press. We are currently restructuring some of the journals and are therefore looking for new editors for the journal **.

You have published in the journal, you work in the field . . . your name was recommended by Prof. ** as a potential editor for the journal. . . . We think you would be an excellent choice and I would like to ask you kindly whether you are interested to become an editor of the journal. In case you are interested (and even if you are not), we would be glad if you could maybe recommend us some additional potential candidates who could be interested to get involved with **. We are looking for a several editors who will cover the different areas of the field.

If you have any questions, I will gladly provide you with more information.

I look forward to hearing from you,

with best regards

**

Ummm, don’t take this the wrong way, but . . . why is it exactly that you think I would want to work for free on a project, just to make money for you?

## Christmas special: Survey research, network sampling, and Charles Dickens’ coincidences

It’s Christmas so what better time to write about Charles Dickens . . .

In traditional survey research we have been spoiled. If you work with atomistic data structures, a small sample looks like a little bit of the population. But a small sample of a network doesn’t look like the whole. For example, if you take a network and randomly sample some nodes, and then look at the network of all the edges connecting these nodes, you’ll get something much more sparse than the original. For example, suppose Alice knows Bob who knows Cassie who knows Damien, but Alice does not happen to know Damien directly. If only Alice and Damien are selected, they will appear to be disconnected because the missing links are not in the sample.

This brings us to a paradox of literature. Charles Dickens, like Tom Wolfe more recently, was celebrated for his novels that reconstructed an entire society, from high to low, in miniature. But Dickens is also notorious for his coincidences: his characters all seem very real but they’re always running into each other on the street (as illustrated in the map above, which comes from David Perdue) or interacting with each other in strange ways, or it turns out that somebody is somebody else’s uncle. How could this be, that Dickens’s world was so lifelike in some ways but filled with these unnatural coincidences?

My contention is that Dickens was coming up with his best solution to an unsolvable problems, which is to reproduce a network given a small sample. What is a representative sample of a network? If London has a million people and I take a sample of 100, what will their network look like? It will look diffuse and atomized because of all those missing connections. The network of this sample of 100 doesn’t look anything like the larger network of Londoners, any more than a disconnected set of human cells would look like a little person.

So to construct something with realistic network properties, Dickens had to artificially fill in the network, to create the structure that would represent the interactions in society. You can’t make a flat map of the world that captures the shape of a globe; any projection makes compromises. Similarly you can’t take a sample of people and capture all its network properties, even in expectation: if we want the network density to be correct, we need to add in links, “coincidences” as it were. The problem is, we’re not used to thinking this way because with atomized analysis, we really can create samples that are basically representative of the population. With networks you can’t.

This may be the first, and last, bit of literary criticism to appear in the Journal of Survey Statistics and Methodology.

## How to include formulas (LaTeX) and code blocks in WordPress posts and replies

It’s possible to include LaTeX formulas like $\int e^x \, \mathrm{d}x$. I entered it as $latex \int e^x \, \mathrm{d}x$.

You can also generate code blocks like this

for (n in 1:N)
y[n] ~ normal(0, 1);


The way to format them is to use <pre> to open the code block and </pre> to close it.

You can create links using the anchor (a) tag.

You can also quote someone else, like our friend lorem ipsum,

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

You open with <blockquote> and close with </blockquote>.

You can add bold (tag inside angle brackets is b), italics (tag is i) and typewriter text (tag is tt), but our idiotic style makes typewriter text smaller, so you need to wrap it in a big for it to render the same size as surrounding text.

The full set of tags allowed is:

address, a, abbr, acronym, area, article, aside, b, big,
blockquote, br, caption, cite, class, code, col, del,
details, dd, div, dl, dt, em, figure, figcaption, footer,
font, h1, h2, h3, h4, h5, h6, header, hgroup, hr, i,
img, ins, kbd, li, map, ol, p, pre, q, s, section, small,
span, strike, strong, sub, summary, sup, table, tbody,
td, tfoot, th, thead, tr, tt, u, ul, var


For more details, see: https://en.support.wordpress.com/code/

Too bad there’s no way for users without admin privileges to edit their work. It’s fiddly getting LaTeX or HTML right on the first try.

After some heavy escaping, you deserve some comic relief; it’ll give you some hint at what I had to do to show what I entered to you without it rendering.

## p=.03, it’s gotta be true!

Howie Lempel writes:

Showing a white person a photo of Obama w/ artificially dark skin instead of artificially lightened skin before asking whether they support the Tea Party raises their probability of saying “yes” from 12% to 22%. 255 person Amazon Turk and Craigs List sample, p=.03.

Nothing too unusual about this one. But it’s particularly grating when hyper educated liberal elites use shoddy research to decide that their political opponents only disagree with them because they’re racist.

https://www.washingtonpost.com/news/wonk/wp/2016/05/13/how-psychologists-used-these-doctored-obama-photos-to-get-white-people-to-support-conservative-politics/

https://news.stanford.edu/2016/05/09/perceived-threats-racial-status-drive-white-americans-support-tea-party-stanford-scholar-says/

Hey, they could have a whole series of this sort of experiment:

– Altering the orange hue of Donald Trump’s skin and seeing if it affects how much people trust the guy . . .

– Making Hillary Clinton fatter and seeing if that somehow makes her more likable . . .

– Putting glasses on Rick Perry to see if that affects perceptions of his intelligence . . .

– Altering the shape of Elizabeth Warren’s face to make her look even more like a Native American . . .

The possibilities are endless. And, given the low low cost of Mechanical Turk and Craig’s List, surprisingly affordable. The pages of Psychological Science PPNAS Frontiers in Psychology are wide open to you. As the man says, Never say no!

P.S. Just to be clear: I’m not saying that the above-linked conclusions are wrong or that such studies are inherently ridiculous. I just think you have to be careful about how seriously you take claims from reported p-values.

## Sethi on Schelling

Interesting appreciation from an economist.

## “Dirty Money: The Role of Moral History in Economic Judgments”

Recently in the sister blog . . . Arber Tasimi and his coauthor write:

Although traditional economic models posit that money is fungible, psychological research abounds with examples that deviate from this assumption. Across eight experiments, we provide evidence that people construe physical currency as carrying traces of its moral history. In Experiments 1 and 2, people report being less likely to want money with negative moral history (i.e., stolen money). Experiments 3–5 provide evidence against an alternative account that people’s judgments merely reflect beliefs about the consequences of accepting stolen money rather than moral sensitivity. Experiment 6 examines whether an aversion to stolen money may reflect con- tamination concerns, and Experiment 7 indicates that people report they would donate stolen money, thereby counteracting its negative history with a positive act. Finally, Experiment 8 demonstrates that, even in their recall of actual events, people report a reduced tendency to accept tainted money. Altogether, these findings suggest a robust tendency to evaluate money based on its moral history, even though it is designed to participate in exchanges that effectively erase its origins.

I’m not a big fan of the graphs in this paper (and don’t get me started on the tables!), but the experiments are great. I love this stuff.

## You Won’t BELIEVE How Trump Broke Up This Celebrity Couple!

A few months ago I asked if it was splitsville for tech zillionaire Peter Thiel and chess champion Garry Kasparov, after seeing this quote from Kasparov in April:

Trump sells the myth of American success instead of the real thing. . . . It’s tempting to rally behind him-but we should resist. Because the New York values Trump represents are the very worst kind. . . . He may have business experience, but unless the United States plans on going bankrupt, it’s experience we don’t need.

and this news item from May:

Thiel, co-founder of PayPal and Palantir and a director at Facebook, is now a Trump delegate in San Francisco, according to a Monday filing.

Based on this recent interview, I suspect the bromance is fully over.

I guess we can forget about Kasparov and Thiel ever finishing that book.

P.S. Commenter Ajg at above-linked post gets credit for the title of this post.

## This is not news.

Anne Pier Salverda writes:

I’m not sure if you’re keeping track of published failures to replicate the power posing effect, but this article came out earlier this month: “Embodied power, testosterone, and overconfidence as a causal pathway to risk-taking.”

From the abstract:

We were unable to replicate the findings of the original study and subsequently found no evidence for our extended hypotheses.

Gotta love that last sentence of the abstract:

As our replication attempt was conducted in the Netherlands, we discuss the possibility that cultural differences may play a moderating role in determining the physiological and psychological effects of power posing.

Let’s just hope that was a joke. Jokes are ok in academic papers, right?

## Michael found the bug in Stan’s new sampler

Gotcha!

Michael found the bug!

That was a lot of effort, during which time he produced ten pages of dense LaTeX to help Daniel and me understand the algorithm enough to help debug (we’re trying to write a bunch of these algorithmic details up for a more general audience, so stay tuned).

So what was the issue?

In Michael’s own words:

There were actually two bugs. The first is that the right subtree needs it’s own rho in order to compute the correct termination criterion. The second is that in order to compute the termination criterion you need the points on the left and right of each subtree (the orientation of left and right relative to forwards and backwards depends on in which direction you’re trying to extend the trajectory). That means you have to do one leapfrog step and take that  point as left, then do the rest of the leapfrog steps and take the final point as right. But right now I’m taking the initial point as left, which is one off. A small difference (especially as the step size is decreased!) but enough to bias the samples.

I redacted the saltier language (sorry if that destroyed the flavor of the message, Michael [pun intended; this whole bug hunt has left me a bit punchy]).

I responded:

That is a small difference—amazing it has that much effect on sampling. These things are obviously balanced on a knife edge.

Michael then replied:

Well the effect is pretty small and is significant only when you need extreme precision, so it’s not entirely surprising [that our tests didn’t catch it] in hindsight. The source of the problem also explains why the bias went down as the step size was decreased. It also gives a lot of confidence in the general validity of previous results.

I’m just glad all that math was correct!

Whew. Me, too. Especially since the new approch seems both more efficient and more robust.

What do you mean by “new approach”?

Michael replaced the original NUTS algorithm’s slice sampler with a discrete sampler, which trickles through a bunch of the algorithmic steps, such as whether to jump to the latest subtree being built. We’ve (by which I mean Michael) have also been making incremental changes to the adaptation. These started early on when we broke adaptation down into a step size and a regularized mass matrix estimate and then allowed dense mass matrices.

When will Stan be fixed?

It’ll take a few days for us to organize the new code and then a few more days to push it through the interfaces. Definitely in time for StanCon (100+ registrants and counting, with plenty of submitted case studies).

## You’ll have to figure this one out for yourselves.

So. The other day this following email comes in, subject line “Grabbing headlines using poor statistical methods,” from Clifford Anderson-Bergman:
Continue reading ‘You’ll have to figure this one out for yourselves.’ »

## Stan 2.10 through Stan 2.13 produce biased samples

[Update: bug found! See the follow-up post, Michael found the bug in Stan’s new sampler]

[Update: rolled in info from comments.]

After all of our nagging of people to use samplers that produce unbiased samples, we are mortified to have to announce that Stan versions 2.10 through 2.13 produce biased samples.

The issue

Thanks to Matthew R. Becker for noticing this with a simple bivariate example and for filing the issue with a reproducible example:

The change to Stan

Stan 2.10 changed the NUTS algorithm from using slice sampling along a Hamiltonian trajectory to a new algorithm that uses categorical sampling of points along the trajectory proportional to the density (plus biases to the second half of the chain, which is a subtle aspect of the original NUTS algorithm). The new approach is described here:

From Michael Betancourt on Stan’s users group:

Let me temper the panic by saying that the bias is relatively small and affects only variances but not means, which is why it snuck through all our testing and application analyses. Ultimately posterior intervals are smaller than they should be, but not so much that the inferences are misleading and the shrinkage will be noticeable only if you have more than thousands of effective samples, which is much more that we typically recommend.

What we’re doing to fix it

Michael and I are poring over the proofs and the code, but it’s unfortunate timing with the holidays here as everyone’s traveling. We’ll announce a fix and make a new release as soon
as we can. Let’s just say this is our only priority at the moment.

If all else fails, we’ll roll back the sampler to the 2.09 version in a couple days and do a new release with all the other language updates and bug fixes since then.

What you can do until then

While some people seem to think the error is of small enough magnitude not to be worrisome (see comments), we’d rather see you all getting the right answers. Until we get this fixed, the only thing I can recommend is using straight up static HMC (which is not broken in the Stan releases) or rolling back to Stan 2.09 (easy with CmdStan, not sure how to do that with other interfaces).

Even diagnosing problems like these is hard

Matthew Becker, the original poster, diagnosed the problem with fake data simulations, but it required a lot of effort.

The bug Matthew Becker reported was for this model:

parameters {
vector[2] z;
}

model {
matrix[2,2] sigma;
vector[2] mu;

mu[1] <- 0.0;
mu[2] <- 3.0;
sigma[1][1] <- 1.0 * 1.0;
sigma[1][2] <- 0.5 * 1.0 * 2.0;
sigma[2][1] <- 0.5 * 1.0 * 2.0;
sigma[2][2] <- 2.0 * 2.0;

z ~ multi_normal(mu, sigma);
}


So it's just a simple multivariate normal with 50% correlation and reasonably small locations and scales. It led to this result in Stan 2.13. It used four chains of 1M iterations each:

Inference for Stan model: TWOD_Gaussian_c1141a5e1a103986068b426ecd9ef5d2.
4 chains, each with iter=1000000; warmup=100000; thin=1;
post-warmup draws per chain=900000, total post-warmup draws=3600000.


and led to this posterior summary:

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
z[0]-5.1e-4  1.9e-3   0.95  -1.89  -0.61 7.7e-4   0.61   1.88 235552    1.0
z[1]    3.0  3.9e-3    1.9  -0.77   1.77    3.0   4.23   6.77 234959    1.0
lp__  -1.48  2.3e-3   0.96  -4.09  -1.84  -1.18   -0.8  -0.57 179274    1.0


rather than the correct (as known analytically and verified by our static HMC implementation):

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
z[0] 6.7e-5  5.2e-4   0.99  -1.95  -0.67-1.8e-4   0.67   1.95  3.6e6    1.0
z[1]    3.0  1.1e-3   1.99  -0.92   1.66    3.0   4.34   6.91  3.6e6    1.0
lp__  -1.54  6.8e-3    1.0  -4.25  -1.92  -1.23  -0.83  -0.57  21903    1.0


In particular, you can see that the posterior sd is too low for NUTS (not by much 0.95 vs. 1.0), and the posterior 90% intervals are (-1.89, 1.88) rather than (-1.95, 1.95) for z[1] (here, for some reason listed as "z[0]").

Again, our sincere apologies here for messing up so badly. I hope everyone can forgive us. It is going to cause us to focus considerable energy on functional tests that'll diagnose these issues---it's a challening problem to balance sensitivity and specificity of such tests.

## Steve Fienberg

I did not know Steve Fienberg well, but I met him several times and encountered his work on various occasions, which makes sense considering his research area was statistical modeling as applied to social science.

Fienberg’s most influential work must have been his books on the analysis of categorical data, work that was ahead of its time in being focused on the connection between models rather than hypothesis tests. He also wrote, with William Mason, the definitive paper on identification in age-period-cohort models, and he worked on lots of applied problems including census adjustment, disclosure limitation, and statistics in legal settings. The common theme in all this work is the combination of information from multiple sources, and the challenges involved in taking statistical inferences using these to make decisions in new settings. These ideas of integration and partial pooling are central to Bayesian data analysis, and so it makes sense that Fienberg made use of Bayesian methods throughout his career, and that he was a strong presence in the Carnegie Mellon statistics department, which has been one of the important foci of Bayesian research and education during the past few decades.

Fienberg’s CMU obituary quotes statistician and former Census Bureau director Bob Groves as saying,

Steve Fienberg’s career has no analogue in my [Groves’s] lifetime. . . . He contributed to advancements in theoretical statistics while at the same time nurturing the application of statistics in fields as diverse as forensic science, cognitive psychology, and the law. He was uniquely effective in his career because he reached out to others, respected them for their expertise, and perceptively saw connections among knowledge domains when others couldn’t see them. He thus contributed both to the field of statistics and to the broader human understanding of the world.

I’d say it slightly differently. I disagree that Fienberg’s career is unique in the way that Groves states. Others of Fienberg’s generation such as Don Rubin and Nan Laird have similarly made important theoretical or methodological contributions while also actively working on a broad variety of live applications. One can also point to researchers such as James Heckman and Lewis Sheiner who have come from outside to make important contributions to statistics while also doing important work in their own fields. And, to go to the next generation, I can for example point to my collaborators John Carlin and David Dunson, both of whom have had deep statistical insights while also contributing to the reform and development of their fields of application.

But please don’t take my qualification of Groves’s statement to be a criticism of Fienberg. Rather consider it as a plus. Fienberg is a model of an important way to be a statistician: to be someone deeply engaged with a variety of applied projects while at the same time making fundamental contributions to the core of statistics. Or, to put it another way, to work on statistical theory and methodology in the context of a deep engagement with a wide range of applications.

Lionel Trilling famously wrote this about George Orwell:

Orwell, by reason of the quality that permits us to say of him that he was a virtuous man, is a figure in our lives. He was not a genius, and this is one of the remarkable things about him. His not being a genius is an element of the quality that makes him what I am calling a figure. . . . if we ask what it is he stands for, what he is the figure of, the answer is: the virtue of not being a genius, of fronting the world with nothing more than one’s simple, direct, undeceived intelligence, and a respect for the powers one does have, and the work one undertakes to do. . . . what a relief! What an encouragement. For he communicates to us the sense that what he has done any one of us could do.

Or could do if we but made up our mind to do it, if we but surrendered a little of the cant that comforts us, if for a few weeks we paid no attention to the little group with which we habitually exchange opinions, if we took our chance of being wrong or inadequate, if we looked at things simply and directly, having only in mind our intention of finding out what they really are . . . He tells us that we can understand our political and social life merely by looking around us, he frees us from the need for the inside dope.

George Orwell is one of my heroes. I am not saying that Steve Fienberg is the George Orwell of statistics, whatever that would mean. What I do think is that the above traits identified by Trilling are related to what I admire most about Fienberg, and this is why I think it’s a fine accomplishment indeed for Fienberg to have not been a unique example of a statistician contributing both to theory and applications but an exemplar of this type. Laplace, Galton, and Fisher also fall in this category but none of us today can hope to match the scale of their contributions. Fienberg through his efforts changed the world in some small bit, as we all should hope to do.

## On deck very soon

A bunch of the 170 are still in the queue. I haven’t been adding to the scheduled posts for awhile, instead I’ve been inserting topical items from time to time—I even got some vicious hate mail for my article on the electoral college—and then I’ve been shoving material for new posts into a big file that now has a couple hundred items, I’m not quite sure what to do with that one, maybe I’ll write all my posts for 2017 on a single day and get that over with? Also, sometimes our co-bloggers post here, and that’s cool.

Anyway, three people emailed me today about a much-publicized science news item that pissed them off. It’s not really topical but maybe I’ll post on it, just to air the issue out. And I have a couple literary ideas I wanted to share. So maybe I’ll do something I haven’t done for a few months, and bump a few of this week’s posts to the end of the queue.

Before that, though, I have a post that is truly topical and yet will never be urgent. I’ll schedule it to appear in the usual slot, between 9 and 10 in the morning.

## Low correlation of predictions and outcomes is no evidence against hot hand

Josh Miller (of Miller & Sanjurjo) writes:

On correlations, you know, the original Gilovich, Vallone, and Tversky paper found that the Cornell players’ “predictions” of their teammates’ shots correlated 0.04, on average. No evidence they can see the hot hand, right?

Here is an easy correlation question: suppose Bob shoots with probability ph=.55 when he is hot and pn=.45 when he is not hot. Suppose Lisa can perfectly detect when he is hot, and when he is not. If Lisa predicts based on her perfect ability to detect when Bob is hot, what correlation would you expect?

With that setup, I could only assume the correlation would be low.

I did the simulation:

> n <- 10000
> bob_probability <- rep(c(.55,.45),c(.13,.87)*n)
> lisa_guess <- round(bob_probability)
> bob_outcome <- rbinom(n,1,bob_probability)
> cor(lisa_guess, bob_outcome)
[1] 0.06


Of course, in this case I didn’t even need to compute lisa_guess as it’s 100% correlated with bob_probability.

This is a great story, somewhat reminiscent of the famous R-squared = .01 example.

P.S. This happens to be closely related to the measurement error/attenuation bias issues that Miller told me about a couple years ago. And Jordan Ellenberg in comments points to a paper from Kevin Korb and Michael Stillwell, apparently from 2002, entitled “The Story of The Hot Hand: Powerful Myth or Powerless Critique,” that discusses related issues in more detail.

The point is counterintuitive (or, at least, counter to the intuitions of Gilovich, Vallone, Tversky, and a few zillion other people, including me before Josh Miller stepped into my office that day a couple years ago) and yet so simple to demonstrate. That’s cool.

Just to be clear, right here my point is not the small-sample bias of the lagged hot-hand estimate (the now-familar point that there can be a real hot hand but it could appear as zero using GIlovich et al.’s procedure) but rather the attenuation of the estimate: the less-familiar point that even a large hot hand effect will show up as something tiny when estimated using 0/1 data. As Korb and Stillwell put it, “binomial data are relatively impoverished.”

This finding (which is mathematically obvious, once you see it, and can demonstrated in 5 lines of code) is related to other obvious-but-not-so-well-known examples of discrete data being inherently noisy. One example is the R-squared=.01 problem linked to at the end of the above post, and yet another is the beauty-and-sex-ratio problem, where a researcher published paper after paper of what was essentially pure noise, in part because he did not seem to realize how little information was contained in binary data.

Again, none of this was a secret. The problem was sitting in open sight, and people have been writing about this statistical power issue forever. Here, for example, is a footnote from one of Miller and Sanjurjo’s papers:

Funny how it took this long for it to become common knowledge. Almost.

P.P.S. I just noticed another quote from Korb and Stillwell (2002):

Kahneman and Tversky themselves, the intellectual progenitors of the Hot Hand study, denounced the neglect of power in null hypothesis significance testing, as a manifestation of a superstitious belief in the “Law of Small Numbers”. Notwithstanding all of that, Gilovich et al. base their conclusion that the hot hand phenomenon is illusory squarely upon a battery of significance tests, having conducted no power analysis whatsoever! This is perhaps the ultimate illustration of the intellectual grip of the significance test over the practice of experimental psychology.

I agree with the general sense of this rant, but I’d add that, at least informally, I think Gilovich et al., and their followers, came to their conclusion not just based on non-rejection of significance tests but also based on the low value of their point estimates. Hence the relevance of the issue discussed in my post above, regarding attenuation of estimates. It’s not just that Gilovich et al. found no statistically significant differences, it’s also that their estimates were biased in a negative direction (that was the key point of Miller and Sanjurjo) and pulled toward zero (the point being made above). Put all that together and it looked to Gilovich et al. like strong evidence for a null, or essentially null, effect.

## What’s powdery and comes out of a metallic-green cardboard can?

This (by Jason Torchinsky, from Stay Free magazine, around 1998?) is just hilarious. We used to have both those shake-out-the-powder cans, Comet and that parmesan cheese, in our house when I was growing up.

## Would Bernie Sanders have lost the presidential election?

Nobody knows what would’ve happened had Bernie Sanders been the Democratic nominee in 2016. My guess based on my reading of the political science literature following Steven Rosenstone’s classic 1983 book, Forecasting Presidential Elections, is that Sanders would’ve done a bit worse than Hillary Clinton, because Clinton is a centrist within the Democratic party and Sanders is more on the ideological extreme. This is similar to the reasoning that Ted Cruz, as the most conservative of the Republican candidates, would’ve been disadvantaged in the general election.

But I disagree with Kevin Drum, who writes, “Bernie Sanders Would Have Lost the Election in a Landslide.” Drum compares Sanders to failed Democratic candidates George McGovern, Walter Mondale, and Michael Dukakis—but they were all running against incumbent Republicans under economic conditions which were inauspicious for the Democratic opposition.

My guess would be that Sanders’s ideological extremism could’ve cost the Democrats a percentage or two of the vote. So, yes, a priori, before the campaign, I’d say that Hillary Clinton was the stronger general election candidate. And I agree with Drum that, just as lots of mud was thrown at Clinton, the Russians would’ve been able to find some dirt on Sanders too.

But here’s the thing. Hillary Clinton won the election by 3 million votes. Her votes were just not in the right places. Sanders could’ve won a million or two votes less than Clinton, and still won the election. Remember, John Kerry lost to George W. Bush by 3 million votes but still almost won in the Electoral College—he was short just 120,000 votes in Ohio.

So, even if Sanders was a weaker general election candidate than Clinton, he still could’ve won in this particular year.

Or, to put it another way, Donald Trump lost the presidential vote by 3 million votes but managed to win the election because of the vote distribution. A more mainstream Republican candidate could well have received more votes—even a plurality!—without winning the electoral college.

The 2016 election was just weird, and it’s reasonable to say that (a) Sanders would’ve been a weaker candidate than Clinton, but (b) in the event, he could’ve won.

P.S. Drum responds to my points above with a post entitled, “Bernie Woulda Lost.” Actually that title is misleading because then in his post he writes, “I won’t deny that Sanders could have won. Gelman is right that 2016 was a weird year, and you never know what might have happened.”

But here’s Drum’s summary:

Instead of Clinton’s 51-49 percent victory in the popular vote, my [Drum’s] guess is that Sanders would lost 47-53 or so.

Drum elaborates:

Sanders would have found it almost impossible to win those working-class votes [in the midwest]. There’s no way he could have out-populisted Trump, and he had a ton of negatives to overcome.

We know that state votes generally follow the national vote, so if Sanders had lost 1-2 percentage points compared to Clinton, he most likely would have lost 1-2 percentage points in Wisconsin, Michigan, and Pennsylvania too. What’s the alternative? That he somehow loses a million votes in liberal California but gains half a million votes in a bunch of swing states in the Midwest? What’s the theory behind that?

OK, there are a few things going on here.

1. Where does that 47-53 estimate come from? Drum’s saying that Sanders would’ve done a full 4 percentage points worse than Clinton in the popular vote. 4 percentage points is huge. It’s huge historically—Rosenstone in his aforementioned book estimates the electoral penalty for ideological extremism to be much less than that—and even more huge today in our politically polarized environment. So I don’t really see where that 4 percentage points is coming from. 1 or 2 percentage points, sure, which is why in my post above I did not say that I thought Sanders necessarily would’ve won, I just say it could’ve happened, and my best guess is that the election would’ve been really close.

As I said, I see Sanders’s non-centered political positions as costing him votes, just not nearly as much as Drum is guessing. And, again, I have no idea where Drum’s estimated 4 percentage point shift is coming from. However, there is one other thing, which is that Sanders is a member of a religious minority. It’s said that Romney being a Mormon cost him a bunch of votes in 2012, and similarly it’s not unreasonable to assume that Sanders being Jewish would cost him too. It’s hard to say: one might guess that anyone who would vote against someone just for being a Jew would already be voting for Trump, but who knows?

2. Drum correctly points out that swings are national and of course I agree with that (see, for example, item 9 here), but of course there were some departures from uniform swing. Drum attributes this to Mitt Romney being “a pro-trade stiff who was easy to caricature as a private equity plutocrat”—but some of this characterization applied to Hillary Clinton too. So I don’t think we should take the Clinton-Trump results as a perfect template for what would’ve happened, had the candidates been different.

Here are the swings:

To put it another way: suppose Clinton had run against Scott Walker instead of Donald Trump. I’m guessing the popular vote totals might have been very similar to what actually happened, but with a different distribution of votes.

Drum writes, “if Sanders had lost 1-2 percentage points compared to Clinton, he most likely would have lost 1-2 percentage points in Wisconsin, Michigan, and Pennsylvania too. What’s the alternative? That he somehow loses a million votes in liberal California but gains half a million votes in a bunch of swing states in the Midwest? What’s the theory behind that?”

My response: The theory is not that Sanders “loses” a million votes in liberal California but that he doesn’t do as well there as Clinton did—not an unreasonable assumption given that Clinton won the Democratic primary there. Similarly with New York. Just cos California and New York are liberal states, that doesn’t mean that Sanders would outperform Clinton in those places in the general election: after all, the liberals in those states would be voting for either of them over the Republican. And, yes, I think the opposite could’ve happened in the Midwest. Clinton and Sanders won among different groups and in different states in the primaries, and the gender gap in the general election increased a lot in 2016, especially among older and better-educated voters, so there’s various evidence suggesting that the two candidates were appealing to different groups of voters. My point is not that Sanders was a stronger candidate than Clinton on an absolute scale—as I wrote above, I don’t know, but my guess is that he would’ve done a bit worse in the popular vote—but rather that the particular outcome we saw was a bit of a fluke, and I see no reason to think a Sanders candidacy would’ve seen the same state-by-state swings as happened to occur with Clinton. Drum considers the scenario suggested above to be “bizarre” but I think he’s making the mistake of taking the particular Clinton-Trump outcome as a baseline. If you take Obama-Romney as a starting point and go from there, everything looks different.

Finally, writes that my post “sounds like special pleading.” I looked up that term and it’s defined as “argument in which the speaker deliberately ignores aspects that are unfavorable to their point of view.” I don’t think I was doing that. I was just expressing uncertainty. Drum wrote the declarative sentence, “Bernie Sanders Would Have Lost the Election in a Landslide,” and I responded with doubt. My doubt regarding landslide claims is not new. For example, here I am on 31 Aug 2016:

Trump-Clinton Probably Won’t Be a Landslide. The Economy Says So.

I wasn’t specially pleading then, and I’m not specially pleading now. I’m just doing my best to assess the evidence.

## “Calm Down. American Life Expectancy Isn’t Falling.”

Ben Hanowell writes:

In the middle of December 2016 there were a lot of headlines about the drop in US life expectancy from 2014 to 2015. Most of these articles painted a grim picture of US population health. Many reporters wrote about a “trend” of decreasing life expectancy in America.

The trouble is that the drop in US life expectancy last year was the smallest among six drops between 1960 and 2015. What’s more, life expectancy dropped in 2015 by only a little over a month. That’s half the size of the next smallest drop and two-thirds the size of the average among those six drops. Compare that to the standard deviation in year-over-year change in life expectancy, which is nearly three months. In terms of percent change, 2015 life expectancy dropped by 1.5%… but the standard deviation of year-over-year percent change in life expectancy is nearly 4%.

Most importantly, of course, life expectancy in the US has increased by about two months on average since 1960. [see above graph]

Hanowell has the full story at his blog:

The media is abuzz about a small drop in life expectancy in 2015. Yet despite sensationalist headlines, human lifespan has actually risen globally and nationally for decades if not centuries with no signs of a reversal. Alarmist news headlines follow noise rather than signal, causing us to lose sight of what’s really important: understanding how human lifespan has improved; how we can maintain that progress; how social institutions will cope with a rapidly aging population; and trends in vital statistics more fine-grained than overall life expectancy at birth.

Don’t believe the hype. Life expectancy isn’t plummeting.

Hanowell then goes through the steps:

What Is Life Expectancy?

Fact: Human Lifespan Has Risen Globally for Over 250 Years

Then he gets to the main point:

Fact: There’s No Evidence American Life Expectancy at Birth Is Falling

Okay. So the human lifespan has been increasing over the last few centuries in the U.S. and other nations. There still could have been a recent slowdown or reversal, right? Well, yes, but there’s virtually no evidence for it. The 2015 annual drop in lifespan is a mere 1.2 months of life. That’s 50% smaller than the average among six annual drops since 1960. Yet between 1960 and 2015, life expectancy in the U.S. increased by about two months per year on average. In 1960, newborns could expect to live just over 71 years. Now they can expect to live just under 79 years.

If words aren’t enough to convince you, here is an annotated picture of the numbers.

And then he gives the image that I’ve reproduced at the top of this post.

What, then?

Hanowell continues:

Let’s Stop Crying Wolf About Falling American Life Expectancy

Here are some examples of sensationalist, alarmist headlines about life expectancy:

U.S. life expectancy declines for first time in 20 years (BBC News)
Drugs blamed for fall in U.S. life expectancy (The Times)
Dying younger: U.S. life expectancy a ‘real problem’ (USA Today)
Heart disease, Alzheimer’s and accidents lead to drop in U.S. life expectancy (Newsweek)
We’ve already seen that American life expectancy is probably not a “real problem.” Quite the opposite. There may be an explanation for this short-term drop. Maybe The Times is right and it has something to do with the so-called “opioid epidemic.” Maybe Newsweek is right and we should chalk it up to heart disease and Alzheimer’s (although probably not). Maybe it’s something else entirely.

By sensationalizing short-term trends without the proper long-term context, we lose sight of the progress we’ve made. That leaves us less informed about how we’ve come so far in the first place, and where to go from here.

What We Should Be Talking About Instead of Falling Life Expectancy

(Because It Isn’t Falling)

Falling American lifespan isn’t a pressing problem. What should we focus on instead? Here are a couple of ideas:

Understand How We Came This Far and How to Keep Going . . .

Improve Health and Quality of Life at Advanced Ages Without Overwhelming Social Institutions . . .

Pay Greater Attention to Trends in Finer-Grained Vital Statistics Than Overall Life Expectancy . . .

Hanowell concludes his post as follows:

Recent headlines about a drop in expected American lifespan are misleading. Although life expectancy dropped by a small amount between 2014 and 2015, the long-term trend shows climbing lifespan. Instead of worrying about a problem for which there is no evidence, we should be focusing on meeting the challenges that come with longer human lifespans, and understanding why lifespan differs by demographic characteristics.

And then he has a question for me:

How can we encourage journalists and the prominent scientists they quote that you can still make a story about steadily increasing life expectancy despite occasional faltering, and it won’t hurt your chances of it “going viral” or getting research funding next year? Because to me, steadily increasing life expectancy is a more interesting story once you take into account how we got here, and what we’ll need to do to keep up with our own needs while taking care of the elderly.

## An efficiency argument for post-publication review

This came up in a discussion last week:

We were talking about problems with the review process in scientific journals, and a commenter suggested that prepublication review should be more rigorous:

There are lot of statistical missteps you just can’t catch until you actually have the replication data in front of you to work with and look at. Andrew, do you think we will ever see a system implemented where you have to submit the replication code with the initial submission of the paper, rather than only upon publication (or not at all)? If reviewers had the replication files, they could catch many more of these types of arbitrary specification and fishing problems that produce erroneous results, saving the journal from the need for a correction. . . . I review papers all the time and sometimes I suspect there might be something weird going on in the data, but without the data itself I often just have to take the author(s) word for it that when they say they do X, they actually did X, etc. . . . then bad science gets through and people can only catch the mistakes post-publication, triggering all this bs from journals about not publishing corrections.

I responded that, no, I don’t think that beefing up prepublication review is a good idea:

As a reviewer I am not going to want to spend the time finding flaws in a submitted paper. I’ve always been told that it is the author, not the journal, who is responsible for the correctness of the claims. As a reviewer, I will, however, write that the paper does not give enough information and I can’t figure out what it’s doing.

Ultimately I think the only only only solution here is post-publication review. The advantage of post-publication review is that its resources are channeled to the more important cases: papers on important topics (such as Reinhart and Rogoff) or papers that get lots of publicity (such as power pose). In contrast, with regular journal submission, every paper gets reviewed, and it would be a huge waste of effort for all these papers to be carefully scrutinized. We have better things to do.

This is an efficiency argument. Reviewing resources are limited (recall that millions of scientific papers are published each year) so it makes sense to devote them to work that people care about.

And, remember, the problem with peer review is the peers.

## Hark, hark! the p-value at heaven’s gate sings

Three different people pointed me to this post, in which food researcher and business school professor Brian Wansink advises Ph.D. students to “never say no”: When a research idea comes up, check it out, put some time into it and you might get some success.

I like that advice and I agree with it. Or, at least, this approached worked for me when I was a student and it continues to work for me now, and my favorite students are those who follow this approach. That said, there could be some selection bias here, that the students who say Yes to new projects are the ones who are more likely to be able to make use of such opportunities. Maybe the students who say No would just end up getting distracted and making no progress, were they to follow this advice. I’m not sure. As an advisor myself, I recommend saying Yes to everything, but in part I’m using this advice to take advantage of the selection process, in that students who don’t like this advice might decide not to work with me.

Wansink’s post is dated 21 Nov but it’s only today, 15 Dec, that three people told me about it, so it must just have hit social media in some way.

The controversial and share-worthy aspect of the post is not the advice for students to be open to new research projects, but rather some of the specifics. Here’s Wansink:
Continue reading ‘Hark, hark! the p-value at heaven’s gate sings’ »