Skip to content

“The hype economy”

Palko writes:

I have no idea whether it is real or apocryphal, but there’s an often referred to study with primates where the they earned tokens that could be exchanged for food. According to the standard version, the subjects soon came to value those tokens more than the treats they could be exchanged for.

The hype economy works along similar lines. The ability to get people talking about something (preferably but not always necessarily in a positive way) is tremendously valuable by most traditional standards. For entertainers, it can bring in large audiences. For goods and services, it can drive sales and help maintain customer loyalty. For politicians, it can be votes. For public policy initiatives, it can generate and shore up support.

At some point though (and it’s a point we passed quite a while back) the ability to generate buzz becomes disconnected both from the attributes which are supposed to drive it and the objectives it is supposed to serve. It then takes on a life of its own. Hype becomes the primary if not sole metric by which anything is judged.

I agree with Palko that this is happening, and of course we see this in science too, where the goal is to publish a paper in a high-impact journal, or to get lots of citations, or lots of awards and reputation points. The idea would be that these tokens are signs of scientific progress and success. To the extent that’s happening, it’s not so unreasonable to aim for the tokens under the theory that, if you do what it takes to get the tokens, true success will come along the way. Just as, if your goal is to be a top athlete, it’s a reasonable strategy to aim for a championship. But this all becomes a problem if there are other ways to get the tokens than through true scientific progress.

One thing I’d like to add to Palko’s conception is that the ability to aim for buzz rather than sales (in the economics context) will only work if you are already comfortable, right? So an internet entrepreneur can go for clicks and buzz only if he or she is already putting food on the table in some other way. I’m not sure how this fits into the rest of the story.

P.S. Also this:

One of the classic recipes for disaster in a market bubble is focusing exclusively on a sharp increase in demand while ignoring an even greater increase in supply.

I guess everyone who’s putting money into this, is thinking they won’t be the last one to get out.

Tom Wolfe

I’m a big Tom Wolfe fan.

My favorites are The Painted Word and From Bauhaus to Our House, and I have no patience for the boosters (oh, sorry, “experts”) of modern art of the all-black-painting variety or modern architecture of the can’t-find-the-front-door variety who can’t handle Wolfe’s criticism.

I also enjoyed Bonfire of the Vanities, with my only real complaint being the ending. Or maybe I should say the structure of the book. Wolfe sets all these great characters and plot elements in motion and then he doesn’t really resolve anything; it was kinda like he just got tired and decided to stop. That said, endings are hard. Robert Heinlein and Roald Dahl are two other writers who were great in the set-up but often had problems with the follow-through.

The characters in Bonfire were two-dimensional, but I can attribute that to Wolfe being more of a reporter than a novelist. When you’re reporting, you don’t need to flesh out your characters’ full dimensionality because, after all, they’re real people—as a reporter, you’re just telling part of the story. With a novel you have to put in that extra effort. In any case, two-dimensional ain’t bad. Gone Girl was pretty good, and all its characters were one-dimensional. 1984 is a great novel, and one might argue that its characters are zero-dimensional. The Rotter’s Club, these guys are three dimensional, but he’s Jonathan Coe, for chrissake. And Rabbit’s positively four-dimensional (remember, time is the fourth dimension), but Rabbit’s the greatest creation of a great novelist. Bonfire of the Vanities is an excellent book that we should define by its many strengths—its vividness, its up-to-the-minuteness, its Dickensianness, etc.—not by its few weaknesses.

What else? I never read The Right Stuff—the movie was so good, I felt no need to read the book—and was never able to get through his classic reportages of the car guys and the surfers and all those other pieces from the 60s. I found the whole Style!!! thing just too exhausting. It’s not that I’ll never read stream of consciousness—I read On the Road back when I was 20 or so, and it was great!—but something about those Wolfe essays, they just seemed so willed. I’m sure they were great at the time, but from the standpoint of decades later, I find the understated style of Gay Talese much more convincing. I did, however, like some of Wolfe’s later essays, such as his justification for Bonfire (that article about the billion-footed beast) and his attack on Mailer/Irving/Updike. Perhaps it’s just my taste that I preferred Wolfe when he was writing straight.

And then there was Wolfe’s attack on evolution. That was just foolish. But, hey, nobody’s perfect. Wolfe was proud of his ability to defend ridiculous positions, and in other settings that made for great writing.

After Wolfe died, I read a bunch of obituaries. And I learned a few things.

First off, I learned that he was tall. Who knew? In all those photos, I just somehow assumed he was short. Really short, like 5’3″ or something. Maybe it was how he dressed, like a dandy?

I also learned that Wolfe was middle-of-the-road, politically. I’d always thought of him as conservative, but I guess that was just in comparison to the rest of the literary establishment. According to Kyle Smith, he “habitually voted for the winner in every presidential election, except when he picked Mitt Romney in 2012 and Ross Perot in 1992.”

Finally, I learned that, in his famous article, Radical Chic, Wolfe was unfair to Leonard Bernstein. By this I don’t mean that he was making fun of Bernstein, quoting Bernstein out of context, not showing sufficient respect for Bernstein, etc. What I mean is that he put words into Bernstein’s mouth, put thoughts in Bernstein’s head, based on no evidence at all. Jay Livingston has the story. I’d never actually read that particular essay so I had no idea, and it didn’t come up in any of the other obituaries that I read. I guess maybe Radical Chic should be taken as fiction or satire; Bernstein was a public figure; you can say what you want about public figures if you’re writing fiction or satire; in any case Wolfe was still a brilliant writer and cultural critic. Still, it made me sad to learn this. Making fun of Bernstein, fine. Attributing thoughts to him—and not just any thoughts, but thoughts that make him look particularly foolish—not so cool. Then again, Wolfe was many things but I doubt he ever would’ve claimed to be cool.

Graphs and tables, tables and graphs

Jesse Wolfhagen writes:

I was surprised to see a reference to you in a Quartz opinion piece entitled “Stop making charts when a table is better”. While the piece itself makes that case that there are many kinds of charts that are simply restatements of tabular data, I was surprised that you came up as an advocate of tables being a “more honest” way to present information. It seems hard to see downstream effects by looking solely at tables.
So then I looked at the link, which led me to your blog post from 2009. Specifically, from April 1, 2009. Yes, like any good satire, your post was taken at face value!
So while yes, there are reasons for tables and reasons for charts and misuses of both formats, I might humbly suggest you put a tag on your future April 1 posts (on, say April 2), because it’s the internet age: satire and close inspection of dates are dead, but text searching and confirmation bias are alive and well.

Yup.  For anyone who has further interest in the particular topic of tables and graphs, I recommend this paper from 2011, which begins:

The statistical community is divided when it comes to graphical methods and models. Graphics researchers tend to disparage models and to focus on direct representations of data, mediated perhaps by research on perceptions but certainly not by probability distributions. From the other side, modelers tend to think of graphics as a cute toy for exploring raw data but not much help when it comes to the serious business of modeling. In order to better understand the benefits and limitations of graphs in statistical analysis, this article presents a series of criticisms of graphical methods in the voice of a hypothetical old-school analytical statistician or social scientist. We hope to elicit elaborations and extensions of these and other arguments on the limitations of graphics, along with responses from graphical researchers who might have different perceptions of these issues.

Still relevant seven years later, I think.

“Using numbers to replace judgment”

Julian Marewski and Lutz Bornmann write:

In science and beyond, numbers are omnipresent when it comes to justifying different kinds of judgments. Which scientific author, hiring committee-member, or advisory board panelist has not been confronted with page-long “publication manuals”, “assessment reports”, “evaluation guidelines”, calling for p-values, citation rates, h-indices, or other statistics in order to motivate judgments about the “quality” of findings, applicants, or institutions? Yet, many of those relying on and calling for statistics do not even seem to understand what information those numbers can actually convey, and what not. Focusing on the uninformed usage of bibliometrics as worrysome outgrowth of the increasing quantification of science and society, we place the abuse of numbers into larger historical contexts and trends. These are characterized by a technology-driven bureaucratization of science, obsessions with control and accountability, and mistrust in human intuitive judgment. The ongoing digital revolution increases those trends. We call for bringing sanity back into scientific judgment exercises.

I agree. Vaguely along the same lines is our recent paper on the fallacy of decontextualized measurement.

This happens a lot, that the things that people do specifically to make their work feel more scientific, actually pull them away from scientific inquiry.

Another way to put it is that subjective judgment is unavoidable. When Blake McShane and the rest of us were writing our paper on abandoning statistical significance, once potential criticism we had to address was: What’s the alternative? If researchers, journal editors, policymakers, etc., don’t have “statistical significance” to make their decisions, what can they do? Our response was that decision makers already are using their qualitative judgment to make decisions. PNAS, for example, doesn’t publish every submission that is sent to them with “p less than .05”; no, they still reject most of them, on other grounds (perhaps because their claims aren’t dramatic enough). Journals may use statistical significance as a screener, but they still have to make hard decisions based on qualitative judgment. We, and Marewski and Bornmann, are saying that such judgment is necessary, and it can be counterproductive to add a pseudo-objective overlay on top of that.

2018: How did people actually vote? (The real story, not the exit polls.)

Following up on the post that we linked to last week, here’s Yair’s analysis, using Mister P, of how everyone voted.

Like Yair, I think these results are much better than what you’ll see from exit polls, partly because the analysis is more sophisticated (MRP gives you state-by-state estimates in each demographic group), partly because he’s using more data (tons of pre-election polls), and partly because I think his analysis does a better job of correcting for bias (systematic differences between the sample and population).

As Yair puts it:

We spent the last week combining all of the information available — pre-election projections, early voting, county and congressional district election results, precinct results where we have it available , and polling data — to come up with our estimates.

In future election years, maybe Yair’s results, or others constructed using similar methods, will become the standard, and we’ll be able to forget exit polls, or relegate them to a more minor part of our discourse.

Anyway, here’s what Yair found:

The breakdown by age. Wow:

Changes since the previous midterm election, who voted and how they voted:

Ethnicity and education:



Yair’s got more at the link.

And here’s our summary of what happened in 2018, that we posted a few days after the election.

Hey, check this out: Columbia’s Data Science Institute is hiring research scientists and postdocs!

Here’s the official announcement:

The Institute’s Postdoctoral and Research Scientists will help anchor Columbia’s presence as a leader in data-science research and applications and serve as resident experts in fostering collaborations with the world-class faculty across all schools at Columbia University. They will also help guide, plan and execute data-science research, applications and technological innovations that address societal challenges and related University-wide initiatives.

Postdoc Fellows

Requirements: PhD degree


Research Scientist (Open Rank)

Requirements: PhD degree + (see position description for more information)

Research Scientist who will conduct independent cutting-edge research in the foundations or application of data science or related fields, or be involved in interdisciplinary research through a collaboration between the Data Science Institute and the various schools at the University.


Research Scientist (Open Rank)

Requirements: PhD degree + (see position description for more information)

Research Scientist who will serve as Columbia’s resident experts to foster collaborations with faculty across all the schools at Columbia University.


Candidates for all Research Scientists positions must apply using the links above that direct to the Columbia HR portal for each position, whereas the Postdoc Fellows submit materials via:

I’m part of the Data Science Institute so if you want to work with me or others here at Columbia, you should apply.

The State of the Art

Christie Aschwanden writes:

Not sure you will remember, but last fall at our panel at the World Conference of Science Journalists I talked with you and Kristin Sainani about some unconventional statistical methods being used in sports science. I’d been collecting material for a story, and after the meeting I sent the papers to Kristin. She fell down the rabbit hole too, and ended up writing a rebuttal of them, which is just published ahead of print in a sports science journal.

The authors of the work she is critiquing have written a response on their website (the title, “The Vindication of Magnitude-Based Inference”) though they seem to have taken it down for revisions at the moment.

I’m attaching the paper that the proponents of this method (“magnitude-based inference”) wrote to justify it. Kristin’s paper is at least the third to critique MBI. Will Hopkins, who is the mastermind behind it, has doubled down. His method seems to be gaining traction. A course that teaches researchers how to use it has been endorsed by British Association of Sport and Exercise Science and Exercise & Sports Science Australia.

My reply:

The whole thing seems pretty pointless to me. I agree with Sainani that the paper on MBI does not make sense. But I also disagree with all the people involved in this debate in that I don’t think that “type 1 error rate” has any relevance to sports science, or to science more generally. See for example here and here.

I think scientists should be spending more time collecting good data and reporting their raw results for all to see, and less time trying to come up with methods for extracting a spurious certainty out of noisy data. I think this whole type 1, type 2 error thing is a horrible waste of time which is distracting researchers from the much more important problem of getting high quality measurements.

See here for further discussion of this general point.

P.S. These titles are great, no?

Robustness checks are a joke

Someone pointed to this post from a couple years ago by Uri Simonsohn, who correctly wrote:

Robustness checks involve reporting alternative specifications that test the same hypothesis. Because the problem is with the hypothesis, the problem is not addressed with robustness checks.

Simonsohn followed up with an amusing story:

To demonstrate the problem I [Simonsohn] conducted exploratory analyses on the 2010 wave of the General Social Survey (GSS) until discovering an interesting correlation. If I were writing a paper about it, this is how I may motivate it:

Based on the behavioral priming literature in psychology, which shows that activating one mental construct increases the tendency of people to engage in mentally related behaviors, one may conjecture that activating “oddness,” may lead people to act in less traditional ways, e.g., seeking information from non-traditional sources. I used data from the GSS and examined if respondents who were randomly assigned an odd respondent ID (1,3,5…) were more likely to report reading horoscopes.

The first column in the table below shows this implausible hypothesis was supported by the data, p<.01 (STATA code)

People are about 11 percentage points more likely to read the horoscope when they are randomly assigned an odd number by the GSS. Moreover, this estimate barely changes across alternative specifications that include more and more covariates, despite the notable increase in R2.

Simonsohn titled his post, “P-hacked Hypotheses Are Deceivingly Robust,” but really the point here has nothing to do with p-hacking (or, more generally, forking paths).

Part of the problem is that robustness checks are typically done for purpose of confirming one’s existing beliefs, and that’s typically a bad game to be playing. More generally, the statistical properties of these methods are not well understood. Researchers typically have a deterministic attitude, identifying statistical significance with truth (as for example here).

Chocolate milk! Another stunning discovery from an experiment on 24 people!

Mike Hull writes:

I was reading over this JAMA Brief Report and could not figure out what they were doing with the composite score. Here are the cliff notes:

Study tested milk vs dark chocolate consumption on three eyesight performance parameters:

(1) High-contrast visual acuity
(2) Small-letter contrast sensitivity
(3) Large-letter contrast sensitivity

Only small-letter contrast sensitivity was significant, but then the authors do this:

Visual acuity was expressed as log of the minimum angle of resolution (logMAR) and contrast sensitivity as the log of the inverse of the minimum detectable contrast (logCS). We scored logMAR and logCS as the number of letters read correctly (0.02 logMAR per visual acuity letter and 0.05 logCS per letter).

Because all 3 measures of spatial vision showed improvement after consumption of dark chocolate, we sought to combine these data in a unique and meaningful way that encompassed different contrasts and letter sizes (spatial frequencies). To quantify overall improvement in spatial vision, we computed the sum of logMAR (corrected for sign) and logCS values from each participant to achieve a composite score that spans letter size and contrast. Composite score results were analyzed using Bland-Altman analysis, with P < .05 indicating significance.

There are more details in the short Results section, but the conclusion was that “Twenty-four participants (80%) showed some improvement with dark chocolate vs milk chocolate (Wilcoxon signed-rank test, P < .001)." Any idea what's going on here? Trial pre-registration here.

I replied that I don’t have much to say on this one. They seemed to have scoured through their data so I’m not surprised they found low p-values. Too bad to see this sort of thing appearing in Jama. I guess chocolate’s such a fun topic, there’s always room for another claim to be made for it.

“Law professor Alan Dershowitz’s new book claims that political differences have lately been criminalized in the United States. He has it wrong. Instead, the orderly enforcement of the law has, ludicrously, been framed as political.”

This op-ed by Virginia Heffernan is about g=politics, but it reminded me of the politics of science.

Heffernan starts with the background:

This last year has been a crash course in startlingly brutal abuses of power. For decades, it seems, a caste of self-styled overmen has felt liberated to commit misdeeds with impunity: ethical, sexual, financial and otherwise.

There’s hardly room to name them all here, though of course icons of power-madness such as Donald Trump and Harvey Weinstein are household names. In plain sight, even more or less regular schmos — including EPA administrator Scott Pruitt, disgraced carnival barker Bill O’Reilly and former New York Atty. Gen. Eric Schneiderman — seem to have fancied themselves exempt from the laws that bind the rest of us.

These guys are not exactly men of Nobel-level accomplishment or royal blood. Like the rest of us, they live in a democracy, under rule of law. Still, they like to preen. . . .

On Friday, a legal document surfaced that suggested Donald Trump and Michael Cohen, his erstwhile personal lawyer, might have known about Schneiderman’s propensity for sexual violence as early as 2013, when Trump tweeted menacingly about Schneiderman’s being a crook . . . Cohen’s office also saw big sums from blue-chip companies not known for “Sopranos”-style nonsense, specifically, Novartis and AT&T. . . .

Also this week, the Observer alleged that Trump confederates hired the same gang of former Israeli intelligence officers to frame and intimidate proponents of the Iran deal that Harvey Weinstein once viciously sicced on his victims.

Then, after this overview of the offenders, she discusses their enablers:

Law professor Alan Dershowitz’s new book claims that political differences have lately been criminalized in the United States. He has it wrong. Instead, the orderly enforcement of the law has, ludicrously, been framed as political.

As with politics, so with science: Most people, I think, are bothered by these offenses, and are even more bothered by idea that they have been common practice. And some of us are so bothered that we make a fuss about it. But there are others—Alan Dershowitz types—who are more bothered by those who make the offense public, who have loyalty not to government but to the political establishment. Or, in the science context, have loyalty not to science but to the scientific establishment.

In politics, we say that the consent of the governed is essential to good governance, thus there is an argument that, at least in the short term, it’s better to hush up crimes rather than to let them be known. Similarly, in science, there are those who prefer happy talk and denial, perhaps because they feel that the institution of science is under threat. As James Heathers puts it, these people hype bad science and attack its critics because criticism is bad for business.

Who knows? Maybe Dershowitz and the defenders of junk science are right! Maybe a corrupt establishment is better than the uncertainty of the new.

They might be right, but I wish they’d be honest about what they’re doing.

Hey! Here’s what to do when you have two or more surveys on the same population!

This problem comes up a lot: We have multiple surveys of the same population and we want a single inference. The usual approach, applied carefully by news organizations such as Real Clear Politics and Five Thirty Eight, and applied sloppily by various attention-seeking pundits every two or four years, is “poll aggregation”: you take the estimate from each poll separately, if necessary correct these estimates for bias, then combine them with some sort of weighted average.

But this procedure is inefficient and can lead to overconfidence (see discussion here, or just remember the 2016 election).

A better approach is to pool all the data from all the surveys together. A survey response is a survey response! Then when you fit your model, include indicators for the individual surveys (varying intercepts, maybe varying slopes too), and include that uncertainty in your inferences. Best of both worlds: you get the efficiency from counting each survey response equally, and you get an appropriate accounting of uncertainty from the multiple surveys.

OK, you can’t always do this: To do it, you need all the raw data from the surveys. But it’s what you should be doing, and if you can’t, you should recognize what you’re missing.

2018: Who actually voted? (The real story, not the exit polls.)

Continuing from our earlier discussion . . . Yair posted some results from his MRP analysis of voter turnout:

1. The 2018 electorate was younger than in 2014, though not as young as exit polls suggest.

2. The 2018 electorate was also more diverse, with African American and Latinx communities surpassing their share of votes cast in 2014.

3. Voters in 2018 were more educated than all the years in our dataset going back to 2006. . . . the exit poll shows the opposite trend. As noted earlier, they substantially changed their weighting scheme on education levels, so these groups can’t be reliably compared across years in the exit poll.

Details here.

Matching (and discarding non-matches) to deal with lack of complete overlap, then regression to adjust for imbalance between treatment and control groups

John Spivack writes:

I am contacting you on behalf of the biostatistics journal club at our institution, the Mount Sinai School of Medicine. We are working Ph.D. biostatisticians and would like the opinion of a true expert on several questions having to do with observational studies—questions that we have not found to be well addressed in the literature.

And here are their questions:

(1) Among the popular implementations of propensity score-based methods for analyzing observational data, matching, stratification based on quintiles (for instance) , and weighting (by inverse probability of assigned treatment, say) is there a clear preference? Does the answer depend on data-type?
I personally like stratification by quintiles of propensity score (followed by an analysis pooled over the quintile groups) because it is simple and robust with no complicated matching algorithms, and no difficult choices over different types of weights. Is this method acceptable for high quality publications?

Also, given that the main threats to the validity of observational studies are elsewhere (unmeasured confounders, treatment heterogeneity, etc.), is the choice of which implementation of propensity score to use really as important as the volume of the literature devoted to the subject would suggest?

(2) Let’s say we’re estimating a treatment effect using an estimated-propensity-score-based-reweighting of the sample (weighting by the inverse of the estimated probability of the assigned treatment, say). The authoritative references (eg Lunceford and Davidian) seem to say that one must take account of the fact that the weights are themselves estimated to produce a valid standard error for any final estimate of treatment effect. Complex formulas are sometimes provided for the standard errors of particular estimators, expressed for instance, as a sandwich variance.
In practice, however, whether for weighted analyses or matched ones, we seldom make this kind adjustment to the standard error and just proceed to do a ‘standard’ analysis of the weighted sample.

if you conceptualize the ’experiment’ being performed as follows:

(a) Draw a sample from the (observational) population, including information on subjects’ covariates and treatment assignments
(b) Re-weight the sample by means of an estimated propensity score (estimated only using that sample).
(c) Observe the outcomes and perform the weighted analysis ( for instance using inverse-probability-of-assigned-treatment weights) to calculate an estimate of treatment effect.

Then, yes, over a large number of iterations of this experiment the sampling distribution of the estimate of treatment effect will be affected by the variation over multiple iterations of the covariate balance between treatment groups (and the resulting variation of the weights) and this will contribute to the variance of the sampling distribution of the estimator.

However, there is another point of view. If a clinical colleague showed up for a consultation with a ‘messy’ dataset from an imperfectly designed or imperfectly randomized study, we would often accept the dataset as-is, using remedies for covariate imbalances, outliers, etc. as needed, in hopes of producing a valid result. In effect, we would be conditioning on the given treatment assignments rather than attempting an unconditional analysis over the (unknowable) distribution of multiple bad study designs. It seems to me that there is nothing very wrong with this method.

Applied to a reweighted sample (or a matched sample), why would such a conditional analysis be invalid, provided we apply the same standards of model checking, covariate balance, etc. that we use in other possibly messy observational datsets? In fact, wouldn’t conditioning on all available information, including treatment assignments and sample covariate distributions, lead to greater efficiency, and be closer in spirit to established statistical principles (like the extended form of the ancillarity principle)? To a non-expert, wouldn’t this seem like a strong enough argument in favor of our usual way of doing things?

My reply:

(1) The concern in causal inference is mismatch between the treatment and control groups. I have found it helpful to distinguish between two forms of mismatch:
– lack of complete overlap on observed pre-treatment predictors
– imbalance on observed pre-treatment predictors

From my book with Jennifer, here are a couple pictures to distinguish the two concepts.

Lack of complete overlap:


The point of matching, as I see it, is to restrict the range of inference to the zone of overlap (that is, to define the average treatment effect in a region where the data can directly answer such a question). You match so as to get a subset of the data with complete overlap (in some sense) and then discard the points that did not match: the treated units for which there was no similar control unit, and the control units for which there was no treated unit.

Stratification and weighting are ways of handling imbalance. More generally, we can think of stratification and weighting as special cases of regression modeling, with the goal being to adjust for known differences between sample and population. But you can really only adjust for these differences where there is overlap. Outside the zone of overlap, your inferences for treatment effects are entirely assumption-based.

To put it another way, matching (and throwing away the non-matches) is about identification, or robustness. Stratification/weighting/regression are for bias correction.

Propensity scores are a low-dimensional approximation, a particular way of regularizing your regression adjustment. Use propensity scores if you want.

(2) If someone were to give me the propensity score, I’d consider using it as a regression predictor—not alone, but with other key predictors such as age, sex, smoking history, etc. I would not do inverse-propensity-score weighing, as this just seems like a way to get a lot of noise and get some mysterious estimate that I wouldn’t trust anyway.

You write, “we would be conditioning on the given treatment assignments.” You always have to be conditioning on the given treatment assignments: that’s what you have! That said, you’ll want to know the distribution of what the treatment assignments could’ve been, had the experiment gone differently. That’s relevant to your interpretation of the data. If you do everything with a regression model, your implicit assumption is that treatment assignment depends only on the predictors in your model. That’s a well known principle in statistics and is discussed in various places including chapter 8 of BDA3 (chapter 7 in the earlier editions).

(3) One more thing. If you have lack of complete overlap and you do matching, you should also do regression. Matching to restrict your data to a region of overlap, followed by regression to adjust for imbalance. Don’t try to make matching do it all. This is an old, old point, discussed back in 1970 in Donald Rubin’s Ph.D. thesis.

2018: What really happened?

We’re always discussing election results on three levels: their direct political consequences, their implications for future politics, and what we can infer about public opinion.

In 2018 the Democrats broadened their geographic base, as we can see in this graph from Yair Ghitza:

Party balancing

At the national level, what happened is what we expected to happen two weeks ago, two months ago, and two years ago: the Democrats bounced back. Their average district vote for the House of Representatives increased by enough to give them clear control of the chamber, even in the face of difficulties of geography and partisan districting.

This was party balancing, which we talked about a few months ago: At the time of the election, the Republicans controlled the executive branch, both houses of congress, and the judiciary, so it made sense that swing voters were going to swing toward the Democrats. Ironically, one reason the Democrats did not regain the Senate in 2018 is . . . party balancing in 2016! Most people thought Hillary Clinton would win the presidency, so lots of people voted Republican for congress to balance that.

The swing in votes toward the Democrats was large (in the context of political polarization). As Nate Cohn wrote, the change in seats was impressive, given that there weren’t very many swing districts for the Democrats to aim for.

Meanwhile, as expected, the Senate remained in Republican control. Some close races went 51-49 rather than 49-51, which doesn’t tell us much about public opinion but is politically consequential.

Where did it happen?

The next question is geographic. Nationally, voters swung toward the Democrats. I was curious where this happened, so I did some googling and found this map by Josh Holder, Cath Levett, Daniel Levitt, and Peter Andringa:

This map omits districts that were uncontested in one election or the other so I suspect it understates the swing, but it gives the general idea.

Here’s another way to look at the swings.

Yair made a graph plotting the vote swing from 2016 to 2018, for each election contested in both years, plotting vs. the Democratic share of the two-party vote in 2016.

The result was pretty stunning—so much that I put the graph at the top of this post. So please scroll up and take a look again, then scroll back down here to keep reading.

Here’s the key takeaway. The Democrats’ biggest gains were in districts where the Republicans were dominant.

In fact, if you look at the graph carefully (and you also remember that we’re excluding uncontested elections, so we’re missing part of the story), you see the following:
– In strong Republican districts (D’s receiving less than 40% of the vote in 2016), Democrats gained almost everywhere, with an average gain of, ummm, it looks something like 8 percentage points.
– In swing districts (D’s receiving 40-60% of the vote in 2016), D’s improved, but only by about 4 percentage points on average. A 4% swing in the vote is a lot, actually! It’s just not 8%.
– In districts where D’s were already dominating, the results were, on average, similar to what happened in 2016.

I don’t know how much this was a national strategy and how much it just happened, but let me point out two things:

1. For the goal of winning the election, it would have been to the Democrats’ advantage to concentrate their gains in the zone where they’d received between 40 and 55% of the vote in the previous election. Conversely, these are the places where the Republicans would’ve wanted to focus their efforts too.

2. Speaking more generally, the Democrats have had a problem, both at the congressional and presidential levels, of “wasted votes”: winning certain districts with huge majorities and losing a lot of the closer districts. Thus, part of Democratic strategy has been to broaden their geographic base. The above scatterplot suggests that the 2018 election was a step in the right direction for them in this regard.

Not just a statistical artifact

When Yair sent me that plot, I had a statistical question: Could it be “regression to the mean”? We might expect, absent any election-specific information, that the D’s would improve in districts where they’d done poorly, and they’d decline in districts where they’d done well. So maybe I’ve just been overinterpreting a pattern that tells us nothing interesting at all?

To address this possible problem, Yair made two more graphs, repeating the above scatterplot, but showing the 2014-to-2016 shift vs. the 2014 results, and the 2012-to-2014 shift vs. the 2012 results. Here’s what he found:

So the answer is: No, it’s not regression to the mean, it’s not a statistical artifact. The shift from 2016 to 2018—the Democrats gaining strength in Republican strongholds—is real. And it can have implications for statewide and presidential elections as well. This is also consistent with results we saw in various special elections during the past two years.

The current narrative is wrong

As Yair puts it:

Current narrative: Dems did better in suburban/urban, Reps did better in rural, continuing trend from 2012-2016. I [Yair] am seeing the opposite:

This isn’t increasing polarization/sorting. This also isn’t mean-reversion. D areas stayed D, R areas jumped ship to a large degree. A lot of these are the rural areas that went the other way from 2012-2016.

Not just in CD races, also in Gov and Senate races . . . Massive, 20+ point shift in margin in Trump counties. Remember this is with really high turnout. . . . This offsets the huge shift from 2012-2016, often in the famous “Obama-Trump” counties.

Yair adds:

The reason people are missing this story right now: focusing on who won/lost means they’re looking at places that went from just below 50 to just above 50. Obviously that’s more important for who actually governs. But this is the big public opinion shift.

I suspect that part of this could be strategy from both parties.

– On one side, the Democrats knew they had a problem in big swathes of the country and they made a special effort to run strong campaigns everywhere: part of this was good sense given their good showing in special elections, and part of it was an investment in their future, to lay out a strong Democratic brand in areas that where they’ll need to be competitive in future statewide and presidential elections.

– On the other side, the Republicans had their backs to the wall and so they focused their effort on the seats they needed to hold if they had a chance of maintaining their House majority.

From that standpoint, the swings above do not completely represent natural public opinion swings from national campaigns. But they are meaningful: they’re real votes, and they’re in places where the Democrats need to gain votes in the future.

There are also some policy implications: If Democratic challengers are more competitive in previously solid Republican districts, enough so that the Republican occupants of these seats are more afraid of losing centrist votes in future general elections than losing votes on the right in future primaries, this could motivate these Republicans to vote more moderately in congress. I don’t know, but it seems possible.

I sent the above to Yair, and he added the following comments:
Continue reading ‘2018: What really happened?’ »

“Recapping the recent plagiarism scandal”

Benjamin Carlisle writes:

A year ago, I received a message from Anna Powell-Smith about a research paper written by two doctors from Cambridge University that was a mirror image of a post I wrote on my personal blog roughly two years prior. The structure of the document was the same, as was the rationale, the methods, and the conclusions drawn. There were entire sentences that were identical to my post. Some wording changes were introduced, but the words were unmistakably mine. The authors had also changed some of the details of the methods, and in doing so introduced technical errors, which confounded proper replication. The paper had been press-released by the journal, and even noted by Retraction Watch. . . .

At first, I was amused by the absurdity of the situation. The blog post was, ironically, a method for preventing certain kinds of scientific fraud. [Carlisle’s original post was called “Proof of prespecified endpoints in medical research with the bitcoin blockchain,” the paper that did the copying was called “How blockchain-timestamped protocols could improve the trustworthiness of medical science,” so it’s amusing that copying had been done on a paper on trustworthiness. — ed.] . . .

The journal did not catch the similarities between this paper and my blog in the first place, and the peer review of the paper was flawed as well.

OK, so far, nothing so surprising. Peer review often gets it wrong, and, in any case, you’re allowed to keep submitting your paper to new journals until it gets accepted somewhere, indeed F1000 Research is not a top journal, so maybe the paper was rejected a few times before appearing there. Or maybe not, I have no idea.

But then the real bad stuff started to happen. Here’s Carlisle:

After the journal’s examination of the case, they informed us that updating the paper to cite me after the fact would undo any harm done by failing to credit the source of the paper’s idea. A new version was hastily published that cited me, using a non-standard citation format that omitted the name of my blog, the title of my post, and the date of original publication.

Wow. That’s the kind of crap—making a feeble correction without giving any credit—that gets done by Reuters and Perspectives on Psychological Science. I hate to see the editor of a real journal act that way.

Carlisle continues:

I was shocked by the journal’s response. Authorship of a paper confers authority in a subject matter, and their cavalier attitude toward this, especially given the validity issues I had raised with them, seemed irresponsible to me. In the meantime, the paper was cited favourably by the Economist and in the BMJ, crediting Iriving and Holden [the authors of the paper that copied Carlisle’s work].

I went to Retraction Watch with this story, which brought to light even more problems with this example of open peer review. The peer reviewers were interviewed, and rather than re-evaluating their support for the paper, they doubled down, choosing instead to disparage my professional work and call me a liar. . . .

The journal refused to retract the paper. It was excellent press for the journal and for the paper’s putative authors, and it would have been embarrassing for them to retract it. The journal had rolled out the red carpet for this paper after all, and it was quickly accruing citations.

That post appeared in June, 2017. But then I clicked on the link to the published article and found this:

So the paper did end up getting retracted—but, oddly enough, not for the plagiarism.

On the plus side, the open peer review is helpful. Much better than Perspectives on Psychological Science. Peer review is not perfect. But saying you do peer review, and then not doing it, that’s really bad.

The Carlisle story is old news, and I know that some people feel that talking about this sort of thing is a waste of time compared to doing real science. And, sure, I guess it is. But here’s the thing: fake science competes with real science. NPR, the Economist, Gladwell, Freakonomics, etc.: they’ll report on fake science instead of reporting on real science. After all, fake science is more exciting! When you’re not constrained by silly things such as data, replication, coherence with the literature, you can really make fast progress! The above story is interesting in that it appears to feature an alignment of low-quality research and unethical research practices. These two things don’t have to go together but often it seems that they do.

Melanie Mitchell says, “As someone who has worked in A.I. for decades, I’ve witnessed the failure of similar predictions of imminent human-level A.I., and I’m certain these latest forecasts will fall short as well. “

Melanie Mitchell‘s piece, Artificial Intelligence Hits the Barrier of Meaning (NY Times behind limited paywall), is spot-on regarding the hype surrounding the current A.I. boom. It’s soon to come out in book length from FSG, so I suspect I’ll hear about it again in the New Yorker.

Like Professor Mitchell, I started my Ph.D. at the tail end of the first A.I. revolution. Remember, the one based on rule-based expert systems? I went to Edinburgh to study linguistics and natural language processing because it was strong in A.I., computer science theory, linguistics, and cognitive science.

On which natural language tasks can computers outperform or match humans? Search is good, because computers are fast and it’s a task at which humans aren’t so hot. That includes things like speech-based call routing in heterogeneous call centers (something I worked on at Bell Labs).

Then there’s spell checking. That’s fantastic. It leverages simple statistics about word frequency and typos/brainos and is way better than most humans at spelling. It’s the same algorithms that are used for speech recognition and RNA-seq alignment to the genome. These all sprung out of Claude Shannon’s 1948 paper, “A Mathematical Theory of Communication”, which has over 100K citations. It introduced, among other things, n-gram language models at the character and word level (still used for speech recognition and classification today with different estimators). As far as I know that paper contained the first posterior predictive checks—generating examples from the trained language models and comparing them to real language. David McKay’s info theory book (the only ML book I actually like) is a great introduction to this material and even BDA3 added a spell-checking example. But it’s hardly A.I. in the big “I” sense of “A.I.”.

Speech recognition has made tremendous strides (I worked on it at Bell Labs in the late 90s then at SpeechWorks in the early 00s), but its performance is still so far short of human levels as to make the difference more qualitative than quantitative, a point Mitchell makes in her essay. It would no more fool you into thinking it was human than an animatronic Disney character bolted to the floor. Unlike games like chess or go, it’s going to be hard to do better than people at language, but it would certainly be possible. But it would be hard to do that the same way they built, say Deep Blue, the IBM chess-playing hardware that evaluated so many gazillions of board positions per turn with very clever heuristics to prune search. That didn’t play chess like a human. If the better language was like that, humans wouldn’t understand it. IBM Watson (natural language Jeopardy playing computer) was closer to behaving like humans with its chain of associative reasoning—to me, that’s the closest we’ve gotten to something I’d call “A.I.”. It’s a shame IBM’s oversold it since then.

Human-level general purpose A.I. is going to be an incredibly tough nut to crack. I don’t see any reason it’s an unsurmounable goal. It’s not going to happen in a decade without a major breakthrough. Better classifiers just aren’t enough. People are very clever, insanely good at subtle chains of associative reasoning (though not so great at logic) and learning from limited examples (Andrew’s sister Susan Gelman, a professor at Michigan, studies concept learning by example). We’re also very contextually aware and focused, which allows us to go deep, but can cause us to miss the forest for the trees.

Watch out for naively (because implicitly based on flat-prior) Bayesian statements based on classical confidence intervals! (Comptroller of the Currency edition)

Laurent Belsie writes:

An economist formerly with the Consumer Financial Protection Bureau wrote a paper on whether a move away from forced arbitration would cost credit card companies money. He found that the results are statistically insignificant at the 95 percent (and 90 percent) confidence level.

But the Office of the Comptroller of the Currency used his figures to argue that although statistically insignificant at the 90 percent level, “an 88 percent chance of an increase of some amount and, for example, a 56 percent chance that the increase is at least 3 percentage points, is economically significant because the average consumer faces the risk of a substantial rise in the cost of their credit cards.”

The economist tells me it’s a statistical no-no to draw those inferences and he references your paper with John Carlin, “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors,” Perspectives on Psychological Science, 2014.

My question is: Is that statistical mistake a really subtle one and a common faux pas among economists – or should the OCC know better? If a lobbying group or a congressman made a rookie statistical mistake, I wouldn’t be surprised. But a federal agency?

The two papers in question are here and here.

My reply:

I don’t think it’s appropriate to talk about that 88 percent etc. for reasons discussed in pages 71-72 of this paper.

I can’t comment on the economics or the data, but if it’s just a question of summarizing the regression coefficient, you can’t say much more than to just give the 95% conf interval (estimate +/- 2 standard errors) and go from there.

Regarding your question: Yes, this sort of mistake is subtle and can be made by many people, including statisticians and economists. It’s not a surprise to see even a trained professional making this sort of error.

The general problem I have with noninformatively-derived Bayesian probabilities is that they tend to be too strong.

“35. What differentiates solitary confinement, county jail and house arrest” and 70 others

Thomas Perneger points us to this amusing quiz on statistics terminology:

Lots more where that came from.

Postdocs and Research fellows for combining probabilistic programming, simulators and interactive AI

Here’s a great opportunity for those interested in probabilistic programming and workflows for Bayesian data analysis:

We (including me, Aki) are looking for outstanding postdoctoral researchers and research fellows to work for a new exciting project in the crossroads of probabilistic programming, simulator-based inference and user interfaces. You will have an opportunity to work with top research groups in Finnish Center for Artificial Intelligence, including both Aalto University and at the University of Helsinki and to cooperate with several industry partners.

The topics for which we are recruiting are

  • Machine learning for simulator-based inference
  • Intelligent user interfaces and techniques for interacting with AI
  • Interactive workflow support for probabilistic programming based modeling

Find the full descriptions here

“Statistical and Machine Learning forecasting methods: Concerns and ways forward”

Roy Mendelssohn points us to this paper by Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos, which begins:

Machine Learning (ML) methods have been proposed in the academic literature as alternatives to statistical ones for time series forecasting. Yet, scant evidence is available about their relative performance in terms of accuracy and computational requirements. The purpose of this paper is to evaluate such performance across multiple forecasting horizons using a large subset of 1045 monthly time series used in the M3 Competition. After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined.

and continues:

Moreover, we observed that their computational requirements are considerably greater than those of statistical methods.

Mendelssohn writes:

For time series, ML models don’t work as well as traditional methods, at least to date. I have read a little on some of the methods. Some have layers of NNs. The residuals from one layer are passed to the next. I would hate to guess what the “equivalent number of parameters” would be (yes I know these are non-parametric but there has to be a lot of over-fitting going on).

I haven’t looked much at these particular models, but for the general problem of “equivalent number of parameters,” let me point you to this paper and this paper with Aki et al.