Skip to content

Hey! Here’s what to do when you have two or more surveys on the same population!

This problem comes up a lot: We have multiple surveys of the same population and we want a single inference. The usual approach, applied carefully by news organizations such as Real Clear Politics and Five Thirty Eight, and applied sloppily by various attention-seeking pundits every two or four years, is “poll aggregation”: you take the estimate from each poll separately, if necessary correct these estimates for bias, then combine them with some sort of weighted average.

But this procedure is inefficient and can lead to overconfidence (see discussion here, or just remember the 2016 election).

A better approach is to pool all the data from all the surveys together. A survey response is a survey response! Then when you fit your model, include indicators for the individual surveys (varying intercepts, maybe varying slopes too), and include that uncertainty in your inferences. Best of both worlds: you get the efficiency from counting each survey response equally, and you get an appropriate accounting of uncertainty from the multiple surveys.

OK, you can’t always do this: To do it, you need all the raw data from the surveys. But it’s what you should be doing, and if you can’t, you should recognize what you’re missing.

2018: Who actually voted? (The real story, not the exit polls.)

Continuing from our earlier discussion . . . Yair posted some results from his MRP analysis of voter turnout:

1. The 2018 electorate was younger than in 2014, though not as young as exit polls suggest.

2. The 2018 electorate was also more diverse, with African American and Latinx communities surpassing their share of votes cast in 2014.

3. Voters in 2018 were more educated than all the years in our dataset going back to 2006. . . . the exit poll shows the opposite trend. As noted earlier, they substantially changed their weighting scheme on education levels, so these groups can’t be reliably compared across years in the exit poll.

Details here.

Matching (and discarding non-matches) to deal with lack of complete overlap, then regression to adjust for imbalance between treatment and control groups

John Spivack writes:

I am contacting you on behalf of the biostatistics journal club at our institution, the Mount Sinai School of Medicine. We are working Ph.D. biostatisticians and would like the opinion of a true expert on several questions having to do with observational studies—questions that we have not found to be well addressed in the literature.

And here are their questions:

(1) Among the popular implementations of propensity score-based methods for analyzing observational data, matching, stratification based on quintiles (for instance) , and weighting (by inverse probability of assigned treatment, say) is there a clear preference? Does the answer depend on data-type?
I personally like stratification by quintiles of propensity score (followed by an analysis pooled over the quintile groups) because it is simple and robust with no complicated matching algorithms, and no difficult choices over different types of weights. Is this method acceptable for high quality publications?

Also, given that the main threats to the validity of observational studies are elsewhere (unmeasured confounders, treatment heterogeneity, etc.), is the choice of which implementation of propensity score to use really as important as the volume of the literature devoted to the subject would suggest?

(2) Let’s say we’re estimating a treatment effect using an estimated-propensity-score-based-reweighting of the sample (weighting by the inverse of the estimated probability of the assigned treatment, say). The authoritative references (eg Lunceford and Davidian) seem to say that one must take account of the fact that the weights are themselves estimated to produce a valid standard error for any final estimate of treatment effect. Complex formulas are sometimes provided for the standard errors of particular estimators, expressed for instance, as a sandwich variance.
In practice, however, whether for weighted analyses or matched ones, we seldom make this kind adjustment to the standard error and just proceed to do a ‘standard’ analysis of the weighted sample.

if you conceptualize the ’experiment’ being performed as follows:

(a) Draw a sample from the (observational) population, including information on subjects’ covariates and treatment assignments
(b) Re-weight the sample by means of an estimated propensity score (estimated only using that sample).
(c) Observe the outcomes and perform the weighted analysis ( for instance using inverse-probability-of-assigned-treatment weights) to calculate an estimate of treatment effect.

Then, yes, over a large number of iterations of this experiment the sampling distribution of the estimate of treatment effect will be affected by the variation over multiple iterations of the covariate balance between treatment groups (and the resulting variation of the weights) and this will contribute to the variance of the sampling distribution of the estimator.

However, there is another point of view. If a clinical colleague showed up for a consultation with a ‘messy’ dataset from an imperfectly designed or imperfectly randomized study, we would often accept the dataset as-is, using remedies for covariate imbalances, outliers, etc. as needed, in hopes of producing a valid result. In effect, we would be conditioning on the given treatment assignments rather than attempting an unconditional analysis over the (unknowable) distribution of multiple bad study designs. It seems to me that there is nothing very wrong with this method.

Applied to a reweighted sample (or a matched sample), why would such a conditional analysis be invalid, provided we apply the same standards of model checking, covariate balance, etc. that we use in other possibly messy observational datsets? In fact, wouldn’t conditioning on all available information, including treatment assignments and sample covariate distributions, lead to greater efficiency, and be closer in spirit to established statistical principles (like the extended form of the ancillarity principle)? To a non-expert, wouldn’t this seem like a strong enough argument in favor of our usual way of doing things?

My reply:

(1) The concern in causal inference is mismatch between the treatment and control groups. I have found it helpful to distinguish between two forms of mismatch:
– lack of complete overlap on observed pre-treatment predictors
– imbalance on observed pre-treatment predictors

From my book with Jennifer, here are a couple pictures to distinguish the two concepts.

Lack of complete overlap:


The point of matching, as I see it, is to restrict the range of inference to the zone of overlap (that is, to define the average treatment effect in a region where the data can directly answer such a question). You match so as to get a subset of the data with complete overlap (in some sense) and then discard the points that did not match: the treated units for which there was no similar control unit, and the control units for which there was no treated unit.

Stratification and weighting are ways of handling imbalance. More generally, we can think of stratification and weighting as special cases of regression modeling, with the goal being to adjust for known differences between sample and population. But you can really only adjust for these differences where there is overlap. Outside the zone of overlap, your inferences for treatment effects are entirely assumption-based.

To put it another way, matching (and throwing away the non-matches) is about identification, or robustness. Stratification/weighting/regression are for bias correction.

Propensity scores are a low-dimensional approximation, a particular way of regularizing your regression adjustment. Use propensity scores if you want.

(2) If someone were to give me the propensity score, I’d consider using it as a regression predictor—not alone, but with other key predictors such as age, sex, smoking history, etc. I would not do inverse-propensity-score weighing, as this just seems like a way to get a lot of noise and get some mysterious estimate that I wouldn’t trust anyway.

You write, “we would be conditioning on the given treatment assignments.” You always have to be conditioning on the given treatment assignments: that’s what you have! That said, you’ll want to know the distribution of what the treatment assignments could’ve been, had the experiment gone differently. That’s relevant to your interpretation of the data. If you do everything with a regression model, your implicit assumption is that treatment assignment depends only on the predictors in your model. That’s a well known principle in statistics and is discussed in various places including chapter 8 of BDA3 (chapter 7 in the earlier editions).

(3) One more thing. If you have lack of complete overlap and you do matching, you should also do regression. Matching to restrict your data to a region of overlap, followed by regression to adjust for imbalance. Don’t try to make matching do it all. This is an old, old point, discussed back in 1970 in Donald Rubin’s Ph.D. thesis.

2018: What really happened?

We’re always discussing election results on three levels: their direct political consequences, their implications for future politics, and what we can infer about public opinion.

In 2018 the Democrats broadened their geographic base, as we can see in this graph from Yair Ghitza:

Party balancing

At the national level, what happened is what we expected to happen two weeks ago, two months ago, and two years ago: the Democrats bounced back. Their average district vote for the House of Representatives increased by enough to give them clear control of the chamber, even in the face of difficulties of geography and partisan districting.

This was party balancing, which we talked about a few months ago: At the time of the election, the Republicans controlled the executive branch, both houses of congress, and the judiciary, so it made sense that swing voters were going to swing toward the Democrats. Ironically, one reason the Democrats did not regain the Senate in 2018 is . . . party balancing in 2016! Most people thought Hillary Clinton would win the presidency, so lots of people voted Republican for congress to balance that.

The swing in votes toward the Democrats was large (in the context of political polarization). As Nate Cohn wrote, the change in seats was impressive, given that there weren’t very many swing districts for the Democrats to aim for.

Meanwhile, as expected, the Senate remained in Republican control. Some close races went 51-49 rather than 49-51, which doesn’t tell us much about public opinion but is politically consequential.

Where did it happen?

The next question is geographic. Nationally, voters swung toward the Democrats. I was curious where this happened, so I did some googling and found this map by Josh Holder, Cath Levett, Daniel Levitt, and Peter Andringa:

This map omits districts that were uncontested in one election or the other so I suspect it understates the swing, but it gives the general idea.

Here’s another way to look at the swings.

Yair made a graph plotting the vote swing from 2016 to 2018, for each election contested in both years, plotting vs. the Democratic share of the two-party vote in 2016.

The result was pretty stunning—so much that I put the graph at the top of this post. So please scroll up and take a look again, then scroll back down here to keep reading.

Here’s the key takeaway. The Democrats’ biggest gains were in districts where the Republicans were dominant.

In fact, if you look at the graph carefully (and you also remember that we’re excluding uncontested elections, so we’re missing part of the story), you see the following:
– In strong Republican districts (D’s receiving less than 40% of the vote in 2016), Democrats gained almost everywhere, with an average gain of, ummm, it looks something like 8 percentage points.
– In swing districts (D’s receiving 40-60% of the vote in 2016), D’s improved, but only by about 4 percentage points on average. A 4% swing in the vote is a lot, actually! It’s just not 8%.
– In districts where D’s were already dominating, the results were, on average, similar to what happened in 2016.

I don’t know how much this was a national strategy and how much it just happened, but let me point out two things:

1. For the goal of winning the election, it would have been to the Democrats’ advantage to concentrate their gains in the zone where they’d received between 40 and 55% of the vote in the previous election. Conversely, these are the places where the Republicans would’ve wanted to focus their efforts too.

2. Speaking more generally, the Democrats have had a problem, both at the congressional and presidential levels, of “wasted votes”: winning certain districts with huge majorities and losing a lot of the closer districts. Thus, part of Democratic strategy has been to broaden their geographic base. The above scatterplot suggests that the 2018 election was a step in the right direction for them in this regard.

Not just a statistical artifact

When Yair sent me that plot, I had a statistical question: Could it be “regression to the mean”? We might expect, absent any election-specific information, that the D’s would improve in districts where they’d done poorly, and they’d decline in districts where they’d done well. So maybe I’ve just been overinterpreting a pattern that tells us nothing interesting at all?

To address this possible problem, Yair made two more graphs, repeating the above scatterplot, but showing the 2014-to-2016 shift vs. the 2014 results, and the 2012-to-2014 shift vs. the 2012 results. Here’s what he found:

So the answer is: No, it’s not regression to the mean, it’s not a statistical artifact. The shift from 2016 to 2018—the Democrats gaining strength in Republican strongholds—is real. And it can have implications for statewide and presidential elections as well. This is also consistent with results we saw in various special elections during the past two years.

The current narrative is wrong

As Yair puts it:

Current narrative: Dems did better in suburban/urban, Reps did better in rural, continuing trend from 2012-2016. I [Yair] am seeing the opposite.

This isn’t increasing polarization/sorting. This also isn’t mean-reversion. D areas stayed D, R areas jumped ship to a large degree. A lot of these are the rural areas that went the other way from 2012-2016.

Not just in CD races, also in Gov and Senate races . . . Massive, 20+ point shift in margin in Trump counties. Remember this is with really high turnout. . . . This offsets the huge shift from 2012-2016, often in the famous “Obama-Trump” counties.

Yair adds:

The reason people are missing this story right now: focusing on who won/lost means they’re looking at places that went from just below 50 to just above 50. Obviously that’s more important for who actually governs. But this is the big public opinion shift.

I suspect that part of this could be strategy from both parties.

– On one side, the Democrats knew they had a problem in big swathes of the country and they made a special effort to run strong campaigns everywhere: part of this was good sense given their good showing in special elections, and part of it was an investment in their future, to lay out a strong Democratic brand in areas that where they’ll need to be competitive in future statewide and presidential elections.

– On the other side, the Republicans had their backs to the wall and so they focused their effort on the seats they needed to hold if they had a chance of maintaining their House majority.

From that standpoint, the swings above do not completely represent natural public opinion swings from national campaigns. But they are meaningful: they’re real votes, and they’re in places where the Democrats need to gain votes in the future.

There are also some policy implications: If Democratic challengers are more competitive in previously solid Republican districts, enough so that the Republican occupants of these seats are more afraid of losing centrist votes in future general elections than losing votes on the right in future primaries, this could motivate these Republicans to vote more moderately in congress. I don’t know, but it seems possible.

I sent the above to Yair, and he added the following comments:
Continue reading ‘2018: What really happened?’ »

“Recapping the recent plagiarism scandal”

Benjamin Carlisle writes:

A year ago, I received a message from Anna Powell-Smith about a research paper written by two doctors from Cambridge University that was a mirror image of a post I wrote on my personal blog roughly two years prior. The structure of the document was the same, as was the rationale, the methods, and the conclusions drawn. There were entire sentences that were identical to my post. Some wording changes were introduced, but the words were unmistakably mine. The authors had also changed some of the details of the methods, and in doing so introduced technical errors, which confounded proper replication. The paper had been press-released by the journal, and even noted by Retraction Watch. . . .

At first, I was amused by the absurdity of the situation. The blog post was, ironically, a method for preventing certain kinds of scientific fraud. [Carlisle’s original post was called “Proof of prespecified endpoints in medical research with the bitcoin blockchain,” the paper that did the copying was called “How blockchain-timestamped protocols could improve the trustworthiness of medical science,” so it’s amusing that copying had been done on a paper on trustworthiness. — ed.] . . .

The journal did not catch the similarities between this paper and my blog in the first place, and the peer review of the paper was flawed as well.

OK, so far, nothing so surprising. Peer review often gets it wrong, and, in any case, you’re allowed to keep submitting your paper to new journals until it gets accepted somewhere, indeed F1000 Research is not a top journal, so maybe the paper was rejected a few times before appearing there. Or maybe not, I have no idea.

But then the real bad stuff started to happen. Here’s Carlisle:

After the journal’s examination of the case, they informed us that updating the paper to cite me after the fact would undo any harm done by failing to credit the source of the paper’s idea. A new version was hastily published that cited me, using a non-standard citation format that omitted the name of my blog, the title of my post, and the date of original publication.

Wow. That’s the kind of crap—making a feeble correction without giving any credit—that gets done by Reuters and Perspectives on Psychological Science. I hate to see the editor of a real journal act that way.

Carlisle continues:

I was shocked by the journal’s response. Authorship of a paper confers authority in a subject matter, and their cavalier attitude toward this, especially given the validity issues I had raised with them, seemed irresponsible to me. In the meantime, the paper was cited favourably by the Economist and in the BMJ, crediting Iriving and Holden [the authors of the paper that copied Carlisle’s work].

I went to Retraction Watch with this story, which brought to light even more problems with this example of open peer review. The peer reviewers were interviewed, and rather than re-evaluating their support for the paper, they doubled down, choosing instead to disparage my professional work and call me a liar. . . .

The journal refused to retract the paper. It was excellent press for the journal and for the paper’s putative authors, and it would have been embarrassing for them to retract it. The journal had rolled out the red carpet for this paper after all, and it was quickly accruing citations.

That post appeared in June, 2017. But then I clicked on the link to the published article and found this:

So the paper did end up getting retracted—but, oddly enough, not for the plagiarism.

On the plus side, the open peer review is helpful. Much better than Perspectives on Psychological Science. Peer review is not perfect. But saying you do peer review, and then not doing it, that’s really bad.

The Carlisle story is old news, and I know that some people feel that talking about this sort of thing is a waste of time compared to doing real science. And, sure, I guess it is. But here’s the thing: fake science competes with real science. NPR, the Economist, Gladwell, Freakonomics, etc.: they’ll report on fake science instead of reporting on real science. After all, fake science is more exciting! When you’re not constrained by silly things such as data, replication, coherence with the literature, you can really make fast progress! The above story is interesting in that it appears to feature an alignment of low-quality research and unethical research practices. These two things don’t have to go together but often it seems that they do.

Melanie Mitchell says, “As someone who has worked in A.I. for decades, I’ve witnessed the failure of similar predictions of imminent human-level A.I., and I’m certain these latest forecasts will fall short as well. “

Melanie Mitchell‘s piece, Artificial Intelligence Hits the Barrier of Meaning (NY Times behind limited paywall), is spot-on regarding the hype surrounding the current A.I. boom. It’s soon to come out in book length from FSG, so I suspect I’ll hear about it again in the New Yorker.

Like Professor Mitchell, I started my Ph.D. at the tail end of the first A.I. revolution. Remember, the one based on rule-based expert systems? I went to Edinburgh to study linguistics and natural language processing because it was strong in A.I., computer science theory, linguistics, and cognitive science.

On which natural language tasks can computers outperform or match humans? Search is good, because computers are fast and it’s a task at which humans aren’t so hot. That includes things like speech-based call routing in heterogeneous call centers (something I worked on at Bell Labs).

Then there’s spell checking. That’s fantastic. It leverages simple statistics about word frequency and typos/brainos and is way better than most humans at spelling. It’s the same algorithms that are used for speech recognition and RNA-seq alignment to the genome. These all sprung out of Claude Shannon’s 1948 paper, “A Mathematical Theory of Communication”, which has over 100K citations. It introduced, among other things, n-gram language models at the character and word level (still used for speech recognition and classification today with different estimators). As far as I know that paper contained the first posterior predictive checks—generating examples from the trained language models and comparing them to real language. David McKay’s info theory book (the only ML book I actually like) is a great introduction to this material and even BDA3 added a spell-checking example. But it’s hardly A.I. in the big “I” sense of “A.I.”.

Speech recognition has made tremendous strides (I worked on it at Bell Labs in the late 90s then at SpeechWorks in the early 00s), but its performance is still so far short of human levels as to make the difference more qualitative than quantitative, a point Mitchell makes in her essay. It would no more fool you into thinking it was human than an animatronic Disney character bolted to the floor. Unlike games like chess or go, it’s going to be hard to do better than people at language, but it would certainly be possible. But it would be hard to do that the same way they built, say Deep Blue, the IBM chess-playing hardware that evaluated so many gazillions of board positions per turn with very clever heuristics to prune search. That didn’t play chess like a human. If the better language was like that, humans wouldn’t understand it. IBM Watson (natural language Jeopardy playing computer) was closer to behaving like humans with its chain of associative reasoning—to me, that’s the closest we’ve gotten to something I’d call “A.I.”. It’s a shame IBM’s oversold it since then.

Human-level general purpose A.I. is going to be an incredibly tough nut to crack. I don’t see any reason it’s an unsurmounable goal. It’s not going to happen in a decade without a major breakthrough. Better classifiers just aren’t enough. People are very clever, insanely good at subtle chains of associative reasoning (though not so great at logic) and learning from limited examples (Andrew’s sister Susan Gelman, a professor at Michigan, studies concept learning by example). We’re also very contextually aware and focused, which allows us to go deep, but can cause us to miss the forest for the trees.

Watch out for naively (because implicitly based on flat-prior) Bayesian statements based on classical confidence intervals! (Comptroller of the Currency edition)

Laurent Belsie writes:

An economist formerly with the Consumer Financial Protection Bureau wrote a paper on whether a move away from forced arbitration would cost credit card companies money. He found that the results are statistically insignificant at the 95 percent (and 90 percent) confidence level.

But the Office of the Comptroller of the Currency used his figures to argue that although statistically insignificant at the 90 percent level, “an 88 percent chance of an increase of some amount and, for example, a 56 percent chance that the increase is at least 3 percentage points, is economically significant because the average consumer faces the risk of a substantial rise in the cost of their credit cards.”

The economist tells me it’s a statistical no-no to draw those inferences and he references your paper with John Carlin, “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors,” Perspectives on Psychological Science, 2014.

My question is: Is that statistical mistake a really subtle one and a common faux pas among economists – or should the OCC know better? If a lobbying group or a congressman made a rookie statistical mistake, I wouldn’t be surprised. But a federal agency?

The two papers in question are here and here.

My reply:

I don’t think it’s appropriate to talk about that 88 percent etc. for reasons discussed in pages 71-72 of this paper.

I can’t comment on the economics or the data, but if it’s just a question of summarizing the regression coefficient, you can’t say much more than to just give the 95% conf interval (estimate +/- 2 standard errors) and go from there.

Regarding your question: Yes, this sort of mistake is subtle and can be made by many people, including statisticians and economists. It’s not a surprise to see even a trained professional making this sort of error.

The general problem I have with noninformatively-derived Bayesian probabilities is that they tend to be too strong.

“35. What differentiates solitary confinement, county jail and house arrest” and 70 others

Thomas Perneger points us to this amusing quiz on statistics terminology:

Lots more where that came from.

Postdocs and Research fellows for combining probabilistic programming, simulators and interactive AI

Here’s a great opportunity for those interested in probabilistic programming and workflows for Bayesian data analysis:

We (including me, Aki) are looking for outstanding postdoctoral researchers and research fellows to work for a new exciting project in the crossroads of probabilistic programming, simulator-based inference and user interfaces. You will have an opportunity to work with top research groups in Finnish Center for Artificial Intelligence, including both Aalto University and at the University of Helsinki and to cooperate with several industry partners.

The topics for which we are recruiting are

  • Machine learning for simulator-based inference
  • Intelligent user interfaces and techniques for interacting with AI
  • Interactive workflow support for probabilistic programming based modeling

Find the full descriptions here

“Statistical and Machine Learning forecasting methods: Concerns and ways forward”

Roy Mendelssohn points us to this paper by Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos, which begins:

Machine Learning (ML) methods have been proposed in the academic literature as alternatives to statistical ones for time series forecasting. Yet, scant evidence is available about their relative performance in terms of accuracy and computational requirements. The purpose of this paper is to evaluate such performance across multiple forecasting horizons using a large subset of 1045 monthly time series used in the M3 Competition. After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined.

and continues:

Moreover, we observed that their computational requirements are considerably greater than those of statistical methods.

Mendelssohn writes:

For time series, ML models don’t work as well as traditional methods, at least to date. I have read a little on some of the methods. Some have layers of NNs. The residuals from one layer are passed to the next. I would hate to guess what the “equivalent number of parameters” would be (yes I know these are non-parametric but there has to be a lot of over-fitting going on).

I haven’t looked much at these particular models, but for the general problem of “equivalent number of parameters,” let me point you to this paper and this paper with Aki et al.

Why it can be rational to vote

Just a reminder.

The purported CSI effect and the retroactive precision fallacy

Regarding our recent post on the syllogism that ate science, someone points us to this article, “The CSI Effect: Popular Fiction About Forensic Science Affects Public Expectations About Real Forensic Science,” by N. J. Schweitzer and Michael J. Saks.

We’ll get to the CSI Effect in a bit, but first I want to share the passage from the article that my correspondent pointed out. It’s this bit from footnote 16:

Preliminary analyses (power analyses) suggested a sample size of 48 would be sufficient to detect the CSI effect, if it existed. In addition, the subsequent significance tests adjust for sample size by holding smaller samples to a higher standard when determining statistical significance. In other words, finding that a difference is statistically significant is the same as saying that the sample size was of sufficient size to test for the effect.

Emphasis added. This is a great quote because it expresses so clearly this error. What to call it? The “retroactive precision fallacy”?

For a skeptical take on the CSI effect, see this article by Jason Chin and Larysa Workewych, which begins:

The CSI effect posits that exposure to television programs that portray forensic science (e.g., CSI: Crime Scene Investigation) can change the way jurors evaluate forensic evidence. We review (1) the theory behind the CSI effect; (2) the perception of the effect among legal actors; (3) the academic treatment of the effect; and, (4) how courts have dealt with the effect. We demonstrate that while legal actors do see the CSI effect as a serious issue, there is virtually no empirical evidence suggesting it is a real phenomenon. Moreover, many of the remedies employed by courts may do no more than introduce bias into juror decision-making or even trigger the CSI effect when it would not normally occur.

My correspondent writes:

Some people were worried that the sophisticated version of CSI that is portrayed on TV sets up an unrealistic image and so jurors (who watch the show) will be more critical of the actual evidence, which is much lower tech. There have been a handful of studies trying to demonstrate this and two did (including the one at issue).

I was pretty shocked at the poor level of rigour across the board – I think that’s what happens when legal scholars (the other study to show the effect was done by a judge) try to do empirical/behavioural work.

The truly sad thing is that many courts give “anti-CSI Effect” instructions based on these two studies that seem to show the effect. Those instructions do seem to be damaging to me – the judge tells the jury that the prosecution need not bring forensic evidence at all. The number of appeals and court time spent on this shoddy line of research is also a bit problematic.

So, two issues here. First, is the CSI effect “real” (that is, is this a large and persistent effect)? Second, the article on the CSI effect demonstrates a statistical fallacy, which is the view that, once a statistically significant result has been found, that this retroactively removes all concerns about inferential uncertainty due to variation in the data.

Cornell prof (but not the pizzagate guy!) has one quick trick to getting 1700 peer reviewed publications on your CV

From the university webpage:

Robert J. Sternberg is Professor of Human Development in the College of Human Ecology at Cornell University. . . . Sternberg is the author of over 1700 refereed publications. . . .

How did he compile over 1700 refereed publications? Nick Brown tells the story:

I [Brown] was recently contacted by Brendan O’Connor, a graduate student at the University of Leicester, who had noticed that some of the text in Dr. Sternberg’s many articles and chapters appeared to be almost identical. . . .

Exhibit 1 . . . this 2010 article by Dr. Sternberg was basically a mashup of this article of his from the same year and this book chapter of his from 2002. One of the very few meaningful differences in the chunks that were recycled between the two 2010 articles is that the term “school psychology” is used in the mashup article to replace “cognitive education” from the other; this may perhaps not be unrelated to the fact that the former was published in School Psychology International (SPI) and the latter in the Journal of Cognitive Education and Psychology (JCEP). If you want to see just how much of the SPI article was recycled from the other two sources, have a look at this. Yellow highlighted text is copied verbatim from the 2002 chapter, green from the JCEP article. You can see that about 95% of the text is in one or the other colour . . .

Brown remarks:

Curiously, despite Dr. Sternberg’s considerable appetite for self-citation (there are 26 citations of his own chapters or articles, plus 1 of a chapter in a book that he edited, in the JCEP article; 25 plus 5 in the SPI article), neither of the 2010 articles cites the other, even as “in press” or “manuscript under review”; nor does either of them cite the 2002 book chapter. If previously published work is so good that you want to copy big chunks from it, why would you not also cite it?

Hmmmmm . . . I have an idea! Sternberg wants to increase his citation count. So he cites himself all the time. But he doesn’t want people to know that he publishes essentially the same paper over and over again. So in those cases, he doesn’t cite himself. Cute, huh?

Brown continues:

Exhibit 2

Inspired by Brendan’s discovery, I [Brown] decided to see if I could find any more examples. I downloaded Dr. Sternberg’s CV and selected a couple of articles at random, then spent a few minutes googling some sentences that looked like the kind of generic observations that an author in search of making “efficient” use of his time might want to re-use. On about the third attempt, after less than ten minutes of looking, I found a pair of articles, from 2003 and 2004, by Dr. Sternberg and Dr. Elena Grigorenko, with considerable overlaps in their text. About 60% of the text in the later article (which is about the general school student population) has been recycled from the earlier one (which is about gifted children) . . .

Neither of these articles cites the other, even as “in press” or “manuscript in preparation”.

And there’s more:

Exhibit 3

I [Brown] wondered whether some of the text that was shared between the above pair of articles might have been used in other publications as well. It didn’t take long(*) to find Dr. Sternberg’s contribution (chapter 6) to this 2012 book, in which the vast majority of the text (around 85%, I estimate) has been assembled almost entirely from previous publications: chapter 11 of this 1990 book by Dr. Sternberg (blue), this 1998 chapter by Dr. Janet Davidson and Dr. Sternberg (green), the above-mentioned 2003 article by Dr. Sternberg and Dr. Grigorenko (yellow), and chapter 10 of this 2010 book by Dr. Sternberg, Dr. Linda Jarvin, and Dr. Grigorenko (pink). . . .

Once again, despite the fact that this chapter cites 59 of Dr. Sternberg’s own publications and another 10 chapters by other people in books that he (co-)edited, none of those citations are to the four works that were the source of all the highlighted text in the above illustration.

Now, sometimes one finds book chapters that are based on previous work. In such cases, it is the usual practice to include a note to that effect. And indeed, two chapters (numbered 26 and 27) in that 2012 book edited by Dr. Dawn Flanagan and Dr. Patti Harrison, contain an acknowledgement along the lines of “This chapter is adapted from . Copyright 20xx by . Adapted by permission”. But there is no such disclosure in chapter 6.

Exhibit 4

It appears that Dr. Sternberg has assembled a chapter almost entirely from previous work on more than one occasion. Here’s a recent example of a chapter made principally from his earlier publications. . . .

This chapter cites 50 of Dr. Sternberg’s own publications and another 7 chapters by others in books that he (co-)edited. . . .

However, none of the citations of that book indicate that any of the text taken from it is being correctly quoted, with quote marks (or appropriate indentation) and a page number. The four other books from which the highlighted text was taken were not cited. No disclosure that this chapter has been adapted from previously published material appears in the chapter, or anywhere else in the 2017 book . . .

In the context of a long and thoughtful discussion, James Heathers supplies the rules from the American Psychological Association code of ethics:

And here’s Cornell’s policy:

OK, that’s the policy for Cornell students. Apparently not the policy for faculty.

One more thing

Bobbie Spellman, former editor of the journal Perspectives on Psychological Science, is confident “beyond a reasonable doubt” that Sternberg was not telling the truth when he said that “all papers in Perspectives go out for peer review, including his own introductions and discussions.” Unless, as Spellman puts it, “you believe that ‘peer review’ means asking some folks to read it and then deciding whether or not to take their advice before you approve publication of it.”

So, there you have it. The man is obsessed with citing his own work—except on the occasions when he does a cut-and-paste job, in which case he is suddenly shy about mentioning his other publications. And, as editor, he reportedly says he sends out everything for peer review, but then doesn’t.

P.S. From his (very long) C.V.:

Sternberg, R. J. (2015). Epilogue: Why is ethical behavior challenging? A model of ethical reasoning. In R. J. Sternberg & S. T. Fiske (Eds.), Ethical challenges in the behavioral and brain sciences: Case studies and commentaries (pp. 218-226). New York: Cambridge University Press.

This guy should join up with Bruno Frey and Brad Bushman: the 3 of them would form a very productive Department of Cut and Paste. Department chair? Ed Wegman, of course.

“We are reluctant to engage in post hoc speculation about this unexpected result, but it does not clearly support our hypothesis”

Brendan Nyhan and Thomas Zeitzoff write:

The results do not provide clear support for the lack-of control hypothesis. Self-reported feelings of low and high control are positively associated with conspiracy belief in observational data (model 1; p<.05 and p<.01, respectively). We are reluctant to engage in post hoc speculation about this unexpected result, but it does not clearly support our hypothesis. Moreover, our experimental treatment effect estimate for our low-control manipulation is null relative to both the high-control condition (the preregistered hypothesis test) as well as the baseline condition (a RQ) in both the combined (table 2) and individual item results (table B7). Finally, we find no evidence that the association with self-reported feelings of control in model 1 of table 2 or the effect of the control treatments in model 2 are moderated by anti-Western or anti-Jewish attitudes (results available on request). Our expectations are thus not supported.

It is good to see researchers openly express their uncertainty and be clear about the limitations of their data.

“Simulations are not scalable but theory is scalable”

Eren Metin Elçi writes:

I just watched this video the value of theory in applied fields (like statistics), it really resonated with my previous research experiences in statistical physics and on the interplay between randomised perfect sampling algorithms and Markov Chain mixing as well as my current perspective on the status quo of deep learning. . . .

So essentially in this post I give more evidence for [the] statements “simulations are not scalable but theory is scalable” and “theory scales” from different disciplines. . . .

The theory of finite size scaling in statistical physics: I devoted quite a significant amount of my PhD and post-doc research to finite size scaling, where I applied and checked the theory of finite size scaling for critical phenomena. In a nutshell, the theory of finite size scaling allows us to study the behaviour and infer properties of physical systems in thermodynamic limits (close to phase transitions) through simulating (sequences) of finite model systems. This is required, since our current computational methods are far from being, and probably will never be, able to simulate real physical systems. . . .

Here comes a question I have been thinking about for a while . . . is there a (universal) theory that can quantify how deep learning models behave on larger problem instances, based on results from sequences of smaller problem instances. As an example, how do we have to adapt a, say, convolutional neural network architecture and its hyperparameters to sequences of larger (unexplored) problem instances (e.g. increasing the resolution of colour fundus images for the diagnosis of diabetic retinopathy, see “Convolutional Neural Networks for Diabetic Retinopathy” [4]) in order to guarantee a fixed precision over the whole sequence of problem instances without the need of ad-hoc and manual adjustments to the architecture and hyperparameters for each new problem instance. A very early approach of a finite size scaling analysis of neural networks (admittedly for a rather simple “architecture”) can be found here [5]. An analogue to this, which just crossed my mind, is the study of Markov chain mixing times . . .

It’s so wonderful to learn about these examples where my work is inspiring young researchers to look at problems in new ways!

My two talks in Austria next week, on two of your favorite topics!

Innsbruck, 7 Nov 2018:

The study of American politics as a window into understanding uncertainty in science

We begin by discussing recent American elections in the context of political polarization, and we consider similarities and differences with European politics. We then discuss statistical challenges in the measurement of public opinion: inference from opinion polls with declining response rates has much in common with challenges in big-data analytics. From here we move to the recent replication crisis in science, and we argue that Bayesian methods are well suited to resolve some of these problems, if researchers can move away from inappropriate demands for certainty. We illustrate with examples in many different fields of research, our own and others’.

Some background reading:

19 things we learned from the 2016 election (with Julia Azari),
The mythical swing voter (with Sharad Goel, Doug Rivers, and David Rothschild).
The failure of null hypothesis significance testing when studying incremental changes, and what to do about it.
Honesty and transparency are not enough.
The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective.

Vienna, 9 Nov 2018:

Bayesian Workflow

Methods in statistics and data science are often framed as solutions to particular problems, in which a particular model or method is applied to a dataset. But good practice typically requires multiplicity, in two dimensions: fitting many different models to better understand a single dataset, and applying a method to a series of different but related problems. To understand and make appropriate inferences from real-world data analysis, we should account for the set of models we might fit, and for the set of problems to which we would apply a method. This is known as the reference set in frequentist statistics or the prior distribution in Bayesian statistics. We shall discuss recent research of ours that addresses these issues, involving the following statistical ideas: Type M errors, the multiverse, weakly informative priors, Bayesian stacking and cross-validation, simulation-based model checking, divide-and-conquer algorithms, and validation of approximate computations.

Some background reading:

Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors (with John Carlin).
Increasing transparency through a multiverse analysis (with Sara Steegen, Francis Tuerlinckx, and Wolf Vanpaemel).
Prior choice recommendations wiki (with Daniel Simpson and others).
Using stacking to average Bayesian predictive distributions (with Yuling Yao, Aki Vehtari, and Daniel Simpson).
Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC (with Aki Vehtari and Jonah Gabry).
Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data (with Aki Vehtari, Tuomas Sivula, Pasi Jylanki, Dustin Tran, Swupnil Sahai, Paul Blomstedt, John Cunningham, David Schiminovich, and Christian Robert).
Yes, but did it work?: Evaluating variational inference (with Yuling Yao, Aki Vehtari, and Daniel Simpson).
Visualization in Bayesian workflow (with Jonah Gabry, Daniel Simpson, Aki Vehtari, and Michael Betancourt).

I felt like in Vienna I should really speak on this paper, but I don’t think I’ll be talking to an audience of philosophers. I guess the place has changed a bit since 1934.

P.S. I was careful to arrange zero overlap between the two talks. Realistically, though, I don’t expect many people to go to both!

“What Happened Next Tuesday: A New Way To Understand Election Results”

Yair just published a long post explaining (a) how he and his colleagues use Mister P and the voter file to get fine-grained geographic and demographic estimates of voter turnout and vote preference, and (b) why this makes a difference.

The relevant research paper is here.

As Yair says in his above-linked post, he and others are now set up to report adjusted pre-election poll data on election night or shortly after, as a replacement for exit polls, which are so flawed.

Facial feedback: “These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.”

Fritz Strack points us to this article, “When Both the Original Study and Its Failed Replication Are Correct: Feeling Observed Eliminates the Facial-Feedback Effect,” by Tom Noah, Yaacov Schul, and Ruth Mayo, who write:

According to the facial-feedback hypothesis, the facial activity associated with particular emotional expressions can influence people’s affective experiences. Recently, a replication attempt of this effect in 17 laboratories around the world failed to find any support for the effect. We hypothesize that the reason for the failure of replication is that the replication protocol deviated from that of the original experiment in a critical factor. In all of the replication studies, participants were alerted that they would be monitored by a video camera, whereas the participants in the original study were not monitored, observed, or recorded. . . . we replicated the facial-feedback experiment in 2 conditions: one with a video-camera and one without it. The results revealed a significant facial-feedback effect in the absence of a camera, which was eliminated in the camera’s presence. These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.

We’ve discussed the failed replications of facial feedback before, so it seemed worth following up with this new paper that provides an explanation for the failed replication that preserves the original effect.

Here are my thoughts.

1. The experiments in this new paper are preregistered. I haven’t looked at the preregistration plan, but even if not every step was followed exactly, preregistration does seem like a good step.

2. The main finding is the facial feedback worked in the no-camera condition but not in the camera condition:

3. As you can almost see in the graph, the difference between these results is not itself statistically significant—not at the conventional p=0.05 level for a two-sided test. The result has a p-value of 0.102, which the authors describe as “marginally significant in the expected direction . . . . p=.051, one-tailed . . .” Whatever. It is what it is.

4. The authors are playing a dangerous game when it comes to statistical power. From one direction, I’m concerned that the studies are way too noisy: it says that their sample size was chosen “based on an estimate of the effect size of Experiment 1 by Strack et al. (1988),” but for the usual reasons we can expect that to be a huge overestimate of effect size, hence the real study has nothing like 80% power. From the other direction, the authors use low power to explain away non-statistically-significant results (“Although the test . . . was greatly underpowered, the preregistered analysis concerning the interaction . . . was marginally significant . . .”).

5. I’m concerned that the study is too noisy, and I’d prefer a within-person experiment.

6. In their discussion section, the authors write:

Psychology is a cumulative science. As such, no single study can provide the ultimate, final word on any hypothesis or phenomenon. As researchers, we should strive to replicate and/or explicate, and any one study should be considered one step in a long path. In this spirit, let us discuss several possible ways to explain the role that the presence of a camera can have on the facial-feedback effect.

That’s all reasonable. I think the authors should also consider the hypothesis that what they’re seeing is more noise. Their theory could be correct, but another possibility is that they’re chasing another dead end. This sort of thing can happen when you stare really hard at noisy data.

7. The authors write, “These findings suggest that minute differences in the experimental protocol might lead to theoretically meaningful changes in the outcomes.” I have no idea, but if this is true, it would definitely be good to know.

8. The treatments are for people to hold a pen in their lips or their teeth in some specified ways. It’s not clear to me why any effects of this treatments (assuming the effects end up being reproducible) should be attributed to facial feedback rather than some other aspect of the treatment such as priming or implicit association. I’m not saying there isn’t facial feedback going on; I just have no idea. I agree with the authors that their results are consistent with the facial-feedback model.

P.S. Strack also points us to this further discussion by E. J. Wagenmakers and Quentin Gronau, which I largely find reasonable, but I disagree with their statement regarding “the urgent need to preregister one’s hypotheses carefully and comprehensively, and then religiously stick to the plan.” Preregistration is fine, and I agree with their statement that generating fake data is a good way to test it out (one can also preregister using alternative data sets, as here), but I hardly see it as “urgent.” It’s just one part of the picture.

Raghuveer Parthasarathy’s big idea for fixing science

Raghuveer Parthasarathy writes:

The U.S. National Science Foundation ran an interesting call for proposals recently called the “Idea Machine,” aiming to gather “Big Ideas” to shape the future of research. It was open not just to scientists, but to anyone interested in potentially identifying grand challenges and new directions.

He continues:

(i) There are non-obvious, or unpopular, ideas that are important. I’ll perhaps discuss this in a later post. (What might you come up with?)

(ii) There is a very big idea, perhaps bigger than all the others, that I’d bet isn’t one of the ~1000 other submissions: fixing science itself.

And then he presents his Big Idea: A Sustainable Scientific Enterprise:

The scientific enterprise has never been larger, or more precarious. Can we reshape publicly funded science, matching trainees to viable careers, fostering reproducibility, and encouraging risk?

Science has transformed civilization. This statement is so obviously true that it can come as a shock to learn of the gloomy view that many scientists have of the institutions, framework, and organizational structure of contemporary scientific research. Issues of reproducibility plague many fields, fueled in part by structural incentives for eye-catching but fragile results. . . . over 2 million scientific papers are published each year . . . representing both a steady increase in our understanding of the universe and a barrage of noise driven by pressures to generate output. All of these issues together limit the ability of scientists and of science to tackle important questions that humanity faces. A grand challenge for science, therefore, is to restructure the scientific enterprise to make it more sustainable, productive, and capable of driving innovation. . . .

Methods of scholarly communication that indicate progress in small communities can easily become simple tick-boxes of activity in large, impersonal systems. Continual training of students as new researchers, who then train more students, is very effective for exponentially expanding a small community, as was the goal in the U.S. after World War II, but is clearly incompatible with a sustainable, stable population. The present configuration is so well-suited to expansion, and so ill-suited to stability . . .

It is hard to understate the impact of science on society: every mobile phone, DNA test, detection of a distant planet, material phase transition, airborne jetliner, radio-tracked wolf, and in-vitro fertilized baby is a testament to the stunning ability of our species to explore, understand, and engineer the natural world. There are many challenges that remain unsolved . . .

In some fields, a lot of what’s published is wrong. More commonly, much of what’s published is correct but minor and unimportant. . . . Of course, most people don’t want to do boring work; the issue is one of structures and incentives. [The paradox is that funding agencies always want proposals to aim high, and journals always want exciting papers, but they typically want a particular sort of ambition, a particular sort of exciting result—the new phenomenon, the cure for cancer, etc., which paradoxically is often associated with boring, mechanistic, simplistic models of the world. — ed.] . . .

Ultimately, the real test of scientific reforms is the progress we make on “big questions.” We will hopefully look back on the post-reform era as the one in which challenges related to health, energy and the environment, materials, and more were tackled with unprecedented success. . . . science thrives by self-criticism and skepticism, which should be applied to the institutions of science as well as its subject matter if we are to maximize our chances of successfully tackling the many complex challenges our society, and our planet, face.

“Radio-tracked wolf” . . . I like that!

In all seriousness, I like this Big Idea a lot. It’s very appropriate for NSF, and I think it should and does have a good chance of being a winner in this competition. I submitted a Big Idea too—Truly Data-Based Decision Making—and I like it, I think it’s great stuff, indeed it’s highly compatible with Parthasarathy’s. He’s tackling the social and institutional side of the problem, while my proposal is more technical. They go together. Whether or not either of our ideas are selected in this particular competition, I hope NSF takes Parthasarathy’s ideas seriously and moves toward implementing some of them.

You are welcome in comments to discuss non-obvious, or unpopular, ideas that are important. (Please do better than this list; thank you.)

“2010: What happened?” in light of 2018

Back in November 2010 I wrote an article that I still like, attempting to answer the question: “How could the voters have swung so much in two years? And, why didn’t Obama give Americans a better sense of his long-term economic plan in 2009, back when he still had a political mandate?”

My focus was on the economic slump at the time: how it happened, what were the Obama team’s strategies for boosting the economy, and in particular why they Democrats didn’t do more to prime the pump in 2009-2010, when they controlled the presidency and both houses of congress and had every motivation to get the economy moving again.

As I wrote elsewhere, I suspect that, back when Obama was elected in 2008 in the midst of an economic crisis, lots of people thought it was 1932 all over again, but it was really 1930:

Obama’s decisive victory echoed Roosevelt’s in 1932. But history doesn’t really repeat itself. . . With his latest plan of a spending freeze, Obama is being labeled by many liberals as the second coming of Herbert Hoover—another well-meaning technocrat who can’t put together a political coalition to do anything to stop the slide. Conservatives, too, may have switched from thinking of Obama as a scary realigning Roosevelt to viewing him as a Hoover from their own perspective—as a well-meaning fellow who took a stock market crash and made it worse through a series of ill-timed government interventions.

My take on all this in 2010 was that, when they came into office, the Obama team was expecting a recovery in any case (as in this notorious graph) and, if anything, were concerned about reheating the economy too quickly.

My perspective on this is a mix of liberal and conservative perspectives: liberal, or Keynesian, in that I’m accepting the idea that government spending can stimulate the economy and do useful things; conservative in that I’m accepting the idea that there’s some underlying business cycle or reality that governments will find it difficult to avoid. “I was astonished to see the recession in Baghdad, for I had an appointment with him tonight in Samarra.”

I have no deep understanding of macroeconomics, though, so you can think of my musings here as representing a political perspective on economic policy—a perspective that is relevant, given that I’m talking about the actions of politicians.

In any case, a big story of the 2010 election was a feeling that Obama and the Democrats were floundering on the economy, which added some force to the expected “party balancing” in which the out-party gains in congress in the off-year election.

That was then, this is now

Now on to 2018, where the big story is, and has been, the expected swing toward the Democrats (party balancing plus the unpopularity of the president), but where the second biggest story is that, yes, Trump and his party are unpopular, but not as unpopular as he was a couple months ago. And a big part of that story is the booming economy, and a big part of that story is the large and increasing budget deficit, which defies Keynesian and traditional conservative prescriptions (you’re supposed to run a surplus, not a deficit, in boom times).

From that perspective, I wonder if the Republicans’ current pro-cyclical fiscal policy, so different from traditional conservative recommendations, is consistent with a larger pattern in the last two years in which the Republican leadership feels that it’s living on borrowed time. The Democrats received more votes in the last presidential election and are expected to outpoll the Republicans in the upcoming congressional elections too, so they may well feel more pressure to get better economic performance now, both to keep themselves in power by keeping the balls in the air as long as possible, and because if they’re gonna lose power, they want to grab what they can when they can still do it.

In contrast the Democratic leadership in 2008 expected to be in charge for a long time, so (a) they were in no hurry to implement policies that they could do at their leisure, and (b) they just didn’t want to screw things up and lose their permanent majority.

Different perspectives and expectations lead to different strategies.