Skip to content

“Using numbers to replace judgment”

Julian Marewski and Lutz Bornmann write:

In science and beyond, numbers are omnipresent when it comes to justifying different kinds of judgments. Which scientific author, hiring committee-member, or advisory board panelist has not been confronted with page-long “publication manuals”, “assessment reports”, “evaluation guidelines”, calling for p-values, citation rates, h-indices, or other statistics in order to motivate judgments about the “quality” of findings, applicants, or institutions? Yet, many of those relying on and calling for statistics do not even seem to understand what information those numbers can actually convey, and what not. Focusing on the uninformed usage of bibliometrics as worrysome outgrowth of the increasing quantification of science and society, we place the abuse of numbers into larger historical contexts and trends. These are characterized by a technology-driven bureaucratization of science, obsessions with control and accountability, and mistrust in human intuitive judgment. The ongoing digital revolution increases those trends. We call for bringing sanity back into scientific judgment exercises.

I agree. Vaguely along the same lines is our recent paper on the fallacy of decontextualized measurement.

This happens a lot, that the things that people do specifically to make their work feel more scientific, actually pull them away from scientific inquiry.

Another way to put it is that subjective judgment is unavoidable. When Blake McShane and the rest of us were writing our paper on abandoning statistical significance, once potential criticism we had to address was: What’s the alternative? If researchers, journal editors, policymakers, etc., don’t have “statistical significance” to make their decisions, what can they do? Our response was that decision makers already are using their qualitative judgment to make decisions. PNAS, for example, doesn’t publish every submission that is sent to them with “p less than .05”; no, they still reject most of them, on other grounds (perhaps because their claims aren’t dramatic enough). Journals may use statistical significance as a screener, but they still have to make hard decisions based on qualitative judgment. We, and Marewski and Bornmann, are saying that such judgment is necessary, and it can be counterproductive to add a pseudo-objective overlay on top of that.

2018: How did people actually vote? (The real story, not the exit polls.)

Following up on the post that we linked to last week, here’s Yair’s analysis, using Mister P, of how everyone voted.

Like Yair, I think these results are much better than what you’ll see from exit polls, partly because the analysis is more sophisticated (MRP gives you state-by-state estimates in each demographic group), partly because he’s using more data (tons of pre-election polls), and partly because I think his analysis does a better job of correcting for bias (systematic differences between the sample and population).

As Yair puts it:

We spent the last week combining all of the information available — pre-election projections, early voting, county and congressional district election results, precinct results where we have it available , and polling data — to come up with our estimates.

In future election years, maybe Yair’s results, or others constructed using similar methods, will become the standard, and we’ll be able to forget exit polls, or relegate them to a more minor part of our discourse.

Anyway, here’s what Yair found:

The breakdown by age. Wow:

Changes since the previous midterm election, who voted and how they voted:

Ethnicity and education:



Yair’s got more at the link.

And here’s our summary of what happened in 2018, that we posted a few days after the election.

Hey, check this out: Columbia’s Data Science Institute is hiring research scientists and postdocs!

Here’s the official announcement:

The Institute’s Postdoctoral and Research Scientists will help anchor Columbia’s presence as a leader in data-science research and applications and serve as resident experts in fostering collaborations with the world-class faculty across all schools at Columbia University. They will also help guide, plan and execute data-science research, applications and technological innovations that address societal challenges and related University-wide initiatives.

Postdoc Fellows

Requirements: PhD degree


Research Scientist (Open Rank)

Requirements: PhD degree + (see position description for more information)

Research Scientist who will conduct independent cutting-edge research in the foundations or application of data science or related fields, or be involved in interdisciplinary research through a collaboration between the Data Science Institute and the various schools at the University.


Research Scientist (Open Rank)

Requirements: PhD degree + (see position description for more information)

Research Scientist who will serve as Columbia’s resident experts to foster collaborations with faculty across all the schools at Columbia University.


Candidates for all Research Scientists positions must apply using the links above that direct to the Columbia HR portal for each position, whereas the Postdoc Fellows submit materials via:

I’m part of the Data Science Institute so if you want to work with me or others here at Columbia, you should apply.

The State of the Art

Christie Aschwanden writes:

Not sure you will remember, but last fall at our panel at the World Conference of Science Journalists I talked with you and Kristin Sainani about some unconventional statistical methods being used in sports science. I’d been collecting material for a story, and after the meeting I sent the papers to Kristin. She fell down the rabbit hole too, and ended up writing a rebuttal of them, which is just published ahead of print in a sports science journal.

The authors of the work she is critiquing have written a response on their website (the title, “The Vindication of Magnitude-Based Inference”) though they seem to have taken it down for revisions at the moment.

I’m attaching the paper that the proponents of this method (“magnitude-based inference”) wrote to justify it. Kristin’s paper is at least the third to critique MBI. Will Hopkins, who is the mastermind behind it, has doubled down. His method seems to be gaining traction. A course that teaches researchers how to use it has been endorsed by British Association of Sport and Exercise Science and Exercise & Sports Science Australia.

My reply:

The whole thing seems pretty pointless to me. I agree with Sainani that the paper on MBI does not make sense. But I also disagree with all the people involved in this debate in that I don’t think that “type 1 error rate” has any relevance to sports science, or to science more generally. See for example here and here.

I think scientists should be spending more time collecting good data and reporting their raw results for all to see, and less time trying to come up with methods for extracting a spurious certainty out of noisy data. I think this whole type 1, type 2 error thing is a horrible waste of time which is distracting researchers from the much more important problem of getting high quality measurements.

See here for further discussion of this general point.

P.S. These titles are great, no?

Robustness checks are a joke

Someone pointed to this post from a couple years ago by Uri Simonsohn, who correctly wrote:

Robustness checks involve reporting alternative specifications that test the same hypothesis. Because the problem is with the hypothesis, the problem is not addressed with robustness checks.

Simonsohn followed up with an amusing story:

To demonstrate the problem I [Simonsohn] conducted exploratory analyses on the 2010 wave of the General Social Survey (GSS) until discovering an interesting correlation. If I were writing a paper about it, this is how I may motivate it:

Based on the behavioral priming literature in psychology, which shows that activating one mental construct increases the tendency of people to engage in mentally related behaviors, one may conjecture that activating “oddness,” may lead people to act in less traditional ways, e.g., seeking information from non-traditional sources. I used data from the GSS and examined if respondents who were randomly assigned an odd respondent ID (1,3,5…) were more likely to report reading horoscopes.

The first column in the table below shows this implausible hypothesis was supported by the data, p<.01 (STATA code)

People are about 11 percentage points more likely to read the horoscope when they are randomly assigned an odd number by the GSS. Moreover, this estimate barely changes across alternative specifications that include more and more covariates, despite the notable increase in R2.

Simonsohn titled his post, “P-hacked Hypotheses Are Deceivingly Robust,” but really the point here has nothing to do with p-hacking (or, more generally, forking paths).

Part of the problem is that robustness checks are typically done for purpose of confirming one’s existing beliefs, and that’s typically a bad game to be playing. More generally, the statistical properties of these methods are not well understood. Researchers typically have a deterministic attitude, identifying statistical significance with truth (as for example here).

Chocolate milk! Another stunning discovery from an experiment on 24 people!

Mike Hull writes:

I was reading over this JAMA Brief Report and could not figure out what they were doing with the composite score. Here are the cliff notes:

Study tested milk vs dark chocolate consumption on three eyesight performance parameters:

(1) High-contrast visual acuity
(2) Small-letter contrast sensitivity
(3) Large-letter contrast sensitivity

Only small-letter contrast sensitivity was significant, but then the authors do this:

Visual acuity was expressed as log of the minimum angle of resolution (logMAR) and contrast sensitivity as the log of the inverse of the minimum detectable contrast (logCS). We scored logMAR and logCS as the number of letters read correctly (0.02 logMAR per visual acuity letter and 0.05 logCS per letter).

Because all 3 measures of spatial vision showed improvement after consumption of dark chocolate, we sought to combine these data in a unique and meaningful way that encompassed different contrasts and letter sizes (spatial frequencies). To quantify overall improvement in spatial vision, we computed the sum of logMAR (corrected for sign) and logCS values from each participant to achieve a composite score that spans letter size and contrast. Composite score results were analyzed using Bland-Altman analysis, with P < .05 indicating significance.

There are more details in the short Results section, but the conclusion was that “Twenty-four participants (80%) showed some improvement with dark chocolate vs milk chocolate (Wilcoxon signed-rank test, P < .001)." Any idea what's going on here? Trial pre-registration here.

I replied that I don’t have much to say on this one. They seemed to have scoured through their data so I’m not surprised they found low p-values. Too bad to see this sort of thing appearing in Jama. I guess chocolate’s such a fun topic, there’s always room for another claim to be made for it.

“Law professor Alan Dershowitz’s new book claims that political differences have lately been criminalized in the United States. He has it wrong. Instead, the orderly enforcement of the law has, ludicrously, been framed as political.”

This op-ed by Virginia Heffernan is about g=politics, but it reminded me of the politics of science.

Heffernan starts with the background:

This last year has been a crash course in startlingly brutal abuses of power. For decades, it seems, a caste of self-styled overmen has felt liberated to commit misdeeds with impunity: ethical, sexual, financial and otherwise.

There’s hardly room to name them all here, though of course icons of power-madness such as Donald Trump and Harvey Weinstein are household names. In plain sight, even more or less regular schmos — including EPA administrator Scott Pruitt, disgraced carnival barker Bill O’Reilly and former New York Atty. Gen. Eric Schneiderman — seem to have fancied themselves exempt from the laws that bind the rest of us.

These guys are not exactly men of Nobel-level accomplishment or royal blood. Like the rest of us, they live in a democracy, under rule of law. Still, they like to preen. . . .

On Friday, a legal document surfaced that suggested Donald Trump and Michael Cohen, his erstwhile personal lawyer, might have known about Schneiderman’s propensity for sexual violence as early as 2013, when Trump tweeted menacingly about Schneiderman’s being a crook . . . Cohen’s office also saw big sums from blue-chip companies not known for “Sopranos”-style nonsense, specifically, Novartis and AT&T. . . .

Also this week, the Observer alleged that Trump confederates hired the same gang of former Israeli intelligence officers to frame and intimidate proponents of the Iran deal that Harvey Weinstein once viciously sicced on his victims.

Then, after this overview of the offenders, she discusses their enablers:

Law professor Alan Dershowitz’s new book claims that political differences have lately been criminalized in the United States. He has it wrong. Instead, the orderly enforcement of the law has, ludicrously, been framed as political.

As with politics, so with science: Most people, I think, are bothered by these offenses, and are even more bothered by idea that they have been common practice. And some of us are so bothered that we make a fuss about it. But there are others—Alan Dershowitz types—who are more bothered by those who make the offense public, who have loyalty not to government but to the political establishment. Or, in the science context, have loyalty not to science but to the scientific establishment.

In politics, we say that the consent of the governed is essential to good governance, thus there is an argument that, at least in the short term, it’s better to hush up crimes rather than to let them be known. Similarly, in science, there are those who prefer happy talk and denial, perhaps because they feel that the institution of science is under threat. As James Heathers puts it, these people hype bad science and attack its critics because criticism is bad for business.

Who knows? Maybe Dershowitz and the defenders of junk science are right! Maybe a corrupt establishment is better than the uncertainty of the new.

They might be right, but I wish they’d be honest about what they’re doing.

Hey! Here’s what to do when you have two or more surveys on the same population!

This problem comes up a lot: We have multiple surveys of the same population and we want a single inference. The usual approach, applied carefully by news organizations such as Real Clear Politics and Five Thirty Eight, and applied sloppily by various attention-seeking pundits every two or four years, is “poll aggregation”: you take the estimate from each poll separately, if necessary correct these estimates for bias, then combine them with some sort of weighted average.

But this procedure is inefficient and can lead to overconfidence (see discussion here, or just remember the 2016 election).

A better approach is to pool all the data from all the surveys together. A survey response is a survey response! Then when you fit your model, include indicators for the individual surveys (varying intercepts, maybe varying slopes too), and include that uncertainty in your inferences. Best of both worlds: you get the efficiency from counting each survey response equally, and you get an appropriate accounting of uncertainty from the multiple surveys.

OK, you can’t always do this: To do it, you need all the raw data from the surveys. But it’s what you should be doing, and if you can’t, you should recognize what you’re missing.

2018: Who actually voted? (The real story, not the exit polls.)

Continuing from our earlier discussion . . . Yair posted some results from his MRP analysis of voter turnout:

1. The 2018 electorate was younger than in 2014, though not as young as exit polls suggest.

2. The 2018 electorate was also more diverse, with African American and Latinx communities surpassing their share of votes cast in 2014.

3. Voters in 2018 were more educated than all the years in our dataset going back to 2006. . . . the exit poll shows the opposite trend. As noted earlier, they substantially changed their weighting scheme on education levels, so these groups can’t be reliably compared across years in the exit poll.

Details here.

Matching (and discarding non-matches) to deal with lack of complete overlap, then regression to adjust for imbalance between treatment and control groups

John Spivack writes:

I am contacting you on behalf of the biostatistics journal club at our institution, the Mount Sinai School of Medicine. We are working Ph.D. biostatisticians and would like the opinion of a true expert on several questions having to do with observational studies—questions that we have not found to be well addressed in the literature.

And here are their questions:

(1) Among the popular implementations of propensity score-based methods for analyzing observational data, matching, stratification based on quintiles (for instance) , and weighting (by inverse probability of assigned treatment, say) is there a clear preference? Does the answer depend on data-type?
I personally like stratification by quintiles of propensity score (followed by an analysis pooled over the quintile groups) because it is simple and robust with no complicated matching algorithms, and no difficult choices over different types of weights. Is this method acceptable for high quality publications?

Also, given that the main threats to the validity of observational studies are elsewhere (unmeasured confounders, treatment heterogeneity, etc.), is the choice of which implementation of propensity score to use really as important as the volume of the literature devoted to the subject would suggest?

(2) Let’s say we’re estimating a treatment effect using an estimated-propensity-score-based-reweighting of the sample (weighting by the inverse of the estimated probability of the assigned treatment, say). The authoritative references (eg Lunceford and Davidian) seem to say that one must take account of the fact that the weights are themselves estimated to produce a valid standard error for any final estimate of treatment effect. Complex formulas are sometimes provided for the standard errors of particular estimators, expressed for instance, as a sandwich variance.
In practice, however, whether for weighted analyses or matched ones, we seldom make this kind adjustment to the standard error and just proceed to do a ‘standard’ analysis of the weighted sample.

if you conceptualize the ’experiment’ being performed as follows:

(a) Draw a sample from the (observational) population, including information on subjects’ covariates and treatment assignments
(b) Re-weight the sample by means of an estimated propensity score (estimated only using that sample).
(c) Observe the outcomes and perform the weighted analysis ( for instance using inverse-probability-of-assigned-treatment weights) to calculate an estimate of treatment effect.

Then, yes, over a large number of iterations of this experiment the sampling distribution of the estimate of treatment effect will be affected by the variation over multiple iterations of the covariate balance between treatment groups (and the resulting variation of the weights) and this will contribute to the variance of the sampling distribution of the estimator.

However, there is another point of view. If a clinical colleague showed up for a consultation with a ‘messy’ dataset from an imperfectly designed or imperfectly randomized study, we would often accept the dataset as-is, using remedies for covariate imbalances, outliers, etc. as needed, in hopes of producing a valid result. In effect, we would be conditioning on the given treatment assignments rather than attempting an unconditional analysis over the (unknowable) distribution of multiple bad study designs. It seems to me that there is nothing very wrong with this method.

Applied to a reweighted sample (or a matched sample), why would such a conditional analysis be invalid, provided we apply the same standards of model checking, covariate balance, etc. that we use in other possibly messy observational datsets? In fact, wouldn’t conditioning on all available information, including treatment assignments and sample covariate distributions, lead to greater efficiency, and be closer in spirit to established statistical principles (like the extended form of the ancillarity principle)? To a non-expert, wouldn’t this seem like a strong enough argument in favor of our usual way of doing things?

My reply:

(1) The concern in causal inference is mismatch between the treatment and control groups. I have found it helpful to distinguish between two forms of mismatch:
– lack of complete overlap on observed pre-treatment predictors
– imbalance on observed pre-treatment predictors

From my book with Jennifer, here are a couple pictures to distinguish the two concepts.

Lack of complete overlap:


The point of matching, as I see it, is to restrict the range of inference to the zone of overlap (that is, to define the average treatment effect in a region where the data can directly answer such a question). You match so as to get a subset of the data with complete overlap (in some sense) and then discard the points that did not match: the treated units for which there was no similar control unit, and the control units for which there was no treated unit.

Stratification and weighting are ways of handling imbalance. More generally, we can think of stratification and weighting as special cases of regression modeling, with the goal being to adjust for known differences between sample and population. But you can really only adjust for these differences where there is overlap. Outside the zone of overlap, your inferences for treatment effects are entirely assumption-based.

To put it another way, matching (and throwing away the non-matches) is about identification, or robustness. Stratification/weighting/regression are for bias correction.

Propensity scores are a low-dimensional approximation, a particular way of regularizing your regression adjustment. Use propensity scores if you want.

(2) If someone were to give me the propensity score, I’d consider using it as a regression predictor—not alone, but with other key predictors such as age, sex, smoking history, etc. I would not do inverse-propensity-score weighing, as this just seems like a way to get a lot of noise and get some mysterious estimate that I wouldn’t trust anyway.

You write, “we would be conditioning on the given treatment assignments.” You always have to be conditioning on the given treatment assignments: that’s what you have! That said, you’ll want to know the distribution of what the treatment assignments could’ve been, had the experiment gone differently. That’s relevant to your interpretation of the data. If you do everything with a regression model, your implicit assumption is that treatment assignment depends only on the predictors in your model. That’s a well known principle in statistics and is discussed in various places including chapter 8 of BDA3 (chapter 7 in the earlier editions).

(3) One more thing. If you have lack of complete overlap and you do matching, you should also do regression. Matching to restrict your data to a region of overlap, followed by regression to adjust for imbalance. Don’t try to make matching do it all. This is an old, old point, discussed back in 1970 in Donald Rubin’s Ph.D. thesis.

2018: What really happened?

We’re always discussing election results on three levels: their direct political consequences, their implications for future politics, and what we can infer about public opinion.

In 2018 the Democrats broadened their geographic base, as we can see in this graph from Yair Ghitza:

Party balancing

At the national level, what happened is what we expected to happen two weeks ago, two months ago, and two years ago: the Democrats bounced back. Their average district vote for the House of Representatives increased by enough to give them clear control of the chamber, even in the face of difficulties of geography and partisan districting.

This was party balancing, which we talked about a few months ago: At the time of the election, the Republicans controlled the executive branch, both houses of congress, and the judiciary, so it made sense that swing voters were going to swing toward the Democrats. Ironically, one reason the Democrats did not regain the Senate in 2018 is . . . party balancing in 2016! Most people thought Hillary Clinton would win the presidency, so lots of people voted Republican for congress to balance that.

The swing in votes toward the Democrats was large (in the context of political polarization). As Nate Cohn wrote, the change in seats was impressive, given that there weren’t very many swing districts for the Democrats to aim for.

Meanwhile, as expected, the Senate remained in Republican control. Some close races went 51-49 rather than 49-51, which doesn’t tell us much about public opinion but is politically consequential.

Where did it happen?

The next question is geographic. Nationally, voters swung toward the Democrats. I was curious where this happened, so I did some googling and found this map by Josh Holder, Cath Levett, Daniel Levitt, and Peter Andringa:

This map omits districts that were uncontested in one election or the other so I suspect it understates the swing, but it gives the general idea.

Here’s another way to look at the swings.

Yair made a graph plotting the vote swing from 2016 to 2018, for each election contested in both years, plotting vs. the Democratic share of the two-party vote in 2016.

The result was pretty stunning—so much that I put the graph at the top of this post. So please scroll up and take a look again, then scroll back down here to keep reading.

Here’s the key takeaway. The Democrats’ biggest gains were in districts where the Republicans were dominant.

In fact, if you look at the graph carefully (and you also remember that we’re excluding uncontested elections, so we’re missing part of the story), you see the following:
– In strong Republican districts (D’s receiving less than 40% of the vote in 2016), Democrats gained almost everywhere, with an average gain of, ummm, it looks something like 8 percentage points.
– In swing districts (D’s receiving 40-60% of the vote in 2016), D’s improved, but only by about 4 percentage points on average. A 4% swing in the vote is a lot, actually! It’s just not 8%.
– In districts where D’s were already dominating, the results were, on average, similar to what happened in 2016.

I don’t know how much this was a national strategy and how much it just happened, but let me point out two things:

1. For the goal of winning the election, it would have been to the Democrats’ advantage to concentrate their gains in the zone where they’d received between 40 and 55% of the vote in the previous election. Conversely, these are the places where the Republicans would’ve wanted to focus their efforts too.

2. Speaking more generally, the Democrats have had a problem, both at the congressional and presidential levels, of “wasted votes”: winning certain districts with huge majorities and losing a lot of the closer districts. Thus, part of Democratic strategy has been to broaden their geographic base. The above scatterplot suggests that the 2018 election was a step in the right direction for them in this regard.

Not just a statistical artifact

When Yair sent me that plot, I had a statistical question: Could it be “regression to the mean”? We might expect, absent any election-specific information, that the D’s would improve in districts where they’d done poorly, and they’d decline in districts where they’d done well. So maybe I’ve just been overinterpreting a pattern that tells us nothing interesting at all?

To address this possible problem, Yair made two more graphs, repeating the above scatterplot, but showing the 2014-to-2016 shift vs. the 2014 results, and the 2012-to-2014 shift vs. the 2012 results. Here’s what he found:

So the answer is: No, it’s not regression to the mean, it’s not a statistical artifact. The shift from 2016 to 2018—the Democrats gaining strength in Republican strongholds—is real. And it can have implications for statewide and presidential elections as well. This is also consistent with results we saw in various special elections during the past two years.

The current narrative is wrong

As Yair puts it:

Current narrative: Dems did better in suburban/urban, Reps did better in rural, continuing trend from 2012-2016. I [Yair] am seeing the opposite:

This isn’t increasing polarization/sorting. This also isn’t mean-reversion. D areas stayed D, R areas jumped ship to a large degree. A lot of these are the rural areas that went the other way from 2012-2016.

Not just in CD races, also in Gov and Senate races . . . Massive, 20+ point shift in margin in Trump counties. Remember this is with really high turnout. . . . This offsets the huge shift from 2012-2016, often in the famous “Obama-Trump” counties.

Yair adds:

The reason people are missing this story right now: focusing on who won/lost means they’re looking at places that went from just below 50 to just above 50. Obviously that’s more important for who actually governs. But this is the big public opinion shift.

I suspect that part of this could be strategy from both parties.

– On one side, the Democrats knew they had a problem in big swathes of the country and they made a special effort to run strong campaigns everywhere: part of this was good sense given their good showing in special elections, and part of it was an investment in their future, to lay out a strong Democratic brand in areas that where they’ll need to be competitive in future statewide and presidential elections.

– On the other side, the Republicans had their backs to the wall and so they focused their effort on the seats they needed to hold if they had a chance of maintaining their House majority.

From that standpoint, the swings above do not completely represent natural public opinion swings from national campaigns. But they are meaningful: they’re real votes, and they’re in places where the Democrats need to gain votes in the future.

There are also some policy implications: If Democratic challengers are more competitive in previously solid Republican districts, enough so that the Republican occupants of these seats are more afraid of losing centrist votes in future general elections than losing votes on the right in future primaries, this could motivate these Republicans to vote more moderately in congress. I don’t know, but it seems possible.

I sent the above to Yair, and he added the following comments:
Continue reading ‘2018: What really happened?’ »

“Recapping the recent plagiarism scandal”

Benjamin Carlisle writes:

A year ago, I received a message from Anna Powell-Smith about a research paper written by two doctors from Cambridge University that was a mirror image of a post I wrote on my personal blog roughly two years prior. The structure of the document was the same, as was the rationale, the methods, and the conclusions drawn. There were entire sentences that were identical to my post. Some wording changes were introduced, but the words were unmistakably mine. The authors had also changed some of the details of the methods, and in doing so introduced technical errors, which confounded proper replication. The paper had been press-released by the journal, and even noted by Retraction Watch. . . .

At first, I was amused by the absurdity of the situation. The blog post was, ironically, a method for preventing certain kinds of scientific fraud. [Carlisle’s original post was called “Proof of prespecified endpoints in medical research with the bitcoin blockchain,” the paper that did the copying was called “How blockchain-timestamped protocols could improve the trustworthiness of medical science,” so it’s amusing that copying had been done on a paper on trustworthiness. — ed.] . . .

The journal did not catch the similarities between this paper and my blog in the first place, and the peer review of the paper was flawed as well.

OK, so far, nothing so surprising. Peer review often gets it wrong, and, in any case, you’re allowed to keep submitting your paper to new journals until it gets accepted somewhere, indeed F1000 Research is not a top journal, so maybe the paper was rejected a few times before appearing there. Or maybe not, I have no idea.

But then the real bad stuff started to happen. Here’s Carlisle:

After the journal’s examination of the case, they informed us that updating the paper to cite me after the fact would undo any harm done by failing to credit the source of the paper’s idea. A new version was hastily published that cited me, using a non-standard citation format that omitted the name of my blog, the title of my post, and the date of original publication.

Wow. That’s the kind of crap—making a feeble correction without giving any credit—that gets done by Reuters and Perspectives on Psychological Science. I hate to see the editor of a real journal act that way.

Carlisle continues:

I was shocked by the journal’s response. Authorship of a paper confers authority in a subject matter, and their cavalier attitude toward this, especially given the validity issues I had raised with them, seemed irresponsible to me. In the meantime, the paper was cited favourably by the Economist and in the BMJ, crediting Iriving and Holden [the authors of the paper that copied Carlisle’s work].

I went to Retraction Watch with this story, which brought to light even more problems with this example of open peer review. The peer reviewers were interviewed, and rather than re-evaluating their support for the paper, they doubled down, choosing instead to disparage my professional work and call me a liar. . . .

The journal refused to retract the paper. It was excellent press for the journal and for the paper’s putative authors, and it would have been embarrassing for them to retract it. The journal had rolled out the red carpet for this paper after all, and it was quickly accruing citations.

That post appeared in June, 2017. But then I clicked on the link to the published article and found this:

So the paper did end up getting retracted—but, oddly enough, not for the plagiarism.

On the plus side, the open peer review is helpful. Much better than Perspectives on Psychological Science. Peer review is not perfect. But saying you do peer review, and then not doing it, that’s really bad.

The Carlisle story is old news, and I know that some people feel that talking about this sort of thing is a waste of time compared to doing real science. And, sure, I guess it is. But here’s the thing: fake science competes with real science. NPR, the Economist, Gladwell, Freakonomics, etc.: they’ll report on fake science instead of reporting on real science. After all, fake science is more exciting! When you’re not constrained by silly things such as data, replication, coherence with the literature, you can really make fast progress! The above story is interesting in that it appears to feature an alignment of low-quality research and unethical research practices. These two things don’t have to go together but often it seems that they do.

Melanie Mitchell says, “As someone who has worked in A.I. for decades, I’ve witnessed the failure of similar predictions of imminent human-level A.I., and I’m certain these latest forecasts will fall short as well. “

Melanie Mitchell‘s piece, Artificial Intelligence Hits the Barrier of Meaning (NY Times behind limited paywall), is spot-on regarding the hype surrounding the current A.I. boom. It’s soon to come out in book length from FSG, so I suspect I’ll hear about it again in the New Yorker.

Like Professor Mitchell, I started my Ph.D. at the tail end of the first A.I. revolution. Remember, the one based on rule-based expert systems? I went to Edinburgh to study linguistics and natural language processing because it was strong in A.I., computer science theory, linguistics, and cognitive science.

On which natural language tasks can computers outperform or match humans? Search is good, because computers are fast and it’s a task at which humans aren’t so hot. That includes things like speech-based call routing in heterogeneous call centers (something I worked on at Bell Labs).

Then there’s spell checking. That’s fantastic. It leverages simple statistics about word frequency and typos/brainos and is way better than most humans at spelling. It’s the same algorithms that are used for speech recognition and RNA-seq alignment to the genome. These all sprung out of Claude Shannon’s 1948 paper, “A Mathematical Theory of Communication”, which has over 100K citations. It introduced, among other things, n-gram language models at the character and word level (still used for speech recognition and classification today with different estimators). As far as I know that paper contained the first posterior predictive checks—generating examples from the trained language models and comparing them to real language. David McKay’s info theory book (the only ML book I actually like) is a great introduction to this material and even BDA3 added a spell-checking example. But it’s hardly A.I. in the big “I” sense of “A.I.”.

Speech recognition has made tremendous strides (I worked on it at Bell Labs in the late 90s then at SpeechWorks in the early 00s), but its performance is still so far short of human levels as to make the difference more qualitative than quantitative, a point Mitchell makes in her essay. It would no more fool you into thinking it was human than an animatronic Disney character bolted to the floor. Unlike games like chess or go, it’s going to be hard to do better than people at language, but it would certainly be possible. But it would be hard to do that the same way they built, say Deep Blue, the IBM chess-playing hardware that evaluated so many gazillions of board positions per turn with very clever heuristics to prune search. That didn’t play chess like a human. If the better language was like that, humans wouldn’t understand it. IBM Watson (natural language Jeopardy playing computer) was closer to behaving like humans with its chain of associative reasoning—to me, that’s the closest we’ve gotten to something I’d call “A.I.”. It’s a shame IBM’s oversold it since then.

Human-level general purpose A.I. is going to be an incredibly tough nut to crack. I don’t see any reason it’s an unsurmounable goal. It’s not going to happen in a decade without a major breakthrough. Better classifiers just aren’t enough. People are very clever, insanely good at subtle chains of associative reasoning (though not so great at logic) and learning from limited examples (Andrew’s sister Susan Gelman, a professor at Michigan, studies concept learning by example). We’re also very contextually aware and focused, which allows us to go deep, but can cause us to miss the forest for the trees.

Watch out for naively (because implicitly based on flat-prior) Bayesian statements based on classical confidence intervals! (Comptroller of the Currency edition)

Laurent Belsie writes:

An economist formerly with the Consumer Financial Protection Bureau wrote a paper on whether a move away from forced arbitration would cost credit card companies money. He found that the results are statistically insignificant at the 95 percent (and 90 percent) confidence level.

But the Office of the Comptroller of the Currency used his figures to argue that although statistically insignificant at the 90 percent level, “an 88 percent chance of an increase of some amount and, for example, a 56 percent chance that the increase is at least 3 percentage points, is economically significant because the average consumer faces the risk of a substantial rise in the cost of their credit cards.”

The economist tells me it’s a statistical no-no to draw those inferences and he references your paper with John Carlin, “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors,” Perspectives on Psychological Science, 2014.

My question is: Is that statistical mistake a really subtle one and a common faux pas among economists – or should the OCC know better? If a lobbying group or a congressman made a rookie statistical mistake, I wouldn’t be surprised. But a federal agency?

The two papers in question are here and here.

My reply:

I don’t think it’s appropriate to talk about that 88 percent etc. for reasons discussed in pages 71-72 of this paper.

I can’t comment on the economics or the data, but if it’s just a question of summarizing the regression coefficient, you can’t say much more than to just give the 95% conf interval (estimate +/- 2 standard errors) and go from there.

Regarding your question: Yes, this sort of mistake is subtle and can be made by many people, including statisticians and economists. It’s not a surprise to see even a trained professional making this sort of error.

The general problem I have with noninformatively-derived Bayesian probabilities is that they tend to be too strong.

“35. What differentiates solitary confinement, county jail and house arrest” and 70 others

Thomas Perneger points us to this amusing quiz on statistics terminology:

Lots more where that came from.

Postdocs and Research fellows for combining probabilistic programming, simulators and interactive AI

Here’s a great opportunity for those interested in probabilistic programming and workflows for Bayesian data analysis:

We (including me, Aki) are looking for outstanding postdoctoral researchers and research fellows to work for a new exciting project in the crossroads of probabilistic programming, simulator-based inference and user interfaces. You will have an opportunity to work with top research groups in Finnish Center for Artificial Intelligence, including both Aalto University and at the University of Helsinki and to cooperate with several industry partners.

The topics for which we are recruiting are

  • Machine learning for simulator-based inference
  • Intelligent user interfaces and techniques for interacting with AI
  • Interactive workflow support for probabilistic programming based modeling

Find the full descriptions here

“Statistical and Machine Learning forecasting methods: Concerns and ways forward”

Roy Mendelssohn points us to this paper by Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos, which begins:

Machine Learning (ML) methods have been proposed in the academic literature as alternatives to statistical ones for time series forecasting. Yet, scant evidence is available about their relative performance in terms of accuracy and computational requirements. The purpose of this paper is to evaluate such performance across multiple forecasting horizons using a large subset of 1045 monthly time series used in the M3 Competition. After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined.

and continues:

Moreover, we observed that their computational requirements are considerably greater than those of statistical methods.

Mendelssohn writes:

For time series, ML models don’t work as well as traditional methods, at least to date. I have read a little on some of the methods. Some have layers of NNs. The residuals from one layer are passed to the next. I would hate to guess what the “equivalent number of parameters” would be (yes I know these are non-parametric but there has to be a lot of over-fitting going on).

I haven’t looked much at these particular models, but for the general problem of “equivalent number of parameters,” let me point you to this paper and this paper with Aki et al.

Why it can be rational to vote

Just a reminder.

The purported CSI effect and the retroactive precision fallacy

Regarding our recent post on the syllogism that ate science, someone points us to this article, “The CSI Effect: Popular Fiction About Forensic Science Affects Public Expectations About Real Forensic Science,” by N. J. Schweitzer and Michael J. Saks.

We’ll get to the CSI Effect in a bit, but first I want to share the passage from the article that my correspondent pointed out. It’s this bit from footnote 16:

Preliminary analyses (power analyses) suggested a sample size of 48 would be sufficient to detect the CSI effect, if it existed. In addition, the subsequent significance tests adjust for sample size by holding smaller samples to a higher standard when determining statistical significance. In other words, finding that a difference is statistically significant is the same as saying that the sample size was of sufficient size to test for the effect.

Emphasis added. This is a great quote because it expresses so clearly this error. What to call it? The “retroactive precision fallacy”?

For a skeptical take on the CSI effect, see this article by Jason Chin and Larysa Workewych, which begins:

The CSI effect posits that exposure to television programs that portray forensic science (e.g., CSI: Crime Scene Investigation) can change the way jurors evaluate forensic evidence. We review (1) the theory behind the CSI effect; (2) the perception of the effect among legal actors; (3) the academic treatment of the effect; and, (4) how courts have dealt with the effect. We demonstrate that while legal actors do see the CSI effect as a serious issue, there is virtually no empirical evidence suggesting it is a real phenomenon. Moreover, many of the remedies employed by courts may do no more than introduce bias into juror decision-making or even trigger the CSI effect when it would not normally occur.

My correspondent writes:

Some people were worried that the sophisticated version of CSI that is portrayed on TV sets up an unrealistic image and so jurors (who watch the show) will be more critical of the actual evidence, which is much lower tech. There have been a handful of studies trying to demonstrate this and two did (including the one at issue).

I was pretty shocked at the poor level of rigour across the board – I think that’s what happens when legal scholars (the other study to show the effect was done by a judge) try to do empirical/behavioural work.

The truly sad thing is that many courts give “anti-CSI Effect” instructions based on these two studies that seem to show the effect. Those instructions do seem to be damaging to me – the judge tells the jury that the prosecution need not bring forensic evidence at all. The number of appeals and court time spent on this shoddy line of research is also a bit problematic.

So, two issues here. First, is the CSI effect “real” (that is, is this a large and persistent effect)? Second, the article on the CSI effect demonstrates a statistical fallacy, which is the view that, once a statistically significant result has been found, that this retroactively removes all concerns about inferential uncertainty due to variation in the data.

Cornell prof (but not the pizzagate guy!) has one quick trick to getting 1700 peer reviewed publications on your CV

From the university webpage:

Robert J. Sternberg is Professor of Human Development in the College of Human Ecology at Cornell University. . . . Sternberg is the author of over 1700 refereed publications. . . .

How did he compile over 1700 refereed publications? Nick Brown tells the story:

I [Brown] was recently contacted by Brendan O’Connor, a graduate student at the University of Leicester, who had noticed that some of the text in Dr. Sternberg’s many articles and chapters appeared to be almost identical. . . .

Exhibit 1 . . . this 2010 article by Dr. Sternberg was basically a mashup of this article of his from the same year and this book chapter of his from 2002. One of the very few meaningful differences in the chunks that were recycled between the two 2010 articles is that the term “school psychology” is used in the mashup article to replace “cognitive education” from the other; this may perhaps not be unrelated to the fact that the former was published in School Psychology International (SPI) and the latter in the Journal of Cognitive Education and Psychology (JCEP). If you want to see just how much of the SPI article was recycled from the other two sources, have a look at this. Yellow highlighted text is copied verbatim from the 2002 chapter, green from the JCEP article. You can see that about 95% of the text is in one or the other colour . . .

Brown remarks:

Curiously, despite Dr. Sternberg’s considerable appetite for self-citation (there are 26 citations of his own chapters or articles, plus 1 of a chapter in a book that he edited, in the JCEP article; 25 plus 5 in the SPI article), neither of the 2010 articles cites the other, even as “in press” or “manuscript under review”; nor does either of them cite the 2002 book chapter. If previously published work is so good that you want to copy big chunks from it, why would you not also cite it?

Hmmmmm . . . I have an idea! Sternberg wants to increase his citation count. So he cites himself all the time. But he doesn’t want people to know that he publishes essentially the same paper over and over again. So in those cases, he doesn’t cite himself. Cute, huh?

Brown continues:

Exhibit 2

Inspired by Brendan’s discovery, I [Brown] decided to see if I could find any more examples. I downloaded Dr. Sternberg’s CV and selected a couple of articles at random, then spent a few minutes googling some sentences that looked like the kind of generic observations that an author in search of making “efficient” use of his time might want to re-use. On about the third attempt, after less than ten minutes of looking, I found a pair of articles, from 2003 and 2004, by Dr. Sternberg and Dr. Elena Grigorenko, with considerable overlaps in their text. About 60% of the text in the later article (which is about the general school student population) has been recycled from the earlier one (which is about gifted children) . . .

Neither of these articles cites the other, even as “in press” or “manuscript in preparation”.

And there’s more:

Exhibit 3

I [Brown] wondered whether some of the text that was shared between the above pair of articles might have been used in other publications as well. It didn’t take long(*) to find Dr. Sternberg’s contribution (chapter 6) to this 2012 book, in which the vast majority of the text (around 85%, I estimate) has been assembled almost entirely from previous publications: chapter 11 of this 1990 book by Dr. Sternberg (blue), this 1998 chapter by Dr. Janet Davidson and Dr. Sternberg (green), the above-mentioned 2003 article by Dr. Sternberg and Dr. Grigorenko (yellow), and chapter 10 of this 2010 book by Dr. Sternberg, Dr. Linda Jarvin, and Dr. Grigorenko (pink). . . .

Once again, despite the fact that this chapter cites 59 of Dr. Sternberg’s own publications and another 10 chapters by other people in books that he (co-)edited, none of those citations are to the four works that were the source of all the highlighted text in the above illustration.

Now, sometimes one finds book chapters that are based on previous work. In such cases, it is the usual practice to include a note to that effect. And indeed, two chapters (numbered 26 and 27) in that 2012 book edited by Dr. Dawn Flanagan and Dr. Patti Harrison, contain an acknowledgement along the lines of “This chapter is adapted from . Copyright 20xx by . Adapted by permission”. But there is no such disclosure in chapter 6.

Exhibit 4

It appears that Dr. Sternberg has assembled a chapter almost entirely from previous work on more than one occasion. Here’s a recent example of a chapter made principally from his earlier publications. . . .

This chapter cites 50 of Dr. Sternberg’s own publications and another 7 chapters by others in books that he (co-)edited. . . .

However, none of the citations of that book indicate that any of the text taken from it is being correctly quoted, with quote marks (or appropriate indentation) and a page number. The four other books from which the highlighted text was taken were not cited. No disclosure that this chapter has been adapted from previously published material appears in the chapter, or anywhere else in the 2017 book . . .

In the context of a long and thoughtful discussion, James Heathers supplies the rules from the American Psychological Association code of ethics:

And here’s Cornell’s policy:

OK, that’s the policy for Cornell students. Apparently not the policy for faculty.

One more thing

Bobbie Spellman, former editor of the journal Perspectives on Psychological Science, is confident “beyond a reasonable doubt” that Sternberg was not telling the truth when he said that “all papers in Perspectives go out for peer review, including his own introductions and discussions.” Unless, as Spellman puts it, “you believe that ‘peer review’ means asking some folks to read it and then deciding whether or not to take their advice before you approve publication of it.”

So, there you have it. The man is obsessed with citing his own work—except on the occasions when he does a cut-and-paste job, in which case he is suddenly shy about mentioning his other publications. And, as editor, he reportedly says he sends out everything for peer review, but then doesn’t.

P.S. From his (very long) C.V.:

Sternberg, R. J. (2015). Epilogue: Why is ethical behavior challenging? A model of ethical reasoning. In R. J. Sternberg & S. T. Fiske (Eds.), Ethical challenges in the behavioral and brain sciences: Case studies and commentaries (pp. 218-226). New York: Cambridge University Press.

This guy should join up with Bruno Frey and Brad Bushman: the 3 of them would form a very productive Department of Cut and Paste. Department chair? Ed Wegman, of course.