Skip to content

Kaiser’s beef


The Numbersense guy writes in:

Have you seen this?

It has one of your pet peeves… let’s draw some data-driven line in the categorical variable and show significance.

To make it worse, he adds a final paragraph saying essentially this is just a silly exercise that I hastily put together and don’t take it seriously!

Kaiser was pointing me to a news article by economist Justin Wolfers, entitled “Fewer Women Run Big Companies Than Men Named John.”

Here’s what I wrote back to Kaiser:

I took a look and it doesn’t seem so bad. Basically the sex difference is so huge that it can be dramatized in this clever way. So I’m not quite sure what you dislike about it.

Kaiser explained:

Here’s my beef with it…

Just to make up some numbers. Let’s say there are 500 male CEOs and 25 female CEOs so the aggregate index is 20.

Instead of reporting that number, they reduce the count of male CEOs while keeping the females fixed. So let’s say 200 of those male CEOs are named Richard, William, John, and whatever the 4th name is. So they now report an index of 200/25 = 8.

Problem 1 is that this only “works” if they cherry pick the top male names, probably the 4 most common names from the period where most CEOs are born. As he admitted at the end, this index is not robust as names change in popularity over time. Kind of like that economist who said that anyone whose surname begins with A-N has a better chance of winning the Nobel Prize (or some such thing).

Problem 2: we may need an experiment to discover which of the following two statements are more effective/persuasive:

a) there are 20 male CEOs for every female CEO in America
b) there are 8 male CEOs named Richard, Wiliam, John and David for every female CEO in America

For me, I think b) is more complex to understand and in fact the magnitude of the issue has been artificially reduced by restricting to 4 names!

How about that?

I replied that I agree that the picking-names approach destroys much of the quantitative comparisons. Still, I think the point here is that the differences are so huge that this doesn’t matter. It’s a dramatic comparison. The relevant point, perhaps, is that these ratios shouldn’t be used as any sort of “index” for comparisons between scenarios. If Wolfers just wants to present the story as a way of dramatizing the underrepresentation of women, that works. But it would not be correct to use this to compare representation of women in different fields or in different eras.

I wonder if the problem is that econ has these gimmicky measures, for example the cost-of-living index constructed using the price of the Big Mac, etc. I don’t know why, but these sorts of gimmicks seem to have some sort of appeal.

John Lott as possible template for future career of “Bruno” Lacour

Screen Shot 2015-05-22 at 2.34.30 PM

The recent story about the retracted paper on political persuasion reminded me of the last time that a politically loaded survey was discredited because the researcher couldn’t come up with the data.

I’m referring to John Lott, the “economist, political commentator, and gun rights advocate” (in the words of Wikipedia) who is perhaps more well known on the internet by the name of Mary Rosh, an alter ego he created to respond to negative comments (among other things, Lott used the Rosh handle to refer to himself as “the best professor I ever had”).

Again from Wikipedia:

Lott claimed to have undertaken a national survey of 2,424 respondents in 1997, the results of which were the source for claims he had made beginning in 1997. However, in 2000 Lott was unable to produce the data, or any records showing that the survey had been undertaken. He said the 1997 hard drive crash that had affected several projects with co-authors had destroyed his survey data set, the original tally sheets had been abandoned with other personal property in his move from Chicago to Yale, and he could not recall the names of any of the students who he said had worked on it. . . .

On the other hand, Rosh Lott has continued to insist that the survey actually happened. So he shares that with Michael LaCour, the coauthor of the recently retracted political science paper.

I have nothing particularly new to say about either case, but I was thinking that some enterprising reporter might call up Lott and see what he thinks about all this.

Also, Lott’s career offers some clues as to what might happen next to LaCour. Lott’s academic career dissipated and now he seems to spend his time running an organization called the Crime Prevention Research Center which is staffed by conservative scholars, so I guess he pays the bills by raising funds for this group.

One could imagine LaCour doing something similar—but he got caught with data problems before receiving his UCLA social science PhD, so his academic credentials aren’t so strong. But, speaking more generally, given that it appears that respected scholars (and, I suppose, funders, but I can’t be so sure of that as I don’t see a list of funders on the website) are willing to work with Lott, despite the credibility questions surrounding his research, I suppose that the same could occur with LaCour. Perhaps, like Lott, he has the right mixture of ability, brazenness, and political commitment to have a successful career in advocacy.

The above might all seem like unseemly speculation—and maybe it is—but this sort of thing is important. Social science isn’t just about the research (or, in this case, the false claims masquerading as research); it’s also about the social and political networks that promote the work.

Creativity is the ability to see relationships where none exist

Screen Shot 2014-11-17 at 11.19.42 AM

Brent Goldfarb and Andrew King, in a paper to appear in the journal Strategic Management, write:

In a recent issue of this journal, Bettis (2012) reports a conversation with a graduate student who forthrightly announced that he had been trained by faculty to “search for asterisks”. The student explained that he sifted through large databases for statistically significant results and “[w]hen such models were found, he helped his mentors propose theories and hypotheses on the basis of which the ‘asterisks’ might be explained” (p. 109). Such an approach, Bettis notes, is an excellent way to find seemingly meaningful patterns in random data. He expresses concern that these practices are common, but notes that unfortunately “we simply do not have any baseline data on how big or small are the problems” (Bettis, 2012: p. 112).

In this article, we [Goldfarb and King] address the need for empirical evidence . . . in research on strategic management. . . .

Bettis (2012) reports that computer power now allows researchers to sift repeatedly through data in search of patterns. Such specification searches can greatly increase the probability of finding an apparently meaningful relationship in random data. . . . just by trying four functional forms for X, a researcher can increase the chance of a false positive from one in twenty to about one in six. . . .

Simmons et. al (2011) contend that some authors also push almost significant results over thresholds by removing or gathering more data, by dropping experimental conditions, by adding covariates to specified models, and so on.

And, beyond this, there’s the garden of forking paths: even if a researcher performs only one analysis of a given dataset, the multiplicity of choices involved in data coding and analysis are such that we can typically assume that different comparisons would have been studied had the data been different. That is, you can have misleading p-values without any cheating or “fishing” or “hacking” going on.

Goldfarb and King continue:

When evidence is uncertain, a single example is often considered representative of the whole (Tversky & Kahneman, 1973). Such inference is incorrect, however, if selection occurs on significant results. In fact, if “significant” results are more likely to be published, coefficient estimates will inflate the true magnitude of the studied effect — particularly if a low powered test has been used (Stanley, 2005).

They conducted a study of “estimates reported in 300 published articles in a random stratified sample from five top outlets for research on strategic management . . . [and] 60 additional proposals submitted to three prestigious strategy conferences.”

And here’s what they find:

We estimate that between 24% and 40% of published findings based on “statistically significant” (i.e. p<0.05) coefficients could not be distinguished from the Null if the tests were repeated once. Our best guess is that for about 70% of non-confirmed results, the coefficient should be interpreted to be zero. For the remaining 30%, the true B is not zero, but insufficient test power prevents an immediate replication of a significant finding. We also calculate that the magnitude of coefficient estimates of most true effects are inflated by 13%.

I’m surprised their estimated exaggeration factor is only 13%; I’d have expected much higher, even if only conditioning on “true” effects (however that is defined).

I have not tried to follow the details of the authors’ data collection and analysis process and thus can neither criticize nor endorse their specific findings. But I’m sympathetic to their general goals and perspective.

As a commenter wrote in an earlier discussion, it is the combination of a strawman with the concept of “statistical significance” (ie the filtering step) that seems to be a problem, not the p-value per se.

Weggy update: it just gets sadder and sadder

Uh oh, lots on research misconduct lately. Newest news is that noted Wikipedia-lifter Ed Wegman sued John Mashey, one of his critics, for $2 million dollars. Then he backed down and decided not to sue after all.

Best quote from Mashey’s write-up:

None of this made any sense to me, but then I am no lawyer. As it turned out, it made no sense to good lawyers either . . .

Lifting an encyclopedia is pretty impressive and requires real muscles. Lifting from Wikipedia, not so much.

In all seriousness, this is really bad behavior. Copying and garbling material from other sources and not giving the references? Uncool. Refusing to admit error? That’ll get you a regular column in a national newspaper. A 2 million dollar lawsuit? That’s unconscionable escalation, it goes beyond chutzpah into destructive behavior. I don’t imagine that Raymond Keene Bernard Dunn would be happy about what is being done in his name.

Can talk therapy halve the rate of cancer recurrence? How to think about the statistical significance of this finding? Is it just another example of the garden of forking paths?

James Coyne (who we last encountered in the sad story of Ellen Langer) writes:

I’m writing to you now about another matter about which I hope you will offer an opinion. Here is a critique of a study, as well as the original study that claimed to find an effect of group psychotherapy on time to recurrence and survival of early breast cancer patients. In the critique I note that confidence intervals for the odd ratio of raw events for both death and recurrence have P values between .3 and .5. The authors’ claims are based on dubious adjusted analyses. I’ve tried for a number of years to get the data for reanalysis, but the latest effort ended in the compliance officer for Ohio State University pleading that the data were the investigator’s intellectual property. The response apparently written by the investigator invoked you as a rationale for her analytic decisions. I wonder if you could comment on that.

Here is the author’s invoking of you:

In analyzing the data and writing the manuscript, Andersen et al. (2008) were fully aware of opinions and data regarding the use of covariates. See, for example, a recent discussion (2011) among investigators about this issue and the response of Andrew Gelman, an expert on applied Bayesian data analysis and hierarchical models. Gelman’s (2011) provided positive recommendations for covariate inclusion and are corroborated by studies examining covariate selection and entry, which appeared prior to and now following Gelman’s statement in 2011.

Here’s what Coyne sent me:

“Psychologic Intervention Improves Survival for Breast Cancer Patients: A Randomized Clinical Trial,” a 2008 article by Barbara Andersen, Hae-Chung Yang, William Farrar, Deanna Golden-Kreutz, Charles Emery, Lisa Thornton, Donn Young, and William Carson, which reported that a talk-therapy intervention reduced the risk of breast cancer recurrence and death from breast cancer, with a hazard rate of approximately 50% (that is, the instantaneous risk of recurrence, or of death, at any point was claimed to have been reduced by half).

“Finding What Is Not There: Unwarranted Claims of an Effect of Psychosocial Intervention on Recurrence and Survival,” a 2009 article by Michael Stefanek, Steven Palmer, Brett Thombs, and James Coyne, arguing that the claims in the aforementioned article were implausible on substantive grounds and could be explained by a combination of chance variation and opportunistic statistical analysis.

A report from Ohio State University ruling that Barbara Anderson, the lead researcher on the controversial study, was not required to share her raw data with Stefanek et al., as they had requested so they could perform an independent analysis.

I took a look and replied to Coyne as follows:

1. I noticed this bit in the Ohio State report:

“The data, if disclosed, would reveal pending research ideas and techniques. Consequently, the release of such information would put those using such data for research purposes in a substantial competitive disadvantage as competitors and researchers would have access to the unpublished intellectual property of the University and its faculty and students.”

I see what they’re saying but it still seems a bit creepy to me. Think of it from the point of view of the funders of the study, or the taxpayers, or the tuition-paying students. I can’t imagine that they all care so much about the competitive position of the university (or, as they put it, the “University”).

Also, if given that the article was published in 2008, how could it be that the data could “reveal pending research ideas and techniques” in 2014? I mean, sure, my research goes slowly too, but . . . 6 years???

I read the report you sent me, that has quotes from your comments along with the author’s responses. It looks like the committee did not make a judgment on this? They just seemed to report what you wrote, and what the authors wrote, without comments.

Regarding the more general points about preregistration, I have mixed feelings. On one hand, I agree that, because of the garden of forking paths, it’s hard to know what to make of the p-values that come out of a study that had flexible rules on data collection, multiple endpoints, and the like. On the other hand, I’ve never done a preregistered study myself. So I do feel that if a non-prereigstered study is analyzed _appropriately_, it should be possible to get useful inferences. For example, if there are multiple endpoints, it’s appropriate to analyze all the endpoints, not to just pick one. When a study has a data-dependent stopping rule, the information used in the stopping rule should be included in the analysis. And so on.

On a more specific point, you argues that the study in question used a power analysis that was too optimistic. You perhaps won’t be surprised to hear that I am inclined to believe you on that, given that all the incentives go in the direction of making optimistic assumptions about treatment effects. Looking at the details: “The trial was powered to detect a doubling of time to an endpoint . . . cancer recurrences.” Then in the report when they defend the power analysis, they talk about survival rates but I don’t see anything about time to an endpoint. They then retreat to a retrospective justification, that “we conducted the power analysis based on the best available data sources of the early 1990’s, and multiple funding agencies (DoD, NIH, ACS) evaluated and approved the validity of our study proposal and, most importantly, the power analysis for the trial.” So their defense here is ultimately procedural rather than substantive: Maybe their assumptions were too optimistic, but everyone was optimistic back then. This doesn’t much address the statistical concerns but it is relevant to implications of ethical malfeasance.

Regarding the reference to my work: Yes, I have recommended that, even in a randomized trial, it can make sense to control for relevant background variables. This is actually a continuing area of research in that I think that we should be using informative priors to stabilize these adjustments, to get something more reasonable than would be obtained by simple least squares. I do agree with you that it is appropriate to do an unadjusted analysis as well. Unfortunately researchers do not always realize this.

Regarding some of the details of the regression analysis: the discussion brings up various rules and guidelines, but really it depends on contexts. I agree with the report that it can be ok for the number of adjustment variables to exceed 1/10 of the number of data points. There’s also some discussion of backward elimination of predictors. I agree with you that this is in general a bad idea (and certainly the goal in such a setting should not be “to reach a parsimonious model” as claimed in this report). However, practical adjustment can involve adding and removing variables, and this can sometimes take the form of backward elimination. So it’s hard to say what’s right, just from this discussion. I went into the paper and they wrote, “By using a backward elimination procedure, any covariates with P < .25 with an endpoint remained in the final model for that endpoint.” This indeed is poor practice; regrettably, it may well be standard practice.

2. Now I was curious so I read all of the 2008 paper. I was surprised to hear that psychological intervention improves survival for breast cancer patients. It says that the intervention will “alter health behaviors, and maintain adherence to cancer treatment and care.” Sure, ok, but, still, it’s pretty hard to imagine that this will double the average time to recurrence. Doubling is a lot! Later in the paper they mention “smoking cessation” as one of the goals of the treatment. I assume that smoking cessation would reduce recurrence rates. But I don’t see any data on smoking in the paper, so I don’t know what to do with this.

I’m also puzzled because, in their response to your comments, the author or authors say that time-to-recurrence was the unambiguous primary endpoint, but in the abstract they don’t say anything about time-to-recurrence, instead giving proportion of recurrence and survival rates conditional on the time period of the study. Also, the title says Survival, not Time to Recurrence.

The estimated effect sizes (an approx 50% reduction in recurrence and 50% recurrence in death) are implausibly large, but of course this is what you get from the statistical significance filter. Given the size of the study, the reported effects would have to be just about this large, else they wouldn’t be statistically significant.

OK, now to the results: “With 11 years median follow-up, disease recurrence had occurred for 62 of 212 (29%) women, 29 in the Intervention arm and 33 in the Assessment–only arm.” Ummm, that’s 29/114 = 0.25 for the intervention group and 33/113 = 29% in the control group, a difference of 4 percentage points. So I don’t see how they can get those dramatic results shown in figure 3. To put it another way, in their dataset, the probability of recurrence-free survival was 75/114 = 66% in the treatment group and 65/113 = 58% in the control group. (Or, if you exclude the people who dropped out of the study, 75/109 = 69% in treatment group and 65/103 = 63% in control group). A 6 or 8 percentage point difference ain’t nothing, but Figure 3 shows much bigger effects. OK, I see, Figure 3 is just showing survival for the first 5 years. But, if differences are so dramatic after 5 years and then reduce in the following years, that’s interesting too. Overall I’m baffled by the way in which this article goes back and forth between different time durations.

3. Now time to read your paper with Stefanek et al. Hmmm, at one point you write, “There were no differences in unadjusted rates of recurrence or survival between the intervention and control groups.” But there were such differences, no? The 4% reported above? I agree that this difference is not statistically significant and can be explained by chance, but I wouldn’t call it “no difference.”

Overall, I am sympathetic with your critique, partly on general grounds and partly because, yes, there are lots of reasonable adjustments that could be done to these data. The authors of the article in question spend lots of time saying that the treatment and control groups are similar on their pre-treatment variables—but then it turns out that the adjustment for pre-treatment variables is necessary for their findings. This does seem like a “garden of forking paths” situation to me. And the response of the author or authors is, sadly, consistent with what I’ve seen in other settings: a high level of defensiveness coupled with a seeming lack of interest in doing anything better.

I am glad that it was possible for you to publish this critique. Sometimes it seems that this sort of criticism faces a high hurdle to reach publication.

I sent the above to Coyne, who added this:

For me it’s a matter of not only scientific integrity, but what we can reasonably tell cancer patients about what will extend their lives. They are vulnerable and predisposed to grab at anything they can, but also to feel responsible when their cancer progresses in the face of information that should be controllable by positive thinking or take advantage of some psychological intervention. I happen to believe in support groups as an opportunity for cancer patients to find support and the rewards of offering support to others in the same predicament. If patients want those experiences, they should go to readily available support groups. However they should not go with the illusion that it is prolonging their life or that not going is shortening it.

I have done a rather extensive and thorough systematic review and analysis of the literature I can find no evidence that in clinical trials in which survival was in a priori outcome, was an advantage found for psychological interventions.

BREAKING . . . Princeton decides to un-hire Kim Jong-Un for tenure-track assistant professorship in aeronautical engineering


Full story here.

Here’s the official quote:

As you’ve correctly noted, at this time the individual is not a Princeton University employee. We will review all available information and determine next steps.

And here’s what Kim has to say:

I’m gathering evidence and relevant information so I can provide a single comprehensive response. I will do so at my earliest opportunity.

“In my previous post on the topic, I expressed surprise at the published claim but no skepticism”


Don’t believe everything you read in the tabloids, that’s for sure.

P.S. I googled to see what else was up with this story and found this article which reported that someone claimed that Don Green’s retraction (see above link for details) was the first for political science.

I guess it depends on how you define “retraction” and how you define “political science.” Cos a couple of years ago I published this:

In the paper, “Should the Democrats move to the left on economic policy?” AOAS 2 (2), 536-549 (2008), by Andrew Gelman and Cexun Jeffrey Cai, because of a data coding error on one of the variables, all our analysis of social issues is incorrect. Thus, arguably, all of Section 3 is wrong until proven otherwise. We thank Yang Yang Hu for discovering this error and demonstrating its importance.

Officially this is a correction not a retraction. And, although it’s entirely a political science paper, it was not published in a political science journal. So maybe it doesn’t count. I’d guess there are others, though. I don’t think Aristotle ever retracted his claim that slavery is cool, but give him time, the guy has a lot on his plate.

Objects of the class “Foghorn Leghorn”

Reprinting a classic from 2010:


The other day I saw some kids trying to tell knock-knock jokes, The only one they really knew was the one that goes: Knock knock. Who’s there? Banana? Banana who? Knock knock. Who’s there? Banana? Banana who? Knock knock. Who’s there? Orange. Orange who? Orange you glad I didn’t say banana?

Now that’s a fine knock-knock joke, among the best of its kind, but what interests me here is that it’s clearly not a basic k-k; rather, it’s an inspired parody of the form. For this to be the most famous knock-knock joke—in some circles, the only knock-knock joke—seems somehow wrong to me. It would be as if everybody were familiar with Duchamp’s Mona-Lisa-with-a-moustache while never having heard of Leonardo’s original.

Here’s another example: Spinal Tap, which lots of people have heard of without being familiar with the hair-metal acts that inspired it.

The poems in Alice’s Adventures in Wonderland and Through the Looking Glass are far far more famous now than the objects of their parody.

I call this the Foghorn Leghorn category, after the Warner Brothers cartoon rooster (“I say, son . . . that’s a joke, son”) who apparently was based on a famous radio character named Senator Claghorn. Claghorn has long been forgotten, but, thanks to reruns, we all know about that silly rooster.

And I think “Back in the USSR” is much better known than the original “Back in the USA.”

Here’s my definition: a parody that is more famous than the original.

Some previous cultural concepts

Objects of the class “Whoopi Goldberg”

Objects of the class “Weekend at Bernie’s”

P.S. Commenter Jhe has a theory:

I’m not entirely surprised that often the parody is better know than its object. The parody illuminates some aspect of culture which did not necessarily stand out until the parody came along. The parody takes the class of objects being parodied and makes them obvious and memorable.

Bayesian inference: The advantages and the risks

This came up in an email exchange regarding a plan to come up with and evaluate Bayesian prediction algorithms for a medical application:

I would not refer to the existing prediction algorithm as frequentist. Frequentist refers to the evaluation of statistical procedures but it doesn’t really say where the estimate or prediction comes from. Rather, I’d say that the Bayesian prediction approach succeeds by adding model structure and prior information.

The advantages of Bayesian inference include:
1. Including good information should improve prediction,
2. Including structure can allow the method to incorporate more data (for example, hierarchical modeling allows partial pooling so that external data can be included in a model even if these external data share only some characteristics with the current data being modeled).

The risks of Bayesian inference include:
3. If the prior information is wrong, it can send inferences in the wrong direction.
4. Bayes inference combines different sources of information; thus it is no longer an encapsulation of a particular dataset (which is sometimes desired, for reasons that go beyond immediate predictive accuracy and instead touch on issues of statistical communication).

OK, that’s all background. The point is that we can compare Bayesian inference with existing methods. The point is not that the philosophies of inference are different—it’s not Bayes vs frequentist, despite what you sometimes hear. Rather, the issue is that we’re adding structure and prior information and partial pooling, and we have every reason to think this will improve predictive performance, but we want to check.

To evaluate, I think we can pretty much do what you say: ROC as basic summary and do graphical exploration, cross-validation (and related methods such as WAIC), and external validation.

New Alan Turing preprint on Arxiv!


Dan Kahan writes:

I know you are on 30-day delay, but since the blog version of you will be talking about Bayesian inference in couple of hours, you might like to look at paper by Turing, who is on 70-yr delay thanks to British declassification system, who addresses the utility of using likelihood ratios for helping to form a practical measure of evidentiary weight (“bans” & “decibans”) that can guide cryptographers (who presumably will develop sense of professional judgment calibrated to the same).

Actually it’s more like a 60-day delay, but whatever.

The Turing article is called “The Applications of Probability to Cryptography,” it was written during the Second World War, and it’s awesome.

Here’s an excerpt:

The evidence concerning the possibility of an event occurring usually divides into a part about which statistics are available, or some mathematical method can be applied, and a less definite part about which one can only use one’s judgement. Suppose for example that a new kind of traffic has turned up and that only three messages are available. Each message has the letter V in the 17th place and G in the 18th place. We want to know the probability that it is a general rule that we should find V and G in these places. We first have to decide how probable it is that a cipher would have such a rule, and as regards this one can probably only guess, and my guess would be about 1/5,000,000. This judgement is not entirely a guess; some rather insecure mathematical reasoning has gone into it, something like this:-

The chance of there being a rule that two consecutive letters somewhere after the 10th should have certain fixed values seems to be about 1/500 (this is a complete guess). The chance of the letters being the 17th and 18th is about 1/15 (another guess, but not quite as much in the air). The probability of a letter being V or G is 1/676 (hardly a guess at all, but expressing a judgement that there is no special virtue in the bigramme VG). Hence the chance is 1/(500 × 15 × 676) or about 1/5,000,000. This is however all so vague, that it is more usual to make the judgment “1/5,000,000” without explanation.

The question as to what is the chance of having a rule of this kind might of course be resolved by statistics of some kind, but there is no point in having this very accurate, and of course the experience of the cryptographer itself forms a kind of statistics.

The remainder of the problem is then solved quite mathematically. . . .

He’s so goddamn reasonable. He’s everything I aspire to.

Reasonableness is, I believe, and underrated trait in research. By “reasonable,” I don’t mean a supine acceptance of the status quo, but rather a sense of the connections of the world, a sort of generalized numeracy, an openness and honesty about one’s sources of information. “This judgement is not entirely a guess; some rather insecure mathematical reasoning has gone into it”—exactly!

Damn this guy is good. I’m glad to see he’s finally posting his stuff on Arxiv.