Skip to content

Cloak and dagger

Elan B. writes:

I saw this JAMA Pediatrics article [by Julia Raifman, Ellen Moscoe, and S. Bryn Austin] getting a lot of press for claiming that LGBT suicide attempts went down 14% after gay marriage was legalized.
The heart of the study is comparing suicide attempt rates (in last 12 months) before and after exposure — gay marriage legalization in their state. For LGBT teens, this dropped from 28.5% to 24.5%.
In order to test whether this drop was just an ongoing trend in dropping LGBT suicide attempts, they do a placebo test by looking at whether rates dropped 2 years before legalization. In the text of the article, they simple state that there is no drop.
But then you open up the supplement and find that about half of the drop in rates — 2.2% — already came 2 years before legalization. However, since 0 is contained in the 95% confidence interval, it’s not significant! Robustness check passed.
In figure 1 of the article, they graph suicide attempts before legalization to show they’re flat, but even though they have the data for some of the states they don’t show LGBT rates.
Very suspicious to me, what do you think?

My reply: I wouldn’t quite say “suspicious.” I expect these researchers are doing their best; these are just hard problems. What they’ve found is an association which they want to present as causation, and they don’t fully recognize that limitation in their paper.

Here are the key figures:

And from here it’s pretty clear that the trends are noisy, so that little differences in the model can make big differences in the results, especially when you’re playing the statistical significance game. That’s fine—if the trends are noisy, they’re noisy, and your analysis needs to recognize this, and in any case it’s a good idea to explore such data.

I also share Elan’s concern about the whole “robustness check” approach to applied statistics, in which a central analysis is presented and then various alternatives are presented, with the goal is to show the same thing as the main finding (for perturbation-style robustness checks) or to show nothing (for placebo-style robustness checks).

One problem with this mode of operation is that robustness checks themselves have many researcher degrees of freedom, so it’s not clear what we can take from these. Just for example, if you do a perturbation-style robustness check and you find a result in the same direction but not statistically significant (or, as the saying goes, “not quite” statistically significant), you can call it a success because it’s in the right direction and, if anything, it makes you feel even better that the main analysis, which you chose, succeeded. But if you do a placebo-style robustness check and you find a result in the same direction but not statistically significant, you can just call it a zero and claim success in that way.

So I think there’s a problem in that there’s a pressure for researchers to seek, and claim, more certainty and rigor than is typically possible from social science data. If I’d written this paper, I think I would’ve started with various versions of the figures above, explored the data more, then moved to the regression line, but always going back to the connection between model, data, and substantive theories. But that’s not what I see here: in the paper at hand, there’s the more standard pattern of some theory and exploration motivating a model, then statistical significance is taken as tentative proof, to be shored up with robustness studies, then the result is taken as a stylized fact and it’s story time. There’s nothing particularly bad about this particular paper, indeed their general conclusions might well be correct (or not). They’re following the rules of social science research and it’s hard to blame them for that. I don’t see this paper as “junk science” in the way of the himmicanes, air rage, or ages-ending-in-9 papers (I guess that’s why it appeared in JAMA, which is maybe a bit more serious-minded than PPNAS or Lancet); rather, it’s a reasonable bit of data exploration that could be better. I’d say that a recognition that it is data exploration could be a first step to encouraging researchers to think more seriously about how best to explore such data. If they really do have direct data on suicide rates of gay people, that would seem like a good place to look, as Elan suggests.

Clay pigeon

Sam Harper writes:

Not that you are collecting these kinds of things, but I wanted to point to (yet) another benefit of the American Economic Association’s requirement of including replication datasets (unless there are confidentiality constraints) and code in order to publish in most of their journals—certainly for the top-tier ones like Am Econ Review: correcting coding mistakes!
  1. The Impact of Family Income on Child Achievement: Evidence from the Earned Income Tax Credit: Comment
    Lundstrom, Samuel
    The American Economic Review (ISSN: 0002-8282); Volume 107, No. 2, pp. 623-628(6); 2017-02-01T00:00:00
  2. The Impact of Family Income on Child Achievement: Evidence from the Earned Income Tax Credit: Reply
    Dahl, Gordon B.; Lochner, Lance
    The American Economic Review (ISSN: 0002-8282); Volume 107, No. 2, pp. 629-631(3); 2017-02-01T00:00:00
The papers are no doubt gated (I attached them if you are interested), but I thought it was refreshing to see what I consider to be close to a model exchange between the original authors and the replicator: Replicator is able to reproduce nearly everything but finds a serious coding error, corrects it and generates new (and presumably improved) estimates, and original authors admit they made a coding error without making much of a fuss, plus they also generate revised estimates. Post-publication review doing what it should. The tone is also likely more civil because the effort to reproduce largely succeeded and the original authors did not have to eat crow or say that they made a mistake that substantively changed their interpretation (and economists obsession with statistical significance is still disappointing). Credit to Lundstrom for not trying to over-hype the change in the results.
As an epidemiologist I do feel embarrassed that the biomedical community is still so far behind other disciplines when it comes to taking reproducible science seriously—especially the “high impact” general medical journals. We should not have to take our cues from economists, though perhaps it helps that much of the work they do uses public data.
I haven’t looked into this one but I agree with the general point.

Looking for rigor in all the wrong places (my talk this Thursday in the Columbia economics department)

Looking for Rigor in All the Wrong Places

What do the following ideas and practices have in common: unbiased estimation, statistical significance, insistence on random sampling, and avoidance of prior information? All have been embraced as ways of enforcing rigor but all have backfired and led to sloppy analyses and erroneous inferences. We discuss these problems and some potential solutions in the context of problems in social science research, and we consider ways in which future statistical theory can be better aligned with practice.

The seminar is held Thursday, February 23rd at the Economics Department, International Affairs Building (420 W. 118th Street) in room 1101, from 2:30 to 4:00 pm

I don’t have one particular paper, but here are a few things that people could read:

http://www.stat.columbia.edu/~gelman/research/published/rd_china_5.pdf
http://www.stat.columbia.edu/~gelman/research/unpublished/regression_discontinuity_16sep6.pdf
http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf

Unethical behavior vs. being a bad guy

I happened to come across this article and it reminded me of the general point that it’s possible to behave unethically without being a “bad guy.”

The story in question involves some scientists who did some experiments about thirty years ago on the biological effects of low-frequency magnetic fields. They published their results in a series of papers which I read when I was a student, and I found some places where I thought their analysis could be improved.

The topic seemed somewhat important—at the time, there was concern about cancer risks from exposure to power lines and other sources of low-frequency magnetic fields—so I sent a letter to the authors of the paper, pointing out two ways I thought their analysis could be improved, and requesting their raw data. I followed up the letter with a phone call.

Just for some context:

1. At no time did I think, or do I think, that they were doing anything unethical in their data collection or analysis. I just thought that they weren’t making full use of the data they had. Their unethical behavior, as I see it, came at the next stage, when they refused to share their data.

2. Those were simpler times. I assumed by default that published work was high quality, so when I saw what seemed like a flaw in the analysis, I wasn’t so sure—I was very open to the possibility that I’d missed something myself—and I didn’t see the problems in that paper as symptomatic of any larger issues.

3. I was not trying to “gotcha” these researchers. I thought they too would be interested in getting more information out of their data.

To continue with the story: When I called on the phone, the lead researcher on the project said he didn’t want to share the data: they were in lab notebooks and it would be effort to copy these, and his statistician had assured him that the analysis was just fine as is.

I think this was unethical behavior, given that: (a) at the time, this work was considered to have policy implications; (b) there was no good reason for the researcher to think that his statistician had particular expertise in this sort of analysis; (c) I’d offered some specific ways in which the data analysis could be improved so there was a justification for my request; (d) the work had been done at the Environmental Protection Agency, which is part of the U.S. government; (e) the dataset was pretty small so how hard could it be to photocopy some pages of lab notebooks and drop them in the mail; and, finally (f) the work was published in a scientific journal that was part of the public record.

A couple decades later, I wrote about the incident and the biologist and the statistician responded with defenses of their actions. I felt at the time of the original event, and after reading their letters, and I still feel, that these guys were trying to do their best, that they were acting according what they perceived to be their professional standards, and that they were not trying to impede the progress of science and public health.

To put it another way, I did not, and do not, think of them as “bad guys.” Not that this is so important—there’s no reason why these two scientists should particularly care about my opinion of them, nor am I any kind of moral arbiter here. I’m just sharing my perspective to make the more general point that it is possible to behave unethically without being a bad person.

I do think the lack of data sharing was unethical—not as unethical as fabricating data (Lacour), or hiding data (Hauser) or brushing aside a barrage of legitimate criticism from multiple sources (Cuddy), or lots of other examples we’ve discussed over the years on this blog—but I do feel it is a real ethical lapse, for reasons (a)-(f) given above. But I don’t think of this as the product of “bad guys.”

My point is that it’s possible to go about your professional career, doing what you think is right, but still making some bad decisions: actions which were not just mistaken in retrospect, but which can be seen as ethical violations on some scale.

One way to view this is everyone involved in research—including those of us who see ourselves as good guys—should be aware that we can make unethical decisions at work. “Unethical” labels the action, not the person, and ethics is a product of a situation as well as of the people involved.

Should the Problems with Polls Make Us Worry about the Quality of Health Surveys? (my talk at CDC tomorrow)

My talk this Thursday at CDC, Tuesday, February 21, 2017, 12:00 noon, 2400 Century Center, Room 1015C:

Should the Problems with Polls Make Us Worry about the Quality of Health Surveys?

Response rates in public opinion polls have been steadily declining for more than half a century and are currently heading toward the 0% mark. We have learned much in recent years about the problems this is causing and how we can improve data collection and statistical analysis to get better estimates of opinion and opinion trends. In this talk, we review research in this area and then discuss the relevance of this work to similar problems in health surveys.

P.S. I gave the talk. There were no slides. OK, I did send along a subset of these, but I spent only about 5 minutes on them out of a 40-minute lecture, so the slides will give you close to zero sense of what I was talking about. I have further thoughts about the experience which I’ll save for a future post, but for now just let me say that if you weren’t at the talk, and you don’t know anyone who was there, then the slides won’t help.

Blind Spot

X pointed me to this news article reporting an increase in death rate among young adults in the United States:

Selon une enquête publiée le 26 janvier par la revue scientifique The Lancet, le taux de mortalité des jeunes Américains âgés de 25 à 35 ans a connu une progression entre 1999 et 2014, alors que ce taux n’a cessé de baisser dans l’ensemble des pays les plus riches depuis quarante ans. . . . Ce sont principalement les jeunes femmes blanches qui tirent les chiffres à la hausse . . . Ainsi, l’analyse des statistiques collectées auprès du National Center for Health Statistics, montre que le taux de mortalité des femmes blanches de 25 ans a connu une progression moyenne annuelle de 3 % pendant les quinze années prises en compte, et de 2,3 % pour la catégorie des trentenaires. Pour des garçons du même âge, la croissance annuelle du taux de mortalité s’élève à 1,9 %.

I ran this by Jonathan Auerbach to see what he thought. After all, it’s the Lancet, which seems to specialize in papers of high publicity and low content, so it’s not like I’m gonna believe anything in there without careful scrutiny.

As part of our project, Jonathan had already run age-adjusted estimates for different ethnic groups every decade of age. These time series should be better than what was in the paper discussed in the above news article because, in addition to age adjusting, we also got separate estimated trends for each state, fitting some sort of hierarchical model in Stan.

Jonathan reported that we found a similar increase in death rates for women after adjustment. But there are comparable increases for men after breaking down by state.

Here are the estimated trends in age-adjusted death rates for non-Hispanic white women aged 25-34:

And here are the estimated trends for men:

In the graphs for the women, certain states with too few observations were removed. (It would be fine to estimate these trends from the raw data, but for simplicity we retrieved some aggregates from the CDC website, and it didn’t provide numbers in every state and every year.)

Anyway, the above graphs show what you can do with Stan. We’re not quite sure what to do with all these analyses: we don’t have stories to go with them so it’s not clear where they could be published. But at least we can blog them in response to headlines on mortality trends.

P.S. The Westlake titles keep on coming. It’s not just that they are so catchy—after all, that’s their point—but how apt they are, each time. And the amazing thing is, I’m using them in order. Those phrases work for just about anything. I’m just looking forward to a month or so on when I’ve worked my way down to the comedy titles lower down on the list.

Accessing the contents of a stanfit object

I was just needing this. Then, lo and behold, I found it on the web. It’s credited to Stan Development Team but I assume it was written by Ben and Jonah. Good to have this all in one place.

ComSciCon: Science Communication Workshop for Graduate Students

“Luckily, medicine is a practice that ignores the requirements of science in favor of patient care.”

Javier Benitez writes:

This is a paragraph from Kathryn Montgomery’s book, How Doctors Think:

If medicine were practiced as if it were a science, even a probabilistic science, my daughter’s breast cancer might never have been diagnosed in time. At 28, she was quite literally off the charts, far too young, an unlikely patient who might have eluded the attention of anyone reasoning “scientifically” from general principles to her improbable case. Luckily, medicine is a practice that ignores the requirements of science in favor of patient care.

I [Benitez] am not sure I agree with her assessment. I have been doing some reading on history and philosophy of science, there’s not much on philosophy of medicine, and this is a tough question to answer, at least for me.

I would think that science, done right, should help, not hinder, the cause of cancer decision making. (Incidentally, the relevant science here would necessarily be probabilistic, so I wouldn’t speak of “even” a probabilistic science as if it were worth considering any deterministic science of cancer diagnosis.)

So how to think about the above quote? I have a few directions, in no particular order:

1. Good science should help, but bad science could hurt. It’s possible that there’s enough bad published work in the field of cancer diagnosis that a savvy doctor is better off ignoring a lot of it, performing his or her own meta-analysis, as it were, partially pooling the noisy and biased findings toward some more reasonable theory-based model.

2. I haven’t read the book where this quote comes from, but the natural question is, How did the doctor diagnose the cancer in that case? Presumably the information used by the doctor could be folded into a scientific diagnostic procedure.

3. There’s also the much-discussed cost-benefit angle. Early diagnosis can save lives but it can also has costs in dollars and health when there is misdiagnosis.

To the extend that I have a synthesis of all these ideas, it’s through the familiar idea of anomalies. Science (that is, probability theory plus data plus models of data plus empirical review and feedback) is supposed to be the optimal way to make decisions under uncertainty. So if doctors have a better way of doing it, this suggests that the science they’re using is incomplete, and they should be able to do better.

The idea here is to think of the “science” of cancer diagnosis not as a static body of facts or even as a method of inquiry, but as a continuously-developing network of conjectures and models and data.

To put it another way, it can make sense to “ignore the requirements of science.” And when you make that decision, you should explain why you’re doing it—what information you have that moves you away from what would be the “science-based” decision.

Benitez adds some more background:
Continue reading ‘“Luckily, medicine is a practice that ignores the requirements of science in favor of patient care.”’ »

Pizzagate and Kahneman, two great flavors etc.

1. The pizzagate story (of Brian Wansink, the Cornell University business school professor and self-described “world-renowned eating behavior expert for over 25 years”) keeps developing.

Last week someone forwarded me an email from the deputy dean of the Cornell business school regarding concerns about some of Wansink’s work. This person asked me to post the letter (which he assured me “was written with the full expectation that it would end up being shared”) but I wasn’t so interested in this institutional angle so I passed it along to Retraction Watch, along with links to Wansink’s contrite note and a new post by Jordan Anaya listing some newly-discovered errors in yet another paper by Wansink.

Since then, Retraction Watch ran an interview with Wansink, in which the world-renowned eating behavior expert continued with a mixture of contrition and evasion, along with insights into his workflow, for example this:

Also, we realized we asked people how much pizza they ate in two different ways – once, by asking them to provide an integer of how many pieces they ate, like 0, 1, 2, 3 and so on. Another time we asked them to put an “X” on a scale that just had a “0” and “12” at either end, with no integer mark in between.

This is weird for two reasons. First, how do you say “we realized we asked . . .”? What’s to realize? If you asked the question that way, wouldn’t you already know this? Second, who eats 12 pieces of pizza? I guess they must be really small pieces!

Wansink also pulls one out of the Bargh/Baumeister/Cuddy playbook:

Across all sorts of studies, we’ve had really high replication of our findings by other groups and other studies. This is particularly true with field studies. One reason some of these findings are cited so much is because other researchers find the same types of results.

Ummm . . . I’ll believe it when I see the evidence. And not before.

In our struggle to understand Wansink’s mode of operation, I think we should start from the position that he’s not trying to cheat; rather, he just doesn’t know what he’s doing. Think of it this way: it’s possible that he doesn’t write the papers that get published, he doesn’t produce the tables with all the errors, he doesn’t analyze the data, maybe he doesn’t even collect the data. I have no idea who was out there passing out survey forms in the pizza restaurant—maybe some research assistants? He doesn’t design the survey forms—that’s how it is that he just realized that they asked that bizarre 0-to-12-pieces-of-pizza question. Also he’s completely out of the loop on statistics. When it comes to stats, this guy makes Satoshi Kanazawa look like Uri Simonsohn. That explains why his response to questions about p-hacking or harking was, “Well, we weren’t testing a registered hypothesis, so there’d be no way for us to try to massage the data to meet it.”

What Wansink has been doing for several years is organizing studies, making sure they get published, and doing massive publicity. For years and years and years, he’s been receiving almost nothing but positive feedback. (Yes, five years ago someone informed his lab of serious, embarrassing flaws in one of his papers, but apparently that inquiry was handled by one of his postdocs. So maybe the postdoc never informed Wansink of the problem, or maybe Wansink just thought this was a one-off in his lab, somebody else’s problem, and ignored it.)

When we look at things from the perspective of Wansink receiving nothing but acclaim for so many years and from so many sources (from students and postdocs in his lab, students in his classes, the administration of Cornell University, the U.S. government, news media around the world, etc., not to mention the continuing flow of accepted papers in peer-reviewed journals), the situation becomes more clear. It would be a big jump for him to accept that this is all a house of cards, that there’s no there there, etc.

Here’s an example of how this framing can help our understanding:

Someone emailed this question to me regarding that original “failed study” that got the whole ball rolling:

I’m still sort of surprised that they weren’t able to p-hack the original hypothesis, which was presumably some correlate with the price paid (either perceived quality, or amount eaten, or time spent eating, or # trips to the bathroom, or …).

My response:

I suspect the answer is that Wansink was not “p-hacking” or trying to game the system. My guess is that he’s legitimately using these studies to inform his thinking–that is, he forms many of his hypotheses and conclusions based on his data. So when he was expecting to see X, but he didn’t see X, he learned something! (Or thought he learned something; given the noise level in his experiments, it might be that his original hypothesis happened to be true, irony of ironies.) Sure, if he’d seen X at p=0.06, I expect he would’ve been able to find a way to get statistical significance, but when X didn’t show up at all, he saw it as a failed study. So, from Wansink’s point of view, the later work by the student really did have value in that they learned something new from their data.

I really don’t like the “p-hacking” frame because it “gamifies” the process in a way that I don’t think is always appropriate. I prefer the “forking paths” analogy: Wansink and his students went down one path that led nowhere, then they tried other paths.

2. People keep pointing me to a recent statement by Daniel Kahneman in a comment on a blog by Ulrich Schimmack, Moritz Heene, and Kamini Kesavan, who wrote that the “priming research” of Bargh and others that was featured in Kahneman’s book “is a train wreck” and should not be considered “as scientific evidence that subtle cues in their environment can have strong effects on their behavior outside their awareness.” Here’s Kahneman:

I accept the basic conclusions of this blog. To be clear, I do so (1) without expressing an opinion about the statistical techniques it employed and (2) without stating an opinion about the validity and replicability of the individual studies I cited.

What the blog gets absolutely right is that I placed too much faith in underpowered studies. As pointed out in the blog, and earlier by Andrew Gelman, there is a special irony in my mistake because the first paper that Amos Tversky and I published was about the belief in the “law of small numbers,” which allows researchers to trust the results of underpowered studies with unreasonably small samples. We also cited Overall (1969) for showing “that the prevalence of studies deficient in statistical power is not only wasteful but actually pernicious: it results in a large proportion of invalid rejections of the null hypothesis among published results.” Our article was written in 1969 and published in 1971, but I failed to internalize its message.

My position when I wrote “Thinking, Fast and Slow” was that if a large body of evidence published in reputable journals supports an initially implausible conclusion, then scientific norms require us to believe that conclusion. Implausibility is not sufficient to justify disbelief, and belief in well-supported scientific conclusions is not optional. This position still seems reasonable to me – it is why I think people should believe in climate change. But the argument only holds when all relevant results are published.

I knew, of course, that the results of priming studies were based on small samples, that the effect sizes were perhaps implausibly large, and that no single study was conclusive on its own. What impressed me was the unanimity and coherence of the results reported by many laboratories. I concluded that priming effects are easy for skilled experimenters to induce, and that they are robust. However, I now understand that my reasoning was flawed and that I should have known better. Unanimity of underpowered studies provides compelling evidence for the existence of a severe file-drawer problem (and/or p-hacking). The argument is inescapable: Studies that are underpowered for the detection of plausible effects must occasionally return non-significant results even when the research hypothesis is true – the absence of these results is evidence that something is amiss in the published record. Furthermore, the existence of a substantial file-drawer effect undermines the two main tools that psychologists use to accumulate evidence for a broad hypotheses: meta-analysis and conceptual replication. Clearly, the experimental evidence for the ideas I presented in that chapter was significantly weaker than I believed when I wrote it. This was simply an error: I knew all I needed to know to moderate my enthusiasm for the surprising and elegant findings that I cited, but I did not think it through. When questions were later raised about the robustness of priming results I hoped that the authors of this research would rally to bolster their case by stronger evidence, but this did not happen.

I still believe that actions can be primed, sometimes even by stimuli of which the person is unaware. There is adequate evidence for all the building blocks: semantic priming, significant processing of stimuli that are not consciously perceived, and ideo-motor activation. I see no reason to draw a sharp line between the priming of thoughts and the priming of actions. A case can therefore be made for priming on this indirect evidence. But I have changed my views about the size of behavioral priming effects – they cannot be as large and as robust as my chapter suggested.

I am still attached to every study that I cited, and have not unbelieved them, to use Daniel Gilbert’s phrase. I would be happy to see each of them replicated in a large sample. The lesson I have learned, however, is that authors who review a field should be wary of using memorable results of underpowered studies as evidence for their claims.

Following up on Kahneman’s remarks, neuroscientist Jeff Bowers added:

There is another reason to be sceptical of many of the social priming studies. You [Kahneman] wrote:

I still believe that actions can be primed, sometimes even by stimuli of which the person is unaware. There is adequate evidence for all the building blocks: semantic priming, significant processing of stimuli that are not consciously perceived, and ideo-motor activation. I see no reason to draw a sharp line between the priming of thoughts and the priming of actions.

However, there is an important constraint on subliminal priming that needs to be taken into account. That is, they are very short lived, on the order of seconds. So any claims that a masked prime affects behavior for an extend period of time seems at odd with these more basic findings. Perhaps social priming is more powerful than basic cognitive findings, but it does raise questions. Here is a link to an old paper showing that masked *repetition* priming is short-lived. Presumably semantic effects will be even more transient.

And psychologist Hal Pashler followed up:

One might ask if this is something about repetition priming, but associative semantic priming is also fleeting. In our JEP:G paper failing to replicate money priming we noted:

For example, Becker, Moscovitch, Behrmann, and Joordens (1997) found that lexical decision priming effects disappeared if the prime and target were separated by more than 15 seconds, and similar findings were reported by Meyer, Schvaneveldt, and Ruddy (1972). In brief, classic priming effects are small and transient even if the prime and measure are strongly associated (e.g., NURSE-DOCTOR), whereas money priming effects are [purportedly] large and relatively long-lasting even when the prime and measure are seemingly unrelated (e.g., a sentence related to money and the desire to be alone).

Kahneman’s statement is stunning because it seems so difficult for people to admit their mistakes, and in this case he’s not just saying he got the specifics wrong, he’s pointing to a systematic error in his ways of thinking.

You don’t have to be Thomas W. Kuhn to know that you can learn more from failure than success, and that a key way forward is to push push push to understand anomalies. Not to sweep them under the rug but to face them head-on.

3. Now return to Wansink. He’s in a tough situation. His career is based on publicity, and now he has bad publicity. And there no easy solution for him, as once he starts to recognize problems with his research methods, the whole edifice collapses. Similarly for Baumeister, Bargh, Cuddy, etc. The cost of admitting error is so high that they’ll go to great lengths to avoid facing the problems in their research.

It’s easier for Kahneman to admit his errors because, yes, this does suggest that some of the ideas behind “heuristics and biases” or “behavioral economics” have been overextended (yes, I’m looking at you, claims of voting and political attitudes being swayed by shark attacks, college football, and subliminal smiley faces), but his core work with Tversky is not threatened. Similarly, I can make no-excuses corrections of my paper that was wrong because of our data coding error, and my other paper with the false theorem.

P.S. Hey! I just realized that the above examples illustrate two of Clarke’s three laws.

Vine regression?

Jeremy Neufeld writes:

I’m an undergraduate student at the University of Maryland and I was recently referred to this paper (Vine Regression, by Roger Cooke, Harry Joe, and Bo Chang), also an accompanying summary blog post by the main author) as potentially useful in policy analysis. With the big claims it makes, I am not sure if it passes the sniff test. Do you know anything about vine regression? How would it avoid overfitting?

My reply: Hey, as a former University of Maryland student myself I’ll definitely respond! I looked at the paper, and it seems to be presenting a class of multivariate models, a method for fitting the models to data, and some summaries. The model itself appears to be a mixture of multivariate normals of different dimensions, fit to the covariance matrix of a rank transformation of the raw data—I think they’re ranking each variable on its marginal distribution but I’m not completely sure, and I’m not quite sure how they deal with discreteness in the data. Then somehow they’re transforming back to the original space of the data; maybe they do some interpolation to get continuous values, also I’m not quite sure what happens when they extrapolate to beyond the range of the original ranks.

The interesting part of the model is the mixture of submodels of different dimensions. I’m generally suspicious of such approaches, as continuous smoothing is more to my taste. That said, the usual multivariate models we fit are so oversimplified, that I could well imagine that this mixture model could do well. So I’m supportive of the approach. I think maybe they could fit their model in Stan—if so, that would probably make the computation less of a hassle for them.

The one think I really don’t understand at all in this paper is their treatment of causal inference. The model is entirely associational—that’s fine, I love descriptive data analysis!—and they’re fitting a multivariate model to some observational data. But then in section 3.1 of their paper they use explicit causal language: “the effect of breast feeding on IQ . . . If we change the BFW for an individual, how might that affect the individual’s IQ?” The funny thing is, right after that they again remind the reader that this is just descriptive statistics “we integrate the scaled difference of two regression functions which differ only in that one has weeks more breast feeding than the other” but then they snap right back to the causal language. So that part just baffles me. They have a complicated, flexible tool for data description but for some reason they then seem to make the tyro mistake of giving a causal interpretation to regression coefficients fit to observational data. That’s not really so important, though; I think you can ignore the causal statements and the method could still be useful. It seems worth trying out.

Krzysztof Sakrejda speaks in NYC on Bayesian hierarchical survival-type model for Dengue infection

Daniel writes:

Krzysztof Sakrejda is giving a cool talk next Tues 5:30-7pm downtown on a survival model for Dengue infection using Stan. If you’re interested, please register asap. Google is asking for the names for security by tomorrow morning.

Workshop on German national educational panel study

Jutta von Maurice of the Leibniz Institute for Educational Trajectories in Germany writes:

In August this year, we plan to hold a user workshop in New York. We have data on educational processes and competence development from early childhood till late adulthood (n=60.000) and these data might be of special interest for international comparisons. Within the planned workshop we will focus on social disparities (the workshop is scheduled between RC28 and AERA meetings).

The workshop is free and will be held at Columbia University on 11 Aug 2017. Here’s the official announcement, and here’s their link.

Combining results from multiply imputed datasets

Aaron Haslam writes:

I have a question regarding combining the estimates from multiply imputed datasets. In the third addition of BDA on the top of page 452, you mention that with Bayesian analyses all you have to do is mix together the simulations. I want to clarify that this means you simply combine the posteriors from the MCMCs from the different datasets? For instance, with a current study I am working on I have 5 imputed datasets with missing outcome data imputed. I would generate individual posteriors for each of these datasets then mix them together to obtain a combined posterior and then calculate the summary statistics on this combined posterior.

I replied that yes, that is what I would do. But then I thought I’d post here in case anyone has other thoughts on the matter.

Cry of Alarm

Stan Liebowitz writes:

Is it possible to respond to a paper that you are not allowed to discuss?

The question above relates to some unusual behavior from a journal editor. As background, I [Liebowitz] have been engaged in a long running dispute regarding the analysis contained in an influential paper published in one of the top three economics journals, the Journal of Political Economy, in 2007. That paper was written by Harvard’s Felix Oberholzer-Gee and Kansas’s Koleman Strumpf and appeared to demonstrate that piracy did not reduce record sales. I have been suspicious of that paper for many reasons. Partly because the authors publicly claimed that they would make their download data public, but never did, and four years later they told reporters they had signed a non-disclosure agreement (while refusing to provide the agreement to those reporters). Partly because Oberholzer-Gee is a coauthor with (the admitted self-plagiarist) Bruno Frey of two virtually identical papers (one in the JPE and one in the AER) that do not cite one another. But mostly because OS have made claims that they either knew were false, or should have known were false.

Although I have been critical of OS (2007) since its publication, it was not until September of 2016 that I published a critique in one of the few economics journals willing to publish comments and replications, Econ Journal Watch (EJW). [I also have a replication of a portion of their paper not reliant on their download data, that is currently under review at a different journal.] The editors of EJW invited Oberholzer-Gee and Strumpf (OS) to submit a response to my critique, to be published concurrently with my critique, but OS instead published their defense in a different journal,Information Economics and Policy (IEP, an Elsevier journal behind a paywall).

OS’s choice of IEP was not surprising. Among other factors, the editor of the journal, Lisa George, was a student of Oberholzer-Gee (he served on her dissertation committee), had coauthored two papers with him, and listed him as one of four references on her CV. IEP clearly fast-tracked the OS paper—it was first submitted to the journal on October 13, and the final draft, dated October 26, thanked three referees and the editor. The paper was published in December, although it often takes over a year from submission to publication in IEP.[1]

I had spent years attempting to get OS to publicly answer questions about their paper, so I was delighted that OS finally publicly defended their paper. Their published defense still left many questions unanswered, however, such as why the reported mean value of their key instrument was four times as large as its true value, but at least OS were now on the record, trying to explain some of their questionable data and results.

As a critic of their work, I took their published defense as a vindication of my concerns. Although their defense was superficially plausible, and was voiced in a confident tone, it was chock full of errors. For example, in EJW I had noted that OS’s data on piracy, which was the main novelty of their analysis, exhibited unusual temporal variability. I knew that OS might claim that this variability was a byproduct from a process of matching their raw piracy data to data on album sales, so I measured the variability of their raw piracy data prior to the matching process, and included a paragraph in EJW explicitly noting that fact. Yet in IEP, OS mischaracterized my analysis and claimed that the surprisingly large temporal variability was due to the matching process. Not only was their claim about my analysis misleading, but their assertion that the matching process could have materially influenced the variability of their data was also incorrect, which was clearly revealed by visual inspection of the data and a correlation of 0.97 between the matched and unmatched series. The icing on the cake was their attempt to demonstrate the validity of their temporal data by claiming a +0.49 correlation of their weekly data with another data set they considered to be unusually reliable. In fact, the correct correlation between those data sets was ‑0.68 (my rejoinder provides the calculations, raw data, and copies of the web pages from which the data were taken). All these errors were found in just the first section of their paper, with later sections continuing in the same vein.

After I became aware of the OS paper in IEP, I contacted the IEP editor and complained that I had not been extended the courtesy of defending my article against their criticisms. Professor George seemed to understand that fair play would require at least the belated pretense of allowing me to provide a rejoinder:

I welcome a submission from you responding to the Oberholzer – Strumpf paper and indeed intended to contact you about this myself in the coming weeks.

She also seemed to be trying to inflate the impact factor of her journal:

As you might be aware, IEP contributors and readers have rather deep expertise in the area of piracy. I would thus [ask] that in your response you take care to cite relevant publications from the journal. I have found that taking care with the literature review makes the referee process proceed more smoothly.

The errors made by OS in IEP seemed so severe that I thought it likely that IEP would try to delay or reject my submission, both to protect OS and to protect the reputation of IEP’s refereeing process. Still, I had trouble envisioning the reasons IEP might give if it decided to reject my paper.  I decided, therefore, to submit my rejoinder to IEP but to avoid a decision dragging on for months or years, I emphatically told Professor George that I expected a quick decision, and I planned to withdraw the submission if I hadn’t heard within two months.

Wondering what grounds IEP might use to reject my paper indicated an apparent lack of imagination on my part. Although the referees did not find any errors in my paper, the editor told me that she was no longer interested in “continued debate on this one paper [OS, 2007]” and that such debate was “not helpful to researchers actively working in this area, or to IEP readers.” Apparently one side of the debate was useful to her readers in December, when she published the OS article, but that utility had presumably evaporated by January when it came to presenting the other side of the debate.

Since Professor George was supposedly planning to “invite” me to respond to OS’s article, she apparently feels the need to keep up that charade, and does so by redefining the meaning of the word “response.” She stated: “I want to emphasize that in rejecting your submission I did not shut the door on a response…IEP would welcome a new submission from you on the topic of piracy that introduces new evidence or connects existing research in novel ways.”

Apparently, I can provide a “response,” but I am not allowed to discuss the paper to which I am supposedly responding. That appears to be a rather Orwellian request.

I have complained to Elsevier about the incestuous and biased editorial process apparently afflicting IEP. We will see what comes of it. The bigger issue is the quality of the original OS article, the validity of which seems even more questionable than before, given the authors’ apparent inability to defend their analysis. This story is not yet over.

  1.   The other papers in that issue were first received March 2014, September 2015, February 2015, November 2015, and April 2016.

Wow. We earlier heard from Stan Liebowitz on economics and ethics here and here. The above story sounds particularly horrible but of course we’re only hearing one side of it here. So if any of the others involved in this episode (I guess that would be Oberholzer, Strumpf, or George) have anything to add, they should feel free to so so in the comments, or they could contact me directly.

P.S. I hope everyone’s liking the new blog titles. I’ve so far used the first five on the list. They work pretty well.

How important is gerrymandering? and How to most effectively use one’s political energy?

Andy Stein writes:

I think a lot of people (me included) would be interested to read an updated blog post from you on gerrymandering, even if your conclusions haven’t changed at all from your 2009 blog post [see also here]. Lots of people are talking about it now and Obama seems like he’ll be working on it this year and there’s a Tufts summer school course where they’re training quantitative PhDs to be expert witnesses. Initially, I thought it would be fun to attend, but as best I can tell from the limited reading I’ve done, it doesn’t seem like gerrymandering itself has that big of an effect. It seems to be that because Democrats like cities, even compact districts favor Republicans.

I’d also be curious to read a post from you on the most effective ways to use one’s polical energy for something productive. The thing I’m trying to learn more about now is how I can help work on improving our criminal justice system on the state level, since state politics seem more manageable and less tribal than national politics.

Here’s what I wrote in 2009:

Declining competitiveness in U.S. House elections cannot be explained by gerrymandering. I’m not saying that “gerrymandering” is a good thing—I’d prefer bipartisan redistricting or some sort of impartial system—but the data do not support the idea that redistricting is some sort of incumbent protection plan or exacerbator of partisan division.

In addition, political scientists have frequently noted that Democrats and Republicans have become increasingly polarized in the Senate as well as in the House, even though Senate seats are not redistricted.

And here’s how Alan Abramowitz, Brad Alexander, and Matthew Gunning put it:

The increasing correlation among district partisanship, incumbency, and campaign spending means that the effects of these three variables tend to reinforce each other to a greater extent than in the past. The result is a pattern of reinforcing advantages that leads to extraordinarily uncompetitive elections.

I added:

I’m not saying that gerrymandering is always benign; there are certainly some places where it has been used to make districts with unnecessarily high partisan concentrations. But, in aggregate, that’s not what has happened, at least according to our research.

But that was then, etc., so it’s reasonable for Stein to ask what’s happened in the eight years since. The short answer is that I’ve not studied the problem. I’ve read some newspaper articles suggesting that a few states have major gerrymanders in the Republican party’s favor, but that’s no substitute for a systematic analysis along the lines of our 1994 paper. My guess (again, without looking at the data) is that gerrymandering in some states is currently giving a few seats to the Republicans in the House of Representatives but that it does not explain the larger pattern of polarization in Congress that we’ve seen in the past few years with party-line or near-party-line votes on health care policy, confirmations for cabinet nominees, etc.

That said, the redistricting system in the United States is inherently partisan, so it’s probably a good idea for activists to get involved on both sides so that the fight in every state is balanced.

Regarding your other question, on effective ways to use one’s polical energy for something productive: I have no idea. Working on particular legislative battles can have some effect, also direct personal contact is supposed to make a difference: I guess that can involve directly talking with voters or political activists, or getting involved in activities and organizations that involve people talking with each other about politics. The other big thing is legislative primary election campaigns. It think that most primary elections are not seriously contested, and primaries can sometimes seem like a sideshow—but powerful incumbent politicians typically started off their careers by winning primary elections. So your primary campaign today could determine the political leaders of the future. And there’s also the indirect effect of influencing incumbent legislators who don’t want to lose in the primary.

All this counsel could apply to activists anywhere on the political spectrum. That said, I’d like to think of this as positive-sum advice in that (a) I hope that if activists on both sides are involved in redistricting, this will help keep the entire system fair, and (b) my advice regarding political participation should, if applied to both sides, keep politicians more responsive to the voters, which I think would be a net gain, even when some of these voters hold positions with which I disagree.

Lasso regression etc in Stan

Someone on the users list asked about lasso regression in Stan, and Ben replied:

In the rstanarm package we have stan_lm(), which is sort of like ridge regression, and stan_glm() with family = gaussian and prior = laplace() or prior = lasso(). The latter estimates the shrinkage as a hyperparameter while the former fixes it to a specified value. Again, there are possible differences in scaling but you should get good predictions. Also, there is prior = hs() or prior = hs_plus() that implement hierarchical shrinkage on the coefficients.

We discussed horseshoe in Stan awhile ago, and there’s more to be said on this topic, including the idea of postprocessing the posterior inferences if there’s a desire to pull some coefficients all the way to zero. And informative priors on the scaling parameters: yes, these hyperparameters can be estimated from data alone, but such estimates can be unstable, and some prior information should be helpful. What we really need are a bunch of examples applying these models to real problems.

Identifying Neighborhood Effects

Dionissi Aliprantis writes:

I have just published a paper (online here) on what we can learn about neighborhood effects from the results of the Moving to Opportunity housing mobility experiment. I wanted to suggest the paper (and/or the experiment more broadly) as a topic for your blog, as I am hoping the paper can start some constructive conversations.

The article is called “Assessing the evidence on neighborhood effects from Moving to Opportunity,” and here’s the abstract:

The Moving to Opportunity (MTO) experiment randomly assigned housing vouchers that could be used in low-poverty neighborhoods. Consistent with the literature, I [Aliprantis] find that receiving an MTO voucher had no effect on outcomes like earnings, employment, and test scores. However, after studying the assumptions identifying neighborhood effects with MTO data, this paper reaches a very different interpretation of these results than found in the literature. I first specify a model in which the absence of effects from the MTO program implies an absence of neighborhood effects. I present theory and evidence against two key assumptions of this model: that poverty is the only determinant of neighborhood quality and that outcomes only change across one threshold of neighborhood quality. I then show that in a more realistic model of neighborhood effects that relaxes these assumptions, the absence of effects from the MTO program is perfectly compatible with the presence of neighborhood effects. This analysis illustrates why the implicit identification strategies used in the literature on MTO can be misleading.

I haven’t had a chance to read the paper, but I can share this horrible graph:

And Figure 4 is even worse!

But don’t judge a paper by its graphs. There could well be interesting stuff here, so feel free to discuss.

Crossfire

The Kangaroo with a feather effect

OK, guess the year of this quote:

Experimental social psychology today seems dominated by values that suggest the following slogan: “Social psychology ought to be and is a lot of fun.” The fun comes not from the learning, but from the doing. Clever experimentation on exotic topics with a zany manipulation seems to be the guaranteed formula for success which, in turn, appears to be defined as being able to effect a tour de force. One sometimes gets the impression that an ever-growing coterie of social psychologists is playing (largely for one another’s benefit) a game of “can you top this?” Whoever can conduct the most contrived, flamboyant, and mirth-producing experiments receives the highest score on the kudometer. There is, in short, a distinctly exhibitionistic flavor to much current experimentation, while the experimenters themselves often seem to equate notoriety with achievement.

It’s from Kenneth Ring, Journal of Experimental Social Psychology, 1967.

Except for the somewhat old-fashioned words (“zany,” “mirth”), the old-fashioned neologism (“kudometer”) and the lack of any reference to himmicanes, power pose, or “cute-o-nomics,” the above paragraph could’ve been written yesterday, or five years ago, or any time during the career of Paul Meehl.

Or, as authority figures Susan Fiske, Daniel Schacter, and Shelley Taylor would say, “Every few decades, critics declare a crisis, point out problems, and sometimes motivate solutions.”

I learned about the above Kenneth Ring quote from this recent post by Richard Morey who goes positively medieval on the recently retracted paper by psychology professor Will Hart, a case that was particularly ridiculous because it seems that the analysis in that paper was faked by the student who collected the data . . . but was not listed as a coauthor or even thanked in the paper’s acknowledgments!

In his post, Morey describes how bad this article was, as science, even if all the data had been reported correctly. In particular, he described how the hypothesized effect sizes were much larger than could make sense based on common-sense reasoning, and how the measurements are too noisy to possibly detect reasonable-sized effects. These are problems we see over and over again; they’re central to the Type M and Type S error epidemic and the “What does not kill my statistical significance makes it stronger” fallacy. I feel kinda bad that Morey has to use, as an example, a retracted paper by a young scholar who probably doesn’t know any better . . . but I don’t feel so bad. The public record is the public record. If the author of that paper was willing to publish his paper, he should be wiling to let it be criticized. Indeed, from the standpoint of the scientist (not the careerist), getting your papers criticized by complete strangers is one of the big benefits of publication. I’ve often found it difficult to get anyone to read my draft articles, and it’s a real privilege to get people like Richard Morey to notice your work and take the trouble to point out its fatal flaws.

Oh, and by the way, Morey did not find these flaws in response to that well-publicized reaction. The story actually happened in the opposite order. Here’s Morey:

When I got done reading the paper, I immediately requested the data from the author. When I heard nothing, I escalated it within the University of Alabama. After many, many months with no useful response (“We’ll get back to you!”), I sent a report to Steve Lindsay at Psychological Science, who, to his credit, acted quickly and requested the data himself. The University then told him that they were going to retract the paper…and we never even had to say why we were asking for the data in the first place. . . .

The basic problem here is not the results, but the basic implausibility of the methods combined with the results. Presumably, the graduate student did not force Hart to measure memory using four lexical decision trials per condition. If someone claims to have hit a bullseye from 500m in hurricane-force winds with a pea-shooter, and then claims years later that a previously-unmentioned assistant faked the bullseye, you’ve got a right to look at them askance.

At this point I’d like to say that Hart’s paper should never have been accepted for publication in the first place—but that misses the point, as everything will get published, if you just keep submitting it to journal after journal. If you can’t get it into Nature, go for Plos-One, and if they turn you down, there’s always Psychological Science or JPSP (but that’ll probably only work if you’re (a) already famous and (b) write something on ESP).

The real problem is that this sort of work is standard operating practice in the field of psychology, no better and no worse (except for the faked data) than the papers on himmicanes, air rage, etc., endorsed by the prestigious National Academy of Sciences. As long as this stuff is taken seriously, it’s still a crisis, folks.

Stan and BDA on actuarial syllabus!

Avi Adler writes:

I am pleased to let you know that the Casualty Actuarial Society has announced two new exams and released their initial syllabi yesterday. Specifically, 50%–70% of the Modern Actuarial Statistics II exam covers Bayesian Analysis and Markov Chain Monte Carlo. The official text we will be using is BDA3 and while we are not mandating the use of a software package, we are strongly recommending Stan:

The first MASII exam will be given in 2018 and things may change, but now that the information is been released I hope you find this of use as a response to your requests for “Stan usage in the wild.”

The insurance industry isn’t the wildest thing out there. Still, I’m happy to hear this news.