Per Pettersson-Lidbom from the Department of Economics at Stockholm University writes:
I have followed your discussions about replication, criticism, and the self-correcting process of science. I would like to share some sad stories from economics related to these issues. It is the stories about three papers published in highly respected journals, i.e., the study by Dahlberg, Edmark and Lundqvist (2012, henceforth DEL) published in the Journal of Political Economy and the study by Lundqvist, Dahlberg and Mork (2014, henceforth LDM) published in American Economic Journal: Economic Policy and the study by Aidt and Shevts (2012, henceforth AS) also published in AEJ: Economic Policy. I decided to write comments on all 3 papers since I discovered that they all have serious flaws. Here are my stories (I will try to keep them as short as possible).
Starting with DEL’s analyzes of whether there exists a causal relationship between ethnic diversity and preferences for redistribution, we (myself and Lena Nekby) discovered 3 significant problems with their statistical analysis: (i) an unreliable and potentially invalid measure of preferences for redistribution, (ii) an endogenously selected sample and (iii) a mismeasurement of the instrumental variable (the refugee placement policy). We made DEL aware of some of these problems before they resubmitted their paper to JPE. However, they did not pay any attention to our critique. Thus, we decided to write a comment to JPE (we had to collect all the raw data ourselves since DEL refused to share their raw data). When we re-analyzed the data we found that correcting for any of these three problems reveal that there is no evidence of any relationship between ethnic diversity and preferences for redistribution. However, JPE desk-rejected (without sending it to referees) our paper twice (the first time by the same editor handling DEL and the second time by another editor when the original editor had stepped down). We then submitted our papers to 6 other respected economic journals, but it was always rejected (typically without sending it to referees). Nonetheless, most of the editors agreed with our critique but said that it was JPE’s responsibility to publish it. Eventually, Scandinavian Journal of Economics has recently decided to publish our paper.
The second example is from AS which study the effect of electoral incentives on the allocation of public services across U.S. legislative districts. I realized that they have 3 serious problems in their differences-in-difference design: (i) serial correlation in the errors, (ii) functional form issues, and (iii) omitted-time invariant factors at the district level since AS do not control for district fixed effects. When I reanalyze their data (posted on the journals website) I find that correcting for any of these three problems reveals that there is no evidence of any relationship. I submitted my comment to AEJ:Policy long before the paper was published but I was told by the editor that that they do not publish comments. Instead, I was told to post a comment on their website. So that is what I did (see https://www.aeaweb.org/articles.php?doi=10.1257/pol.4.3.1)
The third example is from LDM which use a type of regression-discontinuity design (a kink design) to estimate causal effects of intergovernmental grants on local public employment. I discovered that their results depends on (i) extremely large bandwidths and (ii) mis-specified functional forms of the forcing variable since they omit interactions in the second and third order polynomial specification. I show that when correcting for any these problems there is no regression kink that can be used for identification. I again wrote to the editor of AEJ:policy (another editor this time) long before the paper was published making them aware of this problem but I was once more told that the AEJ: policy do not publish comments. Again, I was told to post my comment on their website and so I did (see https://www.aeaweb.org/articles.php?doi=10.1257/pol.6.1.167)
What bothers me most about my experience with replicating and checking the robustness of others people’s work is two things: (i) the reluctance of economic journals to publish comments on papers that are found to be completely and indisputably wrong (I don’t think posting a comment on a journals website is satisfactory procedure. I am probably the only one stupid enough to do it!) and (ii) that researchers can get away with scientific fraud. The last point is about that I discovered that both DEL and LDM (the two papers have two authors in common) intentionally misreport their results. For example, in DEL they analyze at least 9 outcomes but only choose to report those 3 who confirm their hypothesis. Had they reported these other results, it would have been clear that there is no relationship between ethnic diversity and attitudes for redistribution. DEL also make a number of, sample restrictions, often unreported, which reduces the number of observations from 9620 to 3834 thereby creating a huge sample selection problem. Again, had they reported the results from the full sample it had been very clear that there is no relationship. DEL also misreport the definition of their instrumental variable even though previous work has used exactly the same variable and where the definition has been correct. Had they reported the correct definition it had been obvious that their instrument is actually a poor instrument since it does not measure what is purported to measure. Turning to LDM, there are 4 estimates in their Table 2 (which show the first-stage relationship) that have been left out intentionally. Had they reported these 4 estimates it would have very clear that the first-stage relationship is not robust since the sign of the estimate switches from being positive (about 3) to negative (about -3). Moreover, had they reported smaller bandwidths (for example, a data driven optimal RD) it also had been clear that there is no first-stage relationship since for smaller bandwidths almost all the estimates are negative. Also, had they reported the correct polynomial functions it had also been very clear that the first-stage estimate is nonrobust.
So the bottom line of all this is that “the self-correcting process of science” does not work very well in economics. I wonder if you have any suggestions how I should handle this type of problem since you have had similar experiences.
I don’t have the energy to look into the above cases in detail.
But, stepping back and thinking about these issues more generally, I do think there’s an unfortunate “incumbency advantage” by which published papers with “p less than .05” are taken as true unless a large effort is amassed to take them down. Criticisms are often held to a much higher standard than held for the reviewing of the original paper and, as noted above, many journals don’t publish letters at all. Other problems included various forms of fraud (as alleged above) and a more general reluctance of authors even to admit honest mistakes (as in the defensive reaction of Case and Deaton to our relatively minor technical corrections to their death-rate-trends paper).
Hence, I’m sharing Pettersson’s stories, neither endorsing nor disputing their particulars but as an example of how criticisms in scholarly research just hang in the air, unresolved. Scientific journals are set up to promote discoveries, not to handle corrections.
In journals, it’s all about the wedding, never about the marriage.