Dispute about ethics of data sharing

Several months ago, Sam Behseta, the new editor of Chance magazine, asked me if I’d like to have a column. I said yes, I’d like to write on ethics and statistics. My first column was called “Open Data and Open Methods” and I discussed the ethical obligation to share data and make our computations transparent wherever possible. In my column, I recounted a story from a bit over 20 years ago when I noticed a problem in a published analysis (involving electromagnetic fields and calcium flow in chicken brains) and contacted the researcher in charge of the study, who would not share his data with me.

Two of the people from that research team—biologist Carl Blackman and statistician Dennis House—saw my Chance column and felt that I had misrepresented the situation and had criticized them unfairly.

Blackman and House expressed their concerns in letters to the editor which were just published, along with my reply, in the latest issue of Chance.

Seeing as I posted my article here, I thought it only appropriate to post the letters. Here they are. I encourage all of you who are interested in ethics and data sharing to take a look. As I wrote in my response, I appreciate the letters of Dr. Blackman and Mr. House and I hope that readers will benefit from seeing both their perspectives and mine—just as researchers in general can benefit from seeing multiple analyses of publicly shared data.

P.S. Please don’t put any criticisms of Blackman or House (or me!) in the comments. I appreciate that they put in the effort to respond, and my purpose in posting their letters here is to give a forum for their views. General comments about ethics and data sharing would be fine, but no need to focus on this particular case.

31 thoughts on “Dispute about ethics of data sharing

  1. Blackman says in his reply, “The data in question had been collected in 1981–1983, nearly six years previously, and by 1984, we had tested and confirmed the fundamental hypothesis suggested from the original p-value plot, viz., that the local static magnetic field was involved in determining which frequencies could be biologically effective (Blackman et al., 1985). Thus, in 1988, I concluded that it was unlikely that further statistical examination of the data that was used to generate the original p-value plots would be profitable in advancing the understanding of that science.”

    This strikes me as exactly what you were commenting on. Blackman, et al, did some research in 1981-1983, and in 1984 reached conclusions which were published in 1985. They were so sure of their conclusions that Blackman felt nothing could be gained by another pair of eyes on the data. Isn’t that exactly the issue: researchers being so sure (from their own research) that they keep data to themselves?

    Yes, there were time/effort tradeoffs, etc, but this “we’re so certain that you wound’t find anything that we won’t bother to put our data together for you” is a very real issue in several fields of study today.

  2. P.S. Please don’t put any criticisms of Blackman or House (or me!) in the comments. I appreciate that they put in the effort to respond, and my purpose in posting their letters here is to give a forum for their views. General comments about ethics and data sharing would be fine, but no need to focus on this particular case.

    We have come to discuss ethics, not Blackman, House or Gelman, because these three gentlemen are honourable men. Does that avoid said criticism?

  3. While academia/science is a community for furthering knowledge, it is still a business. People put in thousands of dollars and hours to obtain, clean, and analyze datasets. I don’t believe in free hands-outs. Discussion of analysis is fine and definitely encouraged, but the decision to release the dataset is up to the PI leading the construction of the dataset. The decision is not a matter of ethics; it’s simply a business decision.

    • Researcher:

      These people worked for the Environmental Protection Agency. The EPA is not a business. It is a government agency. And, no, I don’t think it’s up to the PI whether to release data from publicly funded research.

      • I was making a “general comments about ethics and data sharing.” Would you consider public universities a business? Should datasets from projects funded by federal agencies (NSF, EPA, etc.) be freely available? A lot of research and researchers are funded by such sources and I can’t imagine them all handing out their dataset to folks with ideas of further analysis. I wouldn’t say they were “unethical” if they declined my request, perhaps unscholarly, but definitely not unethical.

        I agree sharing data promotes science, but I understand why some people would not want to share the fruits of their labor.

        • Researcher:

          I do not consider universities a business. With some exceptions, they are nonprofit organizations. In any case, yes, I believe that datasets from publicly funded projects should be freely available (after addressing confidentiality issues, which did not come up with an experiment on chicken brains).

        • Researcher:

          This whole discussion can be precluded if NSF, EPA etc. made an explicit funding requirement to publish all data online at end-of-study.

    • Maybe then this are about the rules the founding agencies should put down, rewquiriing open access to data as thye general rule, only with reasoned exceptions. I time ago we visited (as tourists) the ESO (European Southern Observatory) Very Large Telescope at Paranal, at the south of Antofagasta, Chile. Four ten-meter diameter telescopes!

      There , the rules are clear: the lead researchers behind any project have exclusive access to the data for four months, then it becomes public. But of course it is easier when the data are in the computer with standardized formats
      from time zero.

  4. I think it’s a mistake to make claims like “these people didn’t share their data because they’re afraid I might prove them wrong,” in this case or other cases, if the only evidence is that they have not shared their data.

    I am sometimes given data with the request that they not be distributed (this is the case with several datasets I am currently using, for example). I always say that I can’t promise to keep the data confidential because of the Freedom of Information Act but that I will not distribute them simply because someone requests them.

    Nowadays it is probably rare for someone to be unable to send a dataset that they’ve worked with in the past few years: it’s right there on your hard drive, just hit send. But in the 1980s and well into the 90s this could be an issue, and back then it might require a couple of hours to find the raw data and documentation.

    And “code rot” and poor memory can still cause problems: many people are very sloppy and will, for instance, edit “raw data” files to fix anomalies or deal with outliers or whatever. An experimenter gives an Excel file to someone, that person finds some funny things in the data and fixes them, and later performs an analysis (or gives the file to a colleague for the analysis)…a few years later, if asked for the original data, the person who published the results may not remember, or may never have known, that they weren’t in fact working with the original data.

    I could go on. There are several reasons people didn’t, and still don’t, share their data. Not all represent ethical lapses. (If I assure someone that I will -try- to keep their data confidential, and on that basis they give it to me, and I publish something that uses the data, am I being ethical if I share the data upon request, or if I don’t, or was I unethical to accept the data in the first place?)

    One could argue, and in fact I would argue, that it is unethical to use ‘it would be too hard to compile the data’ as an excuse for failing to share data, at least within 7 years or so of publishing a data analysis. Upon completing a project, researchers should take the time to put all of the data and necessary documentation together in a safe place — in the modern era this is a computer file that gets backed up — and should share the information upon request unless there’s a confidentiality agreement or other good reason not to. But failing to keep a good computer record, or relying on fallible memory to remember which of several files represents the “original data” when asked years later, are venial sins. Most of us are not free of them.

      • Umm….you said you don’t want to discuss this particular case, so I’m a bit at a loss of what to say. You said in your original piece that the reason they didn’t want to share their data was that they were afraid you would contradict them, and one of the main things I’m saying is that it is a mistake to think that that is why someone is refusing to share their data (in absence of other evidence). So you can’t agree with me and with your own article.

        • Phil:

          What I’m agreeing with is your statement that there are many reasons people don’t share their data. As I wrote, I believe that part of the reason Blackman didn’t share his data with me is that it would’ve been effort to get the data together. Nonetheless, I think he should’ve shared. Yes, it would’ve taken effort, but if the study was worth doing in the first place, I think it would be worth sharing the data. Especially with someone such as myself who had presented good reasons why alternative analyses would make sense. The researchers were mistaken to be so confident that their analysis couldn’t be improved upon.

        • What I wonder is, couldn’t Andrew’s first column have made its points equally well without mentioning the names of Ms. Blackman and House?

          It’s a bit unfair to first actively mention them, and then mount the high-horse of soliciting “general comments on ethics”.

          If you want a neutral, unheated discussion, don’t invoke painful specifics in the first place!

        • Rahul:

          I discussed the specifics of the case because I wanted to address the particular statistical issues regarding the scientific problem and the data analysis rather than to merely discuss the data-disclosure problem in general.

    • Phil: That was my one experience of being asked for data in one of my publications from the 1980’s [quality scores on published papers] – I actually was not sure if it was possible to re-generate it – but it would have been onerous. So I said no.

      And one always has to decide “what’s good use of one’s time given one’s best bet of the value of it” and my bet – when I do not know someone is again no.

      But, but people are almost always defensive (both sides here?) so I appreciate your comments.

  5. I read the exchange. Without criticising anyone, I find it amusing how all parties in the exchange appeal to hierarchy (their own place in it, naturally–I am referring to “masters-level”, “Ph.D. student”, etc). Gelman’s description of the variation and alternate ways of handling is straight-forward and easily understandable, and is not refuted (Blackman tries, but his rationale for not presenting an alternate analysis is weak). It’s humorous that Blackman makes the comment “I do not intend to defend the classical model of statistical inference against the newer model he is championing”–someone must have tipped him off that Gelman is Mr. Bayesian! There are no hierarchical models or MCMC optimizations in Gelman’s statistical criticisms of this article, so Blackman seems to be replying to something that doesn’t exist.

    What there really needs to be is a requirement that NIH, NSF, EPA, etc devoted a certain amount of their research budget to quality control. These would be grants issued to people like Gelman to reanalyze data, scientific analyses, etc, and make critiques. Relying on reviewers doesn’t work.

  6. Blackman lost me when he said that if he knew that that you were a mere grad student, he “should speak to his research adviser to determine the scientific and educational value of his working with these data.” That is the snottiest sentence that I have read in a while. Your adviser may not have even had an opinion about the data.

    • To the extent that there are “two sides”, neither one comes off looking very well from the snottiness point of view. In Andrew’s case I know that the way he comes across in this exchange is not a reflection of his actual character (I speak as someone who has known him quite well for thirty years), and I think it’s possible that Blackman also has simply expressed himself poorly. But nobody’s views on ethics (or just about anything else) should be ignored simply because they come across as snotty; we should grade on content, not style.

      Unfortunately, from the content perspective I _also_ think neither side has much to be proud of in this exchange, although since Andrew doesn’t want a discussion of this particular case I won’t go into what I think each side did wrong.

      And the articles do serve Andrew’s purpose of raising an important issue, on which I agree with Andrew: if you publish conclusions based on analyzing data, you should make the data available to others. Since this is in fact not at all a standard thing to do, I think it’s a bit unfair to pick on any small group of people for failing to do it; it’s really an ethical problem for the whole profession, just as it was an ethical problem back in the 50s when (supposedly, anyway) most doctors wouldn’t tell people that they had terminal cancer.

  7. I find the most common way in which alternative analyses are stymied is simply by ignoring requests for data. I’m emailed a dozen or so authors for data and been ignored completely roughly half the time. I find that frustrating but if they ignore you what can you do?

    • Wouldn’t an FOIA request be a good workaround for such recalcitrant researchers? Of course, only works in cases where they are funded by a Government agency.

  8. I found it ironic that a discussion of “open data” was happening on an essentially “closed” forum:

    “Some content is only viewable by ASA members who subscribe to CHANCE. Log in to Members Only to continue reading.”

  9. the philospher of Science Karl Popper insisted and was lauded for his insistence that the test of reliability of a scientific premise or theory was in the evidence and that an untested thesis is not science yet, an untestable thesis is non science.

    if Einstein was right (we know he was) when he said that experimental evidence can nullify the most elegant of theories, then anyone who publishes should be ready and cooperative in allowing other scientists to test and attempt to replicate the experiment or research project to affirm the results reported. if the research is publicly funded the research is in the public domain and the researchers should be accountable to the public–and by that we should mean the public, not some agency with an agenda.

    What’s the problem with allowing others to review ones data and attempt to replicate the results? In fact that is what one should be as a good scientist, the best critic of the results and the methods used, before it goes to publication. It’s not even a close call–trust me science is not science, black box science is not science, science is testing and verification and it is transparent and subject to criticism and debate on methods and analysis. Trouble is too many people are in the business of promoting themselves, their incomes, their positions and the revenue expected from another round of grants for research.

    As for the distracting discussions about whether a data set is proprietary and methods might also be proprietary, when the big test for the bending of light in the South Pacific was performed, was that proprietary? Sometimes the people who do trans science, like epidemiologists data torturing for small associations use methods to enhance their results–is that a proprietary trick?

  10. As a researcher who gathers data, and who has unsuccessfully requested data from others, I’m not very sympathetic to the data-sharing argument. Developing a research project, applying for funds and ethical approval, and carrying out the fieldwork takes a huge amount of time and effort. It is simply not in my interests to pay the costs of this work without every opportunity to extract some benefit.

    Of course, if my comparative advantage were in statistics, rather than fieldwork, it would be in my interests for an open-access data regime. I would clearly benefit from such an arrangement because other fools would have to pay the costs of gathering data, and, seeing as how I could probably outgun them in the analysis of said data, I would readily get published.

    In an ideal world, we could have it all. I could do my fieldwork, you could make a couple of graphs with the data, and everyone would be happy. But in the real academic world, there are only so many jobs in desirable locations, endowed chairs, and slots in top journals. Why should I give my competition a free ride?

    I don’t think the ethics are very clear either. It is very easy to occupy the moral high ground when it aligns with your self-interest. One could make a different argument: as with drug development in the pharmaceutical industry, those who invest in data gathering may require a protective period to allow the investment to be worthwhile. The alternative may well be the underprovision of novel data.

    Moreover, the ethical case is still ambiguous even if the data were collected using public funds. Only part of the cost of gathering data is the dollar amount covered by a grant. A large chunk (especially in the social sciences) of the total cost is my time and effort in creating a successful grant.

    • Data gatherer:

      1. I think that for publicly funded work, the cost of making the data available should be part of the grant. Just as it is typically part of the grant to pay for travel to conferences to present the work. I would not recommend funding a grant proposal that had no resources allocated to making the data available to others. If data are not available to others, the project is a black hole, and all depends on trust of the funded researchers. Not such a good idea.

      2. The project discussed in my article was conducted at a government lab. I don’t care if it’s in the interests of the researcher to share the data; I think it is their duty to satisfy any reasonable request (and, yes, I think a letter from a statistician presenting a cogent argument for reanalysis is indeed a reasonable request). The researchers are already being paid a salary, I don’t think they have any right to withhold the data to “extract some benefits.” In fairness to Dr. Blackman and Mr. House, I never had any impression that they were trying to extract any benefits from me; rather, as discussed in the exchange of letters, they were busy and did not want to spend the time to put the data together. At no time were they trying to extract anything from me.

      3. You write, “Why should I give my competition a free ride?” I have three answers. First, being a scientist is a joyful job and I do think it is our duty to share our results—especially when they are publicly funded. Second, I was not competing with Blackman or House. Third, of course I would have cited them (and perhaps collaborated with them, if they’d been interested), so they would be getting even more credit through the wider distribution of their research.

Comments are closed.