David Karger writes:
Your recent post on sharing data was of great interest to me, as my own research in computer science asks how to incentivize and lower barriers to data sharing. I was particularly curious about your highlighting of effort as the major dis-incentive to sharing. I would love to hear more, as this question of effort is on we specifically target in our development of tools for data authoring and publishing.
As a straw man, let me point out that sharing data technically requires no more than posting an excel spreadsheet online. And that you likely already produced that spreadsheet during your own analytic work. So, in what way does such low-tech publishing fail to meet your data sharing objectives?
Our own hypothesis has been that the effort is really quite low, with the problem being a lack of *immediate/tangible* benefits (as opposed to the long-term values you accurately describe). To attack this problem, we’re developing tools (and, since it appears that you are blogging on wordpress, a wordpress plugin) that let non-programmers easily publish rich, interactive visualizations of their data. We speculate that this flashier approach to publishing data might give authors more of a feeling of accomplishment, more hope of impact, and a stronger sense of authorship/ownership, than putting up a dull data file.
But, while we’ve focused on incentives, deterrents are an equally important aspect of the problem, so I’d love to hear more about your experiences.
For context here’s a post on my group’s blog at the tail end of a discussion with a colleague about some of these data sharing issues. You can backtrack to the rest of the discussion from it. Quick snippet:
Given the glaringly obvious (at least to me) benefits of structured data, there must be some barrier in place that is preventing its pervasive use by end-users. Identifying the barrier is the crucial first step to breaking through it. I’ve argued that the (technical) barrier is the lack of good authoring tools for this structured data; this has sparked an argument with Stefano Mazzocchi who has focused on the (social) barrier of reluctance to share data that might be appropriated without compensating value to the author.
I claimed by analogy that people will be happy to share structured data (given the right authoring tools) the same way they have been happy to share unstructured data. Stefano responded that data is different, with a disincentive to share because people might walk off with your data. I answered that these disincentives only arise in a class of “open source collaboration” settings that leave out a large number of sharing scenarios. In his latest rebuttal, Stefano suggests that his “questioning whether the nature of the content can influence the growth dynamics of a sharing ecosystem makes David dismiss it as being related to a particular class of people (programmers) or to a particular class of business models (my employer’s).”
OK, here goes. What’s getting in the way of sharing data and code? From my perspective, I really do see the benefits of sharing but in practice the costs seem to be too high:
1. My data aren’t usually in excel spreadsheets. Often the data come in a mix of formats. For example, in storable votes paper we analyzed data from several experiments and the data were in slightly different formats, different column identifiers, etc.
2. It’s not just data, it’s data + code. I’m not always so clean with my code. I’ll write an R script to do some analysis, then another R script to fit a second model, I’ll pass data and R objects around between directories. It’s not good but that’s how it often works out. The result is that it’s not always so easy even for me to replicate my own analyses. I suppose I could dump the whole directory on the web but that would be a mess. For Red State Blue State it’s been a real problem. We did dozens of analyses and they’re all over the place in different directories. Sometimes I’m reduced so searching my email directory for code that I might have sent to or received from a coauthor.
3. Often we’re not supposed to release the data. Again, we could handle this by including the code and just putting in a link to the data but it’s one more step.
4. Sometimes I post a pretty graph on the blog and people ask for the code. I’d like to just paste in the code in the blog post but it’s not so easy with html. There are various html options (“code,” etc.) but I can never keep them all straight, and I recall that they often don’t preserve whitespace so that the posted code is hard to read. I could save the code in a file and link to the file but that takes a few more steps.
5. That said, when people ask me for data or code, I almost always post it or send it to them.
As I wrote in my Chance article on open data and open methods, I support there being some requirement or norm to share data and code. If people know ahead of time that they will have to share data and code as requested, that will provide the incentive to researchers to make their data accessible, which should ultimately benefit them, other researchers, and science as a whole.
Again, I’m speaking of my own incentives and disincentives here, but perhaps the norm of replicability would have a chance to improve the behavior of the Hausers and Wegmans who spend so much time muddying the waters and covering their tracks. The more effort it takes to hide your offenses, the more incentive there is not to do it in the first place. Recall my theory that a prime motivation for plagiarism and other scholarly misconduct is . . . laziness.
P.S. In comments, Karger adds the following excellent thoughts:
Key point first: as my [Karger's] research is specifically aimed at figuring out how to encourage data sharing, I would be very happy to speak to any other scientists who have the time to tell me about their incentives, deterrents, and experiences sharing their data. Please get in touch.
Turning to Andrew’s list of reasons, I am particularly focused on the observation that it is too much work for the “lazy” scientist to prepare their data (and code) for distribution. Andrew says: “I suppose I could dump the whole directory on the web but that would be a mess.”
It seems to me that the the “logical” thing to do in this case would be to just put up all the data in whatever forms, leaving it up to the consumer to perform whatever understanding/alignment/integration they needed. This would be an improvement on the current situation, as aligning data is surely less work than finding all the data from scratch. Presumably, the deterrent here is the feeling that it would be embarrassing/unprofessional to put up data in that form.
So one way to pursue this, instead of getting everyone to “behave better” in preparing their data, would be to convince everyone to tolerate “worse” behavior (less-nicely curated data). This might be an easier task, as it doesn’t actually require people to improve their behavior….
Could we convince people to just sloppily *describe* the data in the data files, to a point where someone else could do the integration? As opposed to doing the integration yourself?
The issue of packaging code reflects kind of the same problem—running up against the unwillingness to put up less-than-perfect material. Such a discussion arises at http://techblog.netflix.com/2012/07/open-source-at-netflix-by-ruslan.html : ‘we’ve observed that the peer pressure from “Social Coding” has driven engineers to make sure code is clean and well structured, documentation is useful and up to date. What we’ve learned is that a component may be “Good enough for running in production, but not good enough for Github”.’
Again, from a logical perspective putting up mediocre code is more beneficial than putting up nothing. How do we overcome the self-censorship?
Andrew adds “when people ask me for data or code, I almost always post it or send it to them.” This demonstrates that there are no insurmountable obstacles here. And the problem with this approach is that there’s probably someone on the other side saying “if the data were available, I’d do something with it, but I’m not going to be so selfish as to make him produce it for me”. I’d like to figure out how to close that gap.
Here’s an idea for a little experiment: when you post, put in a link saying “data for this post can be found here”. Take them to a page saying “ok, I lied—it isn’t here, but just type your email address here and I’ll send it to you”. How many people would click? How many would type their email address?
I [Karger] speculate that there are people whose curiosity would lead them to look at the data and then do something interesting with it, but who don’t quite have the energy to seek out the data if it isn’t in front of them.
I’ll give Karger’s suggestions a try . . . either before or after I get around to putting all my references in Bibtex.