Skip to content
 

Reproducible science FAIL (so far): What’s stoppin people from sharin data and code?

David Karger writes:

Your recent post on sharing data was of great interest to me, as my own research in computer science asks how to incentivize and lower barriers to data sharing. I was particularly curious about your highlighting of effort as the major dis-incentive to sharing. I would love to hear more, as this question of effort is on we specifically target in our development of tools for data authoring and publishing.

As a straw man, let me point out that sharing data technically requires no more than posting an excel spreadsheet online. And that you likely already produced that spreadsheet during your own analytic work. So, in what way does such low-tech publishing fail to meet your data sharing objectives?

Our own hypothesis has been that the effort is really quite low, with the problem being a lack of *immediate/tangible* benefits (as opposed to the long-term values you accurately describe). To attack this problem, we’re developing tools (and, since it appears that you are blogging on wordpress, a wordpress plugin) that let non-programmers easily publish rich, interactive visualizations of their data. We speculate that this flashier approach to publishing data might give authors more of a feeling of accomplishment, more hope of impact, and a stronger sense of authorship/ownership, than putting up a dull data file.

But, while we’ve focused on incentives, deterrents are an equally important aspect of the problem, so I’d love to hear more about your experiences.


Karger continues:

For context here’s a post on my group’s blog at the tail end of a discussion with a colleague about some of these data sharing issues. You can backtrack to the rest of the discussion from it. Quick snippet:

Given the glaringly obvious (at least to me) benefits of structured data, there must be some barrier in place that is preventing its pervasive use by end-users. Identifying the barrier is the crucial first step to breaking through it. I’ve argued that the (technical) barrier is the lack of good authoring tools for this structured data; this has sparked an argument with Stefano Mazzocchi who has focused on the (social) barrier of reluctance to share data that might be appropriated without compensating value to the author.

I claimed by analogy that people will be happy to share structured data (given the right authoring tools) the same way they have been happy to share unstructured data. Stefano responded that data is different, with a disincentive to share because people might walk off with your data. I answered that these disincentives only arise in a class of “open source collaboration” settings that leave out a large number of sharing scenarios. In his latest rebuttal, Stefano suggests that his “questioning whether the nature of the content can influence the growth dynamics of a sharing ecosystem makes David dismiss it as being related to a particular class of people (programmers) or to a particular class of business models (my employer’s).”

OK, here goes. What’s getting in the way of sharing data and code? From my perspective, I really do see the benefits of sharing but in practice the costs seem to be too high:

1. My data aren’t usually in excel spreadsheets. Often the data come in a mix of formats. For example, in storable votes paper we analyzed data from several experiments and the data were in slightly different formats, different column identifiers, etc.

2. It’s not just data, it’s data + code. I’m not always so clean with my code. I’ll write an R script to do some analysis, then another R script to fit a second model, I’ll pass data and R objects around between directories. It’s not good but that’s how it often works out. The result is that it’s not always so easy even for me to replicate my own analyses. I suppose I could dump the whole directory on the web but that would be a mess. For Red State Blue State it’s been a real problem. We did dozens of analyses and they’re all over the place in different directories. Sometimes I’m reduced so searching my email directory for code that I might have sent to or received from a coauthor.

3. Often we’re not supposed to release the data. Again, we could handle this by including the code and just putting in a link to the data but it’s one more step.

4. Sometimes I post a pretty graph on the blog and people ask for the code. I’d like to just paste in the code in the blog post but it’s not so easy with html. There are various html options (“code,” etc.) but I can never keep them all straight, and I recall that they often don’t preserve whitespace so that the posted code is hard to read. I could save the code in a file and link to the file but that takes a few more steps.

5. That said, when people ask me for data or code, I almost always post it or send it to them.

As I wrote in my Chance article on open data and open methods, I support there being some requirement or norm to share data and code. If people know ahead of time that they will have to share data and code as requested, that will provide the incentive to researchers to make their data accessible, which should ultimately benefit them, other researchers, and science as a whole.

Again, I’m speaking of my own incentives and disincentives here, but perhaps the norm of replicability would have a chance to improve the behavior of the Hausers and Wegmans who spend so much time muddying the waters and covering their tracks. The more effort it takes to hide your offenses, the more incentive there is not to do it in the first place. Recall my theory that a prime motivation for plagiarism and other scholarly misconduct is . . . laziness.

P.S. In comments, Karger adds the following excellent thoughts:

Key point first: as my [Karger’s] research is specifically aimed at figuring out how to encourage data sharing, I would be very happy to speak to any other scientists who have the time to tell me about their incentives, deterrents, and experiences sharing their data. Please get in touch.

Turning to Andrew’s list of reasons, I am particularly focused on the observation that it is too much work for the “lazy” scientist to prepare their data (and code) for distribution. Andrew says: “I suppose I could dump the whole directory on the web but that would be a mess.”

It seems to me that the the “logical” thing to do in this case would be to just put up all the data in whatever forms, leaving it up to the consumer to perform whatever understanding/alignment/integration they needed. This would be an improvement on the current situation, as aligning data is surely less work than finding all the data from scratch. Presumably, the deterrent here is the feeling that it would be embarrassing/unprofessional to put up data in that form.

So one way to pursue this, instead of getting everyone to “behave better” in preparing their data, would be to convince everyone to tolerate “worse” behavior (less-nicely curated data). This might be an easier task, as it doesn’t actually require people to improve their behavior….

Could we convince people to just sloppily *describe* the data in the data files, to a point where someone else could do the integration? As opposed to doing the integration yourself?

The issue of packaging code reflects kind of the same problem—running up against the unwillingness to put up less-than-perfect material. Such a discussion arises at http://techblog.netflix.com/2012/07/open-source-at-netflix-by-ruslan.html : ‘we’ve observed that the peer pressure from “Social Coding” has driven engineers to make sure code is clean and well structured, documentation is useful and up to date. What we’ve learned is that a component may be “Good enough for running in production, but not good enough for Github”.’

Again, from a logical perspective putting up mediocre code is more beneficial than putting up nothing. How do we overcome the self-censorship?

Andrew adds “when people ask me for data or code, I almost always post it or send it to them.” This demonstrates that there are no insurmountable obstacles here. And the problem with this approach is that there’s probably someone on the other side saying “if the data were available, I’d do something with it, but I’m not going to be so selfish as to make him produce it for me”. I’d like to figure out how to close that gap.

Here’s an idea for a little experiment: when you post, put in a link saying “data for this post can be found here”. Take them to a page saying “ok, I lied—it isn’t here, but just type your email address here and I’ll send it to you”. How many people would click? How many would type their email address?

I [Karger] speculate that there are people whose curiosity would lead them to look at the data and then do something interesting with it, but who don’t quite have the energy to seek out the data if it isn’t in front of them.

I’ll give Karger’s suggestions a try . . . either before or after I get around to putting all my references in Bibtex.

37 Comments

  1. Bill Mill says:

    For sharing and displaying code, you might want to use gists: https://gist.github.com/

    Each gist comes with an easy link to embed the code snippet, with highlighting, on a web page. (“Embed” on a gist functions the same way youtube’s “embed” button does.)

  2. I agree that data+code should be released together. I have a rant about it here: http://howdy.physics.nyu.edu/index.php/Data_release_policy

    I think that *if* you imagine someone being able to download your data and code as a big ball, unpack it, type “make” or equivalent, and see all the plots you put in your paper, *then* it (a) makes your data and code much more useful to others (data documented by code and code documented by data), and (b) makes your data and code much more useful to *you*, in the sense that you don’t get the “it’s not always so easy even for me to replicate my own analyses” effect you mention. I also think all this should be a *requirement* from the journals, but I don’t see much movement in that direction!

    • Kjetil Halvorsen says:

      this could be done easily be packaging everything as an R package. If you then submit to CRAN (I dont know if this kind of usage of CRAN is inside the guidelines!), the tracking og necessary changes with new versions get automatic, and if your code is good, the maintenance should be easy.

  3. Jonathan Gilligan says:

    What is “data”? What is “code”?

    Consider some extreme cases on the data side:

    High energy physics, where the raw data may be tens of thousands of terabytes. Obviously you can’t upload this with every paper. This data is repeatedly processed to produce smaller sets. What level of abstraction or filtering should characterize the stored data?

    Human-subject research: Back around 2000, the Shelby amendment requiring sharing of raw data from federally funded studies ran into questions about its implications for confidentiality of subjects. When sharing data, what are the requirements for ensuring that it does not inadvertently make it possible for data mining to de-anonymize and identify the subjects? This can be a big issue with genetic studies and social science surveys.

    Proprietary data: What are the requirements when you license proprietary data from someone else?

    And on the code side:

    If your data analysis code is home-brewed fortran, c, or suchlike it may be nontrivial to make it portable to other machines but yours. Getting scientific software written for one Linux distribution may be difficult to compile on another distribution, even on another release of the same distribution. It may compile fine on one compiler release and break on the next release of the same compiler (I’ve run into this with the NCAR Graphics Libraries). How much effort does the researcher need to put into documenting the code, making it portable, and providing assistance for others trying to compile it? “Download a tarball, unpackit, and type make” is an illusion for most nontrivial scientific code I’ve seen.

    There’s good discussion at realclimate.org of the difficulty (and even futility) of sharing code as a means to enhance reproducibility.

    And if your code includes homebrewed adaptations of copyrighted material, such as Numerical Recipes, do you have to edit the code to remove this? And how useful will it be with those sections removed?

    In short, releasing data and code is a laudable idea but in many cases it is much more complicated than it might seem at first glance.

  4. Woody Setzer says:

    Not only should data+code be released, but generally a lot of metadata needs to be released as well, to put the data and code in context. Even for a modest dataset, much of that metadata may not be written down in a compact form, but is scattered in comments through lab notebooks, comments in code, maybe scraps of paper and emails. Given that many/most researchers probably start from a state of organization that is worse than Andrew describes, it is easy for even non-lazy folks to decide to “think about that tomorrow” when it comes to building an archive of data + code + metadata for public release.

  5. Erik says:

    I’m not sympathetic to any reasons for not sharing data except #3.

    The barrier to sharing code and data is increasingly disappearing with git and github. It’s incredibly easy to setup a repository, add your code and data, and make it available for anyone to quickly read and download. The benefits are incredibly high at the individual level since you can track your code changes, add documentation, and get a great code review system. The social benefits are also pretty large as you can see how research progressed and easily replicate what someone does.

    R’s package functionality is a framework for research as it has data and code functionality built in and it forces you to think of your analysis in terms of repeatable functions. You can make your packages more robust with in-code documentation using roxygen, test functionality to make sure your functions work like you expect them to (with runit or testthat), and numerous other packages, like devtools.

    I’d argue that if one can’t replicate their own analyses, then that’s a flaw in their research methodology. Git and R packages are a great combination to correct those flaw and your research will benefit in turn.

    • Andrew says:

      Erik:

      I think I see where you’re coming from, but don’t forget there are tradeoffs. I learned Fortran 35 yrs ago and S (equivalently, R) 25 yrs ago. I’ve never learned git, roxygen, runit, testthat, or devtools (actually, I don’t even know what these are), nor have I ever created an R package (although others have done that for me). On the other hand, I’ve written 6 books, some research articles, and a couple thousand blog entries. I agree with you that the nonreplicability of my research is a flaw; on the other hand, all our research has flaws and we are always making choices about how best to spend our work time. You might be right that I should, for example, write 4 fewer papers this year and spend the time saved learning git; my only point here is that such a decision would represent a choice, it’s not obvious that’s the best use of my time.

      • Erik says:

        Those are fair points. When you learn a technology you tend to forget the troubles you went through to pick it up.

        In response, though, I think you’re undervaluing the benefits of using a better project management system (e.g. R packages) and source control (e.g. git). You’ll go a little slower this month as you pick up git and start making packages, but from then on you steadily accelerate until you’re going way faster than you were before you started. The point being that it’s not a simple tradeoff: it’s true that your next paper goes slower, but all the rest go faster.

        Quick glossary:
        -git: source control system. It’s linked to github.com, a social code sharing site that allows others to copy your code, comment on it, and submit changes.
        -roxygen2: R package that allows you to document your functions in the source code so that you don’t have to maintain in package documentation and a separate help file
        -devtools: R package the facilitates the development of packages.

  6. chris says:

    Its ugly, but you guys are ignoring competition.

    If you are an existing business who has the ability to spend $1M developing a dataset and then invest $50k into developing a product that does something useful based on that dataset, you would want to publish the results to get visibility, but absolutely would not want to share the data so the next startup can come along and spend the $50k to become your competitor.

    I know you meant in academia, but academia is just as competitive (at least during the brief time I spent there). Translate $ into time invested, and the goal is tenure or a new grant and you have a similar ugly scenario. You spend ten years developing your knowledge and codebase, do you give it away and risk someone beating you to your next result? There are only so many tenured positions.

    Of course once you have tenure (or lots of $) then the competition lessens, but I’d argue it just changes a bit and never goes away.

    From the perspective of a Lab, where the competition aspect pretty much goes away, laziness kicks in. We have enough hurdles getting permission to publish a paper at all, much less going through prepub review to release actual code and data. I’d rather not publish the paper than spend a “year” doing paperwork.

    • Ely Spears says:

      I absolutely second this suggestion. I witnessed a lot of closed-access behaviors in my grad school years too. When a researcher (particularly young, or just anyone in a hot field where getting signal above noise is very tough) spends great effort to secure a data set and code routine that allows them more or less push-button access to answering questions that previously were unanswered or seen as two difficult to answer, then they don’t want to let go of control of it until they have milked it dry, lest they get scooped by a competitor who benefits from their effort. This happened to me time and time again when emailing authors and their students for data, code, or even a brief documentation or readme about the code.

      I’m surprised that social causes and/or laziness causes are being considered as first order answers. I think the data monopoly and no-scooping-me-in-my-niche-data-space causes are more likely.

      And unless some open access journal becomes prestigious and includes the state of available code/data in the review process, I just see no real incentive. Sure I’ll release data years later once I’ve gotten all I can from it, but in the mean time I am too just making sure I squeeze everything out of it so others can’t.

      • Andrew says:

        Ely:

        For me, the motivations are 100% laziness and externally-imposed data restrictions, 0% competition. I’d looove if people would come and squeeze more out of my data and analyses. And I do have a real incentive for that to happen. It’s the same incentive that motivates me to publish a paper in the first place: I want researchers to be aware of what I’m doing so they can do more of it themselves (for methods research), or I want people to be aware of what I’ve found so they can explore it further (for substantive research). I’m aware that others’ motivations differ, but that’s where I’m coming from.

        • K? O'Rourke says:

          > others’ motivations differ

          Once when I made my academic bosses aware of very valuable source of data for methodological research they were interested in – they _told_ I should not tell anyone else about it. At the time I had an appointment at another university and replied that I would need to inform the REB there of this request to comply with it and they backed off.

          There turned out to be some difficulties and limitations with they data, so it was dropped (but who knows how this later affected my relationships with them).

          Some one should do an anonymous survey of academics to get a sense of prevalences.

  7. Completely agree with Jonathan Gilligan that releasing data and code is difficult, and there are many edge cases where it is made more difficult or impossible.

    However, I completely disagree that this should be a justification not to try at all.

    I believe part of a solution, certainly not the whole solution, is about lowering the ‘activation energy’ required to share these critical artifacts. Some examples of tools/workflows that people could use to decrease this activation energy include:

    * use a consistent directory structure that attempts to separate out code, data, and analysis artifacts. This keeps your code bundled up in a single convient location.
    ** for R – there is https://github.com/johnmyleswhite/ProjectTemplate if you wanted to automate this

    * use git to manage your code portion of the analysis. Using a version control system gives you a lot of other benefits, but it really makes it a lot easier to share code.
    ** other benefits of git: http://vallandingham.me/Quick_Git.html
    ** have copyright code that you don’t want to share? add it to the .gitignore file and don’t share it. Something has to be better then nothing.

    * Use github. Right now, I think github is really THE main way to lower the cost of publishing code. It might not work for data as well, but its a start. If you have a github account for your personal/group/office/department ‘s code already setup, sharing additional code is a 3 minute process. If you already have git in your workflow, then uploading to github is really trivial.

    * Like Bill Mill says, gists are another great option for sharing code when its a small fragment. Plus it can be associated with your github account and you can reference it in other documentation/code as you see fit.

    * consider using something like Sweave, or better yet – knitr – http://yihui.name/knitr/ – as a way to add metadata, context, and explanation to your code. Certainly not appropriate for all code, but if its a one time analysis, it might be worth checking out.

    Certainly not a whole solution. However, with a few (what i would consider minor) tweaks to a persons analysis workflow, the cost of sharing code (and relatively small amounts of data) can be significantly reduced… At least in theory.

  8. AP says:

    There is one more problem with having publicly accessible data: smearing.

    Imagine you’re a fairly new academic that has lots of (good) experiments promoting a new and exciting hypothesis.

    Now, someone who is hostile to this hypothesis comes along, downloads your data, runs another analysis on it (which may itself be flawed) and then publishes in either a lower journal or just on a blog or something like that arguing that you’re completely wrong and that your research is flawed.

    In the long run, you’ll be able to defend your reputation and this will go down as a pointless argument but how much time and energy will you have to invest to rebuke these kind of attacks. You have an especially big problem if your hypothesis or theory is countervailing to some existing (and, as you and your colleagues show) wrong theories.

    On top of that, many people, especially those in related fields, just read headlines and summaries from your field (imagine law and psychology). How many of these people will remember the smear and not the carefully orchestrated response?

    I think this is a huge hurdle to overcome before we get to the point where all researchers make all their data public.

    I think until we have a way

    • AP says:

      Hmm, just cut me off…

      …until we have a way to deal with the smear problem, we’ll have a big incentive barrier to getting data made public.

      • David Karger says:

        The “smearing” you’re describing is the fundamental scientific process: other scientists need to be able to replicate/critique your experiments for science to progress. Scientists need to be willing to stand behind their conclusions and defend them against “smears”. Hiding your data to protect it from criticism is bad science.

        • John Mashey says:

          While this is a 3-sigma case at least, how do you deal with people whose goal is to waste your time to the poitn yo uget nothing done? Put another way, at what point do you stop answering questions?

        • AP says:

          Except that criticism isn’t always fair and isn’t always done for reasons of scientific integrity.

          Just like everyone else, scientists are people and love to play political games.

          Is sharing data a good idea? Yes.
          Should you post it publicly in a perfect world? Yes.
          Are individual incentives aligned with incentives of the scientific community? No.

          • Sorry, but this makes very little sense. “Smearing” has been going on without public data for as long as the academy has been around. With your data publicly out there, you have a much better chance of defending yourself. But anyway, how do we know what that person published in the first place isn’t a smear, itself? I have no sympathy for this motivation, at all. Peer review (both anonymous and public) is a stressful business. But if you’re not willing to submit to it (whether fair or foul), you’re failing the fundamental premise of science. You cannot have your pie chart and eat it!

            Actually, I have a lot more sympathy for someone who says, I won’t have my work peer reviewed anonymously. When I was a journal editor, I had one such request and acceded to it. A lot of anonymous peer review is much more tendentious and insidious than any public smearing. So I’d say it’s in the interest of new researchers to go that way. Anonymous peer review may be good for catching silly mistakes and finding out about embarrassing references you may have missed. But any more fundamental mistakes or disagreements should be public! And for that public data is essential. And if it shows your messy process, that should be a matter of public record, as well. And I say that as a messy person, myself.

  9. John Mashey says:

    AP has some good comments.
    I’d add that it is useful to calibrate the balance between:
    1) getting results and insights.
    A) where credibility crucially depends on access to the data.
    B) Where results may seem odd enough that one really wants to look at the code, whether one needs to actually run it or not.
    C) Where complete access to the code and data is still not really convincing, even if done fine, ie, results are deemed significant at some level, but there may be confounders, or they might have been unlucky, and few will give much weight to some hypotheses until they see multiple other studies and meta analyses.

    AND
    2) Efforts to make data available, efforts to make code available, further efforts to do serious software engineering to make code robust, well documented, version-controlled, with good make files, test cases,proper numerical analysis if need be, attention to portability issues including floating point precision, etc.

    3) As it happens, much of modern software engineering has some some of its roots in Bell Labs in the 1970s/1980s.
    R’s ancestor S was there, as was make. Most version control systems can be backtracked to SCCS, written by the guy in the next office and my office-mate. Of course, UNIX portability was done to help the problems of moving software easily among different OS environments. We tried that first. Of course inside BTL, there was in effect open-source, as people traded code and there was a superb internal system for promulgating results.

    4) But still, different groups varied radically in the amount if software engineering that made sense.
    At one extreme, pure researchers might share data and give the code, but it might be importable and poorly documented, and they hadn’t learned to use make. At the other extreme, whole software systems could be rebuilt by one command, not just the current version, but as of any specified release date, with extensive automated test suites.
    Obviously, at that extreme, we’re talking about software as long lived product.

    Anyway, Savvy management ( including some who were very good computing people, including a boss of mine who wrote awk scripts for fun, while managing a 150-person lab), were perfectly happy with the differences in fraction of effort devoted to software engineering.

  10. David Karger says:

    Key point first: as my research is specifically aimed at figuring out how to encourage data sharing, I would be very happy to speak to any other scientists who have the time to tell me about their incentives, deterrents, and experiences sharing their data. Please get in touch.

    Turning to Andrew’s list of reasons, I am particularly focused on the observation that it is too much work for the “lazy” scientist to prepare their data (and code) for distribution. Andrew says: “I suppose I could dump the whole directory on the web but that would be a mess.”

    It seems to me that the the “logical” thing to do in this case would be to just put up all the data in whatever forms, leaving it up to the consumer to perform whatever understanding/alignment/integration they needed. This would be an improvement on the current situation, as aligning data is surely less work than finding all the data from scratch. Presumably, the deterrent here is the feeling that it would be embarrassing/unprofessional to put up data in that form.

    So one way to pursue this, instead of getting everyone to “behave better” in preparing their data, would be to convince everyone to tolerate “worse” behavior (less-nicely curated data). This might be an easier task, as it doesn’t actually require people to improve their behavior….

    Could we convince people to just sloppily *describe* the data in the data files, to a point where someone else could do the integration? As opposed to doing the integration yourself?

    The issue of packaging code reflects kind of the same problem—running up against the unwillingness to put up less-than-perfect material. Such a discussion arises at http://techblog.netflix.com/2012/07/open-source-at-netflix-by-ruslan.html : ‘we’ve observed that the peer pressure from “Social Coding” has driven engineers to make sure code is clean and well structured, documentation is useful and up to date. What we’ve learned is that a component may be “Good enough for running in production, but not good enough for Github”.’

    Again, from a logical perspective putting up mediocre code is more beneficial than putting up nothing. How do we overcome the self-censorship?

    Andrew adds “when people ask me for data or code, I almost always post it or send it to them.” This demonstrates that there are no insurmountable obstacles here. And the problem with this approach is that there’s probably someone on the other side saying “if the data were available, I’d do something with it, but I’m not going to be so selfish as to make him produce it for me”. I’d like to figure out how to close that gap.

    Here’s an idea for a little experiment: when you post, put in a link saying “data for this post can be found here”. Take them to a page saying “ok, I lied—it isn’t here, but just type your email address here and I’ll send it to you”. How many people would click? How many would type their email address?

    I speculate that there are people whose curiosity would lead them to look at the data and then do something interesting with it, but who don’t quite have the energy to seek out the data if it isn’t in front of them.

    • MarkJ says:

      I think publishing data *and* the code that generated it is very important — if you look on my web site you’ll see I put a lot of energy into doing just this. I’m saying this because I’m about to defend the people that don’t do that much.

      In an ideal world with unlimited resources of course all our data and code would be publically available on-line. But academic research is far from an ideal world; it’s a world with extremely limited resources. Most of my academic research projects consist of lots of little experiments, most of which go nowhere. It’s a waste of resources to engineer the ones that go nowhere to a standard that can be distributed. The problem is that I don’t know ahead of time which experiments will turn out to be important; if I did, I’d engineer them better to start with.

      So the question is: is it worth our time to “clean up” the data and code for public distribution, or are we better off spending our time doing something new, which might lead to another publication. Another way to put it is — what’s better for the field, for Andrew to write another great paper, or for Andrew to distribute his data?

  11. konrad says:

    This is an area where Bioinformatics as a discipline seems to be ahead of most other disciplines – it is the norm to make data and software available with journal publication. E.g. see the author guidelines (specifically, “Software” and “Supporting Data” under “General Policies”) here:

    http://www.oxfordjournals.org/our_journals/bioinformatics/for_authors/general.html

  12. Gizmo says:

    Reproductible science (data+code) here: http://www.runmycode.org/CompanionSite/

  13. Tom Moertel says:

    I’ve been experimenting with publishing my analyses and data sets the same way I publish my open-source software, as repositories on GitHub. It seems to work well. (Examples: an analysis of municipal property taxes after a reassessment and a data set for the analysis of a county-wide property reassessment).

    In the software world, people who publish high-quality work on code-sharing sites like GitHub are starting to gain real-world benefits, including improved social status and higher-paying job offers. At many tech companies, especially at start-ups, one of the first things you’ll be asked for when applying for a software job is a link to your GitHub account. In effect, there’s now a penalty for not being able to point to good work on code-sharing sites.

    I suspect that something similar will happen in the data-analysis world. As soon as enough researchers start publishing high-quality work in open, repeatable, and reusable formats, and once they start gaining real-world benefits because of it, other researchers will be penalized if they can’t do likewise and will be motivated to follow suit.

    Thus starting a virtuous cycle :-)

    • FMark says:

      Tom, sadly the incentives are different for academic researchers. In contrast to closed source software developers, academic researchers already share their work and gain reputation through a (pseudo-) public network, the peer-reviewed journal system. Thus, they won’t start getting real world benefits of the type you’ve described by publishing reproducible data + code. I hope I’m wrong about this, but as someone with a foothold in both the software development and research worlds that has been my experience.

      And as other have pointed out, a whole range of factors mitigate against sharing. The primary ones for me in social science are the time it takes to prepare data for sharing, respondent privacy (and concomitant human research ethics committee enforced obligations) and commercial/proprietary data restrictions.

      I think the best way to solve the first problem is for major funding agencies(e.g. the Australian Research Council) to require the public deposit of data collected in projects they fund. The ARC has already started doing that, and it enables/enforces otherwise willing researchers to make time in their busy schedules to curate data for redistribution. The Australian Data Archive ( http://www.ada.edu.au/ ) make dealing with the privacy concerns easy too. So these challenges aren’t insurmountable.

      If you accept the argument that the incentives for open data sharing aren’t there for individual researchers, the case for institutional change becomes clear.

  14. John Mashey says:

    When did open source software start?

  15. Tom Moertel says:

    John, the open-source and Free software “movements” have been around for decades, but social code-sharing sites like GitHub and Bitbucket are recent. GitHub was founded in 2008.

  16. David Karger says:

    This is timely—a web site for sharing your data:
    http://figshare.com/

  17. Jared says:

    I want to second Jim’s recommendation of knitr and add that instead of LaTeX it can be used with markdown to very simply create HTML suitable for the blog.

    I am writing an entire book in knitr/LaTeX and using markdown is supposed to be even easier.

    Here’s his short example on using markdown to embed R code and results in HTML: https://github.com/yihui/knitr/blob/master/inst/examples/knitr-minimal.md

  18. John Mashey says:

    One of these days I’ll have to write a paper on this, but I’d suggest that:

    Open source software started ~same time software did, specifically back when David Wheeler invented subroutines for EDSAC, in the later 1940s,
    ‘The subroutine concept led to the availability of a substantial subroutine library. By 1951, 87 subroutines in the following categories were available for general use: floating point arithmetic; arithmetic operations on complex numbers; checking; division; exponentiation; routines relating to functions; differential equations; special functions; power series; logarithms; miscellaneous; print and layout; quadrature; read (input); nth root; trigonometric functions; counting operations (simulating repeat until loops, while loops and for loops); vectors; and matrices.’ I knew his advisor Maurice Wilkes and was lucky to spend time with David when he came out to Mountain View in 2003 to get an award at the Computer History Museum (I’m a trustee), the year before he died. He was still sharp and active.

    b) This came up in a discussion here, where I posed some questions and here where I gave my answers. (Among the old programs mentioned there, ASSIST (and the related XMACRO package) were open source software for IBM S/360 I wrote ~1970 and used by hundreds of universities. That was in the days when the only way to share code was to put it on a mag tape and send it. In the Digital Equipment Corporation world, there was a regular “DECUS Tape” from the user’s group and it was limited in size, so it was generally considered a kudo to get your software accepted to be included.

    Note comment on Linux:
    ‘In some sense, the biggest innovation is in the Internet-adapted development process and social structure, although some of that mirrors the exact way that the HASP project worked, or to some extent, UNIX inside Bell Labs.’

    The point of all this is:
    1) As long as there has been software, people, especially in technical computing, have been sharing software and contributing to common pools of it.

    3) What’s changed over time are
    a) technologies for sharing, starting with paper tape libraries … SHARE and DECUS tapes … USENET … Internet … Github, etc.

    b) the hardware/software environments that make this easier (with a major upswing from the portability efforts of the late 1960es, and then especially, UNIX V7, which actually got people to think you could have the same OS on wide varieties of hardware, considered a bizarre idea before that.)

    c) The number of people involved.

    4) What has *not changed is:
    a) Some software is written with the implicit hope that other people may find it useful, either pedagologically or directly. It has always been a source of prestige in this turf to have written or contributed code that gets wide use. People may put in great efforts to document, portabilize, provide test cases, fix bugs, etc. I certainly did so with ASSIST in the early 1970s.

    b) Other software is written to do research on something else.
    The software may prove more generally useful (and hence worth some more software engineering) but it may also be that:
    – No one cares about the software (although care about the data), because they want to analyze the data their own way or run different experiments anyway.
    – No one ever cares to actually run the software, but they might want to look at it because they think there are bugs.

    It is always good to improve sharing, it is so nice not to have to make mag tapes and mail them.
    But while a) and b) overlap, they are *different*. Back at Bell Labs, Lots of people wrote lots of code. There was a great internal library system for writing memos, tagging them and disseminating their existence to those who selected the tags.
    But, while we built lots of tools in the hopes that say {physics, semiconductor, etc} researchers might use, we did *not* want them to spend a lot of time doing software engineering. We wanted them to do research. If somebody wrote something that might be more useful, at some point, we’d try to get them software engineering help, or maybe the software would get moved to some support organization.

    Anyway, a) and b) are *not* identical and it is especially important not to over-generalize in either direction.

  19. Bill Harris says:

    Emacs org-mode and Babel are another way to share code and data in a “literate analysis” sense. I can use org-mode to organize and run the analysis, I can publish it to HTML, and I can publish it to PDF or ODT. I can tangle or weave the document. When I’ve published it to PDF, I’ve also included the data and the raw .org files as PDF attachments.

    Still, Andrew’s second point remains true for me, too. There is a bit of overhead (usually tiny) in setting up and running a script in org-mode compared to creating and running a script otherwise (mostly when I forget some of the Babel conventions that make things work the way I want). I just started a small analysis that I wanted to see quickly, and so I just wrote scripts. Now it would be handy to publish the results as if I had taken the org-mode approach.

  20. Rick G says:

    Suppose that sharing data required no effort. I think the obstacle would still be this: people are rightfully afraid that some of their results are wrong, and that people will see this clearly with access to the data. Assume that only 10% of the results are wrong (a very conservative estimate). Even then, for every 10 data sets you put online, 1 of them will be wrong, and if someone identifies that it is wrong and contacts you about it, you are faced with a problem. Having to retract even one paper could irreparably damage your reputation. Even having a controversy associated with your name could irreparably damage your reputation. At that point, claiming adverse selection, and that the un-discussed 90% of your data-sets were perfectly good, will not save you.

    So even if you could publish data and code in a way which was both easy to contribute and easy to understand (and there is a tradeoff between the two), the adverse selection of your worst contributions by the community, and the corresponding impact on your reputation, would be disincentive enough not to share. That being said, refusing to share with someone who asks directly is probably even worse in expectation, so one might still rationally choose to share with people who email you.

    The only way to avoid this equilibrium is to compel sharing, which will force people to make fewer mistakes (be less lazy) and also for the community at large to recognize that there are a lot mistakes being made, so if you make one every now and then it is a forgivable offense.

  21. Yoni says:

    +1 on Erik’s comments about best practices in software development.

    I think the key insight there is that what you’re doing *is* software development, regardless of the application (i.e. research). Some key aspects of the software development discipline are well-understood, and any individual working on software should probably pay attention to best practices within the discipline.

    I’d argue that most of the issues you’re running into may be a direct result of poor software development practices, which make your work not only not-reproducible, but also unmaintainable. Respectfully, I’d recommend that you read The Pragmatic Programmer (http://pragprog.com/the-pragmatic-programmer/) and reconsider the way you go about developing software.

  22. John Mashey says:

    Yoni:
    You might want to review my post that ends:
    “Anyway, a) and b) are *not* identical and it is especially important not to over-generalize in either direction.”

    Do you reject what I wrote in 4 a) and b)?
    Which is fine: feel free to argue, citing relevant experience.

    It has long been quite common for software engineers to be appalled at the quality of code written by people who weren’t …

    Bell Labs was one of the all-time great R&D labs, and many of the ideas in the Pragmatic Programmer could be found around Bell Labs in the 1970s, as in this talk from 1977, used in software project management course and ACM lectures. Level of effort spent on software engineering varied from ~0 to far beyond what most programmers ever experience.

    It was cost-effective to try to build better tools for researchers to use, but it was *never* thought cost-effective to try to get *all* the people in our research division (Area 10) to do their research codes with the sorts of software engineering that other divisions used when building software systems. When I was lecturing managers in the project management course, we always talked about the level of software engineering appropriate to different kinds of projects, since one size did not fit all, even in product-oriented work, much less research.

    • Yoni says:

      Hello John,

      Thanks for your response. I’m certain that any decent software shop stumbles upon principles similar to those found in The Pragmatic Programmer, just as every religion finds compassion to be a useful principle. I think it’s fortunate these books exist nowadays, so that we needn’t waste too much time finding our way. :)

      I agree that there are appropriate degrees of software development rigor depending on the application, and that it’s up to the developer or team to set the bar according to their good judgement. For example, I might agree that rigorous documentation and testing needn’t be necessary for all software development projects. However, I think that rejecting enabling tools such as version control and continuous integration only serves to limit the developer’s (e.g. researcher’s) ability to get their work done. Can you honestly tell me that folks reduce costs by not using version control?

      I cannot claim to have your breadth of experience, but I will speak to mine. I’ve worked on software mainly in these spheres: as a Mathematics and Engineering student at UT Austin, as a Software Engineer at Opower, as a Data Scientist at Opower, and as a research collaborator with UT Austin. Every successful project (small and large) I have worked on has utilized version control and continuous integration at a minimum. Production-worthy products I’ve worked on have, of course, held a much higher bar, but I’d count these as less relevant to the discussion. I have learned many of these best practices on-the-job and been involved in training others, including researchers, developers, and operational folks. In all cases, I’ve seen tremendous increases in people’s productivity and utter delight in their newfound skills.

      Cheers,
      Yoni

  23. I also agree with Erik. I am fully sympathetic with Andrew’s point about the messiness of analysis. I work with language and often, the first pass at something is just exploratory doodling that develops into something more and by the time I get there, I’m just happy to get the ideas and any thought of making the process tidy goes out of the window. But ultimately, I work with corpora which are public data sets and relatively simple queries. Still, I did share my data from previous research until my website redesign – may give it another go with GitHub.

    The problem with Andrew’s reason is that they go even deeper into the problem with science. It exposes the big lie of science as an orderly process. Which I think should mean that all data and code in science should be shared no matter what state it is in. Not just for replication but for credibility checks.

    The analogy with code is apt. Not being replicate one’s own analysis is like not being able to debug your own code because you didn’t put any comments in. Not commenting code leads to bad software! You could be the most brilliant programmer in the world but if you produce tight uncommented code, it’s going to be worse in the long run, than loose commented code. There are best code practices and tools to support them. Labs have similar practices. It is time that all academic disciplines should have them to take away the last excuse for not sharing data.