The gradual transition to replicable science

Somebody emailed me:

I am a researcher at ** University and I have recently read your article on average predictive comparisons for statistical models published 2007 in the journal “Sociological Methodology”.

Gelman, Andrew/Iain Pardoe. 2007. “Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components”. Sociological Methodology 37: 23-51.

Currently I am working with multilevel models and find your approach very interesting and useful.

May I ask you whether replication materials (e.g. R Code) for this article are available?

I had to reply:

Hi—I’m embarrassed to say that our R files are a mess! I had ideas of programming the approach more generally as an R package but this has not yet happened yet.

30 thoughts on “The gradual transition to replicable science

  1. Messy R files are better than nothing.

    What I find tends to happen with file-based data is that a researcher will gradually munge the data into some form other than its original form, either by hand or by script. They tend to alter the scripts as they go to work on some intermediate output that was generated by an earlier script. I’ve always tried to work in such a way that there’s a complete script going from raw data to whatever input is needed for analysis.

    What I find tends to happen with interpreted languages like R is that a researcher will do things in the interpreter rather than in a program file they save. Then, there’s no hope of reproducing what they did.

    Daniel Lee and I are working with Andrew on an applied project now and we’re being very careful that everything is replicable through a makefile from raw input data through statistical analysis to plots and table output. We’re only at the simulated data phase right now, but we set things up so that the makefile has a simulate command, a fit command, and a plot-fit command. Running them in succession will replicate our entire set of results.

    We’ll maintain this approach throughout the project. Of course, everything’s in version control so we can always go back and replicate what we did at any stage of the project. Version control is critical both for collaboration and for being able to back up if you dig yourself into a hole at some point. Using GitHub, you even get a download button that zips up the most recent snapshot of the repository for distribution.

    One might argue our files could use more documentation. I’d argue that the code trail itself, being runnable from a top-level command, is actually a better form of doc because it doesn’t lie. The problem with doc is that it quickly gets out of synch with the code, so reading doc is always dangerous compared to reading the actual code that generated the results. I’ve elaborated on this in both the Stan manual and in a blog post, The Futility of Commenting Code.

    • +1

      Not sure about R but while using gnuplot or octave my solution used to be a bash file with a Heredoc within it. That way I could also do some sed awk preprocessing and have it all in one location.

      +1 for versioning. I used bazaar and by tweaking bit it was very easy to just bzr add and bzr commit files every so often. Just in case I forgot to manually commit changes (very often!) I also had a daily bzr commit take place automatically using cron. Worked quite well.

    • + 1, also!

      I agree that R and other interpreted languages encourage an approach in which data munging is lost in the interpreter, but it doesn’t have to be that way. For the last several years, I have made sure to report data only from Rmd-based scripts (and more recently, webpages constructed off knitr via shiny), which start with the original data set. The final analyses and the code that generated them are posted to a (hidden) permanent web location. That way, anyone with the link could reconstruct what I did, and have the potential to catch errors. This includes myself, years later! At the moment, this approach seems to strike a nice balance: it’s easy to do analyses, and its also “easy” for a stranger to understand how to run and view them. With shiny, of course, the stranger can even manipulate them to get multiple views on the data.

      I also agree that that the code trail itself forms a better documentation than a separate document, but it’s still not very satisfying. This is especially so when variable names are sloppy, out of date, or designed for the writer of the code, not a reader. This too often characterizes my own code!

      • I often document broad sections, their goal etc. but refrain from documenting the finer bits because I’m certain I’ll change variables & function calls etc. without making the effort to update the docs.

    • Bob: Very interesting.

      I have been thinking loosely about how close programming comes to being math.

      In math the representations are self-referential and hence cannot by wrong (though can be mis-understood). Computer programs represent what the computer will compute and hence cannot be wrong about that (though possibly much more easily mis-understood)…

  2. Usually this blog just makes me feel bad about the state of statistical analysis in published research, but today it is making me feel bad that I haven’t taken the 8 or so .do files that generate my new paper (involving a raw-data cleaning program, a .do file that runs from two other .do files to compute p-values, global macros that go from one file to the next, and a simulation program that takes about 16 hours to run) and made them in any way coherent for anyone else or adaptable to people who don’t want to tie up their computer for a day. So thanks for the guilt on that.

    As a side thread on this: do any social sciences actually teach students how to write clean, clear, simple analysis/executable files that other people can understand? I’ve gotten better at it, but it is only by reading other people’s files, working with co-authors, and getting mad at myself when I can’t read my own work a month later. I guess no one wants to waste a grad lecture in methods on tips for clear program writing, which is understandable, but probably not optimal for the field.

    • Maybe not even optimal for a course. A few months ago some blogs mentioned a beginner’s guide to code and data written by economists Gentzkow and Shapiro (faculty.chicagobooth.edu/jesse.shapiro/research/CodeAndData.pdf), a document which seemed to have grown out of a set of guidelines to their RAs.

    • I wanted to share this resource I stumbled across from Kieran Healy who teaches sociology at Duke. http://kieranhealy.org/files/misc/workflow-apps.pdf I am not in the social sciences field and I commend Dr. Healy (and the Chicago group mentioned in another post) for teaching their students about the importance of version control and reproducable results. I earned a statistics masters degree approx 20 years ago and I learned lots of stat theory and models. But no instructor ever mentioned that when suporting researchers I would be re-running my code 20 times over with variations for different exposures, different outcomes, different study subpopulations, etc. and would need to track combinations of raw datasets, interim datasets, analysis code, interim tables, etc. – all leading to the final few tables for publications.

  3. I always have a master file that calls up all other scripts. All other files that transform data begin with ‘cr’ for create. All analysis files begin with ‘an’. Raw data is write protected in a separate file.

  4. That seems like a pretty good method. I like the naming convention. Keeping the raw data in tact is really important to me too.

    But let me ask a follow-up. How flexible should we be making these programs? I think it is really easy to make a program that lets people exactly replicate the tables in a paper, and there is some legitimate value in that – it lets people go step-by-step through the logic and see any changes to the raw data that are made.

    But isn’t there a sense in which it is better to provide programs that allow people to manipulate the analyses to some degree? I can think of two ways that is important in the broader replication scheme. The smaller point is one of convenience – do we make it easy to, say, adjust the number of bootstrap or monte carlo repetitions, or only run some part of the analysis? The more substantive point is about what “replication” really means – do we make it easy to change things like the inclusion/exclusion rules on the observations or to change the set of covariates?

    The sort of economic trade-off I’m thinking of in terms of “optimal” program writing flexibility balances the amount of time and effort required to write these kinds of programs with the value that other researchers/science derives from the exercise. I’m sure other people would include other trade-offs (say, that this would make it really easy for other people to write take-down papers), but I think the idealized trade-off is between time effort and value added to the field, with the objective function being one that maximizes scientific progress.

    • There can’t be a single answer here. The ideal of reproducible research is exactly that: your research should be reproducible, so the aim is a beginning-to-end documentation via code of everything you did. That serves a range of purposes, ranging from making your life easier if and when reviewers want something different to protecting others from claims that can’t be checked.

      But the more your scripts wire in details that are specific to your data and your project, the less useful they are for almost any other project….

      • The “being useful to others” does require additional documentation compared with just allowing somebody to press a button to execute the analyses. And this would include things like saying what your functions are not tested for, what the user may need to change to individualize the function for them (like data names), and probably why somethings were done. More work, and as jrc says there is a trade-off.

        • Dan Wright “what the user may need to change to individualize the function for them”:

          I disagree here. Naturally, being helpful is … helpful, but the point of replicable research is to make it easier (ideally, possible) to see exactly what a researcher did, not to understand why.

          There’s a lot of “in principle” there. If somebody used SAS for their statistical analysis, and I don’t have access to SAS, I can’t get past go in repeating results (although the code might still mean something to me). The larger point is that there should be some others who can check.

          But it’s not the obligation to provide a manual for others to learn how to do their own research on similar topics.

      • “…the aim is a beginning-to-end documentation… of everything you did.”

        .

        Hence the major inherent problem — it’s a lot more time & effort to produce that careful documentation (probably twice the work).

        Few researchers want to schedule that heavy documentation workload into their project plan and even fewer have the personal discipline to perform it in their daily toils; it seems far peripheral to their primary goal.

        As noted researcher/inventor Thomas Edison remarked:

        “… there are no rules here– we’re trying to accomplish something”

  5. I have sent many of these “would you mind sharing the code and data” emails to researchers in very different disciplines, and am glad to report that they usually get answered, with attachments.

    Bob is right, messy code is better than nothing.

  6. I actually think that the solution to this problem is to develop programming notebooks like the Ipython notebook. This allows users to write comments and code within the same file. Nowadays there are so many great ipython notebooks out there with replicable code and the ability to test code permutations. This is one reason why I am migrating a lot of my statistical work over to Python nowadays (sorry for the partisan comment).

  7. I think you should send the person your code even if — maybe especially if — it sucks. If we all wait until code is beautiful and packaged to share, then only software developers share. Code is written to do things by scientists; if it was useful to you it is useful to others. And, generally, sharing even crappy code leads to new insights, citations, acknowledgements, goodwill, mistake and bug catching, and novel research. Share!

    Code does not have to be packaged to be valuably shared. Also, some would say that scientific ethics requires sharing. That’s a bigger question, on which I would be happy to collaborate for your ethics column!

  8. For some time, I’ve standardized on Emacs’ Org mode with embedded source blocks. That gives me much of literate programming, it lets me document the evolution in my thinking through stream-of-consciousness thinking with associated code and results, it lets me use Bazaar with ease, and it lets me munge the development of the analysis into a final report without too much work. I use the Org mode Javascript code that lets me create foldable HTML, which is nice to share with colleagues I’m supporting.

    I had tried standardized workflows (4 or so R scripts that would read the data, transform the data, do the analysis, and produce a report, I think it was), Sweave or ODFweave, and the like, but Org mode has turned out to be the one that’s worked the longest for me.

    As for the history of commented code, I think one can trace it back to the Plankalkül (1943-45). I don’t know of any high-level languages that predate that one.

  9. It is good that the article itself is clear enough so that it is possible to replicate the method based on the description. It would be useful to have the reference implementation available, but it seems that several people have succeeded in implementing the method anyway. GPstuff has also a function for the average predictive comparison.

  10. Journals could require that the data and code underlying a paper be published. If the data is from a commercial vendor, the vendor should be named and the symbol for the data series should be listed, and a few sample lines of data should be shown. The journal editor and/or referees should be able to replicate the paper before publishing it.

Comments are closed.