Reproducibility in Practice

In light of the recent article about drug-target research and replication (Andrew blogged it here) and l’affaire Potti, I have mentioned the “Forensic Bioinformatics” paper (Baggerly & Coombes 2009) to several colleagues in passing this week. I have concluded that it has not gotten the attention it deserves, though it has been discussed on this blog before too.

Figure 1 from Baggerly & Coombes 2009

Figure 1 from Baggerly & Coombes 2009


The authors try to reproduce published data, and end up “reverse engineering” what the original authors had to have done. Some examples:

  • §2.2: “Training data sensitive/resistant labels are reversed.”
  • §2.4: “Only 84/122 test samples are distinct; some samples are labeled both sensitive and resistant.”
  • §2.7: Almost half of the data is incorrectly labeled resistant.
  • §3.2: “This offset involves a single row shift: for example, … [data from] row 98 were used instead of those from row 97.”
  • §5.4: “Poor documentation led a report on drug A to include a heatmap for drug B and a gene list for drug C. These results are based on simple visual inspection and counting, and are not documented further.”
  • There are also gross statistical errors.
Continuing in the usual theme of my occasional posts, I’ll share what reproducible research means for me in practice.
  • I use Sweave for almost everything. Here is my template. Here is my xetex template if you care about typography.
  • Eventually I save objects that took a long time to compute, set their evaluation to false, and then load the saved object immediately below, but crucially I still have their generative code right there. Here’s what I mean:
    <<computation1, eval=TRUE, echo=FALSE>>=
    ## long MCMC run or whatever here
    object.1 <- thing_that_takes_forever()
    save(object.1, file="computation1.Rdata")
    @
    <<process-comp1, eval=TRUE, echo=FALSE>>=
    load("computation1.Rdata")
    @
    

    So once I was satisfied that computation1 produced the object.1 of my dreams, I could just flip eval=FALSE on the first code chunk and save myself the hassle. But if I found a problem with it later on, I now know where to go to fix it.

  • It is generally not painful to leave any pre-processing / data loading and joining, and recoding in the first code chunk. This will prevent you from having a stylized data file that you don’t know what you did to it, because you actually redo it from scratch every time. It sometimes makes sense to separate this out into a file that you source().
  • For presentations or other destinations, I can just copy the paper.Rnw file to pres-gfx.Rnw, make any necessary changes (to the size in the code chunk argument, for example, to make Beamer-friendly images). Running R CMD Sweave pres-gfx.Rnw on this will ensure that my names are consistent (“pres-gfx-codechunkname.pdf”) and I don’t do something completely different or accidentally use the wrong model on the wrong graphic.
  • I could, if I had truly mastered ediff, easily merge any changes I made for presentation back to the paper.

6 thoughts on “Reproducibility in Practice

  1. My solution to the big computations is to use
    file <- "computation1.RData"
    if(file.exists(file)) {
    load(file)
    } else {
    # big computation here
    save(object.1, file=file)
    }

    (I hope that’s readable.) If I want to re-run things, I just delete the computation1.RData file.

  2. There’s also cacheSweave (and pgfSweave, which uses cacheSweave) which implement a caching mechanism for Sweave documents.

  3. Pingback: Reproducible research | Jarad Niemi

Comments are closed.