In light of the recent article about drug-target research and replication (Andrew blogged it here) and l’affaire Potti, I have mentioned the “Forensic Bioinformatics” paper (Baggerly & Coombes 2009) to several colleagues in passing this week. I have concluded that it has not gotten the attention it deserves, though it has been discussed on this blog before too.
The authors try to reproduce published data, and end up “reverse engineering” what the original authors had to have done. Some examples:
- §2.2: “Training data sensitive/resistant labels are reversed.”
- §2.4: “Only 84/122 test samples are distinct; some samples are labeled both sensitive and resistant.”
- §2.7: Almost half of the data is incorrectly labeled resistant.
- §3.2: “This offset involves a single row shift: for example, … [data from] row 98 were used instead of those from row 97.”
- §5.4: “Poor documentation led a report on drug A to include a heatmap for drug B and a gene list for drug C. These results are based on simple visual inspection and counting, and are not documented further.”
- There are also gross statistical errors.
- I use Sweave for almost everything. Here is my template. Here is my xetex template if you care about typography.
- Eventually I save objects that took a long time to compute, set their evaluation to false, and then load the saved object immediately below, but crucially I still have their generative code right there. Here’s what I mean:
<<computation1, eval=TRUE, echo=FALSE>>= ## long MCMC run or whatever here object.1 <- thing_that_takes_forever() save(object.1, file="computation1.Rdata") @ <<process-comp1, eval=TRUE, echo=FALSE>>= load("computation1.Rdata") @
So once I was satisfied that computation1 produced the object.1 of my dreams, I could just flip eval=FALSE on the first code chunk and save myself the hassle. But if I found a problem with it later on, I now know where to go to fix it.
- It is generally not painful to leave any pre-processing / data loading and joining, and recoding in the first code chunk. This will prevent you from having a stylized data file that you don’t know what you did to it, because you actually redo it from scratch every time. It sometimes makes sense to separate this out into a file that you
- For presentations or other destinations, I can just copy the paper.Rnw file to pres-gfx.Rnw, make any necessary changes (to the size in the code chunk argument, for example, to make Beamer-friendly images). Running
R CMD Sweave pres-gfx.Rnwon this will ensure that my names are consistent (“pres-gfx-codechunkname.pdf”) and I don’t do something completely different or accidentally use the wrong model on the wrong graphic.
- I could, if I had truly mastered ediff, easily merge any changes I made for presentation back to the paper.