Data problems, coding errors…what can be done?

This post is by Phil

A recent post on this blog discusses a prominent case of an Excel error leading to substantially wrong results from a statistical analysis. Excel is notorious for this because it is easy to add a row or column of data (or intermediate results) but forget to update equations so that they correctly use the new data. That particular error is less common in a language like R because R programmers usually refer to data by variable name (or by applying functions to a named variable), so the same code works even if you add or remove data.

Still, there is plenty of opportunity for errors no matter what language one uses. Andrew ran into problems fairly recently, and also blogged about another instance. I’ve never had to retract a paper, but that’s partly because I haven’t published a whole lot of papers. Certainly I have found plenty of substantial errors pretty late in some of my data analyses, and I obviously don’t have sufficient mechanisms in place to be sure that errors can’t persist all the way through to the end.

I have adopted some practices to try to reduce the kinds of errors that I tend to make. Below, I list some of the things that I have found to be risky, and some of the things I’ve tried to reduce the risk. Where I give a coding example, it’s in R.

  1. I sometimes used to refer to the wrong column (or, occasionally, row) of data. The solution here is easy: assign column names, and refer to columns by name instead of number. Ideally, column headers are already in the data file; if not, and if it doesn’t make sense to edit the data file to put them in there, then read the datafile normally and assign column names with the very next command.
  2. I sometimes make a change or fix an error in one place in my code but not another. Suppose I am comparing statistical distributions from two datasets and I decide to summarize them by 10th, 25th, 50th, 75th, and 90th percentile, in addition to calculating a lot of other stuff. I might have a block of code that analyzes the first dataset, and an almost duplicated block of code that analyzes the second dataset. If I decide to use the 5th and 95th percentile instead of the 10th and 90th, I might implement that change in the first block but forget to do it in the second. I try to make myself use two mechanisms to make sure I don’t have this problem: (a) label the outputs and look at the labels. In the example here, the “quantile” command automatically labels the results: quantile(data, probs=c(0.05, 0.25, 0.5, 0.75, 0.95)) generates a vector that has column names that tell what percentiles you’ve selected. Unfortunately, the way I usually work I often won’t end up looking at those when it comes time to do something with the results; for instance, I might make an output matrix with something like  rbind(results1, results2), where results1 and results2 are outputs of the quantile function. But another option is (b) do as much as I reasonably can by using user-defined functions rather than repeating blocks of code. In the current example, rather than calling the quantile() function in two different places, I could write a really simple function like this: myquantile = function(datavec) { return(quantile(datavec,probs=c(0.05, 0.25, 0.5, 0.75, 0.95))) } and call it for both datasets. If I decide to use different quantiles, I edit the function. Simple, but for some reason it’s hard to make myself do it, especially for one-liners like the example I’ve given here.
  3. I sometimes use the wrong variable name in a function, and this problem can be hard to find. Suppose I’m taking my own advice (from item 2, above). I’ve got a bunch of R code to do something, and I realize I’m going to want to repeat the same steps with some other data or with a subset of the data. Being a good boy for once, I take the code that I’ve generated and wrap a function around it. Great, except that I sometimes make an error like this: inside the function I define a variable  like std = sqrt(var(dat)), and then later in the function I say something like x = (dat – mean(dat))/stdev. See how I said std in one place and stdev someplace else? Oops. This often isn’t a problem because this will generate an error when I call the function, and I’ll find the mistake and fix it. But sometimes it is a big problem because I have defined stdev in the global environment, and R will happily use that value. I don’t have a solution to this problem other than to try to be careful. I wish R had a “strict” option or something, that would make a function give an error or at least a warning if I use a variable that isn’t local.
  4. I tend to run into problems when I do relational operations across data frames. Say I have data on a bunch of buildings, and frame1 has a list of buildings that have certain characteristics, and frame2 has a big list of lots of buildings and some data on each one. Then frame2[match(frame1$buildingname, frame2$buildingname),] gives me just the rows of frame2 that match the buildings in frame1. Right? Well, no: it should be frame2[match(frame2$buildingname, frame1$buildingname),]. Or was I right the first time? Anyway this is a pitfall for me, but at least I know it and I know to be careful and make sure I get it right. Even so, I can run into trouble if I want to do further subsetting, like look at only the rows in frame2 for which the name occurs in frame1 AND some other column of frame1 has a certain characteristic, or for which the name occurs in frame1 and some other condition of frame2 is also true. I don’t have a full solution here but I do some things to try to reduce the number of problems. The main one is that I try to put everything into a single data frame if at all possible. In the example here, I would probably create a new column in frame2 that indicates, for each building, whether it is listed in frame1 or not; and if there are additional columns in frame1 then I would propagate them into frame2 the same way. This seems simple and obvious, and in the example I’ve given here it is indeed simple and obvious, but sometimes when I have multiple data sources, and maybe data structures that are more complicated than data frames, it’s not so easy.
  5. And finally, two solutions that helps detect many kinds of problems: (a) look at your tabulated data and any tables of results, in tabular form. What I mean is, don’t just call read.table() and then just go ahead and start referring to various rows and columns (by name or otherwise); open the datafile, or write out the results from your work, and take a look. This is one thing Excel is really great for: you can read in a giant table and scroll around in it. In R, if you have a wide dataframe or matrix it ends up getting split up when you display it, so you can’t really see what is going on. (b) Make lots of plots. Many problems will show up as suspicious patterns: you’ll find that a whole column has the same value, or that something that should always be positive has some negative values somehow, or that a variable you thought you had normalized to 1 has some values over 1, or whatever.

So, there’s a list of some of the common ways I screw up, and some of the things I do to try (sometimes unsuccessfully) to avoid those problems. I’d be interested in hearing other common pitfalls and solutions.

23 thoughts on “Data problems, coding errors…what can be done?

  1. Great topic. My feeling is that there is a trade-off between productivity and accuracy. The central theme of the solutions that you describe above is addition of quality-control code, whether this is via more careful upfront labeling, in-code checking, or post-hoc analysis. The commonality is that these steps take time, and most often yield no tangible benefit. The benefits are felt if the checks fail but when the checks pass (which is almost all the time), unfortunately our brains don’t usually experience pleasure. Then, the temptation arises to cut corners, usually this is caused by some deadline. You’re thinking this check almost never fails, you can’t afford the time to be so careful, let’s move on. Or it’s supplementary analysis after the main event – you’re just doing a little bit extra, and your mind is already on some other project. Murphy’s Law sets in: the one time you omit the check is the one time the check would have failed. I think I’m a very meticulous person but I fall victim to this all the time.

    Adding to your list of common problems… duplicates when joining table is a constant thorn in the side. Typically, it is caused by duplicates in certain tables that supposedly should not have duplicates. My solution is to just make sure I check for duplicates each step of the way. This can get very cumbersome.

    Another practice that I have is to keep all codes, and never overwrite. If I’m running the same code and saving to a new table, I duplicate the code instead of overwriting the table name. I find that this allows me to retrace the steps of my work, and backtrack if necessary.

  2. That match function can be really useful but I always find myself staring at the code for an inordinate amount of time trying to figure out the correct ordering.

      • Yeah merge is excellent but typically I use match to keep track of data coordinates across objects with different dimensions. For example #4, merge would certainly work better.

  3. 1. Have two people code the same analysis indepenantly and from scratch with two different tools if at all possible. To me this is the best way. Resource intensive but invaluable on critical projects. Very rare that they make the same mistake and reach identical wrong answers.

    2. Try corner cases. What’s your mean if all individuals had the same wealth etc. Write test data cases before you code the analysis

    3. Think of some “invariants” for the data and keep checking them at various stages.

    4. Add noise to the data and see if it will drastically alter your conclusions. If it does you’ve to exercise even more caution.

      • That reminds me—way back when, Gary King and I used to joke that our results were so strong that they were robust to coding errors. I remember one time we accidentally had our standard errors off by a factor of 2 . . .

        • I really did have a model that was robust to one particular coding error I made. It was a robust (no joke) errors-in-variables model. Without getting into too many details: it was a Gibbs sampler, the error left most of a certain set of nuisance parameters unchanged from their initial values, and my model check (posterior calibration and empirical prediction error on a hold-out set) looked good enough for publication. The model worked even better when I found and fixed the error.

  4. First, get a source code control (SCC) system. I think all the cool kids use Github these days, but you can use whatever is available. Using a public one is a good way to share your code and data automatically. Commit your code often. Start with committing every change until you have an understanding of what code changes don’t need to be committed. In a language like R, your data is really part of your code, so you need to have it in your SCC system as well.

    The solution to the problem that Kaiser describes about positive feedback on test code is to have a system that generates that feedback for you. Take Phil’s function example in item 3. Phil should write a test that runs that function against a known dataset and checks the result. If the test gets the expected result, it gives Phil a pat on the back. Ok, actually it just sets a value to “passed” for that test. Gradually, you build up a test suite that ensures that your common functions are all working the way you expect. And you run them every time any of your code changes. It takes some discipline, but it will be faster in the long run.

    • I love bazaar. One side benefit is that I’ve become a lot more brave making code changes knowing that the worse I’ll ever lose is only so much. ‘bzr revert’ is only a keystroke away.

    • Exactly–Version Control and Unit Testing. I haven’t made the effort to learn GitHub–I’ve found subversion to be quite adequate, and GoogleCode to be a decent (and free) host. For unit testing, Junit for Java, NUnit for C# [Is there an equivalent for R?]. Do the unit test before you write the method…a lot of people skip that in a rush for time, then regret it later.

  5. Phil, forget using R commands like match to do anything complicated, sqldf and sqlite are your friends. Do your table munging and combining using SQL code and your analysis using R.

    Also I second the advice to use git, not necessarily with Github or a public repo, but do use it. I personally use monotone because I’ve been using it for a long time and it was one of the major inspirations for git, but I plan to convert to git when I’m done with my current set of projects.

  6. Pingback: An argument for why careful coding in R is better than an Excel spreadsheet | Sam Clifford

  7. I think you nailed it on the head, pointing out that R & R’s expertise is in economics, not statistics or programming. I took a quick look at the Acknowledgments page of “This Time is Different” & noted no mention of anyone helping them with their data processing. So it does look like a garden-variety case of carelessness & hubris.

    I agree with the various comments around the Web on this case, that anyone doing serious primary data analysis in “front-end” Excel (rather than back-end Visual Basic) is probably making a terrible mistake. Proper source code files should be written that can be proofread & stepped-through in proper fashion. Not cell-by-cell formulas that make errors like R & R’s easy to miss.

    As a personal note, I think you should have included a disclaimer linking to Krugman’s critique, since (a) he’s a noted professional rival of Rogoff’s & (b) as anyone even casually familiar with Krugman knows, he looks at the world through his own rose-colored glasses & so isn’t a great model for unbiased commentary.

    • 1. Yes, it’s too bad they didn’t make their data public right away, then it seems likely the problems would’ve been discovered much earlier.

      2. Yes, Krugman’s got an attitude, that’s for sure, and I’ve commented on this on the blog from time to time. I also linked to Tyler Cowen, who’s got his own set of attitudes. I usually learn a lot from Cowen and from Krugman. Both are thoughtful, while coming from much different positions politically.

  8. More specifically to the titular question “what can be done?”.. as soon as one identifies & acknowledges that what’s actually being done is writing software, the question can be constructively reframed & researched as “how can software quality be improved?” Fortunately, there are any number of angles & approaches. I found the following book extremely & immediately helpful, though it’s terribly edited & cheaply printed : “Software Engineering Best Practices” by Capers Jones.

  9. Nice post!

    At the company I work for, we had our quota of errors last year and one of out goals this year is to decrease the number of errors. One thing that we implemented as a team is that every analysis has to be validated by an independent person in our team, and they have to write another code (they can’t reuse the code that made the analysis). And we have a document with a checklist that we have to fill before saying the analysis is done. No analysis is done until we filled up the check list.

    Best,
    Manoel

  10. For me an important change I made was doing all the data processing at the beginning of the script. If I’m halfway through an analysis and I decide that I need to create a transformed variable, or more importantly if I need to change an existing variable, I go back to the data processing block at the top of the file and create the variable there. You can get into a lot of trouble if you’re performing an analysis and your variables change halfway through.

    Also, with match()I’m fairly sure that the correct order of the arguments is always the second way you try it. They must have some kind of internal counter for it

  11. Pingback: When economic professors making simple (MS Excel) coding mistakes … | Pairach Piboonrungroj

  12. I used to be in academia, but more recently have done research work at a state agency that prides itself on the accuracy of its reports. One simple way we do that: no number appears in a report (anywhere, whether in a table, chart, graph, or simply in the text) until a second person who did not do the original analysis starts from the raw data and replicates the number. We catch a lot of stuff this way — mostly minor (36.8% should really be 37.2%, etc.), but at other times saving us from considerable embarrassment.

    One of the more interesting catches we made was when an analyst using Stata to fact-check caught an error made by another analyst who had used SPSS, because Stata spit out an error message where SPSS had not.

  13. Test data is particularly useful. Use a small set of data in exactly the same format and know what answer you are expecting to get. If at all possible, automate running this test data at the start of any run and alert somehow if the results are wrong.

Comments are closed.