Lessons learned in Hell

This post is by Phil. It is not by Andrew.

I’m halfway through my third year as a consultant, after 25 years at a government research lab, and I just had a miserable five weeks finishing a project. The end product was fine — actually really good — but the process was horrible and I put in much more time than I had anticipated. I definitely do not want to experience anything like that again, so I’ve been thinking about what went wrong and what I should do differently in the future. It occurred to me that other people might also learn from my mistakes, so here’s my story.

The job was a ‘program review’ for a large, heavily regulated company that has an incentive program to try to get customers to behave in certain ways that both the company and its regulators feel are desirable. (Sorry, I have to remain pretty vague about this). The company is required to have an independent contractor review the program annually, to quantify the effectiveness of it and to predict how the program would perform under a few hypothetical futures. The review had been performed by the same consulting company for several years in a row, and it looked like they had done a competent job, so my partner and I felt that we were at a substantial disadvantage bidding against them: the other company already had experience with the data and the process, they had computer code already written and personnel who knew how to run it, and they would presumably do most of the work with junior staff whereas my partner and I are both quite experienced and not willing to work for the low wages a junior person would be happy with. Still, we saw several places where we were sure we could improve on the previous work, and we had been told that the company likes to switch consultants every now and then, so we put in a bid, and we got the job.

An initial stage of statistical manipulation was needed in order to process the raw data into numbers that would feed into the statistical model that would spit out the outputs, and it’s in that initial stage that we made our improvements. For the model itself, which was in my bailiwick, I was going to use pretty much the same model that had been used in the past. The previous reports noted several potential problems with the model, and said they had checked and the effect of the problems was small. There was one thing about the model I really didn’t like: the use of two very highly correlated explanatory variables in a linear regression model. That’s OK if you are making predictions for regions of parameter space that are well sampled in your training data, but can be a big problem when extrapolating, which is what we needed to do. So I changed the variables in the model to use a less highly correlated set, checked that the model still fit about the same, and kept plugging along.

So everything was going fine…right up until the first major ‘deliverable’, about five weeks ago. All the client needed was a few hundred numbers in a table, representing the estimated performance of the program under those hypothetical futures I mentioned earlier. The numbers were due on a Monday, and I had them ready on Friday: I was looking forward to a weekend getaway with my wife and some friends, and didn’t want to have any nagging worry about not being able to submit the numbers on time. So, Friday I sent the numbers to our project manager at the company…and got back an email that afternoon that said: why are the numbers so low? Last year they predicted such-and-such, and you guys are 20% under!

My first thought was: uh oh, where did I screw up? Each of the final numbers is the sum of several other numbers, maybe I forgot to add one of them? Or maybe the problem is in the pre-processing pipeline, which would cause a problem with the inputs to the model? Or maybe…

I worked through all of Friday night, checking one thing at a time, but I didn’t find anything wrong. When my wife woke up Saturday morning I had to send her off with our friends on their getaway; I kept working. I worked all Saturday and still couldn’t find the problem; ditto Sunday. Sunday afternoon I finally roped my partner in, and we spent the day cleaning things up in my analysis code and checking as much as we could check. And we did find some things wrong! I had failed to adequately specify merge parameters in R’s ‘merge’ function, which was causing some numbers to end up in the wrong places; the client had sent us some corrections to their data but those hadn’t made it into our database; and a few other things. None of these turned out to make much difference, but the fact that we were finding little errors made it seem possible that there was one (or more!) big one out there. And it was frustratingly hard to check things or test things because I continue to use some sloppy programming practices that I swore a year ago I would improve upon. Finally Monday afternoon rolled around. We had checked every module and every bit of code, we could not find any problems at all anymore…and we were still generating numbers that were 20% under what previous years had reported. We had no choice but to submit our numbers just before the deadline, with a nagging worry that there was still a major problem. The numbers were due at 5 pm Monday, so we submitted them then, but we asked “obviously nobody is going to start using these tonight and there are still things we would like to check out; when is the time people actually start with these?” The answer: start of business Tuesday, that’s why they were due Monday. So we kept working Monday evening after submitting the numbers.

Through all of this, my assessment of the situation had changed. On Friday, when first told that our numbers were different by 20%, I would have said there was a 90% chance the problem was my fault. Saturday night, maybe I would have said 60%. Sunday night, 30%. Monday evening, maybe 10%. So I was 90% sure we were OK, but that still meant a 10% chance that we weren’t!

Then, around midnight Monday night, my partner and I were looking everything over once again. I was looking at the previous year’s report, and I said “I just don’t understand how they could possibly get the numbers in the second column of such-and-such a table, these completely disagree with us and they don’t even seem plausible.” And then I noticed a footnote to the table: “Numbers in the second column have been adjusted under the assumption that…” And suddenly it all fell into place. The assumption was wrong, indeed easily demonstrated to be wrong, and it was wrong enough that replacing it with the right answer made about a 20% difference. My partner and I went to bed. I got a full night’s sleep for the first time in days.

So was that three days the visit to Hell I alluded to? Oh not, that was just the start. First, we faced some pushback from the client. We were telling them (and ultimately the regulators) that the program wasless effective than they thought. They were reluctant to believe it, although (to my relief) they were not inclined to shoot the messenger. Still, for the next couple of days I did little but try to document the evidence that the previous report had gotten it wrong. So all in all this little episode cost about a week. When you’ve only got five weeks to write the final report, and you spend one of them on something you didn’t anticipate, the pressure mounts. Still not Hell, though.

The real problem was: having realized that the previous work had involved this major misjudgment, I no longer trusted that work in other ways..specifically, I no longer took it for granted that the statistical model that had been used in the past was adequate. I finally did what I should have done months (!) earlier: rather than make a few plots and tabulate a few things to confirm that the model was behaving OK, I started looking for evidence that the model had significant problems. And sure enough: the model had significant problems. I spent several solid days coming up with a final model that I was willing to stand behind…which doesn’t sound so bad, but significantly expanded the workload: now, instead of just saying “we used the same model they used last year, except for such-and-such a minor modification”, we needed to explain why we did what we did, quantify how much difference this made, and so on. Of course this meant that the numbers we had submitted earlier — a week and a half ago, at this point — needed to be changed, because we have this new model, so that was a slightly embarrassing issue with the client. And we eneeded more tables, more plots, a section comparing the models and explaining the differences that are due to the model (in addition to a section explaining what had changed about the program; these two now had to be untangled). All of this with the clock running. Having lost a week diagnosing the problem with the previous model and convincing the client that it was indeed a big problem, and then another coming up with an improved model and figuring out the implications, we now had about three weeks to write the report. Five had already seemed pretty tight, to write what we expected to be a simpler report. Having 3/5 of the time to write a report that was 5/3 as complicated (or something)…well, we got it done, and I am pretty proud of it, but the five weeks from the start of this narrative to the moment we finished were as unpleasant a period of work as I’ve experienced in 25 years.

What are the lessons I am taking away from this?

The biggest lesson is: You know your model is not perfect, which means you know there are ways in which the answers it generates are wrong. You need to know what those ways are, and see whether the results are good enough for your purposes. If they aren’t, you need to modify the model. My initial mindset was more like “this is the model that was used in the past, let me check a few things and see if it makes sense”, and that’s completely wrong. My goal should have been to try to demonstrate that the model was inadequate, not to try to check whether it’s adequate.

The second lesson is: I really do need to code better. I did not find any really major errors in the work that I had done to generate the first set of numbers, but I could have; indeed, it seemed so plausible that I had made a major mistake that I assumed that must be the problem. Right now I am very sure that my analysis, and the code that implements it, does not have any mistakes of practical significance, but that was definitely not true at the time that we submitted our first results, and it should have been. This is partly a matter of simply allowing more time, but there’s a big component of better practices: create modules that can be tested independently, and then create tests for them, those are two of the big things I didn’t do initially.

Finally — and this is a minor one compared to the others — I should have been a lot less trusting of the previous work. In fact there were some major red flags. I already mentioned the use of two highly correlated variables in a regression, with no attempt to ensure that they could reasonably be jointly used to extrapolate beyond the range of the data, nor even any discussion of the issue. Also (perhaps related to that one) I had of course noticed that the previous analysts had not tabulated the regression coefficients of their model. I assumed that was just because the intended audience didn’t care about, or wouldn’t understand, such a table…but shouldn’t they have put it in anyway? (In fact, it is a requirement by the regulators, had anyone chosen to check). In essence, I had made the mistake of judging the book by its cover: I assumed that since the report appeared to have been done reasonably well, it actually had been.

I may not be able to eliminate these mistakes completely — especially the coding one — but even half-measures there are better than no measures at all.

Do as I say, don’t do as I did: (1) Always try to find all of the ways your model is significantly wrong, and to understand the magnitude of its deficiencies. (2) Implement the model in code that can be tested one module at a time. And (3) if someone tells you some specific model is good enough, don’t believe them until you’ve tried to prove them wrong…really this is just a corollary of #1.

Good luck out there!

 

 

30 thoughts on “Lessons learned in Hell

  1. Of all of your posts that I read, I liked this one best. Recognizable.

    I tell my students that if their code and my code provide the same results, we perhaps made the same error, and if our results are different we likely made different errors.

    • I feel the need to clarify that this post is by me, Phil Price, not by Andrew. If this is your favorite of -my- posts, that’s great, thanks!, but since I only post a few times per year it’s not that surprising: logically _some_ post has to be your favorite, and why not this one? But if you thought I’m Andrew and this is your favorite out of all of his zillions of post, that’s incredibly flattering. Also hard to take seriously — this one wouldn’t be in my top 50 of all the posts on this blog! But thanks either way.

  2. Clueless PhD student here. I’m wondering about the following point: “[Multicolinearity is] OK if you are making predictions for regions of parameter space that are well sampled in your training data, but can be a big problem when extraplating…”

    Surely correlated predictors are bad regardless of whether or not you’re extrapolating? The standard errors of the slope estimators must have been massive. I understand that, in a lot of practical applications of statistics, maybe most people don’t care about theoretical issues as long as the model *works*, but what could possibly be the advantage of including correlated variables?

    • When the goal of a regression model is interpreting the coefficients, then having highly correlated features can be bad as the coefficients have high standard errors. When the only goal is fitting the target variable (which I’m guessing it was in this case, as they had to deliver a table estimated program performance), then highly correlated features are okay as the standard error of the forecasts is not made high by the correlated features.

      Consider the following. I set up a linear model that is y = 1 * x1 + 0 * x2 + e, but x2 and x1 are nearly identical. Here’s some R code:
      x1 <- rnorm(100)
      x2 <- x1 + rnorm(100, 0, 1e-10)
      e <- rnorm(100)
      y <- x1 + e
      lm_fit <- lm(y ~ 0 + x1 + x2)

      If I run this code, I get coefficients of about 8268 and -8267 (these are way off), but if I look at the fitted values, they are nearly identical to those I would get if the coefficients were perfect (1 and 0). However, if I add a strange new point, say x1 = 0, x2 = 1, I would get a fitted values of -8267 when the appropriate fitted value is 0.

      • Bryan, thank you for that excellent example. But I think you meant for the coefficient of each of the variables to be 1 in the initial model, or perhaps 1 and -1.

        • Ah yes, I did mean for them to be 1 and 0, but now that I look at the example, it may have been better to have both coefficients be nonzero. The example I provided does show how the high standard error of the coefficients yields good fitted values for normal points and bad fitted values for abnormal points, but it doesn’t show the value of including both variables (which would only be apparent if both of the variables actually mattered).

    • The advantage is that your error on CV or a test set could be lower when you include them. There’s no theorem that says using correlated variables in a regression will result in lower expected predictive accuracy. Also, I’m pretty sure there’s nothing that links variable significance to variable importance for prediction.

      As Phil mentioned, the risk is that you can be far outside well-sampled areas of your training data, but a regression prediction interval will reflect this uncertainty.

      • Your regression prediction interval may reflect it, but you get a high risk of misestimating the coefficients so that the point prediction may be terrible. Given how much people obsess about point predictions… Of course, assuming that some effect that is approximately linear within a small data range continues to be linear outside the data range is also pretty scary (um, Challenger example).
        set.seed(123)
        x1 <- rnorm(100)
        x2 <- x1 + rnorm(100, 0, 1e-6)
        e <- rnorm(100)
        y <- (x1+x2)/2 + e
        lm_fit <- lm(y ~ 0 + x1 + x2)
        summary(lm_fit)
        predict(lm_fit, newdata=data.frame(x1=0, x2=10), interval="prediction")

        • I think we are all in agreement. In Bjorn’s example, he is trying to predict the value at (0,10) but has no data anywhere near there.

          As Kevin says (below), it’s often the case that several parameters are highly correlated and also predictive; you don’t want to leave any of them out, but you have extrapolation issues if you include them all. A Bayesian model can help (a lot) in applying some structure to avoid overfitting — that’s really what we’re talking about, you end up fitting noise in addition to signal — and this falls under the category “sometimes the easiest thing is to do it the hard way”. Indeed, in this project I did create a Stan model and start experimenting with it, and maybe in the long run it would have been better to go ahead and finish that and use it. If I were the audience, I would have, but there is still a big burden to using a Bayesian model in terms of the extra explaining you have to do. It just isn’t a familiar tool to most people. It is kind of sad and amazing to say that, considering it has been 25 years since I published my first paper involving a Bayesian hierarchical model (with Andrew as an author) and I certainly would have thought this stuff would be utterly mainstream by now.

    • Many times when building predictive models, there is an unmeasurable intrinsic quantity (or potentially several distinct intrinsic quantities) that’s causally related to your response variable. You have to make do by using other variables correlated to this intrinsic quantity. All of these variables will be correlated with each other, but including more independent measurements will usually improve your model. You do have to be careful about extrapolation though, as mentioned.

    • The book “The Pragmatic Programmer” is good. I read it last year, although, as my story illustrates, I am not walking the talk. But I am doing better than I used to, and that’s something. “Clean Code” is also supposedly pretty good, haven’t read it.

      Another thing that helps is to look at good code. Actually maybe this is better than any book. My partner, Sam, is a good programmer, so all I have to do is look at the functions that he writes and try to figure out what it is that makes them more compact, readable, and understandable than mine.

      And when I had Sam helping me on the Sunday and Monday that I described above, one of the things he did was to clean up some of my code: break chunks of code into functional blocks and turn them into functions…simple refactoring, really quick to do. Of course I am used to doing this for any code that needs to be used more than once, but in this case it would be ten or twenty or thirty lines of code that just have to be run through in order, so why bother moving it out of the main program? The answers are: (1) it makes it much easier to see the dependencies — this block of code depends on these specific variables, and its entire purpose is to generate these other variables, (2) it gives you a function that you can test independently if you want to, (3) it reduces the cognitive burden of keeping track of what happens where in the program. Any one of these would be reason enough to do it.

  3. If it took you a whole three years to run into a badly-done report I’m really impressed with the profession. After college it took me exactly one project (for an environmental non-profit) to run into a huge Excel spreadsheet that was meant to calculate cancer risk and was off by an order of magnitude. Is it really this good out there? :)

  4. Maybe I’m misunderstanding something here, but if the prior consultants had gotten the job, was it assumed that they’d re-do the analysis from scratch? If it wasn’t, how is your discovery of major errors not a case for renegotiation?
    (Or put another way: if you follow through with your resolutions, doesn’t this lead to budgetting an amount of time that will put you out of competition for similar projects because everyone else offers ‘good enough’ (which, to be fair, it often is) and you offer ‘pristine’ and the company preferably wants to pay for ‘good enough’ and lie to itself about having bought ‘pristine’?)

    In similar cases I’ve had to tell clients that sorry, with the errors that were discovered the original time frame is no longer realistic. This is highly unfortunate and they were right to expect this wouldn’t happen, but nonetheless it has happened and it certainly isn’t my fault either. They can either have the slightly improved version on time (which still leaves them better off than if they had continued to hire the previous consultants) which I won’t vouch for or wait for a from-scratch solution.

    • My sense was similar to Markus – some how the scope of responsibility needs to be limited rather than so apparently open ended.

      I also think it was a really nice post for this blog (perhaps because I have a hard time thinking of what to post on that Andrew doesn’t repeatedly post on).

      I used to warn new clients (academic clinicians) that whenever and whoever discovers an incorrect data value, they have to pay for the analysis to be completely redone. That seem to work well, but how to limit responsibility for software to continue to function well – after initial installation and testing – might be tricky.

      I know its done https://en.wikipedia.org/wiki/Phoenix_Pay_System#Causes
      (When it’s likely a billion dollar problem – the client is really holding the wrong end of the stick.)

    • My partner has a lot more experience with client relations than I do, so I let him make the call on this. We did discuss the fact that the doing forensic analysis to determine what was wrong with models used in the past was not part of our contract, so we could have at least had that discussion with the client. On the other hand, the contract did call for discussing year-over-year changes in the program, so we certainly owed them something. And we thought that if we delivered a really good product on time and on budget this year, we should have the inside track for the next several years: now we’re the ones with the models and the code and the experience, so we can bid lower in future years and perhaps bring in some junior people to do the routine stuff while we try to move the ball forward on some other stuff (we have discovered several ways in which we think the analysis can be improved farther). Perhaps that would have been the case even if we had tried to get paid extra for the increased scope, but perhaps not. In fact, although we aren’t privvy to the other consultant’s bids of course, we think we were more expensive this year than they would have been, so even if we could have gotten more for the increased scope there’s the worry that it would have made us appear too expensive in the future. And finally, my partner and I feel about the same way about our work, which may not be the best way for consultants to think: we judge our work based on our own standards, and we keep working on stuff until we think it’s good. It might be smarter to stop as soon as it’s good enough to be acceptable to the client — in this case the client was satisfied with the model that had been used in the past, so we could have used it too — but that just feels wrong to us.

      To Mike’s question (below) of whether it was worth it…well, I hope so. I suppose it’s too soon to say. One thing that is already happening is that those horrible five weeks already seem less horrible…and of course there are people who work two jobs trying to make ends meet, while also trying to make it to PTA meetings and take care of their aging parents and meet all kinds of other obligations, so how horrible was it really? Perhaps, as with the final stages of finishing my dissertation, in a few years I will feel almost nostalgic for those days and nights: working all the time, ordering in for food so I could stay at my computer…ah, that’s living? If the client is really happy and sends more work our way, then it was probably worth it. If that doesn’t happen then…well, no, the dollars per hour that I earned were not worth it. But that doesn’t mean I didn’t have to do the work: once I accepted the contract, I had to do the work at the agreed price. Yes, we could have argued that the scope had increased so we needed more money, but it’s not obvious we would have won that argument since we were on the hook for comparing this year to last year. Ask me in a year, I’ll tell you if it was worth it.

      • > If the client is really happy and sends more work our way, then it was probably worth it.
        Hopefully.

        And hopefully you avoid an early experience of mine. A fellow MBA graduate and I were working on programming data analyses for some faculty – given we were unable to find full time employment right after graduating. One of them asked if we were up to some consulting for a very large well known firm. We said yes.

        Now, for the faculty we were using Lotus 123 and were told to make the reproducibility of our code our first priority. We were given adequate time to do that (and it was later tested by someone who was not permitted to contact us.) Now, the firm wanted software for some task they needed and its reproducibility, beyond having comments in the code, was not discussed. After completing the code and their initial testing of it, they were happy and took the both of us out to lunch.

        While chatting over lunch they mentioned that they often use their size to encourage consultants to bid low on initial contracts in hopes of getting future business. They then said they use this to their advantage and always try to get bids from new consultants.

        Later, with some embarrassment over what they just told us, they asked if we wanted to do more consulting. We both said no and stressed that we had to focus on finding full time employment.

        When we told the faculty member about the conversation and our decision he said – “good you learn quickly, now if and when they call you to ask about anything you did, you tell them that you will do your best to answer their question in 15 minutes. Anything more needs to be paid for. And after a couple of those calls, just say that your too busy”. I think he either knew or suspected this company did that sort of thing – he told us he did not consult to them anymore.

        Of course, the ending was sweet. They hired a recent comp. sci. grad full time and assigned our program to him. After the first 15 minute answer, he sought more input directly from Lotus 123. His second call was to report that they did not think what we had done in their software was actually possible (we wrote macros to write macros and extended the capabilities by using file saves and retrieves). We said that would take more than 15 minutes to explain and apologized that we were too busy.

  5. What I get from this story is that you ended up spending a lot of time and effort simply because you set the bar higher than you needed to. Now my question is: was it worth it?

  6. This might be old news, but I find Rmarkdown to be a complete game changer for large scale applied analyses.

    If possible, I like to have one directory for raw data, one for cleaned/altered/derived data, one for outputs and one for Rmarkdown scripts for all parts of the analysis (i.e. data cleaning, going from cleaned to summary stats, etc.). This makes it extremely easy to figure out at any point of the process how you got from the raw data to your conclusions. No digging around trying to figure out how table X or figure Y was generated: the code that generated X or Y is the code you see right above X or Y. Really helpful for explicitly seeing *why* you may have done some data cleaning steps too, which are easy to forget and really important! Also really helpful if there ends up being small changes to the raw data itself: just update the raw data directory and rerun all the scripts.

    The only issue is when you have a collaborator that wants to manually alter data in excel. But this also serves as a great tool for passive-aggressively putting a stop to that bad practice…

    • We used Rmarkdown extensively in this project, to document intermediate work for each other and for ourselves and for the project manager. Also, all of the final figures and tables are generated in a couple of markdowns. So, yes, agree, this is a great tool.

  7. @a reader: I think you can generalize a bit from that: literate programming can help, and you can do that multiple ways–R Markdown, Emacs Org Mode (my preference), Jupyter Notebooks, ….

    I also agree with the importance of encapsulating code in functions. In R at least, it’s seemed helpful to encapsulate those into tested (e.g., with testthat) packages. Do I do it as well as I’d like. Not yet, of course.

  8. I’ll be perfectly honest Phil this is going to happen again. When you own a business every so often stuff goes really, really, really wrong and you need to put in the work of several people to keep your business from going off the rails. Don’t get me wrong. I want to say congrats on encountering a major hurdle and coming through the other side! I know what that’s like and it’s tough; many fail, you didn’t, awesome job. But expect this again down the road in different clothing!

    I can relate to the post in several ways but of them the most intriguing to me is that the prime motto we teach our construction supervisors during training is incredibly similar to what you describe at the end of your post about previous work. We teach them: “never trust the work that was done previously, including your own.” The idea is to always assume all work is wrong before you proceed. If you can’t figure out how it’s wrong then you’re not looking hard enough. The best bet is to figure out how wrong the work before you is and if you can live with it.

    The other thing I wanted to mention based on my experience in construction is that qualification of your bid is important. Knowing how to qualify so as not to be disqualified at tender but also to cover your rear end down the road when haggling for extras is an important skill. If you could I would take some time from this experience and see if there are tactful ways to qualify your bids in the future for this type of problem; and others! Though, sometimes qualification renders your bid automatically disqualified. In which case you need to appraise the situation and put in a number you’re comfortable with given the risks.

    Congrats again though on getting it done.

  9. Your story reminds me of my time when I went to work for state government in between my masters and PhD programs. I was given a project to update annual numbers that had to be reported to the feds that tied in about a million dollars in grant pass through money per year. I discovered that my predecessor had been reporting number of incidents and not number of people that the program required and that error had quadrupled the actual number. The agency I worked for would have gotten zero dollars each prior year had the real numbers been reported. Needless to say, it didn’t go over well with management but when I showed them what had happened they accepted that we had to inform the feds of the issue. The funny thing was that the feds, upon hearing our update, said they never believed the prior numbers and that we weren’t the only state to conflate incidents with people. They also never asked for the money back.

    • Funny story.

      Maybe one of the issues is: the people we are working with at the big company don’t have a huge personal stake in the answers coming out one way or another. They hire consultants and try to make sure they get good work out of them but they don’t have the bandwidth and the expertise to carefully track the work of every consultant then they could do the work themselves, so at some level they have to assume the consultants are doing decent work unless demonstrated otherwise. So when we point out that some of the previous work had major mistakes they aren’t happy about it but they also don’t have a strong incentive to bury it or to lie about it. People make mistakes; sometimes you just have to try to correct them and move on.

  10. A few years ago I realized while litigating a mass tort that the numbers that a certain jurisdiction had been reporting to the CDC about testing children for a certain contaminant in their blood were all wrong — I forget the details, but the people reporting it were clearly just typing the wrong number from the test results in the box. But the reported numbers were slightly biased toward us (the plaintiffs), so we didn’t want to get into it, and the case went away before trial, so I still don’t know if anyone knows this but me. I don’t think the blood testing is required anymore, so I don’t think this is a current issue, but the faulty numbers provide a false history of the exposure to the contaminant in this jurisdiction— and an object lesson in incompetent data collection — so I still wonder if I should have leaked this to a newspaper reporter.

Leave a Reply

Your email address will not be published. Required fields are marked *