A silly little error, of the sort that I make every day

Ummmm, running Stan, testing out a new method we have that applies EP-like ideas to perform inference with aggregate data—it’s really cool, I’ll post more on it once we’ve tried everything out and have a paper that’s in better shape—anyway, I’m starting with a normal example, a varying-intercept, varying-slope model where the intercepts have population mean 50 and sd 10, and the slopes have population mean -2 and sd 0.5 (for simplicity I’ve set up the model with intercepts and slopes independent), and the data variance is 5. Fit the model in Stan (along with other stuff, the real action here’s in the generated quantities block but that’s a story for another day), here’s what we get:

            mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu_a[1]    49.19    0.01 0.52 48.14 48.85 49.20 49.53 50.20  2000    1
mu_a[2]    -2.03    0.00 0.11 -2.23 -2.10 -2.03 -1.96 -1.82  1060    1
sigma_a[1]  2.64    0.02 0.50  1.70  2.31  2.62  2.96  3.73   927    1
sigma_a[2]  0.67    0.00 0.08  0.52  0.61  0.66  0.72  0.85   890    1
sigma_y     4.97    0.00 0.15  4.69  4.86  4.96  5.06  5.27  2000    1

We’re gonna clean up this output—all these quantities are ridiculous, also I’m starting to think we shouldn’t be foregrounding the mean and sd as these can be unstable; median and IQR would be better, maybe—but that’s another story too.

Here’s the point. I looked at the above output and noticed that the sigma_a parameters are off: the sd of the intercept is too low (it’s around 2 and it should be 10) and the sd of the slopes is too high (it’s around 0.6 and it should be 0.5). The correct values aren’t even in the 95% intervals.

OK, it could just be this one bad simulation, so I re-ran the code a few times. Same results. Not exactly, but the parameter for the intercepts was consistently underestimated and the parameter for the slopes was consistently overestimated.

What up? OK, I do have a flat prior on all these hypers, so this must be what’s going on: there’s something about the data where intercepts and slopes trade off, and somehow the flat prior allows inferences to go deep into some zone of parameter space where this is possible.

Interesting, maybe ultimately not too surprising. We do know that flat priors cause problems, and here we are again.

What to do? I’d like something weakly informative, this prior shouldn’t boss the inferences around but it should keep them away from bad places.

Hmmm . . . I like that analogy: the weakly informative prior (or, more generally, model) as a permissive but safe parent who lets the kids run around in the neighborhood but sets up a large potential-energy barrier to keep them away from the freeway.

Anyway, to return to our story . . . I needed to figure out what was going on. So I decided to start with a strong prior focused on the true parameter values. I just hard-coded it into the Stan program, setting normal priors for mu_a[1] and mu_a[2]. But then I realized, no, that’s not right, the problem is with sigma_a[1] and sigma_a[2]. Maybe put in lognormals?

And then it hit me: in my R simulation, I’d used sd rather than variance. Here’s the offending code:

a <- mvrnorm(J, mu_a, diag(sigma_a))

That should've been diag(sigma_a^2). Damn! Going from univariate to multivariate normal, the notation changed.

On the plus side, there was nothing wrong with my Stan code. Here's what happens after I fixed the testing code in R:

            mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu_a[1]    48.17    0.11 1.62 45.08 47.07 48.12 49.23 51.38   211 1.02
mu_a[2]    -2.03    0.00 0.10 -2.22 -2.09 -2.02 -1.97 -1.82  1017 1.00
sigma_a[1] 10.98    0.05 1.18  8.95 10.17 10.87 11.68 13.55   496 1.01
sigma_a[2]  0.57    0.00 0.09  0.42  0.51  0.56  0.63  0.75   826 1.00
sigma_y     5.06    0.00 0.15  4.78  4.95  5.05  5.16  5.35  2000 1.00

Fake-data checking. That's what it's all about.

<rant>

And that's why I get so angry at bottom-feeders like Richard Tol, David Brooks, Mark Hauser, Karl Weick, and the like. Every damn day I'm out here working, making mistakes, and tracking them down. I'm not complaining; I like my job. I like it a lot. But it really is work, it's hard work some time. So to encounter people who just don't seem to care, who just don't give a poop whether the things they say are right or wrong, ooohhhhh, that just burns me up.

There's nothing I hate more than those head-in-the-clouds bastards who feel deep in their bones that they're right. Whether it's an economist fudging his numbers, or a newspaper columnist lying about the price of a meal at Red Lobster, or a primatologist who won't share his videotapes, or a b-school professor who twists his stories to suit his audience---I just can't stand it, and what I really can't stand is that it doesn't even seem to matter to them when people point out their errors. Especially horrible when they're scientists or journalists, people who are paid to home in on the truth and have the public trust to do that.

A standard slam against profs like me is that we live in an ivory tower, and indeed my day-to-day life is far removed from the sort of Mametian reality, that give-and-take of fleshy wants and needs, that we associate with "real life." But, y'know, a true scholar cares about the details. Take care of the pennies and all that.

</rant>

44 thoughts on “A silly little error, of the sort that I make every day

  1. I don’t think your modeling situation was analogous the situations from your rant. You encountered a mistake that seemed to contradict your preheld notions (that your method should work) and corrected it, thus confirming your preheld notions. The people in your rant all failed to correct mistakes that seemed to confirm their preheld notions. Your experience is analogous to if David Brooks saw that a small town Red Lobster was really expensive and talked to the manager and discovered that he had accidentally been given a menu from a big city liberal Red Lobster.

    • Big city liberal Red Lobster is full of hipsters wearing thrift-store trucker hats, ironically overpaying for crappy food so they can act so so superior to the common folk. Completely different from big suburb David Brooks who has a deep respect for people in Red America, he just doesn’t want to live there.

      Where Richard Tol and Mark Hauser fit into all of this, I have no idea. But I bet they don’t eat at Red Lobster.

      Karl Weick, though, who knows? When he hangs out with his Wall Street friends he probably eats at expensive Manhattan restaurants. But when he’s back home in Michigan, maybe he goes for the comfort food. And, at his salary, you’d be able to afford all the Red Lobster you can eat!

  2. > Fake-data checking. That’s what it’s all about.

    Cheers to that. I gave a brief today summarizing work where we’d done about three months worth of fake-data checking prior to having of real data. We coded up our algorithm, feed it well-behaved data to confirm we hadn’t screwed up anything obvious and then feed it increasingly challenging fake data to 1) determine whether we’d made any subtle mistakes and 2) to see how much we could violate the basic model presumptions and still get meaningful results. It was productive. We now have real data and it appears its characteristics fall within the range we tested. The fake data testing gives us confidence in our results with the real stuff.

  3. This post is also a great example of why, even in qualitative scholarship or narrative journalism, it’s silly to say, “It’s one little mistake! What’s the big deal?” It’s much easier to see in mathematical calculations and statistical simulations. One “little” mistake early on often ultimately undermines the result. When a journalist or scholar gets a fact wrong, it affects the entire account. It becomes a different narrative.

  4. > those head-in-the-clouds bastards who feel deep in their bones that they’re right
    Like Neyman? (see comment to Basbøll)

    > But it really is work, it’s hard work some time.
    Yup.

    I once stated to a funding agency that my biggest contribution to clinical research was finding and correcting data errors – I doubt if they were impressed.

    • > I once stated to a funding agency that my biggest contribution to clinical research was finding and correcting data errors – I doubt if they were impressed.

      Sexy sells. Spending one’s time in the weeds to ensure that the answer’s right is rarely sexy. In my experience, a compelling narrative – whether fact or fiction – trumps a detail-oriented analysis most of the time. If you can find a sponsor for your work who’s as interested in getting the details right as they are in being able to tell a good story then consider yourself fortunate.

  5. I used to do computational chemistry and once spent a few weeks in grad school comparing results of test cases using the relatively new code we were using against a more widely used code.

    When my advisor found out he was raving mad. Claimed I was wasting valuable cpu time. And benchmarking didn’t behoove a research group of his caliber.

  6. I’ve done that a countless number of times when simulating in R.

    One thing I usually do when testing out an MCMC algorithm on simulated data is estimate each block of parameters conditional on others set to their true values. Has worked pretty well for de-bugging because each conditional block is usually a pretty simple application and the results should be quite precise.

  7. Doesn’t deal with stats. But I fuss at grad students who don’t pay attention to details like citations in a reference list. If you can’t pay attention to rather mundane details like getting citations correct, what else in your analysis did you not pay attention to.

    • “If you can’t pay attention to rather mundane details like getting citations correct, what else in your analysis did you not pay attention to.

      Uh, pretty much everything? I once read an medical association’s position paper on a subject that I know a fair bit about and noticed that the very first citation was wrong. Further investigation showed that something like 35 or 40 percent of the references were incorrect, everything from an occasional misspelled author’s name to wrong dates, screwed up page numbers to completely wrong journal names. I only identified some journals by googling on any possibly accurate part of a reference.

      Further review of the paper and the associated cited papers led me to believe that the authors had not read most (any?) of the papers beyond, possibly, reading some of the abstracts. An inability to distinguish between 15km/hr and 15 miles/hr did not make me confident about the conclusions.

      If it had been a first year university paper it would have received a failing grade accompanied by some strong counselling as in “READ the papers!”.

  8. Especially horrible when they’re scientists or journalists, people who are paid to home in on the truth and have the public trust to do that.

    I’ve already pointed out on your blog that tenure protects the academics who do these sorts of things (to which you said it happens everywhere, but journalists do get fired for fabricating (not at Rolling Stone, apparently, though the claim is that they simply published a fabrication), so I guess the academic analogy is firing the editors of a journal. Digression.)

    Anyway, for scientists in the physical sciences, gross errors of the type you describe in the social sciences gets the funding cut off, and then your university tells you to retire or enjoy sitting in an office adorned with buckets and mops next to the restroom in the basement for the next twenty years. There is another way to proceed, however. Many of these violators are members of prestigious organizations such as the National Academy of Sciences, which do have members who actually care about these things but . Have you thought of preparing a case against them, putting it on the internet for people to sign, and then presenting it to NAS as a request for expulsion? Shaming (and that would be shameful) might help reduce behavior which you (and I! and any scientist) would find objectionable.

      • Well, yes, but since I don’t want to give away my location, I can’t give them to you. Fifty-five is often the magic age (they give you a buy-out to get rid of you), which seems like a long coast but considering it takes you 6-7 years to get a Ph.D. (in the physical sciences), a postdoc (3 years), maybe even a second postdoc (another 3 years), then 5 years to get tenure, you’re already nearly at 40 (or past it) and then you just an associate for 5-10 years, so if you quit bringing in the money your dead time to the university is 5, maybe 10 years. In the physical sciences you have a lab with 5-10 graduate students and 3-5 postdocs (on average–some are huge), so it takes a lot of money coming in to support this. This is at research institutions–state schools in the midwest it may be different.

    • Numeric:

      You ask, “Have you thought of preparing a case against them, putting it on the internet for people to sign, and then presenting it to NAS as a request for expulsion?”

      I’ve never done this but I’ve no problem with others doing such things. It never seems to accomplish much, but I agree that it’s worth trying.

      Here’s what I’ve tried along these lines:

      – Contacted the American Statistical Association to request they retract Edward Wegman’s Founders Award (“to recognize members who have rendered distinguished service to the association”).

      – Contacted David Brooks directly and also my contacts at the NYT to request that he post a correction to one of his false claims.

      – Contacted the editor of the journal where Richard Tol published his paper that had multiple waves of errors.

      I’ve also had correspondents who’ve notified the employers of Matthew Whitaker, Frank Fischer, etc., and some others that I can’t remember.

      None of these efforts have succeeded.

      Even that Canadian medical researcher, Ranjit Chandra, the guy whose extreme frauds were uncovered by my friend Seth, was not quite fired, he was just forced into early retirement. And of course Dr. Anil Potti got that job in North Dakota.

      On the plus side, Mark Hauser was kicked out (not through any efforts of mine), and perhaps the ruined reputation of various sloppy scholars will be enough to deter some future would-be cheaters. And Gregg Easterbrook is no longer writing for Reuters.

      So I’m not really sure what should be done. Mockery and outrage on the blog should help a bit, I hope. But I’m discouraged when I hear about people who have gone to the effort of organizing formal proceedings against fraudsters.

      It seems pretty consistent that institutions will attack the whistleblowers and defend the cheaters, I assume under the theory that the negative publicity of any strong action by the institution would be worse than the negative publicity associated with the original (and, often, ongoing) offense.

      I hope that the exposure that we and others (for example, Retraction Watch) give to crap science will, in some small way, change the cost-benefit calculation for scientists who are thinking of doing sloppy or fraudulent work. We’re trying to reduce the upside to cheating (by reducing the prestige of what Dan Kahan calls “WTF research”), and we’re trying to increase the downside by making it easier to publicize retractions, should-be retractions, and failed replications. I’d like to reduce the incentive for people to publish, and publicize, crap.

      • Your last comment, and your earlier statement about “Take care of the pennies and all that” make me think of another point. Part of the blame, I believe, lies with us – the “good” and “honest” researchers. We often fail to distinguish between technical errors that don’t have much of a material impact and those errors that undermine (or at least call into question) the important conclusions of a study. In fact, I suspect some would say that any error calls into question the conclusions of a study. I don’t agree with that and believe it is part of the problem. Some errors will be made and some don’t really matter much. Others are quite important. It is our job to know the difference and express that. When researchers fail to distinguish between the two, they invite the world to downplay errors and not hold their perpetrators responsible.

        • At least in research conferences I attended (mostly Chemistry related) there was far too little disagreement / criticism expressed by the audience.

          Even the format of our weekly departmental seminar seemed designed to preclude criticism. Typically 50 to 55 minutes of uninterrupted speaker-time followed by approximately three questions.

        • This is very important. We need to be much better at presenting the result concisely and then just field questions. Even if “How did you control for…?” simply elicits the perfectly standard answer you would have given as part of your presentation, there’s a better chance that someone will get a chance to spot something you hadn’t thought of. Imagine the 5-minute seminar version of “We found that female hurricanes are significantly deadlier than male ones” followed by an hour of questions, Instead of the 55-minute version with three polite questions vaguely acknowledged at the end.

      • Thanks for the description about what hasn’t worked. I’m glad you’ve tried, but the outcome is about what I would have expected. I seem to recall some NIH researchers about 30 years ago putting academic history books through a computer comparison program (apparently NIH allows a lot of latitude to its researchers, or at least they did) and finding a great deal of plagarism and/or extensive paraphrasing. They eventually were told not to work on anything like this (haven’t been able to find this on google–anyone have any recollections)?

        Obviously, academic research would benefit tremendously if there was something like those NIH researchers looking into scientific misconduct/shoddiness, but institutionalized. These would be random audits (not the Scientology kind) like the IRS does. In fact, the IRS every year picks a (small) number of returns at random and does a complete check on them, making the unfortunate individual justify every line. This is done so they can refine their auditing techniques and get a baseline of compliance. The same could be done with academic research.

        • numeric:

          I agree (and have suggested) that some sort of random auditing would have a real impact – perhaps mostly in preventing sloppiness as researchers would be aware that they could be audited.

          Exactly how it should be done and especially by whom (unlikely the funding agencies as they have a conflict of interest) would be interesting to figure out.

  9. I believe there are (too) few examples where justice has been served. I can tell you, however, that as an expert witness, failure to cite properly, having errors exposed, not owning up to errors, etc. can be fatal to your career. Sure there are horrible cases of abuse and witnesses that survive despite such issues, but I would speculate they are far rarer than in academia. Once you’ve been burned on the witness stand, it is hard to get another shot. It takes a good lawyer to expose such errors and I’ve seen plenty of witnesses worm their way out when cross examined by less proficient attorneys. But overall, the stakes in academia are just too low so the level of abuse is much higher.

    • PK:

      That’s a really weird interview. Impressive of the interview to extract such damning quotes from Brooks, not so impressive to not push him on it (e.g., “Mr. Brooks, if all this bothers you much, why do you not correct your errors in print?”). I suppose that sort of question would’ve led him to shut down the interview as he did when Sasha Issenberg asked him about the price of a meal at Red Lobster.

  10. Peer review is increasingly becoming a question of whether the reviewer can spot the bugs in the authors’ software without having seen it. Any usage of SPSS or SAS beyond basic point-and-click (let alone R and its derivatives) means that you are now a programmer, and your analysis obeys all the laws of software, just as much as an Excel sheet does from the moment you type a single formula.

    I’m currently working on a reanalysis of a complex article in which the first week was spent working out exactly which bugs I needed to introduce into my own software to match the ones that were causing the original authors to get their (erroneous) results.

    This way of working is not, I suggest, going to produce the kind of science that split the atom or took us to the moon.

    • > I’m currently working on a reanalysis of a complex article in which the first week was spent working out exactly which bugs I needed to introduce into my own software to match the ones that were causing the original authors to get their (erroneous) results.

      That begs the question, how far into the weeds is the reviewer obligated to delve? Is it shirking your duty as a reviewer to say “No obvious errors. Okay to publish.” and then let interested readers find the flaws? (With respect to the article you mention, are you a reviewer or an interested reader?)

        • I don’t check code. I don’t ever get it.

          But even if I did, I think checking things like data management, “cleaning” (all the “set-up” work), or all the matrix algebra in the code itself, would always be outside the scope of the academic referee’s job. I could imagine it becoming a part of the formal review process, but conducted by the journal staff and not the academic referees – at least checking to make sure the code runs and generates the tables/figures seen in the submitted document. Maybe that is just self-serving (shifting the workload), but I think there are more qualified non-academics to do that kind of work, and let the academics worry about whether the reasoning, interpretation, implementation, and research design make sense.

        • I think we ought to take a stand & outright refuse to review papers that won’t send code along.

          Agreed that it might not be practical to do a thorough code review but something is better than nothing.

        • I think this is exactly the policy that is leading to a proliferation of crap. We need more conscientious reviewing.

          And no one can or should be reviewing zillions of papers. It does not make any sense.

        • Rahul:

          This has come up before on the blog. In short, we contribute where we can be most effective. In 5 minutes I can do a useful review. I think I contribute more from 100 five-minute reviews than from five 100-minute reviews.

          Regarding your point about proliferation of crap: I can often detect crap in 5 minutes. It’s not about the correctness of the R code, it’s about things that are claimed that aren’t possible at all. Beyond this, the crap will get published somewhere (if not in Nature, in the Journal of Theoretical Biology or PPNAS), so I think the real solution is post-publication review and the removal of the positive incentive for publishing crap.

          In short, if a researcher gets X career points for publishing crap, he or she should get something like -5X when the crap is discovered. Rather than the current standard which might be -0.5X.

          Just to be clear, when I say “crap” I mean something that was prospectively wrong, or something where the author aggressively ignored contrary indications (as in that notorious Oster paper). I don’t think people should be discouraged from publishing speculation; they should be discouraged from presenting speculation as if it were certainty.

        • Andrew:

          I’m not doubting that you can often detect crap in 5 minutes. That’s great. It’s the papers where you cannot detect any obvious crap in 5 minutes that I worry about.

          I’m not saying you ought to spend 100 minutes on a paper you detect crap in 5 minutes of reading. That’s an obvious reject. But before you say “accept” I think you ought to devote significant time to a paper. Much much more than 5 minutes. Otherwise the whole “review” process becomes a joke.

        • Doesn’t that depend on the amount of 5-minute rejections? If Andrew reviews a hundred papers and 99 are revealed to be crap in the first five minutes and rejected, then letting one go through to post-publication review after five minutes too seems okay. If he instead spends fifty minutes on that paper, and ends up accepting it, he is, on our model, letting 10 crappy papers slip through his filter. It’s a question of what we want Andrew to be spending his time doing, I guess.

        • Cull the herd with 5 minute reviews and then 100 minute reviews for manuscripts which pass the five minute filter. Let post-publication review take care of the ones which pass the 100 minute test.

          An unpaid (NB: unpaid) reviewer shouldn’t be obligated to spend a day, let alone multiple days, validating someone else’s code.

          PS Another question out of curiosity: How much time do you spend on manuscripts which aren’t obviously crap? I’m usually into them for 3-6 hrs. The not-very-good-but-possess-some-redeeming-features ones are the biggest time sinks. (For what it’s worth, 90% of the manuscripts I’ve reviewed have been for Optical Society of America journals. I don’t encounter many statistical analyses.)

        • Also, I don’t get this nihilistic attitude about “the crap will get published somewhere”

          Do we treat students by the same yardstick? Why Fail him at Columbia, since if not here he will go to the University of Phoenix & get that damn degree anyways?

          Why not pass them all in the exams & let the post-graduation employment process weed out the junk?

Leave a Reply

Your email address will not be published. Required fields are marked *