Skip to content

The Ben Geen case: Did a naive interpretation of a cluster of cases send an innocent nurse to prison until 2035?

In a paper called “Rarity of Respiratory Arrest,” Richard Gill writes:

Statistical analysis of monthly rates of events in around 20 hospitals and over a period of about 10 years shows that respiratory arrest, though about five times less frequent than cardio-respiratory arrest, is a common occurrence in the Emergency Department of a typical smaller UK hospital.

He has lots of detailed and commonsensical (but hardly routine) data analysis. Those of you who read my recent posts (here and here) on the World Cup might be interested in this, because it has lots of practical details of statistical analysis but of a different sort (and it’s a topic that’s more important than soccer). The analysis is interesting and clearly written.

But the subject seems pretty specialized, no? Why am I sharing with you an analysis of respiratory arrests (whatever that is) in emergency departments? The background is a possibly huge miscarriage of justice.

Here’s how Gill told it to me in an email:

I’m wondering if you can do anything to help Ben Geen – a British nurse sentenced for 30 years for a heap of crimes that were not committed by anyone. Show the world that in the good hands, statistics can actually be useful!

There are important statistical issues and they need to be made known to the public.

There is in fact an international epidemic of falsely accused health care serial killers.

In Netherlands: Lucia de Berk
In the UK: Colin Norris, Ben Geen
In Canada: Susan Nelles
In the US:

So I [Gill] got involved in the Ben Geen case (asked by defence lawyer to write an expert statistical report). This is it

I became convinced that most of what the media repeated (snippets of over the top info from the prosecution, out of context, misinterpreted) was lies, and actually the real evidence was overwhelmingly strong that Ben was completely, totally innocent.

Here’s the scientific side of the story. It’s connected to the law of small numbers (Poisson and super Poisson variation) and to the Baader-Meinhof effect (observer bias). And to the psycho-social dynamics in a present-day, dysfunctional (financially threatened, badly run) hospital.

In the UK Colin Norris and Ben Geen are in 30 year sentences and absolutely clearly, they are completely innocent. Since no one was murdered there never will be a confession by the true murderer. Because there were no murders there won’t ever be new evidence pointing in a different direction. There will never be a new fact so the system will never allow the cases to be reviewed. Since the medical profession was complicent in putting those guys away no medical doctor will ever say a word to compromise his esteemed colleagues.

What is going on? why this international epidemic of falsely accused “health care serial killers”?

Answer: in the UK: the scare which followed Shipman triggered increased paranoia in the National Health Service. Already stressed, overburdened, underfunded … managers, nurses, specialists all with different interests, under one roof in a hospital … different social classes, lack of communication

So here’s the ingredients for a Lucia / Ben / Colin:

(1) a dysfunctional hospital (chaos, stress, short-cuts being taken)
(2) a nurse who is different from the other nurses. Stands up in the crowd. Different sex or age or class. More than average intelligence. Big mouth, critical.
(3) something goes wrong. Someone dies and everyone is surprised. (Why surprised: because of wrong diagnosis, disinformation, ….)
(4) Something clicks in someone’s mind (a paranoid doctor) and the link is made between the scary nurse and the event
(5) Something else clicks in … we had a lot more cases like that recently (eg. the seasonal bump in respiratory arrests. 7 this month but usually 0, 1 or 2)
(6) The spectre of a serial killer has now taken possession of the minds of the first doctor who got alarmed and he or she rapidly spreads the virus to his close colleagues. They start looking at the other recent cases and letting their minds fall back to other odd things which happened in recent months and stuck in their minds. The scary nurse also stuck in their mind and they connect the two. They go trawling and soon they have 20 or 30 “incidents” which are now bothering them. They check each one for any sign of involvement of the scary nurse and if he’s involved the incident quickly takes on a very sinister look. On the other hand if he was on a week’s vacation then obviously everything must have been OK and the case is forgotten.
(7) Another conference, gather some dossiers – half a dozen very suspicious cases to report to the police to begin with. The process of “retelling” the medical history of these “star cases” has already started. Everyone who was involved and does know something about the screw-ups and mistakes says nothing about them but confirms the fears of the others. That’s a relief – there was a killer around, it wasn’t my prescription mistake or oversight of some complicating condition. The dossiers which will go to the police (and importantly, the layman’s summary, written by the coordinating doctor) does contain “truth” but not the *whole truth*. And there is lots of truth which is not even in hospital dossiers (culture of lying, of covering up for mistakes).
(8) The police are called it, the arrest, there is of course an announcement inside the hospital and there has to be an announcement to the press. Now of course the director of the hospital is in control – probably misinformed by his doctors, obviously having to show his “damage control” capacities and to minimize any bad PR for his hospital. The whole thing explodes out of control and the media feeding frenzy starts. Witch hunt, and then witch trial.

Then of course there is also the bad luck. The *syringe*, in Ben’s case, which clinches his guilt to anyone who nowadays does a quick Google search.

This is what Wendy Hesketh (a lawyer who is writing a book on the topic) wrote to me:

“I agree with your view on the “politics” behind incidences of death in the medical arena; that there is a culture endorsing collective lying”

“Inquries into medico-crime or medical malpractice in the UK see to have been commandeered for political purposes too: rather than investigate the scale of the actual problem at hand; or learn lessons on how to avoid it in future, the inquiries seem designed only to push through current health policy”

“The “Establishment” want the public to believe that, since the Shipman case, it is now easier to detect when a health professional kills (or sexually assaults) a patient. It’s good if the public think there will never be “another Shipman” and Ben Geen and Colin Norris being jailed for 30 years apiece sent out that message; as has the string of doctors convicted of sexual assault but statistics have shown that a GP would have to have a killing rate to rival Shipman’s in order to have any chance of coming to the attention of the criminal justice system. In fact, the case of Northumberland GP, Dr. David Moor, who openly admitted in the media to killing (sorry, “helping to die”) around 300 patients in the media (he wasn’t “caught”) reflects this. I argue in my book that it is not easier to detect a medico-killer now since Shipman, but it is much more difficult for an innocent person to defend themselves once accused of medico-murder.”

Indeed, the rate of serial killers in the UK’s National Health Service must be tiny and if there are good ones around they won’t even be noticed.
Yet is is so so easy in a failing health care organization for the suspicion to arise that there is one around. And once the chances are aligned and the triggering event has happened there is no going back. The thing snowballs. The “victim” has no chance.

Chance events are clustered!!! Pure chance gives little bunches of tightly clustered events with big gaps between them. When chances are changing (e.g. seasonal variation, changes in hospital policies, staffing, new personel with new habits when filling in woefully inadequate diagnosis forms) then the phenomenon is stronger still!

eg three airliners crashed within a couple of days this week!!!

How odd is a cluster of cases? Well by the law of *small* numbers (Poisson and even super-Poisson variation – Poisson means pure chance .. super-Poisson means pure chance but with the “chance per day” slowly varying in time) “short intervals between crashes are more likely than long ones”. (actually – very short, and very long, intervals, are both common. Pure chance means that accidents are *not* uniformly spread out in time. They are clustered. Big gap, cluster, biggish gap, smallish cluster… that’s pure randomness!!!)

Then there is the Baader-Meinhof phenomenon

[I replaced an earlier link for this which pointed to a flaky news site -- AG]
“Baader-Meinhof is the phenomenon where one happens upon some obscure piece of information– often an unfamiliar word or name– and soon afterwards encounters the same subject again, often repeatedly. Anytime the phrase “That’s so weird, I just heard about that the other day” would be appropriate, the utterer is hip-deep in Baader-Meinhof.”

Another name for this is *observer bias*. You (a medical doctor having to fill in a diagnose for a patient in a standard form, which is totally inadequate for the complexity of medicine) saw one case which they had to give a rather unusual label to, and the next weeks that “unusual diagnosis” will suddenly come up several times.

Well, Professor Jane Hutton (Warwick university, UK) wrote all these things in her expert report for the appeal 6 years ago but the judge said that such kind of statistical evidence “is barely more than common sense” so refused the request for her to tell this common sense out loud in court.

OK, it’s me again. I haven’t looked at this case in any detail and so can’t add anything of substance to Gill’s analysis. But what I will say is, if Gill is correct, this example demonstrates both the dangers and the potential of statistics. The danger because it is statistical analysis that has been used to convict Geen (both in court and in the “court of public opinion” as measured by what can be found on Google). The potential because a careful statistical analysis reveals the problems with the case (again, I’m relying on Gill’s report here; that is, my comments here are conditional on Gill’s report being reasonable).

Just to be clear, I’m not not saying that statistical arguments cannot or should not be used to make public health decisions. Indeed, I was involved last year in a case in which the local public health department made a recommendation based on statistical evidence, and this recommendation was questioned, and I (at the request of the public health department, and for no compensation) wrote a brief concurrence saying why I did not agree with the critics. So I am not saying that any statistical argument can be shot down, or that the (inevitable) reliance of any argument on assumptions makes that argument suspect. What I am doing is passing along Richard Gill’s analysis in this particular case, where he has found it possible for people to draw conclusions from noise, to the extent of, in his view, sending an innocent person to jail for 30 years.

SciLua 2 includes NUTS

The most recent release of SciLua includes an implementation of Matt’s sampler, NUTS (link is to the final JMLR paper, which is a revision of the earlier arXiv version).

According to the author of SciLua, Stefano Peluchetti:

Should be quite similar to your [Stan's] implementation with some differences in the adaptation strategy.

If you have time to have a look and give it a try I would appreciate your feedback. My website includes the LuaJIT binaries for 3 arches [Windows, Mac OS X, Linux], and the Xsys and Sci libraries needed to run NUTS.

I’ve also been working on a higher-level interface with some test models and datasets (the user can specify a simple script to describe the model, but it’s always Lua syntax, not a custom DSL).

Please notice that multi-pass forward automatic differentiation is used (I experimented with a single-pass, tape based, version but the speed-up was not big and for high-dimensional problems one wants to use reverse diff in any case, on my TODO).

For complex models you might want to increase loopunroll a little, and maybe callunroll as well. Feel free to ask any question.

I haven’t had time to try it myself. Looks like it’s implemented using a just-in-time compiler (JIT), like Julia, with models specified in its own language, Lua.

Yummy Mr. P!


Chris Skovron writes:

A colleague sent the attached image from Indonesia.

For whatever reason, it seems appropriate that Mr. P is a delicious salty snack with the tagline “good times.”

Indeed. MRP has made the New York Times and Indonesian snack food. What more can we ask for?

A linguist has a question about sampling when the goal is causal inference from observational data

Nate Delaney-Busch writes:

I’m a PhD student of cognitive neuroscience at Tufts, and a question came recently with my colleagues about the difficulty of random sampling in cases of highly controlled stimulus sets, and I thought I would drop a line to see if you had any reading suggestions for us.

Let’s say I wanted to disentangle the effects of word length from word frequency on the speed at which people can discriminate words from pseudowords, controlling for nuisance factors (say, part of speech, age of acquisition, and orthographic neighborhood size – the number of other words in the language that differ from the given word by only one letter). My sample can be a couple hundred words from the English language.

What’s the best way to handle the nuisance factors without compromising random sampling? There are programs that can automatically find the most closely-matched subsets of larger databases (if I bin frequency and word length into categories for a factorial experimental design), but what are the consequences of having experimental item essentially be a fixed factor? Would it be preferable to just take a random sample of the English language, then use a heirarchical regression to deal with the nuisance factors first? Are there measures I can use to determine the quantified extent to which chosen sampling rules (e.g. “nuisance factors must not significantly differ between conditions”) constrain random sampling? How would I know when my constraints start to really become a problem for later generalization?

Another way to ask the same question would be how to handle correlated variables of interest like word length and frequency during sampling. Would it be appropriate to find a sample in which word length and frequency are orthogonal (e.g. if I wrote a script to take a large number of random samples of words and use the one where the two variables of interest are the least correlated)? Or would it be preferable to just take a random sample and try to deal with the collinearity after the fact?

I replied:

I don’t have any great answers except to say that in this case I don’t know that it makes sense to think of word length or word frequency as a “treatment” in the statistical sense of the word. To see this, consider the potential-outcomes formulation (or, as Don Rubin would put it, “the RCM”). Suppose you want the treatment to be “increase word length by one letter.” How do you do this? You need to switch in a new word. But the effect will depend on which word you choose. I guess what I’m saying is, you can see how the speed of discrimination varies by word length and word frequency, and you might find a model that predicts well, in which case maybe the sample of words you use might not matter much. But if you don’t have a model with high predictive power, then I doubt there’s a unique right way to define your sample and your population; it will probably depend on what questions you are asking.

Delaney-Busch then followed up:

For clarification, this isn’t actually an experiment I was planning to run – I thought it would be a simple example that would help illustrate my general dilemna when it comes to psycholinguistics.

Your point on treatments is well-taken, though perhaps hard to avoid in research on language processing. It’s actually one of the reasons I’m concerned with the tension between potential collinearity and random sampling in cases where two or more variable correlate in a population. Theoretically, with a large random sample, I should be able to model the random effects of item in the same way I could model the random effects of subject in a between-subjects experiment. But I feel caught between a rock and a hard place when on the one hand, a random sample of words would almost certainly be collinear in the variables of interest, but on the other hand, sampling rules (such as “general a large number of potential samples and keep the one that is the least collinear”) undermines the ability to treat item as an actual random effect.

If you’d like, I would find it quite helpful to hear how you address this issue in the sampling of participants for your own research. Let’s say you were interested in teasing apart the effects of two correlated variables – education and median income – on some sort of political attitude. Would you prefer to sample randomly and just deal with the collinearity, or constrain your sample such that recruited participants had orthogonal education and median income factors? How much constraint would you accept on your sample before you start to worry about generalization (i.e. worry that you are simply measuring the fixed effect of different specific individuals), and is there any way to measure what effect your constraints have on your statistical inferences/tests?

Stan NYC Meetup – Thurs, July 31

The next Stan NYC meetup is happening on Thursday, July 31, at 7 pm. If you’re interested, registration is required and closes on Wednesday night:


The third session will focus on using the Stan language. If you’re bringing a laptop, please come with RStan, PyStan, or CmdStan already installed.


We’re going to focus less on the math and more on the usage of the Stan language. We’ll cover:

• Stan language blocks

• Data types

• Sampling statements

• Vectorization


On deck this week

Mon: A linguist has a question about sampling when the goal is causal inference from observational data

Tues: The Ben Geen case: Did a naive interpretation of a cluster of cases send an innocent nurse to prison until 2035?

Wed: Statistics and data science, again

Thurs: The health policy innovation center: how best to move from pilot studies to large-scale practice?

Fri: The “scientific surprise” two-step

Sat, Sun: As Chris Hedges would say: It’s too hot for words

Stan 2.4, New and Improved

We’re happy to announce that all three interfaces (CmdStan, PyStan, and RStan) are up and ready to go for Stan 2.4. As usual, you can find full instructions for installation on the

Here are the release notes with a list of what’s new and improved:

New Features
* L-BFGS optimization (now the default)
* completed higher-order autodiff (added all probability functions,
  matrix functions, and matrix operators);  tested up to 3rd order
* enhanced effective range of normal_cdf to prevent underflow/overflow
* added von Mises RNG
* added ability to use scalars in all element-wise operations
* allow matrix division for mixing scalars and matrices 
* vectorization of outcome variates in multivariate normal with efficiency boosts
* generalization of multivariate normal to allow rwo vectors as means

* move bin/print and bin/stanc to CmdStan;  no longer generating main
  when compiling model from Stan C++

New Developer
* Added Jeffrey Arnold as core Stan developer
* Added Mitzi Morris as core Stan developer

Bug Fixes
* modified error messages so that they're all 1-indexed instead of 0-indexed
* fixed double print out of times in commands
* const added to iterators to allow VS2008 compiles
* fix boundary conditions on ordered tests
* fix for pow as ^ syntax to catch illegal use of vectors (which
  aren't supported)
* allow zero-length inputs to multi_normal and multi_student_t
  with appropriate log prob (i.e., 0)
* fixed bug in inverse-Wishart RNG to match MCMCPack results
  with slightly asymmetric inputs
* fixed problem with compiling user-defined function twice
* fixed problem with int-only parameters for user-defined functions
* fixed NaN init problems for user-defined functions
* added check that user variable doesn't conflict with user function + doc
* disallow void argument types in user-defined functions

Code Cleanup and Efficiency Improvements
* removed main() from models generated from C++ Stan (they are
  now available only in CmdStan); removed no_main command options
* reserve vector sizes for saving for sample recorder
* removing many instances of std::cout from API (more still to go)
* removed non-functional Nesterov optimization option
* optimization code refactoring for testing ease
* better constant handling in von Mises distribution
* removed tabs from all source files
* massive re-org of testing to remove redundant files and allow
  traits-based specializations, plus fixed for 1-indexing

* added tests for log_softmax, multiply_lower_tri_self_transpose, tcrossprod
* break out function signature tests into individual files, add many
* enhanced cholesky factor tests for round trip transforms and
* extensive unit testing added for optimization 
* remove use of std::cout in all tests

Example Models
* lots of cleanup in links and models in ARM examples
* added BUGS litter example with more stable priors than in the 
  BUGS version (the original model doesn't fit well in BUGS as is, 

* add infix operators to manual
* categorical_logit sampling statement	
* Cholesky factor with unit diagonal transform example
* example of using linear regression for prediction/forecasting with
* clarified some relations of naive Bayes to clustering
  vs. classification and relation to non-identifiability
* new advice on multivariate priors for quad_form_diag
* fix typo in multiply_lower_self_transpose (thanks to Alexey Stukalov)
* fix formatting of reserved names in manual
* fixed typo and clarified effective sample size doc

Stan found using directed search


X and I did some “Sampling Through Adaptive Neighborhoods” ourselves the other day and checked out the nearby grave of Stanislaw Ulam, who is buried with his wife, Françoise Aron, and others of her family.


The above image of Stanislaw and Françoise Ulam comes from this charming mini-biography from Roland Brasseur, which I found here.

P.S. The Cimetière Montparnasse is full of famous people, including the great political scientist Raymond Aron (who perhaps has no close relation to Françoise Aron, given that his grave is listed as being in a different block of the cemetery), but the only other one we saw that day was Henri Poincaré.

NYC workshop 22 Aug on open source machine learning systems

The workshop is organized by John Langford (Microsoft Research NYC), along with Alekh Agarwal and Alina Beygelzimer, and it features Liblinear, Vowpal Wabbit, Torch, Theano, and . . . you guessed it . . . Stan!

Here’s the current program:

8:55am: Introduction
9:00am: Liblinear by CJ Lin.
9:30am: Vowpal Wabbit and Learning to Search (John Langford and Hal Daumé III)
10:00am: Break w/ refreshments
10:30am: Torch
11:00am: Theano and PyLearn (Bart van Merrienboer and Frederic Bastien)
11:30am: Stan (Bob Carpenter)
Noon: Lunch on your own
1:00pm: Summary: The systems as they exist
1:20pm: Panel: Can projects coordinate?
2:00pm: Break w/ refreshments
2:30pm: Discussion time for individual projects
4:00pm: Liblinear future plans
4:10pm: Vowpal Wabbit future plans
4:20pm: Torch future plans
4:30pm: Theano and PyLearn future plans
4:40pm: Stan future plans
4:50pm: Conclusion

It’s open to all (I think) as long as you let them know you’ll be coming; just email

“An Experience with a Registered Replication Project”

Anne Pier Salverda writes:

I came across this blog entry, “An Experience with a Registered Replication Project,” and thought that you would find this interesting.

It’s written by Simone Schnall, a social psychologist who is the first author of an oft-cited Psych Science(!) paper (“Cleanliness reduces the severity of moral judgments”) that a group of researchers from Michigan State failed to replicate as part of a big replication project.

Schnall writes about her experience as the subject of a “failed recplication”. I like the tone of her blog entry. She discusses some issues that she has with how the replication of her work, and the publication of that work in a special issue of a journal was handled. A lot of what she writes is very reasonable.

Schnall believes that her finding did not replicate because there were ceiling effects in a lot of the items in the (direct) replication of her study. So far so good; people can mistakes in analyzing data from replication studies too. But then she writes the following:

It took me considerable time and effort to discover the ceiling effect in the replication data because it required working with the item-level data, rather than the average scores across all moral dilemmas. Even somebody familiar with a research area will have to spend quite some time trying to understand everything that was done in a specific study, what the variables mean, etc. I doubt many people will go through the trouble of running all these analyses and indeed it’s not feasible to do so for all papers that are published.

She may be talking about the peer-review process here, it’s not entirely clear to me from the context. But what am I to think of her own data-analysis skills and strategies if it takes her “considerable time and effort” to discover the ceiling effects in the data? Looking at the distribution of variables is the very first thing that any researcher should do when they start analyzing their data. Step 1: exploratory data analysis! If she’s right about how this aspect of the data accounts for the failure to replicate her original finding, of course that makes for an interesting story—we need to be critical of failed replications too. But the thought that there may be a substantial number of scientists who don’t look in detail at basic aspects of their data before they start working with aggregated data makes me shake my head in disbelief.

My reply:

On the details of the ceiling effect, all I can say is that I’ve made a lot of mistakes in data analysis myself, so if it really took Schnall longer than it should’ve to discover this aspect of her data, I wouldn’t be so hard on her. Exploratory analysis is always a good idea but it’s still easy to miss things.

But speaking more generally about the issues of scientific communciation, I disagree with Schnall’s implication that authors of published papers should have some special privileges regarding the discussion of their work. Think about all the researchers who did studies where they made no dramatic claims, found no statistical significance, and then didn’t get published? Why don’t they get special treatment too? I think a big problem with the current system is that it puts published work on a plateau where it is difficult to dislodge. For example, “That is the assumption behind peer-review: You trust that somebody with the relevant expertise has scrutinized a paper regarding its results and conclusions, so you don’t have to.” My response to this is that, in many areas of research, peer reviewers do not seem to be deserving of that trust. Peer review often seems to be a matter of researchers approving other papers in their subfield, and accepting claims of statistical significance despite all the problems of multiple comparisons and implausible effect size estimates.

Schnall quotes Daniel Kahneman who writes that if replicators make no attempts to work with authors of the original work, “this behavior should be prohibited, not only because it is uncollegial but because it is bad science. A good-faith effort to consult with the original author should be viewed as essential to a valid replication.” I have mixed feelings about this. Maybe Kahneman is correct about replication per se, where there is a well-defined goal to get the exact but I don’t think it should apply more generally to criticism.

Regarding the specifics of her blog post, she has a lot of discussion about how the replication is different from her study. This is all fine, but the flip side of this is that why should her original study be considered representative of the general population? Once you accept that effect sizes were vary, this calls into question the paradigm of making quite general claims based on a study of a particular sample. Finally, I was disappointed that not anywhere in her blog does she consider the possibility that maybe, just maybe, her findings were spurious. She writes, “I have worked in this area for almost 20 years and am confident that my results are robust.” The problem is that such confidence can be self-reinforcing: once you think you’ve found something, you can keep finding it. The rules of statistical significance give a researcher enough “outs” that he or she can persist in a dead-end research paradigm for a long time. Just to be clear, I’m not saying that’s what’s happening here—I know nothing about Schnall’s work—but it can happen. Consider, for example, the work of Satoshi Kanazawa.

That said, I understand and sympathize with Schnall’s annoyance at some of the criticisms she’s received. I’ve had similar feelings to criticisms of my work: sometimes people point out actual and serious errors, sometimes mistakes I’ve made of over-generalization, other times I’ve misconstrued the literature in some way. Just recently my colleagues and I made some changes in this paper because someone pointed out some small ways in which it was misleading. Luckily he told us while the paper was still being revised for publication. Other times I learn about problems later and need to issue a correction. But sometimes I do get misinformed criticisms, even published papers where someone misunderstands my work and slams it. And I don’t like it. So I see where she’s coming from.

Again, from a statistical perspective, my biggest problem with what Schnall writes is that she doesn’t seem to consider, at all, the possibility that her original results could be spurious and arise from some combination of measurement error and capitalization on noise which, looked at a certain way, might include some statistically significant comparisons. I worry that she is thinking in a binary way: her earlier paper was a success, the replication was a failure, so she seeks to explain the differences. There are certainly differences between any two experiments on the same general phenomenon—that’s why we do meta-analysis—but, as Kahneman and Tversky demonstrated four decades ago, researchers tend to systematically underestimate the importance of variation in interpreting experimental results.

Following up, Salverda writes:

In the meantime, I came across two papers. One is by Schnall (in press, I believe), a response to the paper that failed to replicate her work.

And the people who failed to replicate her work [David Johnson, Felix Cheung, and Brent Donnellan] wrote a paper that addresses her criticism.