Too good to be true: when overwhelming mathematics fails to convince

Gordon Danning points me to this news article by Lisa Zyga, “Why too much evidence can be a bad thing,” reporting on a paper by Lachlan Gunn and others. Their conclusions mostly seem reasonable, if a bit exaggerated. For example, I can’t believe this:

The researchers demonstrated the paradox in the case of a modern-day police line-up, in which witnesses try to identify the suspect out of a line-up of several people. The researchers showed that, as the group of unanimously agreeing witnesses increases, the chance of them being correct decreases until it is no better than a random guess.

This doesn’t make sense. I have a feeling their conclusion is leaning heavily on some independence assumption in their model.

I clicked through to see the paper, and I don’t see any actual data on police lineups. So I see no reason to trust them on that. The math is interesting, though, and I’ll agree there’s some relevance to real problems. I’m just disturbed by everyone’s willingness to assume the particular mathematical results apply to particular real scenarios.

1. Jim says:

I don’t know much about lineups, but the intuition is sensible. I think the analogy in the political science context is that I’ll generally take p = 0.001 as strong evidence but p = 0.000001 as evidence the model is probably misspecified.

2. Jonathan says:

I have to object first to the article: it isn’t true that a unanimous guilty verdict resulted in acquittal. It was, rather, that the Talmud contains this statement attributed to Rabbi Kahana and in the context that a case always has two sides, that a unanimous verdict means there is no further argument. This is generally taken to mean that the legal process should always be open to hearing new evidence, not that a unanimous verdict is wrong. We don’t even know what Kahana was referring to as a verdict, which leads some to say rationally that he meant until you’ve exhausted appeals because otherwise there can be no justice.

There is no paradox of unanimity, just that unanimous today may be overturned by new evidence tomorrow. The odds of unanimous error vary by context, which can be seen in basic police statistics. Decades ago, we did a review of case outcomes in our prosecutor’s office (in a large Midwestern city) and found that 93-94% of those arrested plead guilty and of the remaining 6% or so left we won about half the trials. That’s a pretty good real life multi-level model: most of those arrested were caught red-handed (like in a car with a gun or the equivalent) and others had the kind of records that meant they risked more by going to trial than the offered bargain – like 10 years hard time versus 2 with time served, meaning out in a year, maybe not even in real prison, maybe avoiding 3x offender sentencing laws. The ones who went to trial generally had proof issues and that included identification and whether the elements of the crime were met. Example: people can ID someone but it was dark and at a distance or people can ID you because they know you and only an idiot thinks those are the same, which isn’t a paradox of unanimity at all. As to elements, if you need intent for a crime, you can argue you didn’t intend and those tools were because you were helping a friend or were scrapping in alleys. And every once in a while someone would go to trial because they had nothing to lose; they were looking at big numbers, like 20 or 30 years, so why not roll the dice?

• Curious says:

And you feel the assumption of ‘guilt’ emerging from negotiation tactics which include the threat of long term incarceration set next to a lesser penalty is free from error?

• Jonathan says:

How is that close to what I said? We checked the results of what happened in our office over a period of a few years. This is what we found. I’m trying to read your comment as something other than an aspersion on the justice system but that’s difficult because you don’t understand how it works.

In our office, we indicted generally by information, though some states and localities use only grand juries. This meant prosecutors reviewed files and on more difficult or important cases talked to officers and witnesses to form a set of charges plus the listing of justifications that met the legal standard for each charge. That then led to a preliminary hearing at which we had to demonstrate probable cause on 3 elements: that a crime was committed by this person within the jurisdiction of this court. When the person was bound over for trial, that would start pre-trial negotiation. All stages after indictment always involved defense counsel but more to the point I only saw a handful of cases where the defendant didn’t make his (occasionally her) own decisions because, bluntly, they can figure odds too. You might as a prosecutor also approach a defendant and his counsel before the preliminary exam to discuss a deal and I would say all of those cases involved guilty people, almost always charged with multiple felonies, who used the overloading of the courts and jails to get deals for lesser time. The negotiations were actually tilted heavily in the defendants’ favor in terms of punishment because our city had lots of crime and we couldn’t punish everyone.

Suburban courts were much tougher and there was conflict between their belief that strictness suppressed crime and our belief – rather knowledge – that they lived in worlds with less crime and were inappropriately thinking their approach would work in our world. You saw this from bail to sentencing because they would demand cash, which defendants don’t have, while we’d be forced by volume to take bonds with sureties to offload “risk” to the bondsmen.

Reality is contained in our numbers: a) our office couldn’t handle more trials, most of which were bench not jury btw – given we the judges were behind running the system as it was, b) any “tough on crime” policy would have specific consequences, such as i) need for more prosecutors and judges and thus courtrooms, which won’t happen, so more backlog, longer pre-trial waits, ii) more backlog equals more detention means freeing more people to clear space, means more actual criminals on the streets than before the “get tough” policy, iii) prosecution negotiation leverage is lowered because they know you don’t have the time/resources to try a guy for multiple B&Es (except when the detention system has the guy and he’s looking at sitting in jail indefinitely, given that jail is generally worse than prison for comfort and physical safety), iv) that judges will clear dockets by dismissing cases – this happens out of the public eye – thus reducing trial success rates, v) that cases will get stale as witnesses move, forget, get intimidated, etc. and so on.

The point, other than to get a handle on results, was to try to understand what changes here when you push on things over there. All approaches have error.

• Curious says:

I am not dismissing most of what you say and agree with the differences you observe between the beliefs of judges in high crime areas versus low crime areas. But I am challenging what I believe is a common error in the way people directly involved in the legal justice system think about what they ‘know to be true’.

What I have both observed and experienced in my interactions with police and the legal justice system is that there exists levels of assumed certainty about situations, defendants, and about witnesses that simply are not possible given what we know about the nature of social reality, human perception and cognition.

Perhaps you are correct that in your city the only cases that made it that far into the system were for guilty defendants, but from what I know about social reality, human perception and cognition I find that assumption difficult to believe. I have experienced police using illegal tactics and pretending they didn’t do it. I have observed police trying to intimidate teenagers because they did not like the way the teenagers were looking at them. I have experienced fines for very minor violations that for someone without my means would have meant not being able to pay rent or buy food for the month.

Narratives often give the appearance of certainty which is unwarranted. Simple statistics can give the same appearance when not properly described. Appropriate modeling/handling of error and uncertainty are fundamentally important for getting things right in statistics. I think that is also true in other areas of life.

3. mark k says:

The “when a party wins by a small margin” example is especially bad. 48% of the people voting for an egregiously “bad” candidate is not a good sign, you’d much rather it be 38%.

Now, if only 52% of the people disagree with you but the returns show that your side won with a 25-point margin, I’d agree there’s a very serious issue. But they seem to be confusing way too many things here.

4. David Condon says:

“I’m just disturbed by everyone’s willingness to assume the particular mathematical results apply to particular real scenarios.”

Reminds me of computationalism

5. Jacob says:

>Their conclusions mostly seem reasonable, if a bit exaggerated

Agreed, and apparently that’s best-case scenario for science reporting these days.

There are a few ideas they mention.

* If errors are truly random and independent, that means in a large sample we should expect a few of them to happen. For the witness misidentification example, if we (generously) estimate the probability of a witness picking the wrong suspect at 10%, that means with 10 witnesses we expect about one person to get it wrong. As the number of witnesses W goes to infinity, the number getting it wrong should converge to about 10% of W. If unanimous agreement still persists with very large W, it means one of the assumptions was wrong. I’m not sure if this is what you meant by having an incorrect independence assumption in their model, but that seems to be the point they’re trying to make as well. None of this should be that controversial, it’s just a restatement of “too good to be true”.

* Possible error modes which get ignored in some circumstances can dominate in others. The authors use the example of labs testing pots for determine ancient country of origin. The tests have a random error rate of 30%, so they do a bunch and go by majority rule. Except 1% of ancient pots were contaminated in such a way that causes these tests to systematically fail. 30% >> 1% so for one test we can just ignore that possibility. But if 10 labs each do their own test and we get (7,3), that matches the random error. If instead we get (10,0),observe that this result only has probability 3% under the random model. (100,0) is virtually impossible if errors are random.

Regarding police lineups, I doubt there are ever enough witnesses for this result to ever apply.

Somewhat related in spirit is this paper on unanimous voting in juries: http://www.kellogg.northwestern.edu/faculty/fedderse/homepage/papers/jury.pdf. Again, pure math, no empirical work, so make of that what you will. The crux of the paper is:

>The incentive to vote strategically arises because a juror’s vote only matters when a vote is pivotal and because the information possessed by other jurors is relevant for a juror’s decision. For example, under the unanimity rule a vote is pivotal only when all the other jurors have voted to convict. The fact that all the other jurors have voted to convict reveals additional information about the guilt of the defendant: it reveals, at least in part, the other juror’s private assessment of the case and cause a juror otherwise inclined to vote for acquittal to vote for conviction instead.

• Thanks Jacob. Came here to mention that unanimous jury situation, or the situation where people are less inclined to help someone when there are more bystanders. Then, I just second the last part of what you said.

6. Alex Tabarrok says:

The paper makes sense. The intuition (similar to Jim at (9:57 am) is that if you flip a coin ten times and it comes up heads ten times then you may want to conclude that the coin is biased. Similarly, we would expect some differences in witness statements just to chance and timing alone. But if all the witnesses say the same thing we may expect bias.

7. Maetenloch says:

I read (okay skimmed in detail) the paper shortly after it came out and I think they are on to something. As I recall they relied on previous studies showing that eye witnesses identified the wrong person in simulated police line ups about 50% of the time. This is with assumed independent, non-biased witness selections. But if you allow for the possibility of something ‘hinky’ (bias, contamination, deliberate placement of the suspect, etc) in police lineups, then the probability of identifying the ‘correct’ suspect would be expected to go up to 90% or higher.

So if you have a unanimity of choice among witnesses, you can treat it as a test of two hypotheses – whether p=.50 or p=.90 where p is the probability of correctly identifying the guilty suspect. What they show is that with a ‘hinkyness’ rate of even 1% as the number of unanimous selections goes above a small number, the probability of hypothesis I being true falls (and correspondingly the probability of some kind of hinkyness going on rises). Once you get to around 10 to 15 unanimous choices the probability of hypothesis I (nothing hinky going on) being true falls to only 50%.

• The only thing hinky there is the assumption of independent binomial sampling. All you need is that some feature of the person in the lineup is salient and easy to pick up on in the standard facial-recognition machinery hard-wired into our brains, and you have a common non-independent cause for confusion. It doesn’t require anything done wrong.

• Jonathan (another one) says:

Exactly. Imagine there are 100 eyewitnesses, and there are two suspects, one of whom is blond and the other has black hair. More (independent) unanimity gives you more confidence…. Any counter findings are indicative of error. No counter findings are simply an indication that everyone saw blond hair.

• Clyde Schechter says:

Speculating far from my expertise, but it would seem to me that the implication of this is that if there are two suspects, one of whom has blond hair and the other black, then a line-up, to be probative, needs to include several blonds and several black-haired subjects, enough so that the “identification” does not reduce to a single salient feature. That is, the lineup needs to be constructed so that there are N people in the lineup, and the probability of a correct identification by a person who does not actually know who the culprit is will be low. Perhaps due to highly salient features 1/N cannot be achieved, but something in that ballpark is required. In that situation, the salient feature effects are reduced and the calculations proposed by those researchers, with suitably modified parameters, carry through. In fact, if it were actually known in advance that the culprit has blond hair, then the lineup should contain only blonds.

• Jonathan (another one) says:

What you say is of course correct if, as is generally the case, the goal is to maximize the possibility of misidentification. But in my example we have a small set of actually potentially guilty parties, and maximizing misidentification is a bad idea. Maybe that’s a good way to state the mathematical idea: if you have set up a test to maximize misidentification, aiming at up a probability p per assessment, ending up with a large unanimous set is an indication that you have failed.

• Sure, if you’re trying to determine whether some particular person is the one eyewitnesses saw, you should try to pick 5 or 6 other people who have similar general appearance, but at some point that stops. For example, do you need to match Tatoos? Scars? Deviated nose? Piercings? Distinctive Haircuts / hair dye? Cheekbone structure? Facial hair and color?

There’s a lot that can be said for doing maybe a 2 stage lineup, the first one would have people of all different body types, and NOT have the suspect. ask the identifier to identify the person who looks “most like” the suspect. If they identify someone wildly different then say thanks, and go no further, if they identify the “prototype” for the person you suspect, then bring in a matched group that looks very similar to the suspect and include the suspect. This of course is more complicated, but if you jump straight to people who look near-identical to the suspect it already is a lot of information being given to the identifier. You couldn’t identify a witness who was totally clueless that way.

8. Jameson Burtr says:

There are arenas where near perfect agreement strongly hints at mischief (lack of independence or mucking with the data).
For example,
a. 100% of employees contribute to a company sponsored non-profit.
One suspects coercion.
b. Over 75% vote for one person in an election.
One suspects a stuffed ballot box, or finagled electronic election machines.
c. Ten percent of graduates in nursing are male, but 50% of some hospital employees are male.
One suspects inappropriate attempts at equality, although this might be appropriate at the Veterans Administration if 90% of patients are male.
d. Everyone at a table chooses lasagna.
One imagines people who can’t read, or people with little self motivation.

In a lineup of 3 people, the statistician hasn’t a lot to say.