Calibration in chess

Daniel Kahneman posted the following on the Judgment and Decision Making site:

Have there been studies of the calibration of expert players in judgments of chess situations — e.g., probability that white will win?

In terms of the amount and quality experience and feedback, chess players are at least as privileged as weather forecasters and racetrack bettors — but they don’t have the experience of expressing their judgments in probabilities. I [Kahneman] am guessing that the distinction between a game that is “certainly lost” and “probably lost” is one that very good players can make reliably, but I know of no evidence.

Despite knowing much less about decision making and (likely) less about chess than Kahneman, I have three conjectures:

1. Players would show superadditivity in the sense of overstating their own chances of winning. To put it another way, suppose that both players in a game give you Pr(I win), Pr(I tie), Pr(I lose). Call these W1, W2, W3 (for white) and B1, B2, B3 (for black). My conjecture is that (W1+B1) > (W3+B3)–that is, that the total “I win” probability exceeds the total “I lose” probability. It would be interesting to see this on average and also for individual games and times of the game.

2. Players would show the usual overconfidence in probability statements, for example, events that are stated to happen 90% of the time only happening 75% of the time, and so forth.

3. Aspects of both points above might be explained by the idea that:chess players, like the rest of us, tend to make their probability statements about the ideal, rather than the actual, game outcome. For example, suppose you were to do a study to measure probability judgments and find the (generically) expected overconfidence: when players predict a 99% chance of victory, it only happens 90% of the time, or whatever. On those 10% of the times when his or her prediction is wrong, I could imagine he or she explaining it away as some blunder that “wasn’t supposed to happen” and so shouldn’t count.

Similarly, before the game even starts, each player’s probability of winning can be calculated based on who is playing white, who is black, and their ratings (see here), but I would imagine that, before the game begins, each player overestimates his or her own winning probability, thinking “this time I’ll play harder” or something similar.

This ties in a bit to the distinction between the “is vs. should” or “descriptive vs. normative” distinction in decision analysis. I think it would be natural to assess the chances of winning in the well-fought game of the player’s imagination rather than in the calibrated empirical world of all realistic possibilities.

Anyway, it would be fun to see the data. And I’m probably being overconfident about my own conjectures above.

Update from the comments

In the comments, Smiley and James suggest that chess players evaluate the position rather than the players. This would lead to classical overconfidence (bias #2 above) because evalation of “the postion” would tend to imply near-optimal play and would discount the possibility of blunders or simply of aspects of the position that are not noticed. It would also lead to superadditivity from some version of the endowment effect (overvaluing my position because it’s mine.) Koray points out that chess programs perform these evaluations automatically so maybe these could be compared to players’ personal evaluations.

And Lemmus points out that you could have observers other than the players make the probability evaluations also–some observers who are watching the games, others who know something about the players but aren’t watching live, and others who only see the position (and possibly how the players got there).

10 thoughts on “Calibration in chess

  1. Interesting post. I agree with Andrew's intuition that good chess players are likely to overestimate the probability of a win in a given situation. As a tournament chess player (and someone who follows games among top players), my sense is that an overestimation of winning probability has two sources – a inherent miscalibration in evaluating a position, and a potential for making errors in playing the game. The latter source is I believe what Andrew cites as blunders that weren't supposed to happen. The former source is consistent with the notion that when a position looks strong, it is sometimes difficult to appreciate that defensive resources exist.

    I also agree (apparently I'm very agreeable) with Kahneman's assertion that strong players have developed the tools to distinguish between certainly and probably won positions. I seem to recall (though I cannot cite specifically) work pointing out that the major difference between strong and not-so-strong tournament players is in their ability to evaluate positions, and not in the ability to calculate sequences of moves. So I would conjecture that top players would be better able to provide accurate probability estimates of winning relative to "ordinary" tournament players in positions that particularly did not require extensive calculations, and would not necessarily provide more accurate estimates of winning probabilities for tactical types of positions (i.e., of the sort that did require calculations).

    – Mark

  2. I don't follow the question. What does probability that white will win mean? White played by Kasparov or by me? Do we assume that the chess expert also knows the two players?

    Chess programs already assign a score to the entire board at every step. The score may not be perfect, but they seem to do very well against humans.

  3. IMO and IME chess players generally evaluate the position rather than the players: their judgement about a position being "a win for white" is made with the assumption of best play on both sides, rather than based on the skills of the players (indeed the concept of evaluating a position requires this approach). Therefore this claim does not directly translate into a prediction of the result, and a blunder certainly does nothing to invalidate the judgement.

    Of course this doesn't necessarily invalidate your comments, on the occasions when players actually do predict outcomes.

  4. There is an interesting discussion of psychology in chess that touches on your points in the book "Chess for Zebras" by Jonathan Rowson. For instance, a lot of good players don't think at all about the actual evaluation of the position during a game, or perhaps only very vaguely in the background. Also, some players come to the game just wanting to play and the game to go on – a strong will to win can for some even be overwhelming and damaging, although not for all. A more common question, however, chess players ask themselves is 'how many results am I playing for?' Ie draw, win but maybe draw, or is a loss on too? Sometimes this helps composure and orientation during a game.

    I think only occasionally people put concrete probabilities on game results, and this was mostly in the past when we still had adjournments at the top level. So most of what you refer to doesn't happen.

    Hope this helps.

  5. Interesting post. I'm curious about something. There seems to be a general trend of impressions that people are overconfident about their probabilistic estimations of lots of things. There is something else I know of that is overconfident in probabilistic estimations of lots of things: a naive Bayes classifier. I wonder if (psychologically) there is any connection. Are humans just bad at realizing things like "yes, factor A and factor B both make a positive outcome more likely, but A and B are far from independent"? Has anyone tried to look at this formally? (I know this is getting away from stats, but…)

  6. 'Are humans just bad at realizing things like "yes, factor A and factor B both make a positive outcome more likely, but A and B are far from independent"? Has anyone tried to look at this formally?' I'm betting someone has looked at this (possibly Kahneman himself). I've certainly seen it advanced as an explanation of human judgements. For example, Steve Payne published a Human Factors paper on naive judgements of usability that proposed that people make usability judgements (e.g., A > B) that treat different features as additive. This causes them problems because consistency (both internal and external) has a huge impact on usability.

  7. "In the comments, Smiley and James suggest that chess players evaluate the position rather than the players. This would lead to classical overconfidence (bias #2 above) because evaluation of the position would tend to imply near-optimal play and would discount the possibility of blunders or simply of aspects of the position that are not noticed."

    But chess players realize that evaluating the position and evaluating their probability of winning are different. Chess players try to play accurately by evaluating possible positions that could arise. They rarely think about the probability that they will win.

  8. I agree with what was said above. You need to distinguish 3 types of evaluations of positions:
    1. Theoretical or problem-solving evaluation: assuming optimal play. Here probabilities are pretty difficult to conceptualize really.
    2. Evaluation while playing – can't really comment on that as I do not play – but I can see overconfidence being an issue here as well as the idea that blunders invalidate the prediction – you often here people say 'I had the win' when they ended up loosing.
    3. Observing an actual game/position as a third party – which I believe was the original question – here in addition to the position, an observer would also take into account:
    – the two player's ratings – which can also be seen as the baseline for the calculation of the probabilities
    – knowledge of the player's strengths, weaknesses, playing style and familiarity with certain opening positions and structures
    – the time on the clock hasn't been mentioned yet: even if someone is in a good position, they might not have time to execute their plan or time to figure out a winning manoeuvre.
    I believe that good players would be able to make good estimates of the outcomes given all these info. What might help with getting them to accurately express their expectations though – seeing as humans tend to be bad at expressing probabilities as percentages – would be to make them bet money on specific odds. If anyone is planning on doing this experiment, I would say this would be a more accurate measure of what their expectation really is. And now, come to think about it, it turns out this is nothing more than a question about sports betting!

    ps. ditto on being overconfident about my conjectures ;)

Comments are closed.