Yesterday we had a spirited discussion of the following conditional probability puzzle:
“I have two children. One is a boy born on a Tuesday. What is the probability I have two boys?”
This reminded me of the principle, familiar from statistics instruction and the cognitive psychology literature, that the best way to teach these sorts of examples is through integers rather than fractions.
For example, consider this classic problem:
“10% of persons have disease X. You are tested for the disease and test positive, and the test has 80% accuracy. What is the probability that you have the disease?”
This can be solved directly using conditional probability but it appears to be clearer to do it using integers:
Start with 100 people. 10 will have the disease and 90 will not. Of the 10 with the disease, 8 will test positive and 2 will test negative. Of the 90 without the disease, 18 will test positive and 72% will test negative. (72% = 0.8*90.) So, out of the original 100 people, 26 have tested positive, and 8 of these actually have the disease. The probability is thus 8/26.
OK, fine. But here’s my new (to me) point). Expressing the problem using a population distribution rather than a probability distribution has an additional advantage: it forces us to be explicit about the data-generating process.
Consider the disease-test example. The key assumption is that everybody (or, equivalently, a random sample of people) are tested. Or, to put it another way, we’re assuming that the 10% base rate applies to the population of people who get tested. If, for example, you get tested only if you think it’s likely you have the disease, then the above simplified model won’t work.
This condition is a bit hidden in the probability model, but it jumps out (at least, to me) in the “population distribution” formulation. The key phrases above: “Of the 10 with the disease . . . Of the 90 without the disease . . . ” We’re explicitly assuming that all 100 people will get tested.
Similarly, consider the two-boys example that got our discussion started. The crucial unstated assumption was that, every time someone had exactly two children with at least one born on a Tuesday, he would give you this information. It’s hard to keep this straight, given the artificial nature of the problem and the strange bit of linguistics (“I have two children” = “exactly two,” but “One is a boy” = “exactly one”). But if you do it with a population distribution (start with 4×49 families and go from there), then it’s clear that you’re assuming that everyone in this situation is telling you this particular information. It becomes less of a vague question of “what are we conditioning on?” and more clearly an assumption about where the data came from.