Skip to content
 

Another argument in favor of expressing conditional probability statements using the population distribution

Yesterday we had a spirited discussion of the following conditional probability puzzle:

“I have two children. One is a boy born on a Tuesday. What is the probability I have two boys?”

This reminded me of the principle, familiar from statistics instruction and the cognitive psychology literature, that the best way to teach these sorts of examples is through integers rather than fractions.

For example, consider this classic problem:

“10% of persons have disease X. You are tested for the disease and test positive, and the test has 80% accuracy. What is the probability that you have the disease?”

This can be solved directly using conditional probability but it appears to be clearer to do it using integers:

Start with 100 people. 10 will have the disease and 90 will not. Of the 10 with the disease, 8 will test positive and 2 will test negative. Of the 90 without the disease, 18 will test positive and 72% will test negative. (72% = 0.8*90.) So, out of the original 100 people, 26 have tested positive, and 8 of these actually have the disease. The probability is thus 8/26.

OK, fine. But here’s my new (to me) point). Expressing the problem using a population distribution rather than a probability distribution has an additional advantage: it forces us to be explicit about the data-generating process.

Consider the disease-test example. The key assumption is that everybody (or, equivalently, a random sample of people) are tested. Or, to put it another way, we’re assuming that the 10% base rate applies to the population of people who get tested. If, for example, you get tested only if you think it’s likely you have the disease, then the above simplified model won’t work.

This condition is a bit hidden in the probability model, but it jumps out (at least, to me) in the “population distribution” formulation. The key phrases above: “Of the 10 with the disease . . . Of the 90 without the disease . . . ” We’re explicitly assuming that all 100 people will get tested.

Similarly, consider the two-boys example that got our discussion started. The crucial unstated assumption was that, every time someone had exactly two children with at least one born on a Tuesday, he would give you this information. It’s hard to keep this straight, given the artificial nature of the problem and the strange bit of linguistics (“I have two children” = “exactly two,” but “One is a boy” = “exactly one”). But if you do it with a population distribution (start with 4×49 families and go from there), then it’s clear that you’re assuming that everyone in this situation is telling you this particular information. It becomes less of a vague question of “what are we conditioning on?” and more clearly an assumption about where the data came from.

16 Comments

  1. marcel says:

    Why does "80% accurate" mean that 80% of the positive results are correct, and 80% of the negative results are correct? Why the symmetry? I would think that overall, 80% of the results are correct, but not that the percentages of correct positive and negative results are equal. Also, is there something about the test that says that if incidence in the population is 10%, then the test will return a positive result 10% of the time? Wouldn't it be likely that people who are tested are more likely to have the disease than the population in general?

    Is the answer to all these questions something along the lines of, "Listen wiseguy, this is a simplified version of reality so that we can illustrate some reasoning, so just shut up and go away."?

  2. K? O'Rourke says:

    Key assumption better put as independence?

    From recent experience presenting similar material to a group of about 30, mostly without statistical degrees my impression is

    The _Model_ as a representation of joint outcomes (D,T) in a possible population is better highlighted

    Possibly reasonable – but fallible, questionable and especially “rejectable”

    This joint model is more encouraged to be broken into two pieces (or relevant models)

    The model before the test and the model after the test (one relevant before, the other after) with the after test model visible without a formula

    On the log scale the “model after” minus “model before” gives a model from “just” the test result

    Noting the implied addition – “model after” = “model before” + “model test” – stresses the independence that was specified in the joint model – another aspect to question and possibly reject

    Examine separately and contrast “model before” and “model test” – rejecting either or possible both!

    But given the joint model “as is” there are no errors in the calculations of “model after”!

    As for distinguishing between a population distribution versus a probability distribution – isn’t the population distribution just counterfactual as its just a representation of results you would get if the assumption were exactly true, everyone was tested and the least likely event happened (percentages exactly equalled the probabilities).

    But in any case, the table of numbers is just one of many valid representations of the probability distribution whose joint probabilities equal the table percentages?

    K?

  3. David says:

    Population has 2^N people, who are described by their attributes:

    Attributes including Sex, Day of week of birth, hair color etc… Attributes are all binary: tall or short, blue or green eyes, burnette or blonde… etc. There are N attributions. All possible combinations are represented.

    All individuals have one and only one sibling. There is no correlation between attributes [I don't actually know if it is possible to construct such a population…]

    Q1: For a given person what is the probability their sibling is male?
    A1: 1/2

    Q2: For a given man what is the probability his sibling is male?
    A2: 1/3

    The choice of "man" was arbitrary. If I asked "for a given woman what is the probability her sibling is male" i would get 2/3. The average of these two choices gets us back to Q1.

    So far things seem sensible. Then we add additional attributes:

    Q3: For a given man with blue eyes what is the probability his sibling is male?
    A3: 3/7

    Q4: For a given tall man with blue eyes what is the probability his sibling is male?
    A4: 7/15

    Q5: For a given tall man with blue eyes… who wears bow ties what is the probability his sibling is male?
    A5: This approaches 1/2. The forumla seems to be [2^n-1]/[2^(n+1)-1]. Where n is the number of attributes (including gender) we are conditioning upon.

    But, for all intents and purposes "you" can be described by your attributes, so can we not rephrase the statement as:

    Q5*: Mr So-and-so, what is the probability your sibling is male?
    A5*: Isn't this just question 2 again? Shouldn't it be 1/3.

  4. Andrew Gelman says:

    Marcel: You're taking the discussion exactly where I like to have it go in class, toward a discussion of the assumptions of the model and how they relate to the conclusions.

  5. K? O'Rourke says:

    One scenario that worked well was a not high risk individual who has positive test result but afterwards finds out that a past partner was a IV drug user – his non high risk population model (prior) gets rejected and possibly replaced with high risk population model based on information (arguably independent of the test) recieved after the test result.

    Getting this model tentativeness and constant possible need for revision early in Bayes education – may do more good than harm ;-)

    K?

  6. bxg says:

    I suggest a simple rule-of-thumb with similar if not greater power: *never* blindly assume you should condition on some objective fact-of-the-world X, but always on the fact that _you_ learned X. E.g. "I decided [why?] to take the test and then did so then [was my decision enough or could anything interfere with the report that?] it showed positive". "The parent told me [why?] that one child was a tuesday boy."

    Take this as your default approach to any question. It forces you to spell out the assumptions about how/whether you simply from "|I learned X" to "|X". And if you just can't extend your model or add assumptions enough to cleanly get from "I learned X" to "X", be very fearful that you don't understand the problem well enough to continue.

    In my experience, very few of the standard "non-obvious/paradoxical" probability questions remain at all interesting if this single rule is kept in mind. I'd love to hear your readers views of any exceptions to this claim.

  7. bxg says:

    K?,
    I'm aware you've responded to criticism of your blog style before. Add me (yes, an anonymous commenter so, yeah, WTF) to the chorus of critics. But seriously, why do you (obviously) spend time on this, with so little apparent care as to whether you are writing people can follow. I would suggest, for example, that even trying for a complete sentence _most_ of the time would help. Yes, I know you've talked about the time challenges this poses.

    But honestly, it seems to me like you are writing for some particular and intimately known audience, who have a ton of idiosyncractic context, but if so why do it through the Professor's forum? I want to read everything you write here because there is the tantalizing hint of deep knowledge I'd enjoy, but I'm seriously puzzled as to whether a random reader of this blog is even part of your desired audience. Even a statement "Ignore me unless you are in group X or know all about Y" would provide some clarity and aid us (anomymous parasitic blog-readers) to greater efficiency!

  8. dggoldst.myopenid.co says:

    See a visual representation of Andrew's solution at:

    http://www.decisionsciencenews.com/2010/05/28/tue

  9. K? O'Rourke says:

    bxg I am reminded of John Dewey's response to a student who complained he was difficult to understand – young man you could make a good career trying to understand me.

    But seriously – my primary audience is "me" – I always feel less deep and intelligent when the prose on the screen stares back at me – even disregarding the poor grammar/spelling/sentence structure.

    As for that – most of my posts end with either I post what I have or erase

    Like right now as laundry is literally being thrown around me….

    K?

  10. carlitos says:

    Q2: For a given man what is the probability his sibling is male?
    A2: 1/2 (look at all the men in your population, and you'll find half of them will have a brother)

  11. Sean Matthews says:

    it was about time for another Monty Hall to come down the pike. And don't forget that even Pol Erdös got that one wrong.

    The clue is that 'boy born on day X' is not a partition function so you cannot directly marginalise over it. If you were to do the obvious experiment, and repeat it, you will find that in the (rapidly approached limit), two thirds of the time, the other child is a girl. Bit like if you do the experiment with Monty Hall, you will find that two thirds of the time, the gift is behind the door you can switch to.

  12. sean matthews says:

    Addendum to last post (worked out on the train this morning) dunno if somebody has already posted this in the thread but I didn't see it

    P(X) = 197/197

    P(girl(X)) = 147/197

    P(boy(X)) = 147/197

    P(boy(X), girl(X)) = 98/197

    P(girl(X) | boy(X)) = (98/197)/(147/197) = 2/3

    P(boy(tu, X)) = 27/197

    P(girl(X), boy(tu, X)) = 14/197

    P(girl(X) | boy(tu, X)) = (14/197)/(27/197) = 2/3

  13. Any Mouse says:

    "I have two children. One is a boy born on a Tuesday. What is the probability I have two boys?"

    Um. The answer to this particular question is easy. It's either zero or one, we just don't know which. There's no probability about whether some individual actually has two existing sons or not; the realized outcome is already yes or no. There's only our uncertainty about the realized outcome.

    This is like asking what the probability of a specific coin that I've already flipped being heads is, or what the probability is that George Washington was killed by a stroke.

    Which just makes it more obvious that the question should be asked in population terms.

  14. carlitos says:

    I don't know where the 197 is coming from (maybe you meant 196?), but I'm pretty sure that (14/197)/(27/197) is not 2/3. In fact it's much closer to 1/2.

  15. sean Matthews says:

    Oops. Hangs head in shame. The 197 was a typo, the other bit was proof that you should not fiddle with your pencil on the train.

  16. lemmy caution says:

    I agree with David. To me I think you can ignore the "on tuesday" information. The boy has to be born on a specific day of the week. The man knows the day we don't.

    "I have two children. One is a boy born on a June 14th 1999 at 12:02AM he loves star wars, and basketball, but is in the 37% percentile for height. What is the probability I have two boys?"

    Is it very close to 1/2? or is it 1/3?

    I say 1/3. Parents will tell you anything about their kids.