Sanjay Srivastava reports:

In a typical study, half of the targets are gay/lesbian and half are straight, so a purely random guesser (i.e., someone with no gaydar) would be around 50%. The reported accuracy rates in the articles . . . say that people guess correctly about 65% of the time. . . . Let’s assume that the 65% accuracy rate is symmetric — that guessers are just as good at correctly identifying gays/lesbians as they are in identifying straight people. Let’s also assume that 5% of people are actually gay/lesbian. From those numbers, a quick calculation tells us that for a randomly-selected member of the population, if your gaydar says “GAY” there is a 9% chance that you are right. Eerily accurate? Not so much. If you rely too much on your gaydar, you are going to make a lot of dumb mistakes. . . .

It’s the classic problem of combining direct evidence with base rates.

I’ve always used things like breast cancer and random unknown diseases when I’m teaching this sort of thing, but from now on I’m using this.

This report looks like a great example of drawing strong conclusions based on an analysis with a highly questionable assumption (the assumption symmetry).

This frequentist calculation is relevant for situations where, for instance, you are concerned with betting rates (how often might you erroneously conjecture gay/straight in some long run). The flaw in many appeals to base rates is the presumption that we should calculate the prior probability in a hypothesis H, say, prion disease is transmitted through protein folding (the “prion only” hypothesis of Prusiner) by imagining an urn of hypotheses from which we randomly sample!

So, for example, suppose there is an urn of hypotheses about various diseases within which 95% have been wrong. We randomly select and pick out one particular hypothesis H*, say it is a currently entertained hypothesis about the cause of Alzheimer’s disease. Would anyone really want to say that the prior degree of belief to accord this particular H is .05? I don’t care about the numbers, pick your favorites. It’s an equivocation.

There’s a perfectly well defined experiment of the following sort: randomly select from the urn of hypothesis and pick one out. If 95% of the hypotheses in the urn are false, then the probability of the generic event: any hypothesis I draw at random is false, is equal to .95. This is an ordinary frequency calculation.

The danger is then to combine this with an evaluation of evidence regarding this particular Alzheimer’s hypothesis H*. Imagine there’s highly impressive evidence x (maybe huge improvements of memory in patients, maybe a vaccine is found to reverse severe dementia, etc.). Say it is deemed vastly unlikely it could show such benefits if H* were not correct. Is the Bayesian, saying we should multiply this “likelihood” by .05? If so, you’d always find a different urn with a different rate of successful hypotheses. Which should we use in evaluating the evidence for the particular theory of Alzheimer’s disease, the H* we actually selected? The Bayesian who advocates this, I hope there aren’t many, is switching the meaning of the H* through the example. THIS hypothesis H*, gives a correct or an incorrect understanding of whatever aspects of the disease it describes. Even if one wants to consider the truth of H* a kind of random variable, it’s probability would not be given by the rate of true hypotheses in your urn. Let me be clear, it is crucial to be able to evaluate which aspects of the disease H* has seemed to get right so far, and where the unknowns remain (it’s not merely a true/false claim). There are quantitative assessments, but they have nothing to do with the rate of true hypotheses in your favorite urn. The “event” of selecting a true hypothesis in a series of selections from an urn of hypotheses (and what should we put in the urn anyway?) has nothing to do with the inquiry on how to appraise and develop H*. Something to keep firmly in mind as you evaluate “base-rate” arguments. Sorry to be rushing—hope this isn’t too garbled. (I’ve written about this elsewhere.)

sorry to be rushing….

Mayo:

I agree with the general point: Just as some non-Bayesians are all too ready to reject a Bayesian argument because of its perceived impurity (while accepting arbitrarily chosen likelihood functions without question), from the other direction some Bayesians are all too ready to accept any base rate as a prior distribution without considering the steps involved in selecting the data to be analyzed.

Andrew: You know of course that I’d deny that frequentists accept arbitrarily chosen likelihoods without question. I think an important job for which many frequentist tools are purposely designed is testing assumptions of statistical models. You wouldn’t want to have to trot out all of the ways a statistical model can fail in order to ask about specific mistakes, e.g., does independence hold for this data? Non-Bayesian checks are apt.

In general, I was to suggest that, before wondering–hey, why do you rely on one kind of claim but raise hackles about another?–one should consider what type of claim it is, and whether one can articulate and check whether it is flawed (for the case at hand).

Mayo:

I did write “

somenon-Bayesians.” Over the years I’ve seen a lot of statistics professors express skepticism about priors while accepting arbitrary likelihoods. Just consider standard classical statistics textbooks, which will spend lots and lots of space on maximum likelihood estimation etc with no discussion of prior information (which could, at the very least, impose soft constraints on the inferences) and a complete acceptance of whatever likelihood function happens to be assumed.I don’t have any similar experience with “classical textbooks” that show “a complete acceptance of whatever likelihood function” is assumed. I wonder if you mean that texts on theoretical statistics are not focused on how one arrives at and checks statistical models of experiments? The idea that background information isn’t assumed to be relied on in order to implement the methods is mistaken. Formal tests of assumptions are well-known. Anyway, which “classical” texts do you have in mind? (I wish we could overthrow the term “classical statistics” which seems largely used in a derogatory fashion, and is redolent of classical definitions of probability and much else that is associated with old-fashioned and outdated ideas. It is a rhetorical move that irks me, and perhaps other frequentists “in exile”.)

Mayo:

From volume 2 of Feller’s classic text on probability:

And this:

This is the kind of thing I’m talking about. Probability models and independence assumptions for data (Poisson, normal, etc.) are accepted as a matter of course. But if someone suggests applying a probability distribution to

parameters, then, hey, the scruples come out! All sorts of worries about philosophy, randomization, etc.I’m a huge fan of Feller and his books, but the quotes above represent to me an unthinking anti-Bayesianism, as disturbing in its own way as the reverse tendency of some Bayesians to describe their own procedures as rational, coherent, optimal, etc. On both sides you have people accepting their own assumptions without question and then applying a much higher standard to evaluating the assumptions of others.

P.S. I just thing it’s hilarious that each time I contribute to this discussion I’m scrolling past that picture of Tom Selleck!

Mayo, the distinction you’re making reminds me of the debate in psychology over clinical versus actuarial prediction (esp. the work of Paul Meehl). Say you are trying to predict whether a patient is likely to relapse if you release him from a hospital. The actuarial side said, let’s take a model (like a regression equation) that we have built on previous patients’ data, enter the current patient’s input variables into the model (MMPI scores, presence/absence of certain key symptoms), and use the resulting prediction as our best judgment. The clinical side said, each patient is a unique human being with a unique psychological makeup and circumstances, we cannot use a model built on other patients – we need to judge each one individually. The former is similar to a long-run betting scenario, the latter to the “you’d always find a different urn” scenario. This was a heated debate, but ultimately the field came down on the side of the actuarial approach (in practice, it made better predictions than letting clinicians evaluate each case’s merits individually).

So my question is, how do you know when you’re facing a long-run betting scenario where you should draw a prediction from some previous class of similar events, and how do you know when you’re facing a “you’d always find a different urn” kind of scenario? Or in short, how do we know when to act as though there’s an urn and when not to? In my gaydar post I equivocate on this – I present the long-run calculation in the paragraph that Andrew quoted, but in the next paragraph I hedge it. (In the post I gave reasons to suspect that lab predictions are artificially better than real-world judgments; but there are also reasons to argue that predictions in the lab are artificially low – e.g., that lab stimuli are informationally impoverished.)

I have had some long discussions with Meehl about this in Minny. (As you know he was an ardent Popperian, and non-Bayesian.) I think that what’s going on with Meehl’s formal expert or actuarial approach is really rather different. If you wanted me to try an analogy, it would be more like imagining an urn of experimental cases, with hypotheses making similar claims@, and evidence sets similar to the evidence we have for this one H*, in relevant respects, and then trying to ask how often the evidence correctly indicated the truth of H*, or, more likely, some aspect of what H* asserts. My point was just to highlight a certain equivocation that is commonly made.

On your other query, I think it is rather clear cut when you are trying to evaluate the merits of this hypothesis, and when you are playing a guessing or betting game. But I am not ruling out that, even in the former, scientific,context, if you are imaginative enough, you could usefully identify relative frequency information. That, in effect, would get you to assess error probabilities (of an overall method).

@The way to classify by similarity, I would venture to say, would be groupings in terms of types of erroneous construals of evidence, but this will take me too far afield, and more care is needed to address this.

“if your gaydar says “GAY” there is a 9% chance that you are right.”

A worldly observer can do vastly better than being right 9% of the time. Vastly.

Professor Gelman, are you only right 9% of the time about this?

Steve:

A lot depends on how much information is available. I hope I’m right more than 9% of the time but i will generally only bother to guess that someone is gay after having more information than the “thin slices” in that study.

Right. So, a study of, say, college students just listening to taped voices is a very artificially restricted dataset compared to, say, an urbane middle-aged person having a conversation with somebody.

Here’s a reality check: Think about famous actors. Do you turn out to be wrong about them 91% of the time? I sure don’t. And yet, they are extremely talented at acting, so you would think you’d be wrong even more often than 91% of the time.

A gay guy is pictured underneath the article and sorry boys mags aint gay

Huh? I never hear that Tversky was gay.