No on Yes/No decisions

Just to elaborate on our post from last month (“I’m negative on the expression ‘false positives'”), here’s a recent exchange exchange we had regarding the relevance of yes/no decisions in summarizing statistical inferences about scientific questions.

Shravan wrote:

Isn’t it true that I am already done if P(theta>0) is much larger than P(theta<0)? I don't need to compute any loss function if the former is 0.99 and the latter 0.01. In most studies of the type that people like me do [Shravan is a linguist], we set up experiments where we have a decisive test like this for theory A and against theory B.

To which I replied:

In some way the problem is with the focus on “theta.” Effects (and, more generally, comparisons) vary, they can be positive for some people in some settings and negative for other people in other settings. If you’re talking about a single “theta,” you have to define what population and what scenario you are thinking about. And it’s probably not the population of Mechanical Turk participants and the scenario of an online survey. If an effect is very small and positive in one population in one scenario, there’s no real reason to be confident that it will be positive in a different population in a different scenario.

9 thoughts on “No on Yes/No decisions

  1. My PLoS One paper is a good example of what Andrew is talking about. I looked at about a dozen studies on Chinese relative clauses. The theoretical claim was that $\theta0) was about 78-80%. One point I wanted to make is that even if the results are not stable (presumably driven by different populations and different settings, as Andrew suggests), we can still conclude something from these unstable and noisy results. If someone told me that there is an 80% chance of rain, I would step out with an umbrella.

    Another interesting issue with the Chinese relative clause data is that instability of the results is partly due to the way people do their data analysis. The typical way is to read in the data frame, type library(lme4) and then run the lmer command or the aov command. That’s it (I exaggerate only a little). Sometimes “trimming” is done, but not always and how it’s done varies from paper to paper (even for the same author).

    When I reanalyzed three of the Chinese studies that I could get the raw data for, the published results were a consequence of blind and mechanical “trimming” 2.5% of the outlying extreme values, or a consequence of heteroscedasticity. A fourth paper (I didn’t try to get this raw data) actually claims in its title that $\theta<0$ (I paraphrase of course), but in their data the key effect is in the opposite direction and statistically significant (i.e., even by conventional standards, it goes against the title of their paper). So, when we look at the alleged instability of results, one of the problems is the way people misuse statistical tools (or, as in one case, just make up the title to fit the story they think should be true).

    My paper is here in case someone wants to critique it:
    http://www.plosone.org/article/authors/info%3Adoi%2F10.1371%2Fjournal.pone.0077006;jsessionid=7C0A69A6B11952F5304D5DBE1FDBC8A0

    At I've mentioned before, on the plus side, nobody is actually going to die if Chinese subject relatives turn out to be harder to process than object relatives (which is what \theta<0 means in this context). So it's a nice safe environment to test out the issues Andrew's talking about. I'm still going after this $\theta<0$ or $\theta<0$ issue in Chinese, partly because I just want to know: how bad is the situation? Is it just that we are all running confounded studies or is there no point in even talking about \theta0, as Andrew suggests?

  2. I would have thought the population and scenario to which an analysis applies are pretty clearly understood in most applications? (Is this not true in social science?) In HIV research, for instance, the population might be “all biologically feasible HIV-1 viruses of subtype C” or “all biologically feasible HIV-1 viruses regardless of subtype” (or “viruses in blood” vs “viruses in any tissue”, etc etc). As a rule, researchers (despite sometimes being clueless about many other statistical issues) are well aware that, if the data set contains only viruses of subtype C, conclusions may not be valid for the more general population – attention is always focussed on how broadly representative the data set is. And I think this is true throughout pretty much all of biology, at least.

    • Konrad:

      Things are different in psychology research. There it is standard to do experiments on convenience samples of volunteers who aren’t representative of any clearly defined population, and with the resulting inferences taken to hold in great generality. This sort of generalization has become controversial in recent years, but it still seems to be taught to students, to the extent that a well-known professor of psychology can claim that concerns about such generalizations are “absurd” (see story here). Also, I believe it is standard for generalization to be a concern when extrapolating from medical trials.

      • That correspondent seemed to be saying that concerns about generalization are absurd because the researchers in question (and the general public) know better than to try to generalize in the first place (i.e. everyone understands that in psychology the claims are extremely weak and should always be taken with a grain of salt). It makes one wonder if there are ever strong results in psychology, and how such results would be communicated.

        As for medical trials, my point was that generalization is an _explicit_ concern. Researchers tend to be aware of what the study population was and conscious of questions about how broadly the result is valid – even if they don’t care about other aspects of statistics.

    • Konrad, medical researchers using rodents really have no basis for generalizing from one study to the next. They take the equivalent of small convenience samples (whatever the animal breeder gives them). It is not clear to me what the population would even be in this case (rats of strain x of this age from this company).

      And no you cannot assume the animals are mostly the same, even from the same strain. They can differ drastically over time or from different companies or even from different rooms at the same facility. Some quick examples that can have huge effects on study outcome are the arrangement of cerebral arteries1 and drug metabolism2. There are plenty more.

      1) http://www.ncbi.nlm.nih.gov/pubmed/9183296
      2) http://www.ncbi.nlm.nih.gov/pubmed/7835229

      • The target population for medical research is humans – so when working with animal models, researchers are well aware that they are extrapolating severely, and that conclusions based on data from animal models are much weaker than conclusions based on data from humans. Domain specialists seem to have strong and divergent opinions on just _how_ strong or weak the conclusions should be, and papers such as the ones you cite are relevant to that debate – but everyone agrees that animal studies are just a preliminary step, and that final conclusions should be based on human studies.

        My point is that these are not issues that medical researchers are unaware of.

        • Konrad, yes. I agree that everyone is aware of the problem of generalizing from rodent to human. That does not address that there is reason to expect we should not expect to be able to replicate a preclinical result.

  3. Great answer, Andrew! It really demonstrates the contrast between your Bayesian statistics and the way fruitless philosophical frequentist vs. Bayesian debates try to define things.

  4. Konrad: What kind of psychology are you talking about? Certainly not experimental psychology: psychophysics, animal learning, physiological psychology, etc!

Comments are closed.