Fighting a losing battle

Following a recent email exchange regarding path sampling and thermodynamic integration (sadly, I’ve gotten rusty and haven’t thought seriously about these challenges for many years), a correspondent referred to the marginal distribution of the data under a model as “the evidence.”

I hate that expression! As we discuss in chapter 6 of BDA, for continuous-parametered models, this quantity can be completely sensitive to aspects of the prior that have essentially no impact on the posterior. In the examples I’ve seen, this marginal probability is not “evidence” in any useful sense of the term.

When I told this to my correspondent, he replied,

I actually don’t find “the evidence” too bothersome. I don’t have BDA at home where I’m working from at the moment, so I’ll read up on chapter 6 later, but I assume you refer to the problem of the marginal likelihood being strongly sensitive to the prior in a way that the posterior typically isn’t, thereby diminishing the value of the marginal likelihood as a model selection statistic for problems in which there is no well motivated way to choose the prior. If so, I understand, but I think you might be fighting a losing battle as “the evidence” is seemingly now popular in the stats literature as well as the physics …

I replied that I’ll fight that battle forever. I really really hate the use of linguistically-loaded terms such as “bias,” “evidence,” “empirical Bayes,” etc.

22 thoughts on “Fighting a losing battle

  1. “I really really hate the use of linguistically-loaded terms such as “bias,” “evidence,” “empirical Bayes,” etc.”

    Is this about these terms themselves, or about certain ways of their usage? I do not really see why using these terms would be bad? Suppose we do a Monte Carlo experiment and want to compare two methods, (a) and (b). Say we are interested in a certain parameter for which we know the true value, we may find that method (a) leads to greater bias than method (b), in terms of differences between the true and estimated parameter values. I would also say that such a result would provide evidence that method (b) is better than method (a).

    Or is this about saying THE evidence as some form of a reification? For example a sentence like “we refer to a marginal distribution of the data under a model as THE evidence” would make no sense to me because deciding whether something provides evidence relies on interpretations and is not a part of the ‘objective’ world. Even without reification, saying “THE evidence” does not make much sense as evidence is not a clearly marked entity. One can find evidence that supports something but you cannot find THE evidence as the ways of producing it are infinite and always depend on the angle a certain problem is approached.

    J.

    • Joerg:

      I don’t like the term “bias” because it sounds like a bad thing to be “biased.” Actually, though, there are logs of settings where it is undesirable for E(theta.hat|theta) to equal theta.

      I don’t like the term “evidence” as used above because it sounds like it is some sort of summary of the information in the data in favor or against some hypothesis. Actually, this so-called evidence is strongly dependent on aspects of the model that are untestable and are often assigned without much if any thought.

      I don’t like the term “empirical Bayes” because it implies that regular Bayesian inference is not empirical. Bayesian inference is as empirical as any other method of statistics.

      • AG: “I don’t like the term “bias” because it sounds like a bad thing to be “biased.” Actually, though, there are logs of settings where it is undesirable for E(theta.hat|theta) to equal theta.”

        Isn’t this just a matter of mixing scientific and every-day meaning? In the game of science, we could simply define what we mean with “bias” and go from there. There are probably a lot of such examples where the every-day language connotation of a term is negative but the term is used without problems in scientific language. An example I can think of right now would be “regression”.

        AG: “I don’t like the term “evidence” as used above because it sounds like it is some sort of summary of the information in the data in favor or against some hypothesis. Actually, this so-called evidence is strongly dependent on aspects of the model that are untestable and are often assigned without much if any thought.”

        My point basically was that results which are regarded as “evidence” are largely dependent on a lot of things and interpreting something _as_ “evidence” goes way beyond the scope of choosing a certain model. However, I still think it can be a useful term when properly used. After all, isn’t the pursue of evidence the whole point of empirical research?

        J.

        • I think “regression” fails to convey any understanding, false or true, amongst many before or after they learn about statistical regression…

          It was (mostly) there in the Freedman, et al intro stats book, but very hard to get across to students and some teaching that course just chose to mostly skip over it.

          And Stephen Stigler seems to suggest almost no one really appreciated the importance http://biomet.oxfordjournals.org/content/99/1/1.short

        • Could you expand? I for one have never understood why statisticians chose such a strange term as “regression” to refer to such a simple concept as “parameter estimation”. And I don’t see where in the Stigler article this is discussed?

        • Perhaps I should have included this excerpt

          “as in all scientifically interesting problems, there is variation,
          and the regression phenomenon [regression to the mean] Galton discovered renders this approach inappropriate”

          Konrad: My point was to speculate that even few statisticians truly understand why such a strange term as “regression” was chosen (Freedman I think did).

        • Ah – so it’s not just that the statistical methodology of regression analysis was proposed to correct for the empirical phenomenon of regression towards the mean, but it was actually proposed at the same time that the phenomenon was first pointed out (and by the same author) – this makes the conflation of these two concepts more understandable. But I still think it’s a terrible term, especially in the light of our more modern understanding.

    • Re “bias”: Knowing that method (a) has greater bias than method (b) does not provide evidence that (b) is better, because in most cases the improved bias is obtained at the cost of worse variance. But if you had said (a) has greater bias _with all other considerations being equal_, then I would agree that (b) is better. Andrew: is this what you meant, or are you really thinking of cases where larger bias is desirable for its own sake?

      I agree that the concept of “bias” is highly problematic in statistics, but I don’t think this is because of the choice of term. Rather, it is because of history – the specific way that unbiased estimators (rather than efficient estimators) were emphasized when they were first introduced, which led to their becoming standard in text books and in the way statistics is taught (even today).

      Re “evidence”: I again don’t think the term is problematic. What is problematic is the notion (apparently common) that any notion of evidence can meaningfully be said to exist in a model-independent way. This is Fisher’s idea of “let the data speak for themselves”, and it is what needs to be fought.

      Finally – colourful and memorable terminology is good – we need to cherish them to keep science exciting.

  2. You can add “statistically significant” to the list, because many people interpret it to mean “significant” in the plain-language sense; I’ve even heard statisticians do this. In fact, I’d call this one the Mother of All Misleading Statistics Terms.

    Adopting a common word to do double duty as a word with a well=defined scientific meaning is risky, especially if you choose a word that is, as Andrew says, “linguistically loaded.” “Bias” in its common sense is essentially always a bad thing, a “significant” result is an important one. But neither is necessarily the case in statistical analysis. These are really bad terms.

    Where I think it’s OK is when the term is emotionally neutral, or when its scientific and common meanings are so different that there is no risk of confusion. “Leverage” is OK as far as I’m concerned, for the first reason. And the physics terms charm, strangeness, and color are fine in the context of elementary particles for the second reason.

  3. I agree with Phil: it is a weak form of linguistic relativity http://en.wikipedia.org/wiki/Sapir_Worf

    So it affects those most with the least understanding – which unfortunately for learning or drawing conclusions from data on important matters, most often involves such people – including presidential candidates ;-)

    Notably, Tukey was very concerned about this and even read semiotics (theory of signs) seriously (which is what our first conversation was about). Not sure if that was post or pre-EDA.

  4. It’s amazing how many decades it takes to correct a wrong impression left by an emotionally loaded term in Statistics.

    It’ll probably take centuries to fix the damage done by “Unbeatable Kittens and Sunshine Bayes” as described in Chapter 23 of Bayesian Data Analysis (Gelman); even though it’s a valid replacement for “Pearson’s Exact Puppy Kicking Test”.

  5. By “evidence” (although you don’t like the name), you mean the marginal likelihood, not marginal distribution? In the first paragraph you said it is marginal distribution (of which variable?), but the quoted paragraph mostly talks about the marginal likelihood.

    • “Marginal distribution of the data” is just a synonym for “marginal likelihood”, because the likelihood _is_ the distribution of the data.

  6. Also, are you in opposition to the maximization of marginal likelihood function _even_ when the family of distributions used for prior is fixed and we are merely searching for effective hyperparameters? Do you also hate the idea of ’empirical Bayes’, or is it just that you don’t like how it is named?

    • Hyokun:

      1. We discuss this in chapter 6 of BDA (an example where Bayes factors are useful, an example where Bayes factors are a distraction). In general, I don’t like the idea of “searching for effective hyperparameters.” I prefer continuous model expansion, in which there is a hyperprior for the hyperparameters and we do full Bayes.

      2. “Empirical Bayes” is not clearly defined. I like hierarchical Bayes, as discussed in Chapter 5 of Bayesian Data Analysis.

    • And marginal likelihood has numerous meanings such as marginal over all or just some of the other parameters, marginal over some aspects of the data (i.e. likelihood for f(x) rather than x ), marginal over some aspects of the data such that the likelihood does not involve the nuisance parameter while maintaining all the information (in some sense) about the interest parameter (e.g. http://en.wikipedia.org/wiki/Restricted_maximum_likelihood).

      (This was so annoying to one of my two thesis examiners that they repeatedly and in increasing volume almost to the point of yelling, asked me exactly what I meant by marginal likelihood. For some reason they did not like my picking marginal over some of the other parameters (which was how I explicitly defined it and what I was addressing) rather than some of the others. Or they were just trying to shake me up – its in private with just the two examiners).

Comments are closed.