Skip to content
 

Incorporating Bayes factor into my understanding of scientific information and the replication crisis

I was having this discussion with Dan Kahan, who was arguing that my ideas about type M and type S error, while mathematically correct, represent a bit of a dead end in that, if you want to evaluate statistically-based scientific claims, you’re better off simply using likelihood ratios or Bayes factors. Kahan would like to use the likelihood ratio to summarize the information from a study and then go from there. The problem with type M and type S errors is that, to determine these, you need some prior values for the unknown parameters in the problem.

I have a lot of problems with how Bayes factors are presented in textbooks and articles by various leading Bayesians, but I have nothing against Bayes factors in theory.

So I thought it might help for me to explain, using an example, how I’d use Bayes factors in a scenario where one could also use type M and type S errors.

The example is the beauty-and-sex-ratio study described here, and the is that the data are really weak (not a power=.06 study but a power=.0500001 .0501 study or something like that). The likelihood for the parameter is something like normal(.08, .03^2)–that is, there’s a point estimate of 0.08 (an 8 percentage point difference in Pr(girl birth), comparing children of beautiful parents to others) with a se of 0.03 (that is, 3 percentage points). From the literature and some math reasoning (not shown here) having to do with measurement error in the predictor, reasonable effect sizes are anywhere between 0 and, say, +/- 0.001 (one-tenth of a percentage points); see the above-linked paper.

The relevant Bayes factor here is not theta=0 vs theta!=0. Rather, it’s theta=-0.001 (say) vs. theta=0 vs. theta=+0.001. Result will show Bayes factors very close to 1 (i.e., essentially zero evidence); also relevant is the frequentist calculation of how variable the Bayes factors might be under the null hypothesis that theta=0.

I better clarify that last point: The null hypothesis is not scientifically interesting, nor do I learn anything useful about sex ratios from learning that the p-value of the data relative to the null hypothesis is 0.20, or 0.02, or 0.002, or whatever. However, the null hypothesis can be useful as a device for approximating the sampling distribution of a statistical procedure.

P.S. See here for more from Kahan.

21 Comments

  1. Dieter Menne says:

    Interesting point, but the text reads like it was truncated at the point where “what to write” should have started. How would you formulate this in a publication?

  2. But what should readers do when there is no credible independent evidence that can produce a reasonable prior for effect sizes?

  3. Brad Stiritz says:

    Hi Andrew, thanks for this example. Along with your recent post on the “80% Power Lie” (http://andrewgelman.com/2017/12/04/80-power-lie/), these types of calculation-based discussions are extremely helpful.

    I’m wondering if any other readers might be interested in working with me on a public GitHub repository, dedicated to Andrew’s technical posts? I have already done a lot of work with a tutor on the “80% Power Lie” post. We worked up numerous graphics, and additional code, to make Andrew’s points more understandable at a basic undergrad level (i.e. where I’m at). When I’ve completely worked through the “80% Power Lie”, I will post a GH link in the comments to that post.

    >From the literature and some math reasoning (not shown here) having to do with measurement error in the predictor, reasonable effect sizes are..

    Andrew, would you please consider elaborating your math reasoning..? Or, can anyone guess and explicitly spell out, please?

    • A Github repository is a great idea. I find myself writing little bits of code illustrate things like type M/S errors, hypothesis testing in low power studies, etc all the time, so having a central database to pull from would be convenient. A lot of it could easily be assembled into a sort of “tutorial R package” to let students/researchers get a sense of how the techniques they’re using actually behave in noisy settings.

  4. Jacob says:

    I haven’t deeply reasoned about this and may not have the (mathematical) training to do so, but I feel like the point null has a particular benefit that relates to the idea of Type S errors. I think it is not so uncommon that it will be believed that something has an effect, but opinions will differ on the direction of the effect. One example of interest to me is the effect of political disagreement on engagement in politics. There were some studies (see Diana Mutz’s book) that there was an ironic effect of disagreement, which is supposed to be necessary for making good decisions in a democracy, in that those who encountered disagreement in discussions of politics were less likely to engage in politics.

    There have been several follow-ups to support this as well as follow-ups that suggest both zero and opposite effects. I’ve done an (unpublished) meta-analysis and found many p < .05 studies, but they are split about 50/50 positive/negative. Some of this is statistical (the inclusion/exclusion of certain control variables seems to be influential) and there are problems with the predominantly cross-sectional data used to think about this problem. But if I wanted to make a strong statement about whether the effect is positive or negative, I think the point null comes in handy — with due consideration of the Type S error rate given the design and presumed effect size.

    • Andrew says:

      Jacob:

      I disagree, for the following reason.

      Consider your statements: “I think it is not so uncommon that it will be believed that something has an effect, but opinions will differ on the direction of the effect. . . . if I wanted to make a strong statement about whether the effect is positive or negative . . .”

      I don’t think “the effect” will be positive or negative. I think it will be positive in some settings and negative in others. As I put it in yesterday’s post, “having an effect that varies by context and is sometimes positive and sometimes negative.”

  5. Carlos Ungil says:

    > (not a power=.06 study but a power=.0500001 study or something like that)

    What definition of power is consistent with power=.0500001 or something like that?

    Let’s say that I have a dataset of N=284 births from very attractive parents and I want to test if the percentage of female births is different from 50% (to keep it simple).

    My two-tailed test will reject the null hypothesis if the number of girls is 125 (or lower) or 159 (or higher).
    If the null hypothesis P(girl)=50% is true, the test will be rejected with probability 0.0500 (as it should).

    I calculate the power for a few alternative hypothesis, based on the remark “Given that we only expect to see effects in the range of ±1 percent”:
    If the alternative hypothesis P(girl)=51% is true, the test will be rejected with probability (i.e. the power is) 0.0631.
    If the alternative hypothesis P(girl)=50.3% is true, the test will be rejected with probability (i.e. the power is) 0.0512.
    If the alternative hypothesis P(girl)=50.1% is true, the test will be rejected with probability (i.e. the power is) 0.0501.

    • Andrew says:

      Carlos:

      In the beauty and sex ratio example, I’d expect the true difference in the population to be of order of magnitude 0.01 percentage points, which I’d write as 0.0001 except that it’s hard to keep track with all these zeroes.

      • Carlos Ungil says:

        Ok, so you think that the proper alternative hypothesis to calculate the power of the study is P(girl)=50.01% vs P(girl)=50.00%.

        This seems a bit extreme, but now you’re indeed just one zero away from your power=.0500001 statement.

        But then, why do you bother discussing that “based on the scientific literature it is just possible that beautiful parents are 1 percent more likely than others to have a girl baby” in that paper?

        Just say that it is impossible that there is any effect, that the power has to be calculated against the alternative hypothesis which is equal to the null hypothesis and therefore power=0.05 by definition and that there is no need to do any study because you know the answer anyway.

        • Andrew says:

          Carlos:

          1. The probability of a girl birth is something like 0.485 or 0.488.

          2. I haven’t always been so precise on this myself, but I try to use the term “comparison” rather than “effect” here because what’s being studied is a comparison between two groups, not a causal effect.

          3. I think the difference in sex ratios between the two groups is likely to be very small, in part because there’s no clear reason to expect any systematic difference, and in part because the measurement of attractiveness in this particular study is itself so noisy, so we’re not even really comparing two distinct groups.

          4. I don’t “know the answer anyway.” As I wrote, I expect the true difference in the population to be of order of magnitude 0.01 percentage points. In evaluating the Kanazawa paper, it was enough to point out that the analysis would be hopeless, even if the true population difference were as high as a (scientifically implausible) 1 percentage point. If someone had asked me ahead of time whether this study was worth doing, I’d’ve said no, even if I’d thought the underlying population difference were 1 percentage point. I actually expect the underlying difference to be much less, but it was not really necessary to develop that reasoning to make that point, so I didn’t bother.

          5. If I really wrote, “based on the scientific literature it is just possible that beautiful parents are 1 percent more likely than others to have a girl baby,” then I guess I was being generous with the phrase “just possible.” I should’ve written that sentence more clearly.

          6. It appears that my “power = .0500001” statement was an exaggeration! I’ll fix it in the above post.

          • Carlos Ungil says:

            I use 0.500 for convenience, the results wouldn’t change much with another baseline. Comparing the means of the “very attractive” group with the “not very attractive group” (which is ten times at large) wouldn’t change much my analysis either. I was just trying to get an idea of how close the alternative hypothesis had to be to the null hypothesis to claim that the power is that close to 0.05, using a very simple model. I would be curious to see if another power analysis yields a very different answer.

            I take back the “you know the answer” bit, but I really don’t understand what that power calculation is supossed to mean. All I can see is a circular argument: “The study is useless because the power is ridiculously low, but the power is ridiculously low because I calculate it for an alternative hypothesis which is very close to the null because I think the study is useless.”

            • As I understand it, there are very large studies on births, and so the prior information is very comprehensive about birth rates and their variability in various circumstances.

              If you know ahead of time due to studies involving literally millions of births that any given situation is unlikely to move the needle by more than the 3rd or 4th decimal place, then when someone comes along proposing to study 300 attractive people or whatever you can say ahead of time “this is mostly likely worthless”

              if you made that claim based on just a gut feeling or whatever, sure you could find fault with it, but when there are 7 billion people living and birth records in various countries are comprehensive and hence you might be able to get access to summaries of a half a billion birth records or something… it’s worth it to consider that information pretty seriously.

  6. Andrew says:

    Carlos:

    I’m not saying “the study is useless because the power is ridiculously low.” I’m saying the study is too useless because the measurements are too noisy given any plausible underlying difference between the groups. This problem can be expressed as “low power,” and I talk about power because that’s a scale that many people are familiar with, but the fundamental problem here is not “low power,” it’s that the measurements are very noisy relative to the size of any plausible underlying differences.

    This is not circular reasoning. There is no circle here. From the scientific literature and our understanding of statistics we can get a sense of a plausible range of underlying differences. Then, from statistical analysis, we can see that this particular study will be hopeless. This is direct reasoning, no circles involved.

    • Carlos Ungil says:

      Thanks for you answer. Maybe more than circularity reasoning I was thinking of begging the question (but of course you won’t agree with that either).

      Given how people misunderstands power, I’m not sure stating your issues with this study in terms of power helps. Specially if you do it in an exaggerated fashion for higher dramatic effect.

      You say that there is “plausible range of underlying differences” and whether it is +/- one tenth of a percentage point or +/- one percentage point, clearly it is quite narrow.

      If the measured effect is two orders of magnitude larger than what it’s considered possible, I don’t think a very fine statistical analysis is needed to suggest that the result may be a fluke. By the way, I don’t know if you’ve commented somewhere on the (according to Kanazawa) replication based on British data published in 2011.

      • Andrew says:

        Carlos:

        I’ve seen Kanazawa’s claimed replication. Big forking paths problem, or I guess we could say p-hacking. In the first paper, he had data with attractiveness on a 1-5 scale, and he compares 5 (“very attractive”) to 1,2,3,4. (I don’t think any other comparison would yield statistical significance.) In the second paper the data are coded differently, and it ends up that he labels 84% as attractive, 12% as unattractive, and the rest as neither. This is completely different than the first paper where most of the people are not characterized as “very attractive.” So, basically, enough degrees of freedom to find statistical significance.

        But I didn’t bother even commenting on the paper (except, when Kanazawa sent this paper to me, I replied that I thought his results could entirely be explained by noise; he thanked me but did not take my advice to heart), because, for reasons discussed above, the study had no chance of finding anything useful.

        And, yes, I agree that a very fine statistical analysis was not needed here. (If you read the literature on sex ratios, you’ll see that any difference of more than one percentage point would be extremely hard to imagine.) And it’s not that “the result may be a fluke”; it’s that the data from that survey provide essentially zero evidence on the topic of beauty and sex ratios.

        But that’s what’s so amazing! The beauty-and-sex-ratio paper was published in a reputable biology journal! For real! And it was featured on the Freakonomics website! Even though it was the statistical equivalent of a perpetual motion machine.

        Science is (or, until recently, was) really screwed up. Anything could get published: this paper, that ESP paper, all sorts of things; all that was needed was statistical significance. In retrospect, it’s stunning that so much statistical firepower has been needed to reveal these problems.

        And nothing in that above post was an exaggeration, except for that “power=.0500001” thing, which I’ve now fixed.

        • Chris Wilson says:

          Misunderstanding of NHST/p-values abounds- I would say *is* still screwed up ;)

        • Carlos Ungil says:

          > 84% as attractive, 12% as unattractive, and the rest as neither

          The beauty of the British people is legendary :-)

          At age 7 (the measure used in that study), the children were 85% attractive, 12% unattractive and 4% worse than unatractive (“Looks underfed”, “Abnormal feature”, “Scruffy and dirty”).

          The rating at age 11 (also available) was a bit less enthusiastic: 80% attractive, 15% unatractive and 5% worse than unatractive (“Undernourished”,”Abnormal feature”,”Slovenly, dirty“).

          The question of why that particular measure was used is very pertinent, specially if one considers that another study by the same author focused on people rated as attractive at both age 7 and age 11 by two different teachers, 62% of the population. By the way, this number seems very low: I don’t know if that indicates that the correlation between both ratings is low, that in many cases either of the ratings is missing, that there are many cases where the same teacher filled out the form at ages 7 and 11 and therefore was ignored (why?)…

          His latest paper may be interesting: “Why are there more same-sex than opposite-sex dizygotic twins?”
          Unfortunately he hasn’t posted the pdf in his webpage yet.
          https://academic.oup.com/humrep/advance-article-abstract/doi/10.1093/humrep/dey046/4925331?redirectedFrom=fulltext

  7. Jason Farnon says:

    May I ask where the linked discussion is drawn from? Whenever I google quotes I just get the transcript-thing hosted on this site.

  8. Marcel van Assen says:

    Here we use a Bayesian approach where we compare the likelihoods of zero, small, medium, large effect, taking into account the statistical significance of the original effect size, when jointly evaluation the effect of an original and replication effect:

    http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0175302

    ‘Bayes factors’ and posterior model probabilities are also calculated.

    I believe selecting effect sizes of zero, small, medium, large are more meaningful than -.001, 0, .001.

  9. Jim Hatton says:

    My idea for a new elementary statistics textbook. There are millions of statistics textbooks and the students thereof that rely solely on p-values. And I know that purely Bayesian texts have not caught on. So how about writing a text with exactly the same problems as addressed by the classic texts but solved from a modeling perspective. Beginning students treat any computer use as a black box so why not use modeling software rather than, say, standard regression programs. What would happen, I think, is that the few students that become practitioners will use the better models and the rest of the students see best practice.

Leave a Reply