“What is a good, convincing example in which p-values are useful?”

A correspondent writes:

I came across this discussion of p-values, and I’d be very interested in your thoughts on it, especially on the evaluation in that thread of “two major arguments against the usefulness of the p-value:”

1. With large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis.

2. Almost no null hypotheses are true in the real world, so performing a significance test on them is absurd and bizarre.

P.S. If you happen to share this, I’d prefer not to be identified. Thanks.

This seems like a pretty innocuous question so it’s not clear why he wants to keep his identity secret, but whatever.

I followed the link above and it was titled, “What is a good, convincing example in which p-values are useful?”

I happen to have written on the topic! See page 70 of this article, the section entitled Good, Mediocre, and Bad P-values.

I also blogged on this article last year; see here and here.

52 thoughts on ““What is a good, convincing example in which p-values are useful?”

  1. This blogpost from yesterday is relevant here: “So you banned p-values: how’s that working out for you?”
    https://errorstatistics.com/2016/06/08/so-you-banned-p-values-hows-that-working-out-for-you-d-lakens-exposes-the-consequences-of-a-puzzling-ban-on-statistical-inference/

    Significance tests are a key tool in testing assumptions of statistical models. They usually (but not always) require supplementation so as to determine the magnitude of discrepancies that are and are not indicated. That is what my severity assessment tries to do. Model validation is one of the few cases where significance tests may be used without the Neyman-Pearson introduction of an alternative. (For some strange reason, many fields teach a distorted animal, NHST, that permits moving from rejecting a null hypothesis to a causal research claim. Such claims pass with very poor severity. Moreover, people who stick with NHST are at a loss in construing nonsignificant results.)

    The person says, almost no null hypotheses are true in the real world–fine, you should try telling that to ioannidis, Berger and many others who are fond of putting a spiked prior on the null hypothesis of at least .5. That is the main basis of criticizing the p-value as “exaggerating” evidence! The argument falls apart otherwise. On the other hand, we error statisticians ARE interested in discerning “how false” nulls are by estimating discrepancies that are or are not warranted.

    P-values are used to show failure of replication, for instance that only ~40% of the 100 OSC replications led to small p-values. If you’ve arrived at a low p-value via cherry-picking, p-hacking and an assortment of QRPs (i.e., when you have failed Fisher’s requirements about (a) needing more than an isolated p-value and (b) prohibiting biasing selection effects), then the p-value will give you a hard time when it comes to a preregistered replication.
    “The paradox of replication and the vindication of the p-value”
    https://errorstatistics.com/2015/08/31/the-paradox-of-replication-and-the-vindication-of-the-p-value-but-she-can-go-deeper-i/

    Fisher intended significance tests to go hand in hand with randomization. Given randomization, you can determine the p-value with no other assumptions. Personally, I was very thankful for the randomized control trials and reported p-values regarding hormone replacement therapy HRT. Literally overnight, doctors went from foisting HRT on women of a certain age to warning against them (except in certain cases). The medical world had been so convinced of HRT’s benefits to women in age-related diseases, they it was said to be unethical to run these controlled trials. Women in congress and elsewhere had to fight for them. But the effects are small and without significance levels, they wouldn’t have realized (even before the end of the trial) that HRT wasn’t helping, but rather hurting.

    Then there’s fraud-busting and statistical forensics:
    “Who ya gonna call for statistical fraudbusting” https://errorstatistics.com/2014/05/10/who-ya-gonna-call-for-statistical-fraudbusting-r-a-fisher-p-values-and-error-statistics-again/
    “P-values can’t be trusted except when used to argue that p-values can’t be trusted”

    https://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/

    I’ve written this very quickly, see my blog for more, e.g., 5-sigma in discovering the Higgs particle.

  2. I’ll comment here on Amoeba’s and CliffAB’s defense of P-values in response to the two statements above. First, Amoeba states that he’s never seen papers not report effect sizes. This is fairly common in physiology/ecology/evolution where is it quite common to report asterisks or N.S only. More importantly, in the vast majority of cases when effect sizes are reported, they are not interpreted in any meaningful way. Should we be excited about this effect or not? Second, Amoeba doubts researchers increase N ad infinitum to achieve significant P-values. I’d say in the human biosciences, this is exactly what is done, especially with observational nutritional studies and genomic studies like GWAS (which seem to think that large sample size solves all the problems of observational studies). A great example is this paper from last week’s issue of Nature (“Genome-wide association study identifies 74 loci associated with educational attainment.”) which used N=293,723 to identify 74 genes with an effect on education attainment with R-squared of 0.005. Third, CliffAB says that they aren’t often interested in effect sizes in genomics since these vary wildly among genes but only IF there is an effect. Huh? CiffAB himself stated “for some genes, having 2x higher expression doesn’t mean anything, while on other tightly regulated genes, 1.2x higher expression is fatal” so interpreting effect sizes is important! There is just no universal “large” or “consequential” effect size.

    • My point with that example is that there is no way, a priori, to define what an “interesting” effect size. Really, the only thing we care about in that type of study (and at that stage of the study) is finding genes that we are sure are correlated with disease. If the difference 2x, or 20x, we are much, much more concerned about whether an difference is systematic, rather than random fluctuations, than whether it’s a large enough effect to be interesting. In fact, we don’t even know what “large enough to be interesting” is yet! If I tell you “the average log fold change in expression levels between diseased and controls was 0.5 with n = 15 for gene X”, you have *no idea* if that’s even interesting yet. But if I tell you “the p-value (after multiple comparison correction) is 0.002”, you should want to follow up on that.

      For the record, it’s my belief that there’s plenty of problems caused by p-values within the scientific communities. But my point in that post is that the problem is not that p-values are a tool fundamentally flawed in their theoretical grounding. Rather, I believe the issue is really with the human aspect of the tool; they are misused and abused constantly, and that’s the real issue. With that in mind, if one wants to help the scientific community by replacing the p-value, one should look for methods that are easier to use by non-statistical scientists and more transparent, rather than methods with more bells, whistles and caveats.

      But I cannot say I know what such a method is.

      • But this is precisely the one single use of p values that makes sense: as a filter/trigger/detection-threshold to allow you to spend less time looking at things that are “typical”

        I honestly can’t see any reason to use p values other than as a detector for unusual events.

  3. In the structural equation modeling literature the model chi-square (and its associated p-value) reflects the discrepancy between the variance-covariance matrix that is reproduced/implied by the model and the observed-covariance matrix. As such p-values provide information relating to the possible misspecification of the model – that is, small p-values indicate that something is wrong. Now, of course, a high p-value or low chi-square does not necessarily mean that you have the correct model (Hayes has written about this a lot) but at least high chi-square and low p-value provide some clue that the model is imperfect and that you may want to try to figure out the source of the misspecification.

  4. I don’t understand. If there had been a larger sample, wouldn’t the election case have rejected the null? What would you have concluded then?

    If your conclusion would have changed just by having more data, how can you justify your conclusion with the given data?

    “Ladies and gentlemen of the jury — it only appears that there was no fraud because we don’t have enough data: with the right amount, the fraud would appear plain as day.”

    • Lauren:

      In that example, I think it was helpful to be able to report that the data were consistent with the null hypothesis. I’m not quite sure what you mean when you ask “If there had been a larger sample.” We had all the votes for all the candidates. By “a larger sample,” do you mean, “intermediate vote totals at more time points”? If I had lots more of these intermediate vote totals, I suppose I would’ve done a different analysis. The real point here was that there was no evidence of anything going on here, so It was convenient to say that the variation was entirely explainable by chance. The real variation is of course not zero—different sorts of people vote at different times of the day—so the non-rejection should not be taken as acceptance of the null hypothesis. But sometimes non-rejection can be valuable too: it’s a statement that any signal is swamped by the noise.

      • Is that like saying if you had enough data to reject the null with your first test, you would’ve picked a different test so you didn’t reject it?

        Your model is wrong — why are you “testing” if it holds in real life? You know from prior information that it’s wrong. “The data are consistent with the model” isn’t true: you can think of innumerable ways in which it’s inconsistent with the assumed model, based on prior information. I don’t understand what you’re doing when you do this test. Certainly not making sure that the data are “consistent” with the model.

        • Xor:

          Just to be clear: I’m not saying that p-values are my recommended approach in this or any other problem. I’m just saying that p-values can be a convenient practical tool and in this case they happened to be useful.

        • Xor: The data can only ever be “consistent” with the null model regarding the specific tested aspect (or “in the tested direction”). This particularly means that if you can’t reject the null model in this direction, surely the data don’t give evidence that something special/non-null that you’d like to interpret as a “discovery” is going on regarding this specific aspect (e.g., a mean difference between two groups).

        • You can test if the method’s reported error probabilities are approximately the same as the actual ones. That’s the basis for valid statistical inference (at least in the school within which p-value occur). The idea that we don’t care about adequately capturing systematic information within imperfect models, because they are imperfect, misunderstands scientific inference.

        • “You can test if the method’s reported error probabilities are approximately the same as the actual ones.”

          What are the studies/papers/books you like that actually test error probabilities using real-world data (which I take to mean something like how often a confidence interval covers the real parameter value; or how often we reject a known true null hypothesis; or something like that )?

          I’m looking for empirical evaluations of the quality of statistically-based social science research.

        • At least for non-randomized studies the reported error probabilities won’t be any where close to the actual ones, though with some clever work they can be estimated as in clinical research https://scholar.google.com/citations?view_op=view_citation&hl=en&user=XDajlywAAAAJ&sortby=pubdate&citation_for_view=XDajlywAAAAJ:PVjk1bu6vJQC

          Now even randomized studies in social science research are likely to have non-random dropout, compliance, loss of blinding at the same to apply (though reported and actual won’t presumably not be as discrepant.)

          So for most statistically-based social science research (and clinical research) an H0 that the reported error rates are equal to the actual would not make sense.

        • so prob(p < 0.05 | Nothing Going on) ~ 54% in drug trials. Awesome. Think of how much value we've generated by boosting the GDP through sales of ineffective drugs!

          (Ouch, I may need surgery to remove my tongue from my cheek… hopefully the anesthetics are effective)

        • Daniel, if “54% of findings with p < 0.05 are not actually statistically significant" we're getting significant results twice as much as we should.
          I think you meant prob(p < 0.05 | Nothing Going on) ~ 11%. Even using their worst assumption, it's "only" 36%.

        • Carlos, I must have misread the abstract. It’s pretty complex to figure out what “54% of findings with p < 0.05 are not actually statistically significant" means. what does it mean to be "actually statistically significant?"

        • @Daniel

          The anesthetics are most likely effective (this is one case where lack of effectiveness is so obvious that drug companies can’t cheat on it), but watch out for the side effects!

        • Lauren: paper was Interpreting observational studies: why empirical calibration is needed to correct p‐values

        • Keith,

          Sorry for the late response, this deserved better (apparently “service” translates to “missing all the fun”).

          This is really helpful – just the kind of thing I was phishing for. And those are pretty close to my prior in terms of “rejection rates” on observational studies that get published.

          But I also think your point is important to both social science and medical research – rejection rates at this level are likely not just a problem for observational studies but for RCTs too. I think the way the authors write about RCTs in this paper shows a bit more faith than I would have. And I don’t like the terminology of “systematic” and “random” error – I am not sure I believe in either one when it comes to human beings, but I also suspect that this is partly just nomenclature and trying to speak in a way the field understands. That said (and to be all Andrew about it), I probably prefer more specific discussions of types of biases in point estimates and mis-specifications in precision estimates that lead to these over-rejections. I would guess that it is both a problem of making point estimates that are biased by selection and omitted variables bias AND a problem of estimating standard errors that are too small. Maybe I didn’t read carefully enough… but I’ll be going back to it.

          Thanks!

        • jrc, is Alwyn Young’s “Channeling Fisher” an example of what you’re looking for?

          http://personal.lse.ac.uk/YoungA/ChannellingFisher.pdf

          My paper “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman’s Critique” gives an example of a calibration exercise (Section 7) to check the validity of confidence intervals derived from robust standard errors, using data from a paper by Angrist, Lang, and Oreopoulos. But it’s not an evaluation of ALO’s methods, just a simplified example to illustrate some of the kinds of checks that maybe we should all do more often in our own work.

          http://www.stat.berkeley.edu/~winston/agnostic.pdf

        • I truly feel sorry for the last author of that paper — to have a last name that sounds like “cookie city”!

        • Winston,

          That Alwyn Young paper is a really nice example of what I’m looking for – it is in my pile of “I’ve read this and want to read it again”.

          And I like your paper too, and will add it to my growing lit review on “experimental statistics”. Thanks! One thing I like about it is that it focuses the issue of getting proper standard errors on random assignment and not on properties of X and Y themselves (something people in the Biostats world obsess about – I wonder if it is something about ANOVA). Much of this literature can read like “look how horrible everything is because our models are wrong.” Yours says “Yeah, of course our models are wrong, so what? Can we still get good estimates?” And that is a good contribution to both our understanding of the properties of estimators AND to a kind of “pragmatic anti-realist” epistemology of statistical inference: you don’t have to believe the models in terms of functional forms or data generating processes or their relationship to the “true world”, you can believe them in terms of their empirical usefulness.

          I think that last point is super important. I hear lots of people complain that “but your model doesn’t include…” and then the falsity of the model becomes a barrier to them believing the results. But no model in the social sciences includes everything, and building up a convincing empirical body of literature showing when that does/doesn’t matter is, to me, an important and under-studied issue for the science.

          Also, it is nice to see things sorta work sometimes. Thanks!

  5. I am fascinated by the idea that one-sided p-values are equivalent to the posterior probability of having the effect in the wrong direction under non-informative priors. Perhaps this is because it allows me to be a “lazy Bayesian” as I have joked with some of my colleagues. My understanding of your critique of non-informative priors in that article is that you really don’t think our true priors ought to be non-informative in most (many?) cases, as in the case of both ESP and birth ratios. Is that correct?

    I also wonder how much using p-values as loose statements on posterior probabilitis replicates the problem of establishing binary cutoffs for belief in a result. Logically it seems to me that, in the real absence of informative priors, your best guess as to the true effect is whatever you got on the data, but you may hold that belief with different degrees of certainty, which are given by the posterior distribution in its totality. Null hypothesis sig testing ignores this issue, and thus paradoxically converts a measure of uncertainty into a measure of certainty. We shouldn’t replicate that problem in Bayesian methods.

    • Aaron:

      Yes, as you say, I think we should be using much stronger priors in many cases. Lots of problems come from the routine use of noninformative priors (or the equivalent classical methods).

    • I wouldn’t say they are “equivalent,” since the interpretation is different under the Bayesian and frequentist models. It is just that you’d get the same numbers under each interpretation.

    • Imagining a degree of belief assignment to statistical hypotheses is not the only way to “qualify” a statistical inference. We error statisticians qualify it by measures of the precision and probativeness of the test or estimation method. We scarcely have or trust our degrees of belief in an exhaustive set of hypotheses when we set out to learn new things about the world, and so even if we may grant there’s a place for that type of qualified inference, it’s wrong to suppose that’s the ONLY, or even the preferable, way to take into account the incompleteness of knowledge & the inaccuracy of measurements and models. Even a statistician like Box, who is Bayesian when it comes to the “estimation” problem, insisted we need frequentist significance tests for the task of arriving at models. I can give quotes.

  6. “1. With large samples, significance tests pounce on tiny, unimportant departures from the null hypothesis.”

    a) One-sided tests at least only do this if the departure is in the right direction.
    b) You could choose the smallest departure from “literally nothing’s going on” you’d find “important” as H0 and use a one-sided test (this is quite similar to some of Mayo’s “severity” computations).

    • Add to Christian’s sensible pair of notes the fact that a finding of a tiny, unimportant departure from the null hypothesis is only a concern for those who stupidly fail to look at the data! Any rational assessment of what the data say has to include an assessment of the effect size.

      Objection 1 from the original query carries with it an implied condition that is itself a far larger problem than any of the misinterpretations of P-values.

    • One-sided tests only make sense in totally ordered parameter spaces, and in all but the simplest cases there’s no natural ordering for departures from the model being tested.

      A bog-standard practical example: a 5-by-2 contingency table for which the null hypothesis is independence (i.e., cell probabilities are equal to the product of the row and column marginals) and any deviation from this model is of interest. There’s no way to set up a one-sided test because the full parameter space is 9-dimensional; even the null model space is 5-dimensional (and all of the dimensions are nuisance parameters).

      • “in all but the simplest cases” – these account for quite some percentage of application of p-values, though.

        Regarding 5*2-tables, people would still normally test this using a 1-d test statistic (chi^2-test), although here one wouldn’t normally choose between one- and two-sided tests because the minimum is zero and that’d be a perfect fit. One could do something along the lines of what I have suggested before (specifying a minimum relevant magnitude of deviation from the null model – this would implicitly aggregate the 9-d deviations you’re talking about) here, too, but probably it’s not trivial.

        Anyway, in the original posting the statement 1 sounded like a general statement and I didn’t mean to say that it *never* holds. Still, also in 5*2-tables, Michael’s general comment holds and if you reject H0 but the deviance from it is practically irrelevant, you can see it and act (or not act) on it. By the way, this issue in contingency tables is as relevant for the Bayesian as it is for the frequentist. When making inference about independence, you need to think about how big a deviation from it is relevant, otherwise you’ll be in trouble.

  7. p values are great for determining when a fitted default model is most likely being violated. For example, do you want a method for day-care teachers to detect if a child is sick with a virus? Take 100 children who are not currently exhibiting any symptoms at all, and take their temperature with one of those convenient IR ear thermometers. Now, find the 95%tile of the temperature readings. Now, if you suspect that a child may be sick with a virus take their ear thermometer reading and see if its elevated above that established 95%tile for well children.

    I can’t think of any case where p values are being used effectively that isn’t pretty close to an isomorphism with this problem. The proper role of p values is essentially to filter out “the usual case” from a data set.

    http://models.street-artists.org/2015/07/29/understanding-when-to-use-p-values/

  8. Mayo’s post contains most of what is needed to answer the question. However, I would like to object to the zombie objections to P-values that are numbered 1 and 2 in the original post.

    1. P-values often say that negligibly small effects are ‘significant’. So what? Look at the data and take ownership of any scientific inferences. If the effect is trivial, say it’s trivial and move on. Failure to look at the data and assess the effect size using a rational scaling is the biggest possible sin of data analysis. Don’t blame P-values for stupid behaviour of analysts that is largely a consequence of misapplication of accept/reject rules for hypothesis tests.

    2. The idea that there are no true null hypotheses in the real world is irrelevant to the interpretation of a P-value because P-values ‘live’ within a statistical model! The null hypothesis in the statistical model can certainly be true. It may not map directly onto a real-world hypothesis, but that mapping is the job of that rational mind making inferences, not the P-value.

  9. I’ve visited that link before, and I find it telling that the top answer doesn’t give an example of a useful p-value, and the second one starts off with the words “I take great offense”.

    • Well, that was supposed to be tongue-in-cheek.

      And part of that is that the question is so silly that it’s hard to take seriously, as I explained later. One example where p-values are helpful? When I observe a difference between two groups and I’m curious what’s the probability that could just be by chance rather than systematic differences. That’s a pretty common question to have.

      So if you don’t accept that very simple, clear answer, it’s just an odd question to even start with.

      • Cliff:

        Sorry, but the p-value does not give you “the probability that could just be by chance rather than systematic differences.” That’s a common misconception, way up there on the list of standard misunderstandings of p-values. Indeed, that’s a major problem with p-values, that they are interpreted in that way.

        Your answer is, as you put it, “very clear” and “simple.” It’s also wrong.

        • Yes, you are certainly right.

          I swear I did mean “the probability of seeing a difference the observed amount or greater conditional on there being no systematic difference”. But I have not needed to explain a p-value in so long I got lazy with the explanation.

          -1 for p-values; the ease with which one can slip up the explanation.

        • Cliff:

          I have no doubt that this what you meant but didn’t fully write down.

          What I think is happening here is that folks are disregarding different grades of apprehending the p-value.

          There is the identification grade of being able to discern what is and is not a p-value in a study report (most get this correct).

          There is the nominalist grade of getting the formal definition of a p-value (which many get right when forced to be careful).

          And there is the pragmatic (purposeful) grade of knowing what to make of a given p-value (i.e. what future research or even conclusions should be made) and that I don’t think any one has really figured out very well.

          So the nominalist grade is clearer and simpler but its not the important one.

          This my current favorite general discussion of the issues – http://link.springer.com/article/10.1007/s10654-016-0149-3

          And for those interested the insight from CS Peirce on grades of conception

          “Peirce discussed three grades of clearness of conception:

          Clearness of a conception familiar and readily used, even if unanalyzed and undeveloped.
          Clearness of a conception in virtue of clearness of its parts, in virtue of which logicians called an idea “distinct”, that is, clarified by analysis of just what makes it applicable. Elsewhere, echoing Kant, Peirce called a likewise distinct definition “nominal” (CP 5.553).
          Clearness in virtue of clearness of conceivable practical implications of the object’s conceived effects, such as fosters fruitful reasoning, especially on difficult problems. Here he introduced that which he later called the pragmatic maxim.”
          https://en.wikipedia.org/wiki/Charles_Sanders_Peirce

        • Cliff,

          And what you swear you did mean still misses the parts “contingent on the model assumptions being true” and “in comparison only with other suitable samples of the same size”. (Note: “Suitable” refers here to the fact that model assumptions in include restrictions on what samples are suitable — e.g., iid is one simple example.)

          I heartily agree that a big problem with p-values is that the definition is complex and easily oversimplified.

  10. @Keith, OJM, etc regarding this paper linked by Keith: http://onlinelibrary.wiley.com/doi/10.1002/sim.5925/full

    Reading through it we find that between 50 and 72% of the time they use negative controls to look for adverse effects using the techniques standard in the field they find p < 0.05 even though they know there is nothing going on for those drugs.

    “From Figure 1, it is clear that traditional significance testing fails to capture the diversity in estimates that exists when the null hypothesis is true. Despite the fact that all the featured drug–outcome pairs are negative controls, a large fraction of the null hypotheses are rejected. We would expect only 5% of negative controls to have p < 0.05. However, in Figure 1A (cohort method), 17 of the 34 negative controls (50%) are either significantly protective or harmful. In Figure 1B (case–control), 33 of 46 negative controls (72%) are significantly harmful. Similarly, in Figure 1C (SCCS), 33 of 46 negative controls (72%) are significantly harmful, although not the same 33 as in Figure 1B."

    and when they apply their adjustments to the literature, they find that of the estimates they found in the literature, 82% of them were statistically significant at p < 0.05 whereas after adjusting the estimates to take into account what they found using the negative controls, between about 14 to 38% of these estimates should really be considered justified as "statistically significant" by the observed frequencies (that's a complex idea, it's not that they're actually true… we don't know what's really true in the real literature, only in the negative controls)

    “Figure 4 shows the number of estimates per publication year. The vast majority of these estimates (82% of all estimates) are statistically significant under the traditional assumption of no bias. But even with the most modest assumption of bias (mean  =  0, SD  =  0.25), this number dwindles to less than half (38% of all estimates). This suggests that at least 54% of significant findings would be deemed non-significant after calibration. With an assumption of medium size bias (mean  =  0.25, SD  =  0.25), the number of significant findings decreases further (33% of all estimates), and assuming a larger but still realistic level of bias leaves only a few estimates with p < 0.05 (14% of all estimates)."

    • Why me? I just suggested using p-values to construct ‘non-significance’ (confidence) intervals.

      A p-value just measures the ‘statistical position’ of the data relative to a model. Plot this for each parameter value. Also applicable for infinite-dimensional models – work with functionals of the model.

    • Daniel,

      Even for me, who is professionally skeptical, these estimates are at the high (or “scary”) end of my priors about coverage rates. But that said, there are lots of things going on here in the original observational studies: selection/omitted-variables biases; poor choice of precision (SE/p-val) estimators; and the Garden of Forking Paths. I am reminded of one of the points in Ioannidis’ paper about how as we compound individual elements of poor decision making/methods/processes, our p-values rapidly become completely uninformative.

      Maybe we should have a contest to see who can provide the most convincing statistical analysis of something known to be untrue, using the most reasonable- and transparent-seeming methods.

      • It doesn’t surprise me at all that NHST based *observational* studies from medical data would do a bad job of estimating what is going on. I honestly don’t care about “coverage”, or p values, but I would love to see a re-analysis of this same dataset using the following Bayesian re-envisioning.

        Being under a doctor’s care involves having something wrong with you, which inevitably means potentially having more than one thing wrong with you so that you might experience symptoms that are of interest even if the drug is not what causes them. So, in terms of looking at the rates of things like gastric bleeding and liver injury (which are the outcomes of interest in this paper) we should be considering the dimensionless ratios of the incidence and severity when taking any given drug D vs the incidence and severity averaged over people taking various random other drugs known not to cause those side-effects.

        In that context, we’re discussing the ratio of two quantities f(Drug)/f(ControlDrugs) both of which are numbers between 0 and 1 and both of which have uncertainties associated to them. A Bayesian posterior distribution for this ratio would be of interest and take into account both uncertainties. How would that analysis compare to the p-value/confidence interval analysis? My guess is they would be fairly different.

Leave a Reply

Your email address will not be published. Required fields are marked *