Difficulties in making inferences about scientific truth from distributions of published p-values

Jeff Leek just posted the discussions of his paper (with Leah Jager), “An estimate of the science-wise false discovery rate and application to the top medical literature,” along with some further comments of his own.

Here are my original thoughts on an earlier version of their article. Keith O’Rourke and I expanded these thoughts into a formal comment for the journal. We’re pretty much in agreement with John Ioannidis (you can find his discussion in the top link above).

In quick summary, I agree with Jager and Leek that this is an important topic. I think there are two key places where Keith and I disagree with them:

1. They take published p-values at face value whereas we consider them as the result of a complicated process of selection. This is something I didn’t used to think much about, but now I’ve become increasingly convinced that the problems with published p-values is not a simple file-drawer effect or the case of a few p=0.051 values nudged toward p=0.049, but rather an ongoing process in which tests are performed contingent on data. As Keith and I discussed in our article, Jager and Leek’s model of p-values could make sense in a large genetics study in which many many comparisons are performed and the entire set of comparisons is analyzed (in essence, a hierarchical model), but I don’t think it works so well when you’re analyzing many different single p-values published in many different studies.

2. Jager and Leek talk about things such as “the science-wise false discovery rate.” I don’t think such a concept is so well defined. To start with, I don’t think people are usually studying zero effects. I think what is happening is that there are many effects that are small, and these effects can vary (so that, for example, a treatment could have a small positive effect in one scenario and a small negative effect in another scenario). Errors can be defined in various ways, but as a start I like to think about type S (sign) and type M (magnitude) errors. I certainly believe that the Type S error rate is less than 50%: we’d expect to get the sign of any comparison correct more than half the time, if there’s any signal whatsoever! How high is the Type M error rate? That depends on how large does an error have to be, to be considered a Type M error. Are true effects overestimated by more than a factor of 2, more than 50% of the time? Possibly. This could be worth studying.

In short, I think there are a few more steps needed before your method maps to science as practiced. But it’s great to see all this discussion. Simple calculations have their place, as long as their limitations are understood, and I believe that this sort of discussion pushes the field forward.

A separate, but related, issue, is I think an idea of Jager and Leek that underlies all this work, which is that scientists are generally pretty reasonable people and science as a whole seems to be pretty sensible. I’ll buy that. The Stapels and Hausers and Wegmans and Dr. Anil Pottis of the world are the exceptions, that’s what makes their stories so striking. And even the routine finding of statistical significance amid noise is, I’m willing to believe, usually done in the service of some true underlying effects. This makes it hard to believe that most papers are false etc.

I wonder whether this particular issue can be resolved by considering areas of research rather than single papers. Suppose it’s true (as John Ioannidis and Keith O’Rourke and I suspect) that most scientific papers have a lot more noise than is usually believed, that statistically significant results go in the wrong direction far more than 5% of the time, and that most published claims are overestimated, sometimes by a lot. This can be ok if these scientific subfields are lurching toward the truth in some way. I think this could be a useful way forward, to see if it’s possible to reconcile the feeling that science is basically OK with the evidence that individual claims are quite noisy.

32 thoughts on “Difficulties in making inferences about scientific truth from distributions of published p-values

  1. > … I’ve become increasingly convinced that the problems with published p-values is not a simple file-drawer effect or the case of a few p=0.051 values nudged toward p=0.049, but rather an ongoing process in which tests are performed contingent on data.

    Full disclosure: I’ve never made a decision based on a p-value and I can’t recall the last time I calculated one. That noted, suppose I resample my data (bootstrap) and run my analysis. I end up with a distribution of p-values. What fraction of that distribution has to lie below p=0.05 in order for me to declare significance? Would I report “A bootstrap analysis suggests that there is an [X]% probability that p<0.05."?

    • Chris:

      In theory the idea is that you’d produce just once summary number and compute the probability of getting something that extreme under the null hypothesis. So you’d end up with one p-value, not a distribution. Or, to put it another way, your distribution would collapse to a single value.

      The key point, though, is that to get a sense of this uncertainty you don’t want to be boostrapping the data (or otherwise using the data to estimate your effect size). You want an external (prior) estimate of effect size. This is discussed further in my paper with John Carlin on retrospective design analysis.

  2. “Jager and Leek’s model of p-values could make sense in a large genetics study…”

    They’re often used in genetics, but they don’t necessarily make much sense there either. The issue of causal identifiability is dealt with really sloppily in genetics and multiple comparison adjustments and/or mixture distribution modeling are often used as an inadequate bandaid for those deeper study design issues. I don’t know of any other field that gets away with putting such little thought into causality. I’ve even heard practitioners say that nothing “causes” genes, so the correlation between genotype and phenotype must be a causal effect!

    • I’ve even heard practitioners say that nothing “causes” genes, so the correlation between genotype and phenotype must be a causal effect!

      In a well-designed and executed study, this is of course true, in a way that can never be true for studies of correlations between phenotypes. Genotypes (or at least nearly all the ones in these types of studies) are set at fertilization, so they cannot be caused by things that happening later in life. Maybe I’m misunderstanding what you mean though.

      • Outside purely descriptive studies, the definition of a well-designed and executed study is that it succeeds in isolating a causal effect, so adding that qualifier makes the claim circular.

        In general, the notion that genotype-phenotype correlations are somehow less susceptible to confounding is a common misconception. As you’re aware, human history introduces a correlation between genotype and everything else in the world. The pop gen field tends to focus on studying large scale migrations because the signal for these events spans large regions of the genome, but there’s plenty of correlation structure at other scales. Population substructure algorithms or PCA subtraction is never going to fully get you back to random assignment. This means that there’s _always_ a backdoor path (in the sense of a causal graph) between genotype and phenotype that is never fully blocked.

        The fact that the field is focused on achieving statistical significance and multiple comparison adjustments makes things worse, since weaknesses in causal identifiability become increasingly apparent (i.e. produce more “true” correlations which don’t correspond to “true” causes – something distinct from “spurious” correlations) as sample sizes and statistical power increases.

        • I believe that you are exactly correct, and it’s something I’ve been thinking about for quite awhile. There are always unmeasured (unknown, really) “confounders” in this type of genetic study, as in any observational associational study. Hell, the very concept of “a confounder” cannot be well-defined (*except for* genetics).

        • Outside purely descriptive studies, the definition of a well-designed and executed study is that it succeeds in isolating a causal effect, so adding that qualifier makes the claim circular

          No, there is a fundamental difference in genetic studies. Suppose I have a perfectly-designed a study of phenotype X and phenotype Y (say, obesity and type 2 diabetes) with no confounders. If I see an effect in this perfectly-designed study, I can conclude X->Y or Y->X (either obesity causes diabetes or diabetes causes obesity).

          Now take a perfectly designed study for a genetic variant G and phenotype X. If I see an effect in this perfectly designed study, I can conclude only that G->X. There really is no ambiguity about the direction of causality (in this perfect world), because G was determined at fertilization.

          That’s all people mean with that statement.

        • The fact that the field is focused on achieving statistical significance and multiple comparison adjustments makes things worse, since weaknesses in causal identifiability become increasingly apparent (i.e. produce more “true” correlations which don’t correspond to “true” causes – something distinct from “spurious” correlations) as sample sizes and statistical power increases.

          This is true. But people of course worry about this, and it’s an empirical question how well they’re doing to correct these things. IMO the view that large, well-designed genetic associations studies are picking up confounders is almost certainly wrong. For example, you can take a genetic variant predicted to influence cholesterol levels, put it in a mouse, and show that it indeed influences cholesterol levels (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3062476/). This shows causality in the way that makes sticklers happy. It just takes lots of time and money.

        • “If I see an effect in this perfectly designed study, I can conclude only that G->X. There really is no ambiguity about the direction of causality (in this perfect world), because G was determined at fertilization.”

          This is just not true. Genetic variability is caused by (or regulated by, if you’d rather) selection, to some extent. The same environmental factors related to selection could be related to X. So, observing an association between G and X is just that, an association.

        • Sorry to belabor this, but I’m making what I think is a totally trivial point: assume you have observed an association between X and Y, and there are no other variables in the entire world (including your suggestion about selection that I don’t totally understand but will think about). In general, you can then conclude X causes Y or Y causes X. If one of the variables is a genotype, however, except in pathological cases causality can go only in one direction, because genotype was randomly assigned when a random sperm met a random egg. This is the point people are making when they say nothing “causes” genes.

        • +1 for Joe’s comments. That your genotype is created, at conception, purely as a random mix of mom and dad’s genotypes (Mendelian Randomization) is a really strong assumption, but also one that happens to be accurate. Consequently, the association studies – done well – really can make causal inferences.

          Stratification (i.e. confounding by ancestry, and in particular the lifestyle factors correlated with ancestry) is certainly a potential problem for such inferences – but one which the large scale of the data allows us to assess the severity of, with some accuracy. There is often *very* good reason to believe stratification is far from the biggest issue in these analyses – usually lack of power is more of a problem, the effects are small.

          If you don’t buy these theoretical/statistical arguments, go look at the empirical evidence; genetic association studies – done well – have led to all sorts of new biology.

          @Anonymous; “putting such little thought into causality” – you seem to be overlooking a long and still-growing methods literature based around Mendelian Randomization. Please also extend a little kindness to authors who don’t fill their applied papers with causal arguments for you to read, because reviewers and editors think such arguments are so obvious as to be a waste of space.

        • I get the point and there’s a some minutiae here about the timesecale of selection etc. However, in my opinion focusing on only the question of forward or reverse causality is a red herring because it’s just one small form of confounding and observational study designs almost never convincingly narrow things down to either forward/reverse causality. So to me this notion of “nothing causes genes” indicates a lack of regarding the relevant level of “randomization” in question.

          @Fred below — this is a good example of the confusion. Biologists often seem to frame questions as “is something random or not?”. Randomization at the level of parental mixing has no bearing on randomization of assignment of genotypes to individuals. Furthermore, as I say above, researchers are wrong to focus on power. An infinite sample size would expose the problems I mentioned, not solve them. Yes, there’s work on PCA and population substructure, but stratification does not contain _all_ the historical structure which determines the genotype assignment to an individual. Therefore, there’s always a backdoor path for genotype-phenotype correlations.

          Don’t get me wrong, I’m not saying these studies are worthless. I do think there are better frameworks that could be proposed than the hypothesis testing + multiple correction approach that’s used now. However, I’m perfectly fine with these kinds of genetic correlations being treated as the observational hypothesis generating tool that they are. As Joe says, one can further examine causality with more convincing experiments and in some cases you’ll get lucky and find something useful. However, I’m skeptical when people think expect to understand causality from genotype-phenotype correlations alone. And the particular idea of forward/reverse causality being the only form of confounding misses the real problem.

        • Re: Mendelian randomization – perhaps you can enlighten me because there seems to be a pretty large disconnect between the assumptions and reality.

          “…if we assume that choice of mate is not associated with genotype (panmixia)…” first there’s the explicit preferences associated with genotype (e.g. http://www.aeaweb.org/assa/2006/0106_0800_0502.pdf ) second even if there’s not an explicit preference, there’s the inconvenient issue that people tend to mate with others who are nearby.

        • @Anonymous

          I do think there are better frameworks that could be proposed than the hypothesis testing + multiple correction approach that’s used now

          No matter the framework, you still want very stringent statistical thresholds because in general your prior on any genetic variant being associated with the phenotype is low. See e.g. this discussion from the first serious genome-wide association study (http://www.nature.com/nature/journal/v447/n7145/box/nature05911_BX1.html)

          The simple fact is that large, well-designed association studies work: they identify associations between genetic variants and phenotypes that replicate in different countries (and continents) with vastly different environments, exposures, etc. Only a handful have been examined in the experimental detail one would like to “prove” casusality, but I’d bet that’s more a limitation of time and money than anything else.

          Improving on these studies is of course possible. But if you really think that the 1000s of genetic associations that have been replicated many times over in various parts of the world are somehow spurious, it would obviously be useful for the field if you show this empirically and figure out why!

        • @Anonymous;

          > there’s the inconvenient issue that people tend to mate with others who are nearby.

          … well, it’s no fun to mate with someone not nearby!

          The randomization in question is the bit that happens at conception. If, in people who have parents drawn from roughly the same population, those with a one genotype are more likely to e.g. have Alzheimer’s Disease than those with another, then because genotypes are unchanged by lifestyle factors the variant in question (or something “near” it, genetically) can reasonably be assumed to have some causal impact (somehow) on AD. That’s the level of claim made – it’s not a particularly strong form of causality, but it’s not just association.

          If population stratification was an issue (and it is, sometimes) we’d see association signals all the way along the genome, not just in a tiny proportion of areas. When we don’t (which is often) any confounding must be very localized, genetically (so the signal won’t replicate) or we were incredibly unlucky in the original data – this also won’t replicate. The feature about spurious signals coming from across the genome also enables PCA adjustment (which uses the whole genome) to do a good job of getting rid of these common variant signals that are due to confounding. No-one claims these methods are foolproof, but there is not a lot of room for backdoor paths here.

          NB the issues you discuss were raised years ago by those who didn’t want to fund the initial genome-wide studies, fearing they would churn out garbage in the way you outline. As Joe notes, those fears have now been quite comprehensively debunked by results.

          > Researchers are wrong to focus on power

          I respectfully disagree; more power is better. Forgetting about power sends us back to the bad old days of data-dredging from a few genetic variants in tiny samples. Some signals were found, but the rates of replicability were appalling.

        • Have to get back to my real work, but I’ll reiterate a few more points about this:

          1. I’m fine with association studies being used as one exploratory approach, so long as it’s regarded for what it is. However, things also look a lot rosier when we can pick and choose results retrospectively. I’ve seen the other side where people have run 30,000 person studies and come up with nothing or hits don’t lead to anything useful beyond an entry in a database. There’s also the decline effect (in effect size) and various other replication issues such as in http://www.nature.com/nature/journal/v447/n7145/full/447655a.html

          @Fred – you say the issues get cleared through non-reproduction. That’s good, but by that point how much money has been spent on these x-thousand person observational studies? As a data person, this is good business for me, but at some point I’d like to see the money spent on a more scientific approach. Also, speaking of statistical power, there’s two directions to statistical power – articulating a more specific model/hypothesis apriori and having a large sample size. Focusing on the latter is a race towards smaller and smaller associations.

          3) Causal claims also show up in genotype-phenotype associations where representations are more sensitive to confounding, such as aggregated representations of the genotype. A recent example is the controversy over the Galor paper. This is a harmful side effect of intellectual laziness of treating genotype allocation as some sort of natural experiment (“genotype-phenotype associations must be causal because nothing ’causes’ genes”) rather than the outcome of a complex historical process.

        • I’ve seen the other side where people have run 30,000 person studies and come up with nothing

          Wait, I thought your argument was there is so much unmodeled confounding that a large study should get all sorts of spurious results!

        • @Anonymous

          Re point 1 and the costs, genotyping already-collected specimens on people for whom we already know lots of disease outcomes adds not much to the overall costs of the study – a couple of grants at most. Genotyping chips are cheap – cheaper than getting good analyses of their data! – and collecting specimens and outcomes is expensive.

          Re “the decline effect” – a.k.a. regression to the mean – this is known before we get to replication. Folk doing association studies *well* will allow for RTTM when planning their replication studies, there are several statistical methods that help with this. NB the Nature 2007 paper was fine in its day, but the field’s moved a long way in 6 years.

          Being more specific a priori is not a crazy idea, but doesn’t have a good track record – candidate gene work prior to the current genome-wide technology didn’t work very well. While we know, a priori, that there are associations to find (reflecting causal variants) there is not good prior information about where on the genome to find them.

          Re “nothing causes genes” – I appreciate these are not your views, but it’s the genetic variants – not the genes themselves – that the causal argument revolves around. What you’re quoting sounds like non-expert press releases that discuss scientists “finding THE gene for [height, obesity, blood pressure]”, i.e. you’re objecting to a straw man. There is plenty of literature that is not intellectually lazy.

          Thanks for the discussion, I also have to get back to the day job.

        • Confounding has nothing to do with being spurious…. In fact, it’s just the opposite. Confounding is a systematic (I.e., not spurious) bias. What we’re really talking about is association versus causation, and I think we should probably take a step back and identify what we mean when we sy that something is a cause of some observed effect, especially regarding genetics. Take the most basic phenotype, gender. Is being female a “cause” of pregnancy? I don’t see it as such, I see it as a necessary, but not sufficient, condition. If other genetic events are sufficient to reliably produce an effect (I’m way beyond my comfort zone, don’t know of any offhand… Sickle cell, maybe?), then they would clearly be labeled causes. But genetic traits that simply segregate the population into “higher risk” and “lower risk”? I think it’s a stretch to argue that any such traits are necessarily causal no matter how large the sample that generated them.

        • Ok, I see. You’re using “cause” in the Koch’s postulates sense of the word. I’m using it in the smoking “causes” lung cancer sense of the word. In this sense, yes, being female “causes” pregnancy, smoking “causes” lung cancer, and genetic variants can “cause” disease. You can use another word there if you makes you happy. Maybe “cigarette smoking increases the probability of developing lung cancer, all other things being equal” would be more correct. Similarly, there are genetic variants that increase the probability of lung cancer, all other things being equal. You can call that “causation” or something else; I don’t care. The point is that finding the exact genetic variants that increase the probability of developing lung cancer tells me something potentially interesting about how the world works.

        • Just going with your example. Imagine you were a prehistoric man wondering exactly how this whole “pregnancy” thing works. Being clever, you do a well-designed case-control study, and find something striking–pregnancy happens only in women. Does this help you in understanding how pregnancy works? I’d say yes, absolutely, you’ve discovered something fundamentally true about human biology, though maybe at this point the mechanism isn’t obvious. Geneticists studying disease are like this prehistoric man understanding pregnancy :) We know essentially nothing, so the goal is to identify precise genes that increase or decrease risk, then figure out the biological mechanism later.

  3. One thing that struck me about the rejoinder is the rhetorical sleight of hand going on sliding between “based on data” and “empirical”. The idea of empiricism is appealing as an academic but more often than not we overestimate our ability to make empirical claims. A parameter estimate may be based on data, but if the relationship between the data and the claim is not defensible, then the analysis cannot be characterized as “empirical”. Analogously, the idea of “map-based directions” may be appealing, but relying on a map of Moscow to find my way through New York is potentially harmful.

  4. Are some fields within science better than others for less than obvious reasons. We know chemistry is more deterministic than biology which is less fuzzy than sociology, etc., but are there fields at comparable levels of hardness of science where results are more reliable on average than in other fields at that level?

    Offhand, I’d guess that Big Money has an effect. But I could see it going both ways: there is so much money ready to reward a discovery in field X that discoveries are claimed too often. Or some fields where a lot of money is at stake could be extra careful because huge amounts of money could be wasted.

    So, it would be interesting to try to fit a model to the data.

    • Steve:

      Considering research in psychology, one general pattern is that between-subject studies are noisier than within-subject studies, so you’ll need a much larger sample size to get reliable results if you’re only measuring each person once.

    • This is an interesting point. Even across sub-fields of psychology (and among those at the “softer” end), there’s a pretty big difference whether lab studies and field studies replicate one another: industrial/organizational psychology (full disclosure, my training is in IO) seems to replicate pretty well, whereas social psychology seems to have effects flip signs pretty often, see “Revisiting truth or triviality”, posted here.

      Of course, that conclusion is based on meta-analyses that all could have their own problems.

    • Steve: Interesting idea – lets think about the prior before trying to do the modelling.

      It will be complex and likely lots of dimensions.

      One is the ease and cost of replication. In math its very easy (for those who know that area of math) and very low cost. Think of math as experimental manipulation of symbols and the claims being if one takes these symbols and manipulates them exactly like this one can note this. Very easy for others redo that experimental manipulation and see if “this” replicates (just need paper). Clinical research is very hard and extremely expensive to replicate. So we worry a lot in clinical research.

      Money (incentives) to get an answer. Likely non-linear or at least a group of researchers I once worked with found that clinical researchers studying fatal cancer were less careful than when studying non-fatal cancer. They are so desperate to have something that works that they skip over being careful.

      Money (incentives) to avoid making false claims. Paradoxically in clinical research – almost none. Researchers don’t seem to suffer when their research is discovered as faulty or sloppy. In math, they are heavy – right Andrew? Publish a proof that’s false – does not replicate when others redo it – and it’s not taken lightly or soon forgotten.

      • Having survived lymphatic cancer in 1997 probably due to being part of a clinical trial for Rituxan (now a multibillion dollar cancer drug), I have to say I feel pretty forgiving toward researchers who are a little over-optimistic about finding drugs for fatal cancers.

Comments are closed.