Instead of worrying about multiple hypothesis correction, just fit a hierarchical model.

Pejman Mohammadi writes:

I’m concerned with a problem in multiple hypothesis correction and, despite having read your article [with Jennifer and Masanao] on not being concerned about it, I was hoping I could seek your advice.

Specifically, I’m interested in multiple hypothesis testing problem in cases when the test is done with a discrete finite distribution. For example, when doing many tests using binomial distribution. This is an important problem as it appears in in more and more places in bioinformatics nowadays, such as differential gene expression testing, Allele specific expression testing, and pathway enrichment analysis.

What seems to be clear is that the current correction methods are too conservative for such tests, and it’s also straightforward to show that such finite test distributions produce less false positives as one would expect from the null distribution. My understanding is that there’s not a clear way how to correct for multiple hypotheses in this type of situations. I was wondering if I could have your advice on the issue.

My response:

Instead of picking one comparison and doing a multiple comparisons correction, I suggest you should fit a hierarchical model including all comparisons and then there will be no need for such a corrections.

Mohammadi followed up:

I’m not sure if making a hierarchical model would be a possibility for all the cases, and anyways most of these methods are done in a frequentist way. At the moment I work around it by correcting for unique tests only but that seems not necessarily a good idea.

To which I replied:

“Frequentist” is a word for the way in which inferences are evaluated. It is fine to do a hierarchical model from a frequentist perspective.

31 thoughts on “Instead of worrying about multiple hypothesis correction, just fit a hierarchical model.

  1. A coworker and I have talked about this a number of times since we’ve started getting interested in Bayesian statistics, but I haven’t seen an answer just spelled out. If I run an ANOVA (or an equivalent regression), I have lots of potential comparisons/beta values I could test, and thus need to make a correction. I get that. If I run the hierarchical (or Bayesian) equivalent, there is shrinkage in the betas; I get that. But don’t I still have a lot of comparisons/tests that I could run? Does the hierarchical model really eliminate the need for corrections? Or is there a disconnect I’m missing between the modeling and the decision-making stages?

    • Alex:

      If you only report one comparison then, yes, your inference needs to account for that selection process (or, as we call it in BDA, the inclusion process). In selection-type settings, the inclusion process itself contains information relevant to the parameters of interest. For example, if you report the interaction with the highest level of statistical significance, you should account for this in your statistical analysis; otherwise you’re “doing a Daryl Bem” or an “ovulation-and-clothing analysis” and you’ll get the wrong answer (wrong in the Bayesian sense of not reflecting the posterior distribution given all the data, and wrong in the frequentist sense of routinely reporting confidence in large effects that aren’t there, hence the difficulty of replication).

      But if you condition on all the data, you should be ok. Of course, your model won’t be correct and you’ll have to worry about that, as always, but no selection problems. The key is to perform your analysis conditional on all the data. Or, in your terms, run all the tests and put them in a hierarchical model; don’t select.

    • A related question for Andrew, for clarification on your article:

      I love better modeling and shrinkage, but they don’t seem to tackle the same problem as multiple comparisons, unless I’m really missing something.

      The NCES data (figs 5 vs 7) is a good example where pretty much *every* pairwise comparison has *someone* who cares about it, so it does make sense to report all possible comparisons at once. You write:
      “For the purpose of comparing to the classical approach, we set a .05 cutoff: for each pair of states j, k, we check whether 95% or more of the simulations show aj > ak. If so–—that is, if aj > ak for at least 950 of the 1,000 simulations—–we can claim with 95% confidence that state j outperforms state k.”
      But those sound like standard (uncorrected) separate comparisons, just under a new model.

      To really be analogous to classical multiple comparisons, shouldn’t you build a simultaneous 95% posterior credible set that covers *all* the differences |aj – ak|, for each pair of the 50 states at once, and see which of those differences are never 0 in your credible set?
      I imagine this would give you more “not significant” differences than what the article reports.

      Have you seen any researchers actually do this?
      (I get that you might not *want* to do this because you don’t *care* about multiple comparisons. But that’s different from saying that a good hierarchical model *fixes* the multiple comparisons problem.)

      Thanks!

  2. When you say “multilevel” or “hierarchical” you mean data sets with, for example, kids within schools,counties within states, or people within organizations. Is that correct? If a data set does not have that structure, then it is not possible to do multilevel modeling. Or am I missing something?

    • Anon:

      “Multilevel” can be anything with variation. See my book with Jennifer for many examples and much discussion. For example, suppose you have several students, each of whom is given several test items. You can consider the students and items and responses as a multilevel structure. If you’d like, you can think of responses as being “within students” or “within items,” but I don’t know that this is so helpful.

  3. To do it in a frequentist way wouldn’t we want some information about the frequentist properties of the hierarchical model? Such as the combined size of tests etc. I know that for particular cases, this is available, but for someone working on a particular problem, it seems like they would at least want to run simulations to understand the coverage and size of the resulting tests (if they want to do frequentist inference).

    • Dean:

      For reasons we’ve discussed often on this blog, I don’t think that coverage and size of tests are so interesting. I’d be more interested in type S and type M errors, as Francis Tuerlinckx and I discuss, in the context of hierarchical models, in our 2000 paper. These are frequency calculations too, they’re just calculations that are more interesting to me.

  4. “differential gene expression testing…false positives”

    I doubt there such thing as a false positive in this case. Instead there are small, medium, and large differences. These may or may not be primarily due to your treatment, it is even possible much of the difference is due to totally uninteresting reasons (e.g. batch effects). There are many hours of hard work ahead in figuring out how to extract something likely to be useful from the information.

    You are concerned with calculating the probability that two groups of people/animals/cells are from exactly the same hypothetical infinite population, then using this to distinguish “true” vs “false” positive results according to whether that value is less than a customary (arbitrary?) threshold that has been further “corrected” to punish you for collecting too much information. I just don’t get it.

  5. I really don’t see how you can fit a hierarchical model with most differential gene expression analysis problems. A typical example is autism studies, where RNA samples are obtained in a tissue from autistics and neuro-typicals (non-austistics). (This is done on deceased individuals, not through neurosurgery!) One then wants to examine the 10 to 20 thousand or so genes for differential expression. For most purposes, there is no natural “hierarchy”, or grouping, or selection, because genes do different things in different tissues and it isn’t clear how these groups should be defined. Even if there was a natural grouping, and one could subset the set of n genes into k groups (k relatively small), it’s a pretty standard problem in multivariate statistics to test whether m genes from one condition (autistics) are different from another condition (non-autistics) simultaneously. This certainly reduces the number of comparisons and so any type of Bonferroni type correction will not knock out many significantly different k groups (or, in your Bayesian adjustment, the means won’t be moved that much closer to one another), but the central problem is that a priori one does not really know the groups. Generally what one does is set some level of significance (corrected or not) or some FDR level and run the selected genes through a network analysis program (which you can think of as a huge prior as researchers have culled the literature for connections). These programs allow for “gaps” in specification of the network (they have to–networks aren’t well understood) and so they attempt to match the genes provided with a network (or on an even simpler level, gene ontology matching). The point is that there is no real method of “hierarchicalizing” the data, unlike, say, political science, where an individual might have state level, city level, census tract level data associated with individual data, where reducing the number of variables in the estimation makes sense.

    • tl;dr you don’t really need an explicit hierarchical structure to do what Andrew is suggesting.

      Maybe I’m misunderstanding, but I would assume that what you would do in this scenario is treat the genes as exchangeable (‘random effects grouping variables’ if you want to put it in those terms). Let’s say for simplicity that all you want to look at are fold changes between groups (not actual expression levels in each group). You make some assumption about the distribution of fold changes (e.g. that they are log-Normally distributed) and estimate the distribution, and the posterior/conditional mean for each gene. These are shrinkage estimators for the individual fold changes.

      • Ben: Yes, but with 20,000 shrunk estimates there’s still an overwhelming need to pick a much smaller (and still possibly noisy) set of “winners”, thus inducing regression-to-the-mean artifacts that the exchangeability assumption does not protect against. Some sort of statistical control of the degree of noise in that set is also desirable – to many – and this is also not given by the exchangeability assumption.

        Assuming exchangeability and/or fitting a hierarchical model is a great place to start, but it alone doesn’t give all the machinery one needs in practice.

      • this is already done and the standard packages for analyses are usually based on some form of hierarchical model.. Shrinkage estimation of gene variances is the norm, deseq2 switched to shrinking the fold change. Frankly I don’t know why they didn’t shrink fold changes from the beginning given that the typical study design consists of n=2 or n=3 replicates per condition.

        Although the method developers understand this, it’s unfortunate that results are still presented as lists of p-values and p < .05 is all that most users understand.

        If they shrunk the FC estimates across the board and based "significance" on the 95% CI excluding negligible effect sizes, you wouldn't need the adhocery of using p-value cutoffs in conjunction with fold change cutoffs as the latest 'innovation' in the field.

        • “we assume that genes of similar average expression strength have similar dispersion”–from
          http://dx.doi.org/10.1186/s13059-014-0550-8, http://dx.doi.org/10.1186/s13059-014-0550-8.

          I agree that this approach makes sense but how does this make DESeq2 a hierarchical model? And even a better estimate of variation does not preclude the fact that one is making multiple comparisons. DESeq2 provides FDR and posteriors but there are still a lot of comparisons–note the authors call for “testing of differential expression with respect to user-defined thresholds of biological significance.” The key is user-defined and using standard multivariate methods I can set a threshold and make a good comparison. It is a comparison coming from the uncertainty properties of the data that is difficult and I don’t see that in any multiple testing of genes.

        • The quote you give is the (an?) explicitly hierarchical part of the model. It amounts to an assumption that dispersions are conditionally exchangeable given the mean normalized gene read count (MNGC). By de Finetti’s exchangeability theorem, that makes dispersions at a specific MNGC independent and identically distributed given a common parameter. In equation (5) of the Materials and Methods section they make this explicit with the assumption (more restrictive than mere conditional exchangeability) that the mixing distribution is log-normal and has the common parameter \sigma_d^2 for all MNGC values.

          To see how this is hierarchical, imagine simulating from the model. Given a MNGC and a \sigma_d^2, one would first simulate an \alpha_i value and then simulate a set of gene counts \K_{ij}. At the higher level of the hierarchy are the random variates indexed by i; at the lower level are the random variates indexed by ij.

        • numeric: I addressed what Corey put in de Finetti’s exchangeable language in a different language in my thesis (same concepts but less formalism).

          “Certainly, the concept of something being common in observations even when they vary is
          central in statistics – especially if the observations are taken under apparently identical conditions.
          The appropriate extension of this concept to something being common in slightly or even largely
          different situations or contexts is admittedly more challenging. Parameters being common in
          distribution is an even further extension to treating admittedly different things as though they are
          exchangeable or common in some probabilistic sense. Here the interest could be primarily on what
          is common in the probability distribution that generated the lower level probability distributions,
          or something about a particular lower level distribution.” http://statmodeling.stat.columbia.edu/wp-content/uploads/2010/06/ThesisReprint.pdf

          I went through it in extensive gory details, but the essence is that in coming up with a statistical analysis you do much more than specify a data generating model (distribution) by (implicitly or better still pragmatically) discerning which components or functions of the parameters _can be_ taken as common (fixed), arbitrary (fixed but differing) or common in distribution (parameter’s values are _like_ random draws from distribution – that may have common, arbitrary or common in distribution parameters) and still provide a not too wrong representation of how the data _were_ generated.

          The “_can be_ taken as common in distribution and still provide a not too wrong representation” should make it clear that the representation does not have to be literal – so whether something was fixed or random physically is mostly a distraction.

          Why was it/is it in Anova that variances were/are assumed common?
          (Because Fisher, Yates, Cochran were clever enough to discern the variances _can be_ taken as common in distribution and still provide a not too wrong representation – Neyman-Scott discovered applications where it was way too wrong of a representation)

        • Opps (typo in above fixed below)

          (Because Fisher, Yates, Cochran were clever enough to discern the variances _can be_ taken as common (fixed) and still provide a not too wrong representation – Neyman-Scott discovered applications where it was way too wrong of a representation. A less wrong representation for some of these is variances _can be_ taken as common in distribution)

        • Look, I agree that exchangeability is essential to statistical analysis. I don’t agree that this makes a model “hierarchical” since by this criteria all statistical models are hierarchical. To quote Gelman et.al., “[H]ierarchical models…are used when information is available on several different levels of observational units” (BDA3 page 5). From Love et.al.,

          Var K_ij = mu_ij + alpha_i mu_ij^2

          which to me tells me there is an individual error component (as well as a systemic component) for each gene (K’s a gene). This then tells me that there will be a multiple comparison problem. The alpha_i is an _assumed_ multilevel parameter (bad terminology–a parameter at a higher level of aggregation), not “information which is available on several different levels of observational units”. There is no information here, only an assumption of the model. To see how this definition can’t really work in practice, assume there is no biological variation, only “shot” noise (bioinformatician speak for poisson noise–this would occur with two technical replicates in different lanes of a flow cell, for example). Then alpha_i would be zero and the model would no longer be hierarchical. Or completely differently, assume there is data x1, …, xn drawn from a presumed n(mu, 1) distribution. mu is an aggregate parameter at the level of the entire dataset but is this model hierarchical? If it is, then every statistical model since Pearson has been.

        • “[H]ierarchical models…are used when information is available on several different levels of observational units” isn’t a definition; it’s advice on when to reach for a hierarchical model. My view is that what makes a model for a set of parameters ‘hierarchical’ is that it neither assumes they are all independent in the prior (or for non-Bayesians, that they are ‘fixed effects’; the key point is no pooling of information) nor that they are all identical (complete pooling). In such models, some… thing, some rule or principle or method, dictates how parameter estimates co-vary.

        • numeric:

          Vocabulary here can be challenging.

          To me

          1. n(mu, 1) distribution has one unknown parameter mu, it is fixed and common for all observations.
          2. n(mu.i, 1) distribution has unknown parameters mu.i, they fixed and arbitrary for each observation.
          3. n(mu.i*,1) with mu.i* drawn from n(mu,1) has one unknown parameter mu fixed and common for all observations, and a random unobserved parameter u.i for each observation. One might be interested in mu or a mu.i or all.

          Andrew refers to 1 as complete pooling, 2 as no pooling and 3 as partial pooling.

          Some suggest special notation and names for these parameter characterizations depending which are of interest or just nuisance.

          Yes, it is a mess. The purpose is to get better inference in some sense and 3 often does that.

        • Running out of room but thank you for the comments. Models are typically more powerful the more assumptions are made but are more likely to be wrong. Or maybe restrictions is a better word. Tests of the assumptions are nice (model checking) but don’t seem to have achieved any sort of consensus. I do wonder about the claim of
          Kevin, that, given

          1. n(mu, 1) distribution has one unknown parameter mu, it is fixed and common for all observations.
          2. n(mu.i, 1) distribution has unknown parameters mu.i, they fixed and arbitrary for each observation.
          3. n(mu.i*,1) with mu.i* drawn from n(mu,1) has one unknown parameter mu fixed and common for all observations, and a random unobserved parameter u.i for each observation. One might be interested in mu or a mu.i or all.

          (3) is better. Under standard mle theory, if (1) is the correct model, it is best. Thus (3) is a claim of some sort of robustness under misspecification (and all models are misspecified). Is there an article going through these three type of models in this simple form (or an equivalent form) and demonstrating this?

        • numeric – i’d suggest educating yourself regarding stein’s paradox. you do not need an underlying relationship between variables for a hierarchical model to get you better estimates with a hierarchical model. you do not need an underlying causal relationship or a “correct” modeling assumptions in the sense of underlying mechanism in nature.

          “standard mle theory” gives you worse answers (by very practical definitions of “worse”) if you’re estimating more than 3 parameters simultaneously, which is certainly the case here. Those estimates become a _lot_ worse when power is weak, which is also the case here.

        • Stein’s paradox deals with the admissibility of an estimator when multiple parameters (3 or more, I believe) are estimated. Here there is one parameter and mle achieves the Cramer-Rao lower bound (if the underlying model is true). I’m interested in why estimating a (3) in Kevin’s example is better than mle estimation. As the wikipedia article on Stein’s example states, “On the other hand, if one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse.”

        • numeric:

          In most applications it is obvious that 1 is too wrong so the choice really is between 2 & 3.

          3 is outside usual likelihood theory http://www.jstor.org/discover/10.2307/2291674?uid=16997896&uid=3739448&uid=2&uid=3737720&uid=3&uid=67&uid=16268344&uid=62&sid=21106134303421

          For the Neyman-Scott problems, with 2 the mle is highly inconsistent where as with 3 the _gmle_ is consistent.

          Somewhat annoyingly, when I was working on this http://biostatistics.oxfordjournals.org/content/2/4/463.full.pdf the editor asked us the justify the advantages of 3 and my co-author send back an angry response that this was so well known that to have to justify it in 2000 was ridiculous. The editor agreed and now there is no trace of that being the case in our paper :-( One or more references included from Greenland in the might be helpful (e.g. the simulation study.)

          Probably some of Andrew’s papers and or blog posts will cover this as well.

        • Keith/numeric:

          Speaking of vocabulary, in applied micro:

          2: a fixed effects model, or first-difference model, is how we would think about that, and we would think about the parameters being identified by using comparisons within an individual `i’, the point estimate being those individual differences aggregated across each `i’.

          3: a random effects model, which would be estimated using MLE or with OLS using “Moulton-type” standard errors. OLS is unbiased here, but doesn’t consistently estimate standard errors (if I am understanding your notation right).

          If the assumptions in 3 hold, it is more efficient to use a random-effects framework and the kind of hierarchical models I think you are suggesting. But we often think that Mu_i is correlated with some sort of unobservables (some component of the error term) and then the assumptions behind estimating 3 as a random effects model are violated. But if that correlation is persistent over time, 2 will be consistently estimated in first-differences, fixed effects, or with individual dummy variables.

        • numeric – gene expression data requires estimating ~30K fold changes and 30K dispersions based on 3 or 4 replicates. In what sense is that not multiple parameter estimation?

          The worst sin of omics statistics is the approach of treating large-scale inference as single-parameter inference scaled up with a multiple comparison correction tacked on at the end.

        • jrc: The thread that goes on and on (yup usually haphazard bias being modeling as random).

          When I was I Duke I came up with an example of Simpson’s paradox arsing from 3.

          1 gave dramatic positive effect, 2 gave negative effects and 3 gave a slight (pooled) positive effect.

          (Simpson’s paradox usually arises from inappropriate complete pooling but it can happen with partial pooling. But without an actual example graduate students had difficulty understanding this.)

  6. Andrew – I agree and I myself give this advice all the time.

    However, there is a subtlety here which is that unless a study is highly powered for all parameters, there’s often not a clear answer as to how much shrinkage is too little and how much is too much.

    Granted this isn’t any more subjective than “multiple comparison adjusted p-values < some arbitrary number" (which is why I don't have any qualms about advocating shrinkage over corrections) but it still is a big degree of freedom that leaves the primary conclusions of an analysis undetermined.

Leave a Reply

Your email address will not be published. Required fields are marked *