Interaction-based feature selection and classification for high-dimensional biological data

Ilya Esteban writes:

In your blog your advice for performing regression in the presence of large numbers of correlated features, has been to use composite scores and hierarchical modeling. Unfortunately, many problems don’t provide an obvious and unambiguous way of grouping features together (e.g. gene expression data). Are there any techniques that you would recommend that automatically pool correlated features together based on the data, without requiring the researcher to manually define composite scores or feature hierarchies?

I don’t know the answer to this but I imagine something is possible . . . any ideas?

In the meantime I’m reminded of this recent article by Shaw-Hwa Lo, Haitian Wang, Tian Zheng, and Inchi Hu:

Recent high-throughput biological studies successfully identified thousands of risk factors associated with common human dis- eases. Most of these studies used single-variable method and each variable is analyzed individually. The risk factors so identified account for a small portion of disease heritability. Nowadays, there is a growing body of evidence suggesting gene–gene interactions as a possible reason for the missing heritability . . .

To address these challenges, the proposed method extracts different types of information from the data in several stages. In the first stage, we select variables with high potential to form influential variable modules when combining with other variables. In the second stage, we generate highly influential variable modules from variables selected in the first stage so that each variable interacts with others in the same module to produce a strong effect on the response Y. The third stage combines classifiers, each constructed from one module, to form the classification rule. . . .

I haven’t tried to follow all the details but it looks cool. These genetics problems are different from the social science and environmental health examples I work on. In genetics there seem to be many true zeros—that is, you really are trying to find a bunch of needles in a haystack. In my problems, nothing is really zero and we only set things to zero for computational convenience or to make our models more understandable. Hence the appeal of methods such as Bart and Gaussian processes. Shaw-Hwa’s paper is interesting in that it directly grapples with the problem of interactions.

11 thoughts on “Interaction-based feature selection and classification for high-dimensional biological data

  1. While the causal structure that one would _like_ to find in these studies is indeed sparse, determining the causal structure with independent point null hypothesis tests has similar problems in gene association and gene expression as it does in other fields.

    A lot of the variation will be correlated with say, ethnicity, geography, or demography, which tends to bias the entire distribution of correlation statistics. With gene expression data, there’s even more issues with systematic/structural bias from one dataset to the next. Most studies will attempt to subtract out this “structural bias” in blunt ways for the associations to be at all interpretable, such as subtracting out principle components. People like John Storey and the recently-discussed Jeff Leek have developed more model-based approaches to doing this.

    To me, a lot of this literature reads like it’s putting band-aids on an epistemological approach that’s fundamentally flawed (that is, performing a large number of independent associations between genes and inherently multifactorial phenotypic outcomes). There are going to be a few cases where you get a clear signal from a few genes, but a lot of time, the insight doesn’t amount to much (especially considering the cost of running these studies).

    As for selecting variables for interactions – see Jun Liu’s work on BEAM.

    • From Wang et al’s Intro:
      “[LASSO] works well when the number of variables is not very large. To detect gene–gene interaction, however, it must include additional variables defined by products of original variables and thus the number of variables p is exponentially larger than n, the number of observations.”

      I read that an think two things (without out having read the paper all the way through):
      1. Their model is inherently nonlinear. Perhaps that’s a complicating factor? How does LASSO work with nonlinear models? Why couldn’t implement a Gauss-Newton or Levenberg-Marquardt variation of LASSO? Is the system so nonlinear that wouldn’t work?
      2. I can see p>>n as a practical issue but don’t immediately see why it should induce LASSO to fail.

      • That’s not quite right. Lasso will “work”, in the sense of identifying the correct non-zero predictors, when the non-zero predictors aren’t too correlated with the zero ones (ones that should be zero in the sparse model, the true support). This includes the case where p>>n. If I recall correctly, p can be growing up to exponentially in n, but that’s just an asymptotic analysis, your real data don’t “grow”, you take them as they are :)

        So lasso won’t fail in the sense of numerical failure, it just won’t recover 100% of the true non-zero predictors, there will be false positives and false negatives. But still likely to do betten than a non-sparse model like ridge regression.

        Conceptually, lasso can be extended to any GLM, so you can have any nonlinear effects that a GLM can have. The question of implementation is another issue, glmnet in R has quite a few GLMs already.

        • I completely subscribe to the above. In fact some tweaking of lasso may work well on the problem. Robert Tibshirani and some of his students have been working on such methods recently. There’s also recent work by Fan and Ke on the topic. But I guess these authors wanted their own method.

  2. I have worked in the physical biochemistry and mathematics of gene expression, in predictive toxicology, and now in mathematics and statistics of brain function. All of these systems are non-linear and if you linearize them, you eliminate the biology. In the brain system, the events appear to be intrinsically sparse in a nominally high dimensional space (the dimension and even the metric for this space are difficult to define clearly. The biology, however, works fine.). In the best case so far, the biology is bounded by a hypercone and has genetically selected semi-deterministic behavior inside this cone. If someone can help me get the statistics and mathematics on a better footing, please contact me.

    • Somewhat along those lines, lasso got me thinking about undergrad chemical kinetics problems. Imagine a reaction nA+mB -> C (ignore the back reaction). The reaction rate is r = k * [A]^x * [B]^y, where [A] and [B] are the concentrations of [A] and [B], respectively. Your task is to determine x, y, and k given concentration data. Traditionally – at least how I remember having done it in the mid-80s – you determine the parameter values from plots of ln[C] vs ln[A] and ln[B] and determine the values from slopes and intercepts.

      Although I don’t imagine it would be particularly efficient in practice, in principle I suppose you could set the parameter determination problem up as a lasso problem, i.e., r = c00 + c10*[A] + c01*[B] + c11*[A]*[B] + c20*[A]^2 + c02*[B]^2 + c12*[A]*[B]^2 + … Just write out the equation with enough terms to cover all realistic possibilities – and probably some unrealistic ones too – and let lasso determine which cij values are zero. I’m not saying it would be a good idea to try to solve the problem this way but it does seem like a plausible – if inefficient – approach to finding the solution. Thoughts?

  3. Our proposal how to do feasible Bayesian inference for modeling interactions in genome-wide data can be found in following articles

    Bayesian Variable Selection in Searching for Additive and Dominant Effects in Genome-Wide Data
    http://dx.doi.org/10.1371/journal.pone.0029115

    Finite Adaptation and Multistep Moves in the Metropolis-Hastings Algorithm for Variable Selection in Genome-Wide Association Analysis
    http://dx.doi.org/10.1371/journal.pone.0049445

Comments are closed.