Xihong Lin on sparsity and density

I pointed Xihong Lin to this post from last month regarding Hastie and Tibshirani’s “bet on sparsity principle.” I argued that, in the worlds in which I work, in social and environmental science, every contrast is meaningful, even if not all of them can be distinguished from noise given a particular dataset. That is, I claim that effects are dense but data can be sparse—and any apparent sparsity of effects is typically just an artifact of sparsity of data.

But things might be different in other fields. Xihong had an interesting perspective in the application areas where she works:

Sparsity and density both appear in genetic studies too. For example, ethnicity has effects across millions of genetic variants across the genome (dense). Disease associated genetic variants are sparse.

4 thoughts on “Xihong Lin on sparsity and density

  1. The quotation from Xihong might be a little too short to judge properly, but I would guess that effects really are dense in genetics too — i.e., every genetic variant will have *some* non-zero effect on *every* phenotype, albeit very small most of the time …

  2. In these of discussions, there are often two definitions of sparsity: exact and approximate. Exact would be “there are exactly n nonzero coefficients” or, more generally “there is an exactly n-dimensional basis”. This is probably true in very few cases. Approximate would be “there are few large coefficients and many small ones”, which I find much more plausible. I’ve seen this framing more from the compressive sensing side, such as this talk from Baraniuk (http://astrostatistics.psu.edu/su11scma5/lectures/Baraniuk_scmav.pdf, slides 9 & 10), than from the statistics literature.

    • Yes, that seems to be a much more sensible way to frame the question — “what’s the spectrum of effects and how can we most effectively take advantage of its structure?” rather than “is anything exactly zero?”

      • Agreed. In high-throughput genetics, at least for “complex” phenotypes, it’s plausible that there are no true zero associations – though summarizing what’s there by a sparse representation will very often be better than a more traditional estimate. Andrew has often discussed similar issues on this blog.

        Another issue that complicates Xihong’s comment is distinguishing between associations (in the strict statistical sense) and associations from which we can reasonably infer reflect causal variants. Associations we can find are not that rare; causal variants we have a hope of detecting are much rarer – and more valuable.

        I should add that, like Ben, I’m sure Xihong is well aware of these distinctions, the quote is a bit too short to convey them.

Comments are closed.