Skip to content

The Night Riders

Gilbert Chin writes:

After reading this piece [“How one 19-year-old Illinois man Is distorting national polling averages,” by Nate Cohn] and this Nature news story [“Seeing deadly mutations in an new light,” by Erika Hayden], I wonder if you might consider blogging about how this appears to be the same issue in two different disciplines.

I said, sure, I’d blog on this. Actually I wrote about the Cohn article right after it appeared. Then I finally got around to reading the Hayden article, which is about an effort in epidemiology to set up and analyze a huge database on diseases and genetic mutations. The traditional approach is to look at one disease at a time and try to find genes associated with that disease, but by putting all the information together, more can be learned. This makes sense to me. The key idea is that the “cases” for one disease can be considered as the “controls” for lots of others, so all of a sudden you’re using your data a lot more efficiently.

I think multilevel modeling is the way to analyze such data. When you have just one disease, there’s the whole challenge of choosing a prior distribution. But with thousands of diseases, you have the internal replication that makes the problem much more direct to address.

As is often the case, though, I’m just talk here, having never done such an analysis myself in this sort of problem. I’d love to collaborate with an expert sometime and see how it goes.

Finally, I don’t really see the connection to that earlier polling story—except that multilevel regression and poststratification should be useful in both cases.


  1. awinter says:

    I am wrangling with a similar dataset at the moment. Instead of disease we have whole bacteria genomes with all their genes. I figured some sort of multilevel modeling would be the way to go.

  2. gwern says:

    That description of ExAc immediately makes me shiver because there is a *lot* of trait-level pleiotropy & genetic correlation in the human genome (incomplete list: ). Maybe it’s OK if you’re looking only at individual mutations, since you can probably assume that a single mutation is going to lead to only a few diseases so all the other disease-cases serve as valid controls. (What would this be, some sort of instrumental variable analysis where you have a lot of weak instruments which make up for the biased ones?)

  3. Rudy Banerjee says:

    Will Stein’s paradox work here?

  4. Anoneuoid says:

    From the Hayden 2016 article:

    D178N, for example, was strongly suspected of causing prion disease because it had been seen in several people with the condition and seldom elsewhere…no one really had the power to see just how rare it was.

    And by the time this type of info makes it to the patient:

    It was a veritable death sentence: the average age of onset is 50, and the disease progresses quickly.

    That’s scary stuff. Also, this seems wrong:

    Lurking in the genes of the average person are about 54 mutations that look as if they should sicken or even kill their bearer.

    Where is this number coming from?

    This sounds suspiciously close to the usual estimate of germline mutations per (human) generation:

    a human genome accumulates around 64 new mutations per generation

    The estimated rate of “harmful” mutations should be somewhat less than that. I guess it is supposed to be about one order of magnitude less:

    If, as has been suggested, each human baby has six new deleterious point mutations4, then each human somatic cell could have dozens, even hundreds, of deleterious mutations, and mice would have even more.

    Also, I am glad to see this research on somatic mutation rates is becoming more common. It will soon be time for people to finally accept that you do not have a single genome. It is extremely unlikely any cell in your body has the same sequence as more than a few others, and by the time of adulthood you probably contain at least one cell with any possible basepair mutated.

  5. David Duffy says:

    The significant strength of ExAC is the sample size (and exome coverage) – half of the variants they report were only observed once in 61000 individuals. The genomic PCA using all variant sites allows adjustment for some sampling problems. The “54” comes from the paper:

    “The average ExAC participant harbours ~54 variants reported as disease-causing in two widely used databases of disease-causing variants. Most (~41) of these are high-quality genotypes but with implausibly high (>1%) popmax allele frequencies. We therefore hypothesized that most of the supposed burden of Mendelian disease alleles per person is due not to genotyping error, but rather to misclassification in the literature and/or in databases.”

    A number of common (>1%) protein-changing variants are in fact disease-associated, eg red-hair-associated MC1R alleles and skin cancer risk.

Leave a Reply