Gilbert Chin writes:
After reading this piece [“How one 19-year-old Illinois man Is distorting national polling averages,” by Nate Cohn] and this Nature news story [“Seeing deadly mutations in an new light,” by Erika Hayden], I wonder if you might consider blogging about how this appears to be the same issue in two different disciplines.
I said, sure, I’d blog on this. Actually I wrote about the Cohn article right after it appeared. Then I finally got around to reading the Hayden article, which is about an effort in epidemiology to set up and analyze a huge database on diseases and genetic mutations. The traditional approach is to look at one disease at a time and try to find genes associated with that disease, but by putting all the information together, more can be learned. This makes sense to me. The key idea is that the “cases” for one disease can be considered as the “controls” for lots of others, so all of a sudden you’re using your data a lot more efficiently.
I think multilevel modeling is the way to analyze such data. When you have just one disease, there’s the whole challenge of choosing a prior distribution. But with thousands of diseases, you have the internal replication that makes the problem much more direct to address.
As is often the case, though, I’m just talk here, having never done such an analysis myself in this sort of problem. I’d love to collaborate with an expert sometime and see how it goes.
Finally, I don’t really see the connection to that earlier polling story—except that multilevel regression and poststratification should be useful in both cases.
I am wrangling with a similar dataset at the moment. Instead of disease we have whole bacteria genomes with all their genes. I figured some sort of multilevel modeling would be the way to go.
That description of ExAc immediately makes me shiver because there is a *lot* of trait-level pleiotropy & genetic correlation in the human genome (incomplete list: https://en.wikipedia.org/wiki/Genetic_correlation#Human_correlations ). Maybe it’s OK if you’re looking only at individual mutations, since you can probably assume that a single mutation is going to lead to only a few diseases so all the other disease-cases serve as valid controls. (What would this be, some sort of instrumental variable analysis where you have a lot of weak instruments which make up for the biased ones?)
Will Stein’s paradox work here?
From the Hayden 2016 article:
And by the time this type of info makes it to the patient:
That’s scary stuff. Also, this seems wrong:
Where is this number coming from?
This sounds suspiciously close to the usual estimate of germline mutations per (human) generation:
https://en.wikipedia.org/wiki/Mutation_rate#Consequences
The estimated rate of “harmful” mutations should be somewhat less than that. I guess it is supposed to be about one order of magnitude less:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5436103/
Also, I am glad to see this research on somatic mutation rates is becoming more common. It will soon be time for people to finally accept that you do not have a single genome. It is extremely unlikely any cell in your body has the same sequence as more than a few others, and by the time of adulthood you probably contain at least one cell with any possible basepair mutated.
The significant strength of ExAC is the sample size (and exome coverage) – half of the variants they report were only observed once in 61000 individuals. The genomic PCA using all variant sites allows adjustment for some sampling problems. The “54” comes from the paper:
“The average ExAC participant harbours ~54 variants reported as disease-causing in two widely used databases of disease-causing variants. Most (~41) of these are high-quality genotypes but with implausibly high (>1%) popmax allele frequencies. We therefore hypothesized that most of the supposed burden of Mendelian disease alleles per person is due not to genotyping error, but rather to misclassification in the literature and/or in databases.”
A number of common (>1%) protein-changing variants are in fact disease-associated, eg red-hair-associated MC1R alleles and skin cancer risk.