Noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger.

Kyle MacDonald writes:

I wondered if you’d heard of Purvesh Khatri’s work in computational immunology, profiled in this Q&A with Esther Landhuis at Quanta yesterday.

Elevator pitch is that he believes noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger. The thing that gave me the woollies was this line:

“We start with dirty data,” he says. “If a signal sticks around despite the heterogeneity of the samples, you can bet you’ve actually found something.”

On the one hand, that seems like an almost verbatim restatement of your “what doesn’t kill my statistical significance makes it stronger” fallacy. On the other hand, he seems to use his methods purely to look for things to test empirically, rather than to draw conclusions based on the analysis, which is good, and might mean that the fallacy doesn’t apply. I also like his desire to look for connections that isolated groups might miss:

I realized that heart transplant surgeons, kidney transplant surgeons and lung transplant surgeons don’t really talk to each other!

I’d be interested in hearing your thoughts: worth the noise if he’s finding connections that no one would have thought to test?

My response:

I haven’t read Khatri’s research articles and I know next to nothing about this field of research so I can’t really say. Based on the above-quoted news article, the work looks great.

Regarding your question: On one hand, yes, it seems mistaken to have more confidence in one’s findings because the data were noisier. On the other hand, it’s not clear that by “dirty data,” he means “noisy data.” It seems that he just means “diverse data” from different settings. And there I agree that it should be better to include and model the variation (multilevel modeling!) than to study some narrow scenario. It also looks like good news that he uses training and holdout sets. That’s something we can’t always do in social science but should be possible in genetics where data are so plentiful.

9 thoughts on “Noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger.

  1. “It seems that he just means “diverse data” from different settings”

    &

    “That’s something we can’t always do in social science but should be possible in genetics where data are so plentiful.”

    It is about to happen via “Psychological Science Accelerator”: https://psysciacc.wordpress.com/2017/11/08/the-psychological-science-accelerators-first-study/

    Unfortunately i couldn’t find the (pre-registered) proposal, so i am not sure what they will be investigating and most importantly how.

    I posted a comment concerning some questions i had/points i wanted to raise, but haven’t gotten a reply yet.

  2. It depends on whether the variation can become well enough understood.

    That variation will usually be a mixture of biases (due to confounding, selection of subjects, miss-reporting, data entry errors, computing errors, etc), biological variation (meaningful differences of the populations being sampled) and random/haphazard noise.

    If the studies largely avoided bias or if it can be credibly removed so that the observed variation is driven mostly by biological variation then one does have a sample of varying effects that can be made sense of. For instance, the weighted average of them would be an estimate for a specific population’s fixed average (that might be of scientific relevance) and that average does not vary (or not as much). On the other hand, the differing effects themselves would be a sample from the distribution of such effects and the distribution might be of interest. These are the issues being worked through in this blog post http://statmodeling.stat.columbia.edu/2017/11/01/missed-fixed-effects-plural/

    Now in some applications, such as gene expressions there is information such as a lack of concordance of the correlation structure between markers that can give a signal of data entry errors, computing errors and other misalignment error noise.

    Also we have a relevant philosophy paper here http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

  3. Thanks for posting this, Andrew, and to others for their informative replies! I definitely missed the distinction between “noisy” and “dirty” on first reading.

  4. “It also looks like good news that he uses training and holdout sets. That’s something we can’t always do in social science but should be possible in genetics where data are so plentiful.”

    I’ve recently had a fast growing interest in trying to understand the distinctions in approaches to predictive modelling and explanatory modelling, i.e. causal analysis. A paper I have found useful is ‘To predict or to explain?’ by Schmueli and also several papers by Sander Greenland (one from the 80’s with 1st author Jamie Robin’s, and one from 2015 with last author Neil Pearce). It seems from what I’ve read that methods for prediction modelling such as using a training and hold out data set aren’t that useful for causal analysis. Am I completely wrong? If so, I’d greatly appreciate papers/books that talk about these types of methods specifically in the causal analysis context.

    • I wouldn’t say that a train/validate/test set methodology isn’t useful for causal modeling. I mean, how would you feel about a statistical model supposedly encoding/capturing some important causal mechanism that doesn’t fit a test set better than a model you selected by throwing darts at the wall?

      The reason that the test/train/validate partition is widely adopted by predictive modelers is because it’s only natural to the application; if you want to rely on the prediction of outcomes produced by your model for future decisions, it would be good to know how well it predicts data points it has not previously seen. But that’s not just a useful concept for predictive modelers! If you believe you’ve been able to encode an important causal mechanism into your model, then that model should produce better than arbitrary results on a test set; how well it does will depend on the complexity of the phenomenon, the quality of data (measurement error, etc.), the number of other confounding causal mechanisms, etc. You just don’t have to judge the model entirely by the fit (and probably don’t want to).

      • Not only do you not judge a model entirely by its fit, all the biases, other problems and limitations of the training set are also going to be in the test set. That doesn’t really matter if you are going to continue to made predictions on the same source of data (assuming nothing in the data generation process changes). If your goal is explanation rather than prediction then it is replication in different data sets with different generation processes that with social data are “dirty” in the sense of coming from the real world is what you want.

        • Elin:

          > all the biases, other problems and limitations of the training set are also going to be in the test set
          Agree, but maybe it needs to be stressed that “the replication in different data sets with different generation processes” do not also share the same biases. Differing biases will at least signal a problem. Also sources of data that are informative about the biases can be key.

Leave a Reply to Kyle MacDonald Cancel reply

Your email address will not be published. Required fields are marked *