The End of Theory: The Data Deluge Makes the Scientific Method Obsolete

Drew Conway pointed me to this article by Chris Anderson talking about the changes in statistics and, by implication, in science, resulting from the ability of Google and others to sift through zillions of bits of information. Anderson writes, “The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.”

Conway is skeptical, pointing out that in some areas—for example, the study of terrorism—these databases don’t exist. I have a few more thoughts:

1. Anderson has a point—there is definitely a tradeoff between modeling and data. Statistical modeling is what you do to fill in the spaces between data, and as data become denser, modeling becomes less important.

2. That said, if you look at the end result of an analysis, it is often a simple comparison of the “treatment A is more effective than treatment B” variety. In that case, no matter how large your sample size, you’ll still have to worry about issues of balance between treatment groups, generalizability, and all the other reasons why people say things like, “correlation is not causation” and “the future is different from the past.”

3. Faster computing gives the potential for more modeling along with more data processing. Consider the story of “no pooling” and “complete pooling,” leading to “partial pooling” and multilevel modeling. Ideally our algorithms should become better at balancing different sources of information. I suspect this will always be needed.

13 thoughts on “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete

  1. People who do science will continue to be interested in theory, regardless of how much data is available. In fact, his primary example, that of gene sequencing, is only possible because the theory explaining relationships between genomes already exists. If Darwin hadn't come along to explain the nested hierarchy of species, certainly someone would try to use gene sequences to create such a theory.If you are just interested in selling products, correlation is great; models and theories might be unnecessary. If you are doing science, they remain the ultimate goal.

  2. Actually, the most important role of models in the infinite loop "data-model-new data -" is not describe old data but to show the way to obtaining new data. Models or theories show direction of new search and research. Data themselves do not provide any insite of the next step.
    So, model is the driving force of (hard) science in sense of obtaining quality data.

  3. Anderson forgets that "statistical tools" are models. What has changed is that with rich enough data, we can infer rich models. How do our tools select the right model? The scientific method, encoded into our algorithms.

    More data does not mean we can dispense with models. We do not seek to capture the noisy world. We seek to capture the patterns and structure in the world.

    As Kevin Korb and others have argued, machine learning is experimental philosophy of science.

  4. It's obvious that data drives out analysis. If you have an actual measurement, you don't have to do a predictive measurement.

    That said, unbiased data is hard to come by. And the measurement of biases, and adjustment for biases, is not easy.

  5. This reminds me of Breiman 2001 ("statistical
    modeling: the two cultures") and Peters 1991 ("a
    critique for ecology")

  6. Ben,

    Breiman's paper is interesting but he fell within the grand tradition of anthropologists finding difference between cultures via the simple expedient of not learning about the other culture; see here.

  7. "Anderson has a point–there is definitely a tradeoff between modeling and data. Statistical modeling is what you do to fill in the spaces between data, and as data become denser, modeling becomes less important."

    Somebody once said that in statistical analysis, one samples from a population to make inferences about a population, but usually the two populations are not the same. We've just seen an example of that with the subprime/[insert many finance acronyms here] meltdown.

  8. I gotta say it – this is 'Wired', which is deeply into the cheap contrarianism school of pseudo-journalism. It's not to say that they're never right, but that they don't come with much credibility. They make the NYT science writers look good.

  9. I'm not quite sure what Chris means by a statistical model. His point that in some cases causal analysis is not needed is well taken. But does he realize that regression is based on correlations and most data mining techniques are also models?

    Take the old example of neural networks. NN can produce extremely complex models using lots of data, and these models can indeed be very useful at fraud detection and other applications but generally, they are useless in advancing our understanding of why something happens (so-called Black Box). With more data, these types of models get more and more complex to describe, and arguably produce less and less learning.

  10. Also want to react to Barry's statement at 4:01.

    If statistics is used for explanation rather than prediction, there is no problem of population drift.

    As for the subprime meltdown, if he is alleging that credit prediction models went awry, I haven't seen evidence to suggest that was the case. Models themselves do not make decisions; people decide where to place the credit cutoffs. Models cannot prevent banks from giving out loans without applications and other mischiefs.

    Now, models that try to predict individual house prices would have had much difficulty in the last few years. But I can't see how dispensing with models would lead to better forecasts either.

Comments are closed.