Drew Conway pointed me to this article by Chris Anderson talking about the changes in statistics and, by implication, in science, resulting from the ability of Google and others to sift through zillions of bits of information. Anderson writes, “The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.”
Conway is skeptical, pointing out that in some areas—for example, the study of terrorism—these databases don’t exist. I have a few more thoughts:
1. Anderson has a point—there is definitely a tradeoff between modeling and data. Statistical modeling is what you do to fill in the spaces between data, and as data become denser, modeling becomes less important.
2. That said, if you look at the end result of an analysis, it is often a simple comparison of the “treatment A is more effective than treatment B” variety. In that case, no matter how large your sample size, you’ll still have to worry about issues of balance between treatment groups, generalizability, and all the other reasons why people say things like, “correlation is not causation” and “the future is different from the past.”
3. Faster computing gives the potential for more modeling along with more data processing. Consider the story of “no pooling” and “complete pooling,” leading to “partial pooling” and multilevel modeling. Ideally our algorithms should become better at balancing different sources of information. I suspect this will always be needed.