Gary Marcus and Ernest Davis wrote this useful news article on the promise and limitations of “big data.”
And let me add this related point:
Big data are typically not random samples, hence the need for “big model” to map from sample to population. Here’s an example (with Wei Wang, David Rothschild, and Sharad Goel):
Election forecasts have traditionally been based on representative polls, in which randomly sampled individuals are asked for whom they intend to vote. While representative polling has historically proven to be quite effective, it comes at considerable financial and time costs. Moreover, as response rates have declined over the past several decades, the statistical ben- efits of representative sampling have diminished. In this paper, we show that with proper statistical adjustment, non-representative polls can be used to generate accurate election forecasts, and often faster and at less expense than traditional survey methods. We demon- strate this approach by creating forecasts from a novel and highly non-representative survey dataset: a series of daily voter intention polls for the 2012 presidential election conducted on the Xbox gaming platform. After adjusting the Xbox responses via multilevel regression and poststratification, we obtain estimates in line with forecasts from leading poll analysts, which were based on aggregating hundreds of traditional polls conducted during the election cycle. We conclude by arguing that non-representative polling shows promise not only for election forecasting, but also for measuring public opinion on a broad range of social, economic and cultural issues.