Skip to content
 

Big Data…Big Deal? Maybe, if Used with Caution.

This post is by David K. Park

As we have witnessed, the term “big data” has been thrusted onto the zeitgeist in the past several years, however, when one pushes beyond the hype, there seems to be little substance there. We’ve always had “data” so what so unique about it this time? Yes, we recognize it’s “big” but is there anything unique about data this time around?

I’ve spend some time thinking about this and the answer seems to be yes, and it falls on three dimensions:

  • Capturing Conversations & Relationships: Individuals have always communicated with one another, but now we can capture some of that conversation – email, blogs, social media (Facebook, Twitter, Pinterest) – and we can now do it with machines via sensors, ie “the internet of things” as we hear so much about;
  • Granularity: We can now understand individuals at a much finer level of analysis. No longer do we need to rely on a sample size of 500 people to “represent” the nation, but instead we can access millions to do it; and
  • Realtime. Because computing power has vastly increased (ie clustering, parallel, etc) and the cost to store the data and access the computing power (ie cloud computing) have fallen tremendously in the past 5+ years, we can analyze the volumes of data closer to real-time which has a profound impact on businesses, government and universities. With individuals (as well as hardware, ie the internet of things) being able to continuously generate data that can be captured and analyzed, then the question becomes how can we engage individuals in a real-time basis to purchase a product, change an opinion, modify a behavior, cure an illness, etc. As you can imagine, this is a tremendously important question for businesses, but just as important (and if not more) for policymakers and universities who help shape those policies.

However, as I remind my friends in computer science, engineering, and mathematics, one thing that all this disruption has not changed is the need to think really hard about a problem and to understand the underlying mechanism that drive the processes that generate this data. Data by itself does not derive insights.

A classic (simplistic) example is the relationship between the number of fire trucks and the intensity of fires. If we collected the data and plotted that relationship, but didn’t understand the mechanism between those two things, we could imagine a situation where one would (incorrectly and dangerously) advocate that we need to reduce the number of fire trucks to reduce the intensity of fires. As we venture into this brave new “big data” world, we can imagine where data scientists without the contextual linkages of deep experts in law, business, policy, arts, psychology, economics, sociology, political science, etc. would make similarly bad choices with data and the analysis.

Data and algorithms alone will not fulfill the promises of “big data.” Instead, it is creative humans who need to think very hard about a problem and the underlying mechanisms that drive those processes. It is this intersection of creative critical thinking coupled with data and algorithms that will ultimately fulfill the promise of “big data.”

28 Comments

  1. jonathan says:

    1. The fire truck example is like people arguing that low interest rates are causing low inflation. Flips causality on its head; can you imagine the same people saying that if inflation is high the way to lower it would be to lower interest rates? Or that the way to increase inflation would be raise interest rates? In this light, the fire truck example isn’t as dumb as it looks.

    2. I was reading about Facebook scaling its searchable database up to many petabytes. And about Pinterest’s work on guided search: 750 million pin boards with over 30 billion pins as of now. And what is in some ways the grandfather of these things: google filling in your search terms. Mostly basic predictive correlative algorithms drawn together out of basic associations.

    3. I think the actual question is not how you can engage a customer in a moment but whether you can better understand customers so you can deliver services and products that answer needs. There is, IMHO, too much focus on seeing whether you can influence a buying/behavioral decision now. That seems to me to miss much of the point of actual product development and frankly reminds me too much of test panels and test screenings, etc. that don’t work … because there is something inherently off with that process, which works to design a donkey not a thoroughbred, and not so much because we need more data (to build a more donkey-like donkey by mistake). Actual product development requires tossing out much feedback. I could go on about this but I don’t want to get too far off point.

  2. Fernando says:

    “No longer do we need to rely on a sample size of 500 people to “represent” the nation, but instead we can access millions to do it;”

    Except the millions may be more biased than the 500. To address a question you need the right kind of data, not just more data. The difference between being uncertain and precisely wrong.

    • David says:

      I completely agree. I didn’t mean to say we can ignore the basic foundations of sampling because we have a larger N, instead now that we have a much larger N we can look at differences across smaller unit samples.

  3. Andrew says:

    David:

    Mister P will help also. Big data are typically not close to a random sample of the population of interest, and much can be gained by doing some adjustment. See here for my favorite recent example, our use of the highly nonrepresentative Xbox polls to learn about trends in public opinion during the 2012 election campaign.

    • David says:

      Yes, completely agree about Mister P helping, and the Xbox is a great example.

    • Fernando says:

      Mr P is great but still a heuristic no? What guarantees can you provide if you tell a campaign in a close race: Don’t bother with the $100K survey, do the Xbox for $10K and you’ll be fine. In retrospect this might seem OK. But going into it for the first time not so sure.

      In passing, one worry is that if someone tries it with a Nintendo or whatever and it doesn’t work, it will not be published so we only see confirmations.

  4. John Mashey says:

    What’s really different is that Big Data applications were really only available to big organizations.
    In the 1970s at Bell Labs, there was a whole building (MH Building 5, for Andrew) devoted to analysis of telephone data, which would be considered small now, but was Big then, although it tended to be called data mining.

    Companies like Teradata had been doing this for decades, although those machines were expensive and very proprietary, so required a substantial investment in staff.

    As per
    The Origins of ‘Big Data’: An Etymological Detective Story, in NY Times, the modern phrase Big Data originated in the 1990s, and the costs dropped enough to to make adequate technology available to a much larger group, although still not commodity. This needed:
    – the inflection in cost/bit drop for disks, as the use of cheap disks, RAID proliferated in early 1990s
    – 64-bit microprocessors, running 64-bit UNIX, 64-bit journaled disk file systems
    – big increases in pervasiveness and bandwidth of networks at all levels
    – better database software, some built for data mining, data warehousing, etc, not just operational transactions.

    The real explosion of Big Data apps has occurred because of better networks and clusters of commodity X86 CPus, as those got to be 64-bitas well, plus software to use them, which dropped the cost much more.

    This quite naturally leads to more silly things being done, because long ago, it was so expensive to do that anything needed serious justification. I did once help sell an SGI Origin supercomputer to a telephone company versus Teradata, and it paid back the cost in a month, but those machines were still multi-million$.

    Anyone interested in the history might watch my video Big Data – Yesterday, Today and Tomorrow.

  5. Chris G says:

    Tim Harford had a good post a few weeks ago, Big Data: Are we making a big mistake?
    link = http://timharford.com/2014/04/big-data-are-we-making-a-big-mistake/

    A few quotes of note:

    1. “Figuring out what causes what is hard (impossible, some say). Figuring out what is correlated with what is much cheaper and easier.”

    2. “Because found data sets are so messy, it can be hard to figure out what biases lurk inside them – and because they are so large, some analysts seem to have decided the sampling problem isn’t worth worrying about. It is.”

    3. “… a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down”

    4. “[B]ig data do not solve the problem that has obsessed statisticians and scientists for centuries: the problem of insight, of inferring what is going on, and figuring out how we might intervene to change a system for the better.”

  6. question says:

    Andrew,

    “Granularity: We can now understand individuals at a much finer level of analysis. No longer do we need to rely on a sample size of 500 people to “represent” the nation, but instead we can access millions to do it”

    That seems contradictory to me. How does looking at even more aggregated data help to understand the individual? Can you explain what you mean further?

    • David says:

      Andrew didn’t write the post, but I’m more than happy for him to take the blame for any contradictory statements ; )

      You’re right, it should read groups of individuals at a finer level of analysis. An example of this was our Red State Blue State book. Since we had a larger (representative) sample it let us tease out the different effects of income across different states. It would have been interesting to reduce the unit of analysis further if we had even a larger (representative) sample.

      Cheers,
      David

      • question says:

        David,

        Sorry about missing the name in the post. Thank you for clarifying, I have observed there is some confusion caused by whether, for example, we are analyzing data for policy makers, or clinicians, or scientists. It seems to me that different levels and types of analysis are useful to each role.

  7. Alex Gamma says:

    I agree very much with what Andrew and Tim Harford in Chris’s quote say about the importance of testing causal hypothesis, not just mining big data sets. I’ve made that point in a recent paper on personalized medicine (muse.jhu.edu/login?auth=0&type=summary&url=/journals/perspectives_in_biology_and_medicine/v056/56.4.gamma.html), where we see the same tendency of expecting that merely piling up enough data through “high throuput phenotypic measurements” of patients will somehow reveal the mechanisms of their illness and how to best treat it. The field of “Personalized Medicine” is currently intoxicated with this idea, e.g. Leroi and Hood, 2012 (N Biotechnol 29, 613–624.) That paper is full of excitement about collecting “enormous amounts of digitalized personal data”, but has very little to offer in terms of how that big data cloud is supposed to yield a causal understanding of health and disease.

  8. John Mashey says:

    Rarely is anything new.
    I noted that Bell Labs in the 1970s was doing Biog Data (for the time).

    But rather influential over statistical thinking there was a fellow named John Tukey. I’d guess he would hardly be pleased with giant data mining setups that find *something* whether or not there is any sensible explanation. I’d guess he’d be happy with the far more powerful tools now available to explore data.

    The case I mentioned where a telco paid for an expensive machine in a month was by using enough call records to find some interesting patterns, which they then verified were real, and changes in pricing/marketing got a lot more business.

  9. John Mashey says:

    “Big Data evangelism” does not seem precise enough to evaluate.

    This is like anything else: there is a distribution of usefulness of the various applications that go under the label Big Data, from terrific to useless, just as there is for statistics or computer-based graphics or special effects. As above, I’m sure there are more marginal ones around today than in the 1990s when I was doing this and it was still not cheap. As always, marketeers can run wild., but that doesn’t mean there are not good solid applications,s tarting wtih things like fraud detection, which were Big via the Velocity element. Telcos have long had to do traffic analysis fo various kinds on as much data as they could get. Anyway, in the middle of that video I mentioned, there were listed a bunch of 1990s Big Data customers, and you might be able to guess what some were doing.

  10. Øystein says:

    A propos your fire example, I recall a MR post suggesting that reducing the number of firefighters may not be that dangerous: http://marginalrevolution.com/marginalrevolution/2012/07/firefighters-dont-fight-fires.html
    Firewise, that is.

  11. John Mashey says:

    Just as a reminder of how not-new some if this is, which also relates to certain events with the NSA of late:

    1) Conversations: telcos have long had to capture the call records (not the contents), ie which phone called which, when and for how long. Among other things, that’s needed if you expect to give people bills for items costing $.10.
    In addition, one needs that kind of data for the originating telco (who charges the customer) to evidentially pay the other involved telcos. Finally, for long time, telcos have been requires by law to supply that data, following rules that (I think) were started in Europe. I know detailed analysis of such data for traffic engineering and marketing was already well I progress in the early 1970s. It certainly wasn’t in real time (they had to mail mag tapes).
    The real time parts were done in Network Operations Centers.

    2) Having helped get 2 wireless sensor net companies funded, I’d distinguish between the familiar Internet and ghe Internetvif Things. Emails, blogs, Facebook, etc are part of the former.

    3) In the 1990s, the big boost in compute power and storage started to enable closer-to-real-time fraud detection at telcos and credit card companies. I helped telcos with systems for that in early 1990s.

    4) Likewise, airlines have long been sophisticated users of operations research codes for scheduling and then dynamic pricing. Things like Travelocity were Big Data problems when they started in the 1990s. That needed multiple 64-bit multiprocessors with well more than 4GB of memory each, back when that was a lot :-)

    5) For better or worse, intelligence agencies have done Big Data apps for a long time, at least as far back as WW II , likely earlier, even without computers.

    6) While not Internet of Things, sensors have long been used to generate serious data sets people wanted to explore, as in 10TB seismic data sets used by oil companies. I recall one who wanted to fly around in 3D interactive visualizations in the mid-1990s, and we figured out about 15GB/sec of sustained disk transfer was needed, back when about 7GB/s was the best we’d been able to do on graphics supercomputers.

  12. Phillip M. says:

    I would simply comment that Big Data is merely data for which its size alone requires different thinking with respect to both its architecture and it’s throughput for analysis (if not the method of analysis itself). All of the attributes David and others have mentioned contribute. Of course, Big Data isn’t just about marketing, it’s about computational tractability of physical, biological systems, and process phenomena as well. So storage, data architecture, algorithms, interpreters/compilers, networks etc all simply must scale (think of CERNs data capture size and rate to even detect a Higgs – like particle, much less the needed efficiencies of the algorithms (and the methods behind them) which were the analytic workhorses for the staff)

    But some evangelists, like Vincent Granville of the Data Science Central blog, tend to use the term to market (or just self promote) the ‘data scientist’ (sometimes at the expense of statisticians: http://www.datasciencecentral.com/profiles/blogs/the-best-kept-secret-about-linear-and-logistic-regression). Others still market it as something ‘novel’, so you must have it now or get left in the dust…yep, just more marketing…for now. But really, data are only ‘bigger’ as a result of data capture practices, technological advances, and perhaps a subconscious meme hinting that ‘life is complicated’, and our understanding of those complexities may be aided by wider scoped, more robust, and more granular (state/space/time/whatever) data.

    Essentially, we’ve always had Big Data, but ‘Big’ is merely relative to our understanding of the universe and technology at any point in time. Yesterday’s card file….today’s data centers and architectures, tomorrow’s…………who knows?

  13. […] “Big Data…Big Deal? Maybe, if Used with Caution.” http://andrewgelman.com/2014/04/ … […]

  14. […] n’y a que des données en volume plus ou moins important. Pour le statisticien Andrew Gelman, il y a bien un phénomène caractérisé par la nature conversationnelle des données, la finesse de leur granularité, et […]

  15. K? O'Rourke says:

    David:

    I just stole this from Jeff Leek, who got it from Terry Speed’s talk, who got it from …

    Big data is like teenage sex: everyone talks about it, nobody really knows how to do it,
    everyone thinks everyone else is doing it, so everyone claims they are doing it…

    Dan Ariely, 2013

    Hilarious because true?

    Jeff’s post http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/

    Terry’s paper http://www.chalmers.se/en/areas-of-advance/ict/events/Documents/Terry%20Speed_Data%20Science,%20Big%20Data%20and%20Statistics%20-%20Can%20We%20All%20Live%20Together.pdf

    (My reaction to Terry’s paper was to wonder what percent of statisticains would be ideal in a big data research group – my prior for that percentage is 5-10%)

Leave a Reply