Skip to content
 

Doing Data Science: What’s it all about?

Screen Shot 2013-10-31 at 11.15.22 PM

Rachel Schutt and Cathy O’Neil just came out with a wonderfully readable book on doing data science, based on a course Rachel taught last year at Columbia. Rachel is a former Ph.D. student of mine and so I’m inclined to have a positive view of her work; on the other hand, I did actually look at the book and I did find it readable!

What do I claim is the least important part of data science?

Here’s what Schutt and O’Neil say regarding the title: “Data science is not just a rebranding of statistics or machine learning but rather a field unto itself.” I agree. There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science.

The question then arises: why do descriptions of data science focus so strongly on statistical tasks? (As Schutt and O’Neil write, “the media often describes data science in a way that makes it sound like as if it’s simply statistics or machine learning in the context of the tech industry.”) I think it’s because statistics is the fun part and the part that, in this context, is new. The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of data science is more of an option.

To put it another way: you can do tech without statistics but you can’t do it without coding and databases. But in recent years, lots of tech companies have made use of statistical methods (including various statistical ideas that have been developed in the computer science literature). So, from the industry perspective, the new part of data science is the statistics. Statistics is the least important part of data science, hence it is the part most recently added, hence it is the part that is getting the most attention right now.

Schutt and O’Neil also write:

People have said to us, “Anything that has to call itself a science isn’t.” Although there might be truth in there, that doesn’t mean that the term “data science” itself represents nothing, but of course what it represents may not be science but more of a craft.

Well put.

What is Hadoop, anyway?

OK, back to the book. I read and enjoyed the first couple of chapters and then went back to the table of contents to see where I could learn. Chapter 14 grabbed my eye: “Data Engineering: MapReduce, Pregel, and Hadoop.” I keep hearing about “map reduce” and “hadoop” but I’ve never known what they are about. Before checking out the chapter, I did a quick Wikipedia read. The wiki articles seem clear enough but after a 30-second read (hey, I’m impatient!) I still don’t really have a sense of what is going on here. So on to the chapter, which is coauthored with David Crawshaw and Josh Wills:

You’re dealing with Big Data when you’re working with data that doesn’t fit into your computer unit. Note that makes it an evolving definition: Big Data has been around for a long time. . . . Today, Big Data means working with data that doesn’t fit in one computer.

Then they get into the details. I still don’t understand map reduce and hadoop, but at this point I’m pretty sure it’s my fault, not theirs—or, to put it another way, to learn it I’d need to be able to have a q-and-a discussion with Bob or Daniel Lee or someone else who can explain it to me, or else I’d need to put in a bit more work. Fair enough, it’s not like someone could learn Bayesian data analysis by just reading a book and not doing any homework.

Numeracy

In that hadoop chapter, we get the following motivation for comprehensive integration of data sources, a story that is reminiscent of the parables we sometimes see in business books:

By some estimates, one or two patients died per week in a certain smallish town because of the lack of information flow between the hospital’s emergency room and the nearby mental health clinic. In other words, if the records had been easier to match, they’d have been able to save more lives. On the other hand, if it had been easy to match records, other breaches of confidence might also have occurred. Of course it’s hard to know exactly how many lives are at stake, but it’s nontrivial.

The moral:

We can assume we think privacy is a generally good thing. . . . But privacy takes lives too, as we see from this story of emergency room deaths.

But what about this story?

One or two patients per week? 75 people is a lot! To calibrate, I’d like to get a denominator, the total number of deaths each year.

I’m not sure how large the “smallish town” is. Here’s Wikipedia: “A town is a human settlement larger than a village but smaller than a city. The size definition for what constitutes a ‘town’ varies considerably in different parts of the world. . . . In the United States of America, the term “town” refers to an area of population distinct from others in some meaningful dimension, typically population or type of government. . . . In some instances, the term “town” refers to a small incorporated municipality of less than 10,000 people, while in others a town can be significantly larger. Some states do not use the term ‘town’ at all, while in others the term has no official meaning and is used informally to refer to a populated place, of any size, whether incorporated or unincorporated. . . .” Wikipedia then goes state by state, for example, “In Alabama, the legal use of the terms ‘town’ and ‘city’ are based on population. A municipality with a population of 2,000 or more is a city, while less than 2,000 is a town.”

Just to go forward on this, I’ll assume the “smallish town” has 10,000 people. If approximately 1/70 of the population is dying every year, that’s 140 deaths a year. So that can’t be right—there’s no way that half the deaths in this town are caused by poor record-keeping in a hospital. If the town had 20,000 people (which would seem to be near the upper limit of the population of a town that one would call “smallish,” at least in the United States), then we’re talking 1/4 of the deaths, which still seems way too large a proportion. Even if it is a town with lots of old people, so that much more than 1/70 of the population is dropping off each year, the numbers just don’t seem to add up. Maybe the town happens to have a large regional hospital. But, 75 excess deaths a year caused by “lack of information flow” still seems like a lot, and if the patients are drawn from a large population, it seems a bit misleading to describe these deaths as being “in a certain smallish town.”

What I just did was statistical reasoning, or maybe I should call it mathematical reasoning or numeracy. Based on my calculations, I feel like there is something missing in the story that was told about the hospital records. I could be wrong, though. I might be missing something subtle or even something obvious. It’s hard for me to know, though, because the story is not sourced. This is a reminder that all data, big or small, is more easily used when its source is clear. From a statistical perspective, we want to know the data-generation process (also called the likelihood function, also called “where did the data come from”) as well as the numerical data (or, in this case, the story, or anecdote, or parable, itself).

Summary

I enjoyed Rachel and Cathy’s book, it’s readable, informative, and like no other book I’ve read on the topic of statistics or data science. It has a lot in common with the “365 stories” project that we started here (but never got off the ground because far fewer than 365 people sent us their stories). I think/hope that lots of people will get a lot out of this book. It got me thinking about all sorts of things.

P.S. I wonder what Richard Stallman would think about the book. On one hand, it’s all about being a “data humanist,” which I think he’d like. On the other hand, Rachel Schutt is the Senior VP of Data Science at News Corp, which would surely be a turnoff to the Gnu-man. And I seem to recall he’s down on O’Reilly.

39 Comments

  1. Rahul says:

    Do they have a section on sed / awk? I couldn’t find one in a cursory search.

    I may be old school but I think if you are doing any serious amount of data analysis & not using sed / awk you are probably being inefficient.

    Of course, I’m prejudiced towards the Unix philosophy of any tool doing one small thing and doing it well.

  2. Rahul says:

    ” I still don’t understand map reduce and hadoop”

    Have you ever had a project where the size of the data or its data-crunching time was the limiting factor? If not, I don’t think you need to worry about Hadoop. I think Hadoop is and always will be a niche.

    • Ike says:

      Most of the time when someone brings me a problem where the size or the crunching time is the limiting factor, its becasue they are pretty myopic with their ability to reframe the problem or situation with the lost “black arts” of data manipulatation, coding theory, or other old school ways of doing things. When I review their analysis plan they are doing very inefficient things to begin with.

      • Rahul says:

        Agree. Though there still may be some cases where the limitations are genuine and not amenable to conventional optimization. Especially in these data heavy days where a days dump of log files can be terabytes at some firms.

        • Ike says:

          Thats my point, they are storing the data in very inefficient matter to begin with. If you think about it, all you need is 3TB, to store the minute by minute location of every living US Citizen down to the nearest 100ft^2 area for a day. But you have to be smart in using your storage, something many people don’t do anymore.

  3. koala says:

    Hadoop processes files line by line. At each line, you contruct a key+value pair from that line (map phase). Next (reduce phase), same keys are sent the same machine, to the same “reducer code” so group operations can be applied on that group of data. Very simple, two step process.

    In terms of architecture: Each machine contains a piece of the file (thanks to HDFS file system), so when map starts, each machine does mapping on its piece of the file. This is the key unit of distribution. Hadoop makes these pieces appear as if it is one big file to outside world.

  4. JSE says:

    Cathy has become a real New Yorker; when she says “smallish town” I’m pretty sure she means, like, Philadelphia.

    • K? O'Rourke says:

      I was guessing the Mayo Clinic in Rochester Minnesota.

      That “city” had a population estimate of 106,769 in 2010.

      I audited their pharmacy on their following proper randomisation instructions in an RCT (something that should always be done)

      It was a really large hospital and the weather was really, really cold (and I once worked outside in Yukon!)

      Also, Andrew the likelihood in not actually a function but an (weird) equivalence class of functions ;-)

  5. Steve Laniel says:

    MapReduce: many algorithms can be framed as a map step (where you take an input list and transform it into a new list) followed by a reduce step (where you take an input list and reduce it down to a single number. For instance, the action of taking the sum of the squares of a set of data values is a map step (square every item in the list) followed by a reduce step (sum the list).

    Turns out that if you can frame your algorithm in this way, then it is trivial to parallelize the algorithm across many CPUs, or many different computers.

  6. As far as the more reasonable defenses of the term “data science” go, Yann LeCun’s is interesting:
    https://plus.google.com/104362980539466846301/posts/i8Xs9AGgCQW

  7. Robin Morris says:

    Data Scientist, (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

    (Originally by Josh Wills, see http://blog.kaggle.com/2012/10/04/engineering-practices-in-data-science/ )

  8. Seems like another rebranding issue to me. In fact, Yann LeCun explicitly treats it this way in his blog post on data science linked by Brendan O’Connor above, the telling lines of which are

    What I like about “Data Science” is that everyone can feel at home. Also, it means exactly what it says.

    To me, it shares with “evidence-based medicine” the property of immediately raising the question of “what other kind of science is there?”

    So maybe it’s like the stone soup fable, the moral of which I think was positive. Is high-energy physics just data science by another name?

    Before tackling “data science,” what’s the difference between machine learning and computational statistics? I see two practical differences. First, and perhaps foremost, machine learning people tend to come from computer science backgrounds and computational stats people from either physics or statistics backgrounds. Second, and this distinction is fading, I think, is that machine learning research tends to focus on 0/1 loss on predictive tasks (not exclusively, of course), whereas computational stats people tend to focus on log loss and fitting the data at hand (not exclusively, of course).

    If stats people respected computation more, it seems unlikely there’d be terms “machine learning” and “data science.” It goes along with Yann’s story about CS splitting out from EE and math — if EE took software more seriously or math took computation more seriously, they might still be in those two departments.

    I have no idea what people mean by “data science” in contrast to “machine learning.” Isn’t it just stats being done by computer scientists. Maybe it’s less of a focus on the math? I hope the answer isn’t data munging!

    • Rahul says:

      I don’t think it is rebranding. Data science would involve skills beyond what might be considered a standard statisticians toolkit.

      • Tools change over time, even within a field. A modern physicist or mathematician or biologist uses tools that go beyond what a standard physicists toolkit would’ve looked like even fifty years ago. And they didn’t create a subfield of “data science” in response.

        The evolution of new tools (or any other new ideas) can be a big force for change, in part because of the generational effect in a field — the older folks become professors based on work they did in the first five or ten years after grad school, then pass out the money and tenure decisions for another twenty or thirty years. I think this goes with Yann’s point about having separate funding opportunities through NSF.

        Unless we narrowly constrain what statistics is to proving theorems on a board, statisticians should be using new tools these days!

    • Andrew says:

      Bob:

      1. If you’re going to put a link on “stone soup,” please link here!

      2. As I wrote above, I don’t think data science is the same as statistics, nor do I think that data science is a subset or an application of statistics. In fact, I think data science does not need statistics at all, or, to be more precise, I think that statistics is the least important part of data science.

  9. Rahul says:

    Andrew’s “two patients died per week” example stresses the need for domain knowledge in a data scientist. Often such apparent inconsistency or illogical results can most easily be detected if you have prior experience with the sector or application that is generating the data.

  10. Fernando says:

    P(y,x|&) does not tell us what model generated the data. Does & measure the effect of x on y, or y on x, or both, or a cause in common, or selection bias, etc (inclusive or).

    P(y,x|&) is _a_ model of the data. To make it _the_ model generating the data requires expressing additional intent not explicit in the likelihood.

    I prefer the longer notation in early chapter of BDA: P(y,x|&,A) where A is assumptions. Therein hides the Devil, and he likes to hide.

  11. Mike says:

    A reviewer on Amazon completely trashes this book.

  12. Frank says:

    The (presumably terribly bright) Senior VP of Data Science at News Corp thinks that ‘… privacy takes lives too.’ Well, that’s just great isn’t it?

  13. As for the deaths, I took it to mean that the deaths that occur in patients *at that hospital* which would presumably be much less than all deaths in the town (which would be around 1/70 of the population). So if 75 patients are dying per year in this hospital alone, and maybe 50% of deaths in this town are in the hospital, then there are maybe 150 per year in the town. If this is 1/70 of the population, then the town is about 10.5k people, and that makes sense to me.

    Deaths that occur in the hospital might be due mainly to preventable causes like diseases, injuries, surgical complications, failure to take medicines, etc, whereas deaths outside the hospital might be largely things like fatal at the scene car accidents, the ravages of old age (strokes, heart attacks etc at home), and other things where the patients would be dead before anyone could even get them in an ambulance and get them to a hospital.

    • Rahul says:

      Nope. Doesn’t make sense. What you are saying is more that 50% of the deaths that occurred in that hospital were “because of the lack of information flow”?

      That sounds implausible. Almost bizarre.

      • Not to me. Especially when you consider it was the flow of information between mental health clinic and E-room. Here’s the model that generates such results:

        1) If people who are not under the care of mental health clinic come into the hospital at all, they are generally not dead or about to die, and these people generally recover and leave the hospital. This may be especially true in small towns.

        2) Many deaths occur outside the hospital, such as DOA from car crash or massive heart attacks or hospice care.

        3) Mental health patients who come into the hospital are generally much much sicker and less stable, they are on a wide variety of medications, and they arrive only in a fragile and emergency state. Without quality knowledge of their medication and medical history they are at much greater risk of death. Yet, still many of them come in and recover and leave.

        4) Nevertheless, most of the deaths in the hospital occur in the E room, and most of them occur among the relatively fragile mental health population.

        5) run the numbers and you get my result. It doesn’t sound at all implausible to me, but I’ve seen my wife struggle with trying to help someone she knows who is mentally ill and was frequently in and out of the hospital E room. It doesn’t seem implausible at all that the population similar to her friend would be responsible for a large fraction of hospital deaths, especially in a smallish town where gunshot wounds and freeway pile-ups and soforth are rare.

        • Other things that might go into such a model, often there are state-level trauma centers where serious trauma patients are taken, so the local hospital wouldn’t be seeing the freeway pile-ups, gunshot wounds, or falls off the 2nd story ladder anyway even if they are happening.

  14. […] Doing Data Science: What’s it all about? (andrewgelman.com) […]

  15. I keep hearing about “map reduce” and “hadoop” but I’ve never known what they are about.

    Map is like a mapping in mathematics, from N dimensions to N dimensions. A standard data transform. Reduce smushes the dimensionality from N dimensions to 1 dimension. Like a volume integral or a count.

    When operations become too big to do with just one computer, Hadoop and its ilk have figured out how to automate breaking the operations up to be done on multiple standard-issue computers. (Rather than going to a supercomputer.)

    The connection is this: some of the computers in a Hadoop cluster will be doing maps and others will be reducing the dimensionality.

Leave a Reply