## Statistics and data science, again

Phillip Middleton writes in with two questions:

(1) Is html markdown or some other formatting script usable in comments? If so, what are the tags I may use?

(2) What are your views on the role of statistics in the evolution of the various folds of convergent science? For example, upon us there is this rather nebulously defined thing called data science which roughly equates to an amalgam of comp sci and stats. I tend to interpret it as somewhat computational stats, somewhat AI – or maybe it’s just an evolution of data mining, I don’t know yet.

The comp sci camp is marketing their brand in this area quite well. But is it me or is the statistical community perhaps taking a bit of a back seat? This feels more like a bit of competition than an invitation to converge ideas (so where does computational and graphical stats sit, then?). I had a brief conversation with Prof. Joe Blitzstein about this some time ago, and he expressed a number of concerns as well.

There are those that believe statisticians as a collective group have become less important in the work to translate data into meaning/knowledge than this unicorn (for now) ‘data scientist’. I believe part of the reason is that this new discipline has quite a bit of historical public ‘glitter’ and a few other shiny objects embedded (the term ‘big data’ for one, and a larger paycheck for another – quite the economically driven discipline), though it hasn’t in my opinion been fully defined. So what do you believe will become of it? Is it fad’ish or the storming/norming stage of a particular convergence between comp sci and stats? Or more?

I don’t see much official discussion by either groups like the ASA or any of the data science group/society outcroppings. What do you make of this? As much as I see the need for convergence in bayesian and frequentist philosophies in statistics, I see the need for convergence between the statistical and comp sci camps to solve problems which each alone may not necessarily be good at. Is that even a reasonable expectation?

Physics has been doing this quite rapidly in recent yrs. (convergence with medicine, biology, finance, economics, etc), and it has integrated fairly seamlessly I believe. I could be making the wrong comparison here, but it would seem to me statistics converges with, well….everything, and in no small way. Yet it seems to me there are forces ‘placing’ statistics as a whole into somewhat of a basement, or at least at the cusp of an identity crisis.

(1) You can use html in comments. I’m not sure which are allowed and which aren’t, but, as I tell everyone, if your comment gets caught in the spam filter, just drop me an email. It happens.

(2) As I’ve written before (see also here), I think statistics (and machine learning, which I take to be equivalent to statistical inference but done by different people) is the least important part of data science. But I do think statistics has a lot to contribute, in design of data collection, model building, inference, and a general approach to making assumptions and evaluating their implications. Beyond this, I don’t really know what to say. I agree these things are important but somehow I feel I lack the big picture—having only enough of this picture to recognize that I lack it!

Perhaps others have more useful thoughts to add.

1. There’s not a markdown language for WordPress (the way there is for GitHub). Instead, you can use a limited set of HTML. It’s not well documented, but there’s a blog post that explains it:

What you can do is use LaTeX, though. If you write “\$latex e^x\$” (without the quotes), it will be rendered as $e^x$.

The problem is that there’s no preview. WordPress is just awful at comments.

• Rahul says:

I think WordPress is just fine for comments. Who needs a gazillion tags and complex formatting for a comment? I’d rather err on the side of minimalistic than overdo it. e.g. I hate Disqus and other heavy duty, slow commenting add ons.

• Phillip M. says:

Thanks for clarifying that. There are cases in which my intent is to explain a thought in a comment, but it makes better sense in a formalism or some other more ‘picturesque’ way. Roughy a month ago, I was attempting to explain a method originated by Dr. Lee Hively at ORNL which used a polynomial (actually just a quadratic) has the filtering threshold of low frequency artifactual events noticed in EEG readings. I had a complete brain fizzle and in that moment, I could only think spatially and formally (gives new meaning to ‘cat got tongue’ – actually ‘cat got keyboard’. When words evade, I find a picture or a simplified mathematic formalism tells the story better than I can.

2. Anyone who can say “data scientist” with a straight face is a data scientist. Especially if you add “big data”.

Most of the serious textbooks in machine learning have a strong statistics bent (I’m thinking Murphy’s new book, Hastie et al., MacKay, and Bishop, or even the Witten et al. Weka book on data mining). Where they differ from most stats textbooks is that they also have a very strong algorithmic orientation. For instance, consider Rasmussen and Williams’s book on Gaussian processes — it’s relatively crunchy on both the stats and computation side, as is typical of serious machine learning publications (just browse a NIPS proceedings for more examples).

Then there are SVMs or perceptrons or other large-margin classifiers, which are statistical in some sense, but not probabilistic in their basis.

Most statistics professors seem to concentrate on theory, and as such, aren’t going to be making splashy mainstream news the way “a comp sci professor at university X cures cancer by data mining” is going to. I’m not saying theory isn’t important, just that it doesn’t grab the headlines.

In my experience, the major difference is that machine learning researchers tend to concentrate on predictive performance on either properly held out bakeoff test sets or under cross-validation, whereas statisticians tend to concentrate on understanding the parameters in their model by assessing data fit versus the training set. And they want to interpret their parameters, perhaps causally, in a way that’s still not mainstream in machine learning. I think we’re seeing convergence, though.

Another difference is in scale and approximation. Most machine learning researchers are comfortable building approximate models at scale, whereas statistics tends to concentrate on getting theoretically correct results at more modest scales. Contrast the interest in variational methods (a kind of efficient and scalable approximate Bayes) across the two communities.

There’s also a third camp of engineers operating under the rubric of “quantifying uncertainty” or “compressed sensing”, which can be reduced to statistics. And there’s a fourth camp of mostly engineers operating under the “information theory” banner, which again has a statistical underpinning (Thomas and Cover’s book is good at pointing out the similarities).

I find the people most strongly positioned to make contributions are people like the authors of the textbooks above — those who have skill in both stats and in computation.

• Daniel Gotthardt says:

Bob:

All applied statisticians, regardless which background they originally had (if Maths/Statistics, Computer Science or some specialist field), tend to have some capabilities in both stats and in computation, but they vary a lot by depth and ratio, of course. That’s at least my impression. People really interested in applied statistics from Sociology or Political Sciences tend to learn more about statistical computing or statistical theory mostly by themselves for example. That’s at least my impression.

• StatsPhDStudent says:

I worked at a tech company and had the title data scientist. I wasn’t crazy about the title, because most people don’t know what it is (not that they really understand what a statistician is either) and it’s associated with a lot of charlatanism. I have been caught using air quotes when mentioning my title and have told a police officer issuing me a citation that my profession is “programmer” rather than use the phrase. But it’s not clear that there’s a better title for these jobs. What would you suggest?

• Anon says:

Data Analyst: Someone who analysis data.

• StatsPhDStudent says:

That’s one aspect of the job. It doesn’t really capture other requirements, e.g. automating aspects of the data analysis/processing and integrating it into a production system.

• Anon says:

Indeed, that flaw would make “data analyst” stand out from other job titles like “teacher”, “doctor”, “fighter pilot”, “senator”, “secretary”, whose names fully capture all the requirements of their day-to-day activities.

• zbicyclist says:

I wish the term was “data engineer”. I think this term more appropriately captures the relentlessly applied mentality that is at the heart of this interconnected set of terms (data science, big data, predictive analytics …)

You don’t want the bridge you are constructing (your model) to fall down when carrying traffic (actual application), but you aren’t terribly interested in, say, statistical theory for its own sake.

• Anon says:

Since “data scientist” and “data engineering” are trying to piggyback on the good name of “scientists” and “engineer”, then why stop there?

How about “data warrior”, “data celebrity”, or my personal favorite “data lama”.

• Corey says:

My favorite is “Data Blackbelt” — a suggestion of my wife’s.

• Anon says:

what about “data general” or the similar but completely different “data generalist”.

Some dishonorable mentions:

“data do-right”

“data czar/czarina”

“data masseuse”

“data tamer”

“data khan”

• K? O'Rourke says:

Or the Bruce Lee version which would be “Data Master” as in “Careful he knows data”.

• Corey says:

“Data-fu”, shurely?

• Corey says:

The problem with “data engineer” is that it carries a hint of a suggestion that one’s job is to engineer data…

I do agree that engineering is the heart of the matter, though.

• leeroy says:

I think people use the title “Data Scientist” to distinguish themselves from statisticians who can’t program and don’t understand algorithms. For example, I would say that 5-10% of the experienced statisticians I interview can write down the likelihood for a logistic regression. It’s even rarer to find someone who can do a join in SQL or explain what an object is.

3. I really didn’t mean to sidetrack this thread into the name “data scientist”. The reason I think it’s funny is because there’s no contrast set — “non-data scientist”? But then I don’t like the “scientist” in “computer scientist” either, because the computer science I practice is a combination of applied math, design, and engineering. Nothing really empirical about it unless you count wrestling with compilers as mostly black boxes.

• Anon says:

Adding “science” to non-sciency things does seem cheap. We should replace these barbarisms with something plain spoken. For example, “political science” should become “civics”, while “Social Science” becomes “socialism”.

“Computer Scientist”. “data scientist”, and “management scientist” would then become “coder”, “quant”, and “mountebank” respectively.

4. numeric says:

Worth a look:

http://www.moore.org/programs/science/data-driven-discovery

Data-Driven Discovery

We empower data scientists in academic research.

Today’s scientific instruments, sensors and computer simulations are producing complex data at exponential rates—creating a virtual data deluge. Although these data represent an unprecedented resource, their size and complexity are overwhelming scientists’ current practices to extract useful information.

Effectively harnessing these large and complex scientific datasets requires better tools, practices and new solutions. We need fundamentally different techniques for taking advantage of modern science data developed by an emerging, multidisciplinary type of research—data science.

Data scientists combine scientific expertise, computational knowledge and statistical skills to solve critical problems and make new discoveries. The research community recognizes the need for these skills, but the lack of academic incentives creates a critical shortage of practitioners. In other words, science may be data-rich, but will remain discovery-poor without the institutional commitment, people-power and technology needed to mine the data and reveal hidden breakthroughs.

That’s why we are supporting the people who innovate around data-driven discovery—through the creation of data science hubs at major research universities, and through investigator awards and data science projects.

5. Phillip M. says:

So within this concept of data science, I see facets such as data generation, data ecosystem engineering (for large and streaming data – aka big data, nothing new), machine-learning, more strictly object oriented programming (OOP), a bit of high performance computing, and automation. Today’s statistical realm appears to take at least the data generation and consuming the ‘big data’ (static or streaming) parts of this, and OOP languages are far more commonplace (folks are seeing le\$\$ value in expen\$ive proprietary \$oftware whose egos and base languages are well – rather confusing, much less dated).

Machine-learning in my mind would be the analogue of what data-mining has traditionally employed, though these algorithms have been applied to known data to predict outcomes, not just unknown data to drum up new research questions. Still, many methods in prediction are hardly explanatory (yet) – the proverbial cognitive black box prevails – yet in accuracy, when compared to analogous statistical methods, many times outperform the statistical counterpart.

This is where the convergence/synthesis of disciplines needs to operate as I think both provide complementary solutions. I’ve been doing a bit of work in the vein of evidence-based ‘best practices’ when it comes to say, model development. In my lit review to date as well as polling my current network and considering my own experience in the matter, I feel I’m coming to the conclusion that, until the black box is able to converge with explanation, in a number of cases it may be appropriate to employ methods from both sides of the aisle – one side deployed to aid in explanation of a set of outcomes, the other to predict them. To what degree one is likely to infer the other however is a complex problem (case in point: multivariate prediction – GLM vs NN – the former is far better at explaining contributions to a system that defines the likelihood of an outcome, the other generally (of course not always) more precise/accurate at predicting it).

I agree with the comment on charlatans running rampant through the aisles. Of this camp, it has been thought that with ‘more, bigger data’ that statistics becomes obsolete. This is quite a bastardization of information theory (more data = more information). However this notion has been the marketing swing from various (quite successful) vendors that has been batting 1.000 in both how data are consumed and information in business is developed from those data (basically just the same untested summary report / data aggregation and graphics we’ve been throwing darts at for years). I’ve known this approach to produce some undesirable and even downright nasty outcomes (stemming from the mentality “it’s so easy, an 8th grader could use it”).