Hype can be irritating but sometimes it’s necessary to get people’s attention (as in the example pictured above). So I think it’s important to keep these two things separate: (a) reactions (positive or negative) to the hype, and (b) attitudes about the subject of the hype.
Overall, I like the idea of “data science” and I think it represents a useful change of focus. I’m on record as saying that statistics is the least important part of data science, and I’m happy if the phrase “data science” can open people up to new ideas and new approaches.
Data science, like any just about new idea you’ve heard of, gets hyped. Indeed, if it weren’t for the hype, you might not have heard of it!
So let me emphasize, that in my criticism of some recent hype, I’m not dissing data science, I’m just trying to help people out a bit by pointing out which of their directions might be more fruitful than others.
Yes, it’s hype, but I don’t mind
Phillip Middleton writes:
I don’t want to rehash the Data Science / Stats debate yet again. However, I find the following post quite interesting from Vincent Granville, a blogger and heavy promoter of Data Science.
I’m not quite sure if what he’s saying makes Data Science a ‘new paradigm’ or not. Perhaps it is reflective of something new apart from classical statistics, but then I would also say so of Bayesian analysis as paradigmatic (or at least a still budding movement) itself. But what he alleges – i.e that ‘Big Data’ by its very existence necessarily implies that cause of a response/event/observation can be ascertained, and seemingly w/o any measure of uncertainty, seems rather ‘over-promising’ and hypish.
I am a bit concerned with what I’m thinking he implies regarding ‘black box’ methods – that is the blind reliance upon them by those who are technically non-proficient. I feel the notion that one should always trust ‘the black box’ is not in alignment with reality.
He does appear to discuss dispensing with p-values. In a few cases, like SHT, I’m not totally inclined to disagree (for reasons you speak aobut frequently), but I don’t think we can be quite so universal about it. That would pretty much throw out most every frequentist test wrt to comparison, goodness-of-fit, what have you.
Overall I get the feeling that he’s implying the ‘new’ era as one of solving problems w/ certainty, which seems more the ideal than the reality.
What do you think?
OK, so I took a look at Granville’s post, where he characterizes data science as a new paradigm “very different, if not the opposite of old techniques that were designed to be implemented on abacus, rather than computers.”
I think he’s joking about the abacus but I agree with this general point. Let me rephrase it from a statistical perspective.
It’s been said that the most important thing in statistics is not what you do with the data, but, rather, what data you use. What makes new statistical methods great is that they open the door to the use of more data. Just for example:
- Lasso and other regularization approaches allow you to routinely thrown in hundreds or thousands of predictors, whereas classical regression models blow up at that. Now, just to push this point a bit, back before there was lasso etc., statisticians could still handle large numbers of predictors, they’d just use other tools such as factor analysis for dimension reduction. But lasso, support vector machines, etc., were good because they allowed people to more easily and more automatically include lots of predictors.
- Multiple imputation allows you to routinely work with datasets with missingness, which in turn allows you to work with more variables at once. Before multiple imputation existed, statisticians could still handle missing data but they’d need to develop a customized approach for each problem, which is enough of a pain that it would often be easier to simply work with smaller, cleaner datasets.
- Multilevel modeling allows us to use more data without having that agonizing decision of whether to combine two datasets or keep them separate. Partial pooling allows this to be done smoothly and (relatively) automatically. This can be done in other ways but the point is that we want to be able to use more data without being tied up in the strong assumptions required to believe in a complete-pooling estimate.
And so on.
Similarly, the point of data science (as I see it) is to be able to grab the damn data. All the fancy statistics in the world won’t tell you where the data are. To move forward, you have to find the data, you need to know how to scrape and grab and move data from one format into another.
On the other hand, he’s wrong in all the details
But I have to admit that I’m disturbed on how much Granville gets wrong. His buzzwords include “Model-free confidence intervals” (huh?), “non-periodic high-quality random number generators” (??), “identify causes rather than correlations” (yeah, right), and “perform 20,000 A/B tests without having tons of false positives.” OK, sure, whatever you say, as I gradually back away from the door. At this point we’ve moved beyond hype into marketing.
Can we put aside the cynicism, please?
Why some people don’t see the unfolding data revolution?
They might see it coming but are afraid: it means automating data analyses at a fraction of the current cost, replacing employees by robots, yet producing better insights based on approximate solutions. It is a threat to would-be data scientists.
Ugh. I hate that sort of thing, the idea that people who disagree with you, do so out of corrupt reasons. So tacky. Wake up, man! People who disagree with you aren’t “afraid of the truth,” they just have different experiences than yours, they have different perspectives. Your perspective may be closer to the truth—as noted above, I agree with much of what Granville writes—but you’re a fool if you so naively dismiss the perspectives of others.
Continue reading ‘Don’t, don’t, don’t, don’t . . . We’re brothers of the same mind, unblind’ »