Don’t, don’t, don’t, don’t . . . We’re brothers of the same mind, unblind

Hype can be irritating but sometimes it’s necessary to get people’s attention. So I think it’s important to keep these two things separate: (a) reactions (positive or negative) to the hype, and (b) attitudes about the subject of the hype.

Overall, I like the idea of “data science” and I think it represents a useful change of focus. I’m on record as saying that statistics is the least important part of data science, and I’m happy if the phrase “data science” can open people up to new ideas and new approaches.

Data science, like any just about new idea you’ve heard of, gets hyped. Indeed, if it weren’t for the hype, you might not have heard of it!

So let me emphasize, that in my criticism of some recent hype, I’m not dissing data science, I’m just trying to help people out a bit by pointing out which of their directions might be more fruitful than others.

Yes, it’s hype, but I don’t mind

Phillip Middleton writes:

I don’t want to rehash the Data Science / Stats debate yet again. However, I find the following post quite interesting from Vincent Granville, a blogger and heavy promoter of Data Science.

I’m not quite sure if what he’s saying makes Data Science a ‘new paradigm’ or not. Perhaps it is reflective of something new apart from classical statistics, but then I would also say so of Bayesian analysis as paradigmatic (or at least a still budding movement) itself. But what he alleges – i.e that ‘Big Data’ by its very existence necessarily implies that cause of a response/event/observation can be ascertained, and seemingly w/o any measure of uncertainty, seems rather ‘over-promising’ and hypish.

I am a bit concerned with what I’m thinking he implies regarding ‘black box’ methods – that is the blind reliance upon them by those who are technically non-proficient. I feel the notion that one should always trust ‘the black box’ is not in alignment with reality.

He does appear to discuss dispensing with p-values. In a few cases, like SHT, I’m not totally inclined to disagree (for reasons you speak aobut frequently), but I don’t think we can be quite so universal about it. That would pretty much throw out most every frequentist test wrt to comparison, goodness-of-fit, what have you.

Overall I get the feeling that he’s implying the ‘new’ era as one of solving problems w/ certainty, which seems more the ideal than the reality.

What do you think?

OK, so I took a look at Granville’s post, where he characterizes data science as a new paradigm “very different, if not the opposite of old techniques that were designed to be implemented on abacus, rather than computers.”

I think he’s joking about the abacus but I agree with this general point. Let me rephrase it from a statistical perspective.

It’s been said that the most important thing in statistics is not what you do with the data, but, rather, what data you use. What makes new statistical methods great is that they open the door to the use of more data. Just for example:

– Lasso and other regularization approaches allow you to routinely thrown in hundreds or thousands of predictors, whereas classical regression models blow up at that. Now, just to push this point a bit, back before there was lasso etc., statisticians could still handle large numbers of predictors, they’d just use other tools such as factor analysis for dimension reduction. But lasso, support vector machines, etc., were good because they allowed people to more easily and more automatically include lots of predictors.

– Multiple imputation allows you to routinely work with datasets with missingness, which in turn allows you to work with more variables at once. Before multiple imputation existed, statisticians could still handle missing data but they’d need to develop a customized approach for each problem, which is enough of a pain that it would often be easier to simply work with smaller, cleaner datasets.

– Multilevel modeling allows us to use more data without having that agonizing decision of whether to combine two datasets or keep them separate. Partial pooling allows this to be done smoothly and (relatively) automatically. This can be done in other ways but the point is that we want to be able to use more data without being tied up in the strong assumptions required to believe in a complete-pooling estimate.

And so on.

Similarly, the point of data science (as I see it) is to be able to grab the damn data. All the fancy statistics in the world won’t tell you where the data are. To move forward, you have to find the data, you need to know how to scrape and grab and move data from one format into another.

On the other hand, he’s wrong in all the details

But I have to admit that I’m disturbed on how much Granville gets wrong. His buzzwords include “Model-free confidence intervals” (huh?), “non-periodic high-quality random number generators” (??), “identify causes rather than correlations” (yeah, right), and “perform 20,000 A/B tests without having tons of false positives.” OK, sure, whatever you say, as I gradually back away from the door. At this point we’ve moved beyond hype into marketing.

Can we put aside the cynicism, please?

Granville writes:

Why some people don’t see the unfolding data revolution?
They might see it coming but are afraid: it means automating data analyses at a fraction of the current cost, replacing employees by robots, yet producing better insights based on approximate solutions. It is a threat to would-be data scientists.

Ugh. I hate that sort of thing, the idea that people who disagree with you, do so out of corrupt reasons. So tacky. Wake up, man! People who disagree with you aren’t “afraid of the truth,” they just have different experiences than yours, they have different perspectives. Your perspective may be closer to the truth—as noted above, I agree with much of what Granville writes—but you’re a fool if you so naively dismiss the perspectives of others.

P.S. I just noticed this post is coming up, and I was reading it—based on the title, I had no idea what it would be about and no recollection of having written it! But the name Vincent Granville rang a bell . . . it turns out that just a few days ago (i.e., a couple months after writing the above post), I happened to get a completely unrelated email from someone else asking about this guy, and the funny thing is, I replied that I’d never heard of Vincent Granville but I thought he had some interesting and some silly things to say. And this other correspondent and I had an email exchange which I decided I’d blog. I’d post that email here but I think it would dilute the points above. So it will appear in a couple of months. It’s funny how I completely forgot this whole thing. Good that I blog; it’s an excellent memory extender.

P.P.S. From comments, I learn that Granville seems to have a habit of propping up his reputation via paid reviews and sock puppets. So perhaps people are taking his writings too seriously: he seems to have had some success grabbing the “data science” label and getting a bunch of hits to his site, but that doesn’t mean that he knows what he’s talking about. Indeed, if he’s actually making it up as he goes along, that would explain why so much of what he writes makes no sense.

The best analogy, perhaps, is to various business-advice books, poker manuals, and fad diets that try the bully the reader into submission with emphatic advice, unpolluted by evidence beyond the apparent success or slimness of the author.

P.P.P.S. Granville seems to be making stuff up about me. I have no interest in dealing with this sort of person and I don’t plan to post anything more about him.

46 thoughts on “Don’t, don’t, don’t, don’t . . . We’re brothers of the same mind, unblind

  1. “identify causes rather than correlations”

    If I were to name the biggest contribution of econometrics to the broader statistical literature, I might say it is the realization statistical methods on their own can never identify causal effects. Only statistical methods married to critical thinking about the nature of the world can speak to causality. In fact, you can almost predict the quality of a paper by the extent to which an author touts their method as the reason they can identify casual effects, instead of how their method allows them to harness some sort of interesting variation in the world (some change/event/discontinuity/whatever that happened or exists in the world).

    The source of the variation in the data is the thing – where did the differences in X come from that allow you to estimate the effect of X on Y. That is why correlation does not imply causation, only correlation along with critical thinking about the relationship between the world, the data, and the method imply causation. I don’t think that is something that pure computation can ever do, and if you say you have a method that can identify causal effects in any situation, my first reaction is you have no idea what you are talking about.

    • What exactly is the utility of Grainger causality and is it more refined than just your garden variety correlation? What about the Judea Pearl framework; isn’t that claiming to “identify” causality? Or not?

      • My take on pearl is it’s more along the lines of “if you have the causal structure pre-identified, this is how you estimate it”.

        the IC algorithm stuff seems to be a dead end.

        • Naive question: If you have the causal structure pre-identified what’s left to be estimated? The differential impact contribution of multiple causes?

        • Ok, then is it fair to say that the Pearl formalism has nothing at all to do with identifying causality in the first place. The causal structure was entirely an assumption.

        • I don’t know. Suppose you have two causal models, Pearl’s formalism I think will allow you to identify in what ways they predict different outcomes, and then you could go and look in that corner of the world and see which is more accurate… thereby helping to identify which one is correct.

          It’s the old “make a model, test it against existing data, then make new predictions and see if they are accurate”.. that last bit is often overlooked and I think it’s central to Pearls’ contribution, but honestly I’m not very knowledgeable about Pearls’ stuff.

      • I have never used the concept/technique of Grainger causality. I was really thinking of the development of “quasi-experimental econometrics” and the focus on identifying variation in making causal arguments (and on developing models that are intended mostly to latch on to particular variation in the data – IV, RD, various differencing estimators).

        And yes, I believe that Pearl’s framework is intended to identify causal effects. But as I say below, I generally see that contribution as an epistemological contribution, or a framework for doing causal inference, but I was trying to discuss the “practice” of causal inference. Partly I can’t judge Pearl’s influence (good or bad) on that because I don’t see a lot of really compelling work that uses that framework (this is not in any way related to the quality of Pearl’s work, just an observation about the methods and frameworks used by the practitioners I think do the most interesting and believable work. But like I said below, I readily admit my biases and my blindnesses here).

      • If X and Y have a hidden common cause Z, but the effect of Z on X is felt sooner than Z’s effect on Y, this will induce a time-lagged correlation between X and Y of the sort that a test for Granger-causality will detect. That is, in this scenario X is a Granger-cause of Y, but by construction it’s not a *cause* cause. So Granger-causality is actually Granger-usefulness-for-prediction, or maybe we could just call it time-lagged mutual information.

        Pearl’s causal inference calculus just straight up assumes that events can and do have direct causes. Given that axiom, it permits you to compute all kinds of interesting things, e.g., the expected consequences of arbitrary interventions in arbitrarily complex causal networks.

      • Rahul et al:

        The main contribution of Pearl’s framework is to provide an unambiguous language for speaking about causality. This is a major achievement.

        For the longest time statistics basically forbade talking about causality (cf Pearson). Indeed, it did not even include a formal notation for this. For example, there was no simple way to distinguish P(Y,X) — the passively observed distribution of two variables in Nature — from P(Y,do(X)) — the distribution you would observe had you directly manipulated X.

        Indeed, to this date most statistical programing languages — including STAN — do not have the semantics to differentiate Y = a +bX + u from Y := a + bX + u. The latter is a structural equation that says X causes Y. The former does not distinguish between causation and prediction. Thus, I can use the first equation to predict Y on the basis of X but I cannot know from the equation alone what will happen if I make deliberate changes in X. The equation is ambiguous. This is a major gap.

        Neyman and Rubin tried to fill this gap using potential outcome notation. On the face of it this is a great achievement. But when you dive deep on it you realize that it is trying to fit a square peg on a round hole. For one, the notation is incredibly clunky for all but the simplest of cases. For another, it sets up the problem of causal inference as one of missing data. This is not only contra natura, in that interventions supposedly _reveal_ potential outcomes without actually changing anything in Nature. It also has it all the wrong way round, in that problematic missing data is a causal problem, and not the other way (e.g. see http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2302735).

        DAGs get a bad rep because, by providing a simple language with which to encode causal assumptions, people think you need to know the true model, etc… Or as Anonymous above puts it “if you have the causal structure pre-identified, this is how you estimate it”. This is a mis-characterization. All empirical work in causality makes some identification assumption, so implicitly everyone is assuming they know the model. DAGs simply help communicate this more explicitly.

        But not only that, having an unambiguous language allows us to push the research envelope. Just like arabic numerals facilitated modern accounting relative to unwieldy Roman numerals, so DAGs can give us traction in problems where potential outcomes become too complex. Examples include a better understanding of confounding, selection bias, Simpson’s paradox, mediation, missing data, external validity, the list goes on. But many of these are recent, and not on most people’s radar.

        • Ok, thanks! It’s interesting. I’d love for it to get on people’s radar more so that I can actually see that stuff in use more often in problems I care about.

          Its been sort of an rough journey for me: As an undergrad I took causality for granted, then progressed to a cynical mindset where true causality was almost impossible to prove & all statistics gave you was more & more refined correlations.

          Then in a “correlation doesn’t imply causation” debate I was introduced to Grainger causality & someone pointed me to Judea Pearl’s work which seemed to be a glimmer in the darkness: Here was someone actually trying to formally prove causation as apart from mere correlation.

          But now I’m conflicted because I don’t see much persuasive use of those DAGs etc. in real world work that I care about.

    • I agree with the conclusion, but how is this an “econometrics” contribution. Angrist popularized the application of causal inference, but aren’t the foundations due to Rubin, Pearl, and others outside economics?

      • I agree that lots of people in lots of different fields have contributed mightily to the discourse on causal inference – from John Snow to Rubin and Pearl and Angrist. But I do think that it has been empirical economists who have been at the forefront of doing causal inference. People like David Card and Alan Krueger – applied people who push the methodological frontier and have shown how to do good, smart causal inference on really important empirical questions.

        Sure, there are Political Scientists, and probably even some Epidemiologists and Sociologists (dig!), who do incredibly good empirical work that make strong causal arguments. But setting aside who developed the epistemological framework, I think econometrics as a practice has been the statistical discipline that has shown how to DO causal inference.

        Whenever I write this kind of stuff on the blog, I like to remind people that applied micro is my field, and so I am obviously biased, and obviously see things from that perspective, and obviously know that literature best. I’d be interested if there were people here who thought, for instance, that the best and most believable causal inference in empirical social science research was in, say, Anthropology or Public Health.

  2. People who disagree with you aren’t “afraid of the truth,” they just have different experiences than yours, they have different perspectives.

    That’s a pretty naive statement. Evolution, global warming, supply-side economics are current areas where one of our two major political party’s “perspective” is simply to ignore evidence (on the other hand, the “truthers” certainly aren’t afraid of the truth). As far as academics, Krugman has been eviscerating fresh water economists for their atheoretical pronouncements on the current economic situation. This is justified because these guys (and they are inevitably male) simply don’t have any theory/model for their policy perscriptions (for example, IS-LM has been predicting no inflation for six years which the fresh water crew has been warning about for six years). And don’t forget the fiducial argument, if you want an example from statistics.

    • Numeric:

      I agree with you in some settings, indeed I’ve seen people sometimes write that certain topics should not be studied because it would a bad thing for people to learn the truth (an obvious example here is nuclear and biological weapons). But in the situation described by Granville, I don’t believe it. People who disagree with him on statistics may well be wrong, but I strongly doubt they are “afraid of the truth.”

      And, yes, I may be naive, but cynics can be naive too. The cynical explanation that “my opponents disagree with me from ulterior motives” can often be a naive view that does not recognize the existence of legitimate differences in opinion. I’m reminded of my frustrating blog discussion a couple years ago with some of commenters who, along with Charles Murray, seemed not to believe that I legitimately had no problem with mothers who brought up children without a father. They seemed to have this view that I secretly agreed with them but simply refused to admit it.

      • You should have qualified it with “sometimes”, as in “Sometimes people who disagree with you aren’t “afraid of the truth,” they just have different experiences than yours, they have different perspectives.” This qualification is the essence of your above reply and with that its a perfectly unobjectionable statement. Note Granville uses “some people” and “might” in his rather sweeping claims.

  3. I realize you write your blog with a lag of a couple months, but Granville posted something related to this in the past week. Perhaps that was what your correspondent had written you about? http://www.datasciencecentral.com/profiles/blogs/data-science-without-statistics-is-possible-even-desirable

    I’m new to him, but I’m shocked that he has a PhD in statistics and says the things he does in that more recent post. Those “model-free confidence intervals” you mention? His writing is very jargony and hard to follow, so maybe what he’s doing is more sensible than I give him credit for, but that procedure appears to involve partitioning data into bins at random and looking at the distribution of the means from each partition: http://www.datasciencecentral.com/profiles/blogs/black-box-confidence-intervals-excel-and-perl-implementations-det That’s basically a bootstrap-like approach without resampling that happens to be estimating the distribution of the wrong quantity (sampling variability of the mean for a smaller number of samples than what he actually has).

  4. Vincent Granville has a bit of a… reputation on the UK mailing list Allstat, where he posted links to his various blog posts multiple times per week, and at one point soliciting reviews for his book in exchange for payment as described above. When the moderators then asked him to reduce his posting, he offered to buy(!) the non-profit mailing list:

    I would like to discuss an option to purchase Allstats – not the mailing list by itself, but rather, the platform. I have three scenarii in mind:

    1) Purchase with standard agreement spanning over three years. Nothing would change (same people managing Allstats), except that (1) I would be entitled to post up to 8 relevant announcements per week, with no more than 0 or 1 commercial announcement. I would significantly grow the mailing list by bringing in tons of data scientists, mostly from US, and might pay a monthly fee to people who manage spam (I expect their workload would increase).

    2) Licensing agreement. I pay a monthly fee for posting up to 8 announcements per week, with no more than 0 or 1 highly relevant commercial announcement (statistical training, Hadoop etc.) Unlike in the “purchase” agreement, I don’t grow the list.

    3) I create a separate data science mailing list similar to Allstats, but focusing on data science and mostly US traffic. I pay someone to manage this project – contact me if interested.

    Let me know your thoughts. Currently, I publish most of my great statistical articles elsewhere, for instance the most recent ones on continued fractions over linear regression for predictive modeling (http://bit.ly/1llLKNx) or the creation of the Data Science Reseach Center (http://bit.ly/1hOlNGM) which could be of interest to many statistical students, researchers, authors and academics. I think this is a sad fact for the European statistical community, and if you are fine not having access to this type of content, then maybe it means that Allstats is not the right platform to acquire. If instead you like the idea, just say it!

    For your information, I discovered Allstats when I was on a post-doc at Cambridge UK (stats lab), in 1995. I have been a member since 1995, subscribed through one of the oldest email addresses still in existence and used for its original purpose – [email protected].”

    He seems to have a solid stats background back in the day, but his current credibility is somewhat questionable.

    • “He seems to have a solid stats background back in the day”

      A lot of people claim a lot of things about themselves. I have met a lot of people who characterize themselves as “fluent in Japanese” (oddly, these are always Americans), where their actual on-the-ground fluency level is pretty laughable.

      I couldn’t find any clear statements about what his educational background is. He says on his linkedin page: “Facultés universitaires ‘Notre-Dame de la Paix’ Ph.D., Statistics, Mathematics, Science, 1983 – 1993”, Then he lists two courses he did there, “Stochastic Geometry, Markov Processes.” I find it odd that a guy does a PhD somewhere, over 10 years, and lists two courses under that PhD.

      Also, I searched for this mysterious uni I have never heard of: Facultés universitaires ‘Notre-Dame de la Paix.

      I ended up at a weird Jesuits page in Belgium:

      http://www.jesuites.be/Facultes-universitaires-Notre-Dame.html

      They describe themselves as companions of Jesus. Further research reveals that the relevant university is the University of Namur, located in Namur, the capital of Wallonia. Huh? I have never heard of this university. Wikipedia enlightens:

      http://en.wikipedia.org/wiki/Universit%C3%A9_de_Namur

      So what is this “Facultés universitaires ‘Notre-Dame de la Paix”? It is not listed under this university. Further study of the wikipedia page shows that this refers to one of those interminable and inscrutably French-style mergers and demergers of a bunch of universities, and it will be dissolved in 2014 (I did say interminable).

      So he says he’s got a PhD in Statistics, Mathematics, but the Uni of Namur doesn’t even have a Statistics or Mathematics department. So how did this PhD come about? I want to see the PhD dissertation.

      Similarly, he says he did a “Post doctorate, Statistics 1995 – 1996, Cambridge University” (I have never heard the phrase “post doctorate” before, it sounds like a phrase someone would use who is not from academia), and then lists four courses he did there: “Time Series Modeling, Markov Chain Monte Carlo, Hierarchical Bayesian Models, Stochastic Point Processes”. Do postdocs list the courses they did while being a postdoc?

      Also, I doubt that Cambridge would hire a guy as a postdoc coming from whatever this FUNDP thing is. Isn’t there any Cambridge statistician reading this blog? Can someone check if he really did do a postdoc there?

      So what does this mean? He did six courses overall. Of course, it doesn’t really matter how many courses a person did, but as far as his claimed academic credentials are concerned, I’m very suspicious of such a “cv”. If I were looking at such a cv from the perspective of hiring a statistician with a PhD, it would go straight into the garbage heap.

      My bullshit filters are all blinking red.

      • Here are some more inconsistencies: He writes:

        http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey

        “When I attended college, I stopped showing up in the classroom altogether – afterall, you could just read the syllabus, memorize the material before the exam and regurgitate it at the exam. Moving fast forward, I ended up with a PhD summa cum laude in (computational) statistics, followed by a joint postdoc in Cambridge (UK) and the National Institute of Statistical Science (North Carolina).”

        This time he has a PhD in computational statistics (on linkedin it is Statistics, Mathematics, and something called “Science”), and his postdoc is a “joint” one with Cambridge and NISS (North Carolina). What the heck is a joint postdoc?

        His description of his time at Cambridge is surprisingly rushed:

        “When I moved to Cambridge university stats lab and then NISS to complete my post-doc (under the supervision of Professor Richard Smith)”

        So why is NC not mentioned on his linkedin page?

        If I were reading between the lines (and I am), it sounds like he got kicked out of Cambridge. Doesn’t Spiegelhalter read this blog? He would know what the facts are.

        • “When I moved to Cambridge university stats lab and then NISS to complete my post-doc (under the supervision of Professor Richard Smith)”

          Richard Smith was a professor at Cambridge (specializing in optimization methods) who left to become Chairman of the Statistics Department at the University of North Carolina. He was also affiliated with the National Institute of Statistical Science which is in the Research Triangle region. NISS has a number of post-doctorates. Richard Smith currently is Director of the SAMSI Institute at Duke University, one of the major mathematics research institutes funded by the National Science Foundation. He is still on the faculty at UNC (as far as I know).

      • Sounds like BS but the “course list” you mention sounds just like a topic list. Like I might say soil liquefaction, continuum mechanics, dynamics, bayesian applied statistics. To describe my PhD work but I dont think I took any courses w those names.

      • I think Granville’s education is represented accurately. He had a JRSS-B paper published in 1995: http://www.jstor.org/stable/2346153 His affiliation is listed as Namur, but his corresponding address is at Cambridge. Based on the submission date of 1993, I assume this was related to his thesis work and published after graduation while he was a postdoc at Cambridge. His postdoc advisor, Richard Smith, was on-leave at Cambridge from 1994-1996 but returned to UNC in 1996 and could have taken Granville back with him: http://www.unc.edu/~rls/cv.html I do find it strange that Smith has no publications with Granville and there is no indication that he ever had him as a postdoc advisee, but he doesn’t list any postdocs on his CV. Granville does have a NISS tech report from 1996: http://www.niss.org/publications/technical-reports

  5. It’s worth mentioning that Granville’s LinkedIn discussion group is the biggest analytically-focused group on LI with over 150,000 members. In addition, 1/3 of those members have joined within the last 12 months. So, his influence over a large number of people is not trivial. Of course, it is LI…but that’s still an impressive number.

    Then, too, like any number of others before him and will come after him, he has leveraged that following into a for-profit enterprise although it’s not clear exactly what he’s peddling beyond his claims to providing startup consulting advice and headhunting of quant talent.

    He has a considerable… reputation among LI statisticians and analytic practitioners for tactics intentionally designed to antagonize and demean the profession, sucking disputants into arguments that typically devolve into ad hominem shouting matches. There have been many such instances of this in the past year alone, as VG himself acknowledges.

    Finally, he has often stated that he is “disruptive” and a “heretic” drawing analogies between himself and historic figures like Galileo.

  6. Vincent Granville has been the subject of several discussions on both his blogs and LinkedIn. See this for instance: http://goo.gl/W6tjnF

    His statements are puzzling and carry the aura of being dogmatic. Some examples of puzzling statements are the idea of self diagnosis and treatment, or the setup of a free unused drug exchange. He stresses that not publishing or presenting his methods at conferences contributes to a more rapid dissemination of his techniques. That may be true, however, who’s keeping him honest?

    Not the platforms where he mostly publish, because he owns them (DSC, AnalyticBridge and the LinkedIn fora). Not the comments on those sites, because he moderates them. Not the commenters because, by his own admissions, some are artificially created. And offering money in exchange for positive reviews of a book, speaks volume about his ethical standing. Along these lines, his company offers a certification in data science. Apparently, one only needs to submit a resume or a LinkedIn profile to be evaluated by a one-man organization. There is no code of professional conduct as far as I can tell, let alone an evaluation committee.

    His definition of Data Science is continuously changing in what appears to be an effort to discredit statistics and statisticians in order to elevate his own flavour of data science. He does that by creating artificial boundaries between data science and statistics. Those boundaries are created not on what genuinly separates data science from statistics , like making an app out of a model, but by separating inferential methods that are “old”, “dated”, “cave men”, “antiquated”, “arcane”, “hard to interpret”, from “modern”, “assumption free”, “with few parameters”, “stable”, “easy to understand” ones. Again, when the person evangelizing modern methods is the one who developed them, it’s a conflict of interest.

    Sadly, his platforms have a staggering amount of followers. Some, especially in business, have a genuine interest in learning about data science. His blogs and fora are likely the first places they will stumble upon. He appeals to them by creating a false sense of novelty, appealing to a person’s innumeracy by questioning the need for statistics, and proclaiming the automation of everything and “death of the statistician”. We may shake our head and move on. Others will buy into it before they realize they were sold a lemon.

    If these are his views, so be it, but it’s not science works. I think it’s important that we publicly raise these issues. If not us, who will?

  7. I happened to find this blog randomly today. I am neither a statistician nor a data scientist, more like a data engineer who has been working on learning statistics and data science since past year. As part of my initial discovery process of what I needed to know if I wanted to learn more about this field led me to Analytic Bridge and then he (his bots) invited me to join his forums and websites about a year ago. By now I have put in more work, learned more about Statistics at basic level and earned a certificate from University of Washington in Data Science. Compared to folks commenting here, my education in basic data analysis is bare bones.

    Even with so little understanding of the subject, it did not take me more than a few weeks to figure out Vincent Granville’s websites are not that useful. His website and forums look big but you might be misjudging on how much influence it actually has. Numbers do not cause influence – unless people are blindly following, and in that case it does not matter, someone else will replace him.

  8. I’m glad you wrote this, at least so that I know I’m not alone in questioning Granville’s work. For a while he was so prolific and frequently referenced that I worried it was I who was mistaken in my criticisms, or that I was missing some nuance of his writing, but I think I just found myself down a dark corridor of misinformation. He seems to pop up frequently on Google Plus, and LinkedIn. My experiences have been similar to yours, and to the majority of the other commentators here, it seems.

  9. RE: I’m on record as saying that statistics is the least important part of data science, …
    RESP: Not so wise. I interpret this to mean that either data analysis is not in your definition of data science or data analysis is not very important. A definition of data science without data analysis is not sustainable. IT and the software industry are going to see to that.

  10. RE: Similarly, the point of data science (as I see it) is to be able to grab the damn data. All the fancy statistics in the world won’t tell you where the data are.
    RESP: Again, you see data science as only IT. This would be fantastic! … if it were sustainable. Market forces are going to push data analysis into data science and we should call it statistical data science to keep it separate.

    By the way, all the fancy data pulling in the world will not analyze the data. It is as if there are two distinct applications: data management and data analysis (statistics). Statistics is for analyzing partial information problems.

  11. You need to do due diligence if you want to understand Vincent Granville. For example, with a French colleague, Granville published some very important papers on statistics. He may call himself a Data Scientist, but that Data Scientist grew out of a Statistician. If we use the Granville Transformation Model, then Data Science grew out of Statistics. Therefore, it’s not remotely a new paradigm – it’s still manipulation of numbers to describe, predict, visualize, etc. I spent considerable time investigating Mr. Granville. I called all of his previous employers, and learned interesting things. There is a 10 year gap missing in Granville’s life. Entering the gap, he’s a scholar in statistics. Exiting the gap, he’s a renegade, a maverick, who, not understanding his own development, actually says that Data Science can exist without statistics, then gives examples that rely on statistics. He is not a renegade. There is no paradigm shift. Keeping it simple, we are all playing with numbers. That’s the bottom line.

  12. I first disputed with Dr. Granville over whether correlation can infer causality (myself holding the traditional view that it does not) and became – with several others here – mystified that the methods he promotes against statistics are all statistically based. He lost me completely when he chose to foray into one of the few things I actually know something about – astrostatistics – when he insisted that two small asteroids making a near pass to Earth only a day apart (March 5/6 2014) must have the same source, despite the scientific evidence presented to the contrary.

    I, too investigated his credentials but must say Mark Birenbaum has been much more thorough – his work and that if others is greatly appreciated! As others, I had found his connect to Smith and did not find his dissertation. I did get the distinct impression that he has a strong statistical background but is now more involved in promotion than science.

    For myself, at least, the final blow was his recently announced use of sock puppets to attract women and other under-represented minorities, instead of inviting actual women and minority scholars to contribute blogs. Perhaps this is a very sore spot with me: although I am neither female nor a minority, I am involved in a very small way with education and social justice advocacy, in which STEM fields (Science Technology Engineering mathematics) play an important role. Justice is everyone’s business and a labor in which all of us may participate.

    For all these reasons – most especially the last – I became inclined to simply let Dr. Granville destroy his own reputation, which he appears to be doing apace. I am, of course, wrong in this: no need to write a rebuttal, for Thomas Speidel has already done so (see above). As Speidel writes, “If these are his views, so be it, but it’s not science works. I think it’s important that we publicly raise these issues. If not us, who will?”

    What, then, are we to do? In particular, I am wondering what discussions may take place at the American Statistical Association’s upcoming Conference on Statistical Practice in February. Leave it to a time series guy like me to recommend taking the long view but so be it. If Dr. Granville is selling himself by sowing false discord among the statistical and data science community, he is not the only one. This is a practice issue, it is broader than Vincent Granville, and is one that will continue to impact ourselves, our profession and the good we would do in the future.

    • Eric:

      I read the linked post and am confused. Most of what Granville writes in this post is pretty vague (he talks about “die-hard statisticians” etc but without any direct quotes), but at one point he writes:

      Andrew Gelman himself claimed that I stole ideas from his research.

      I don’t recall claiming any such thing. Maybe there’s something I missed?

      The thing is, this really bothers me. Granville running sock puppets and writing general statements about data science is irritating, but, whatever, the guy’s got his shtick. But to say that I claimed something that I didn’t—that’s really annoying, especially as he gives no links whatsoever to the purported claim. The likely result is that his zillions of readers will believe it.

      That’s beyond tacky. It’s rude.

      Or maybe I’m missing something? If there’s somewhere I claimed that Granville stole ideas from my research, please let me know!

      • Andrew:

        I was just bringing it to your attention given the claims being made. I don’t remember you claiming that he stole from you.

        The final paragraph of the post seems a bit odd. I compared his post to Xie’s paper, and they seemed to be doing much different things.

  13. Data Science is really not new (I have been doing this stuff for over twenty years). Certainly, technology has advanced if not transformed it, but a rose is still a rose by another name. It has existed for many years under other pseudonyms. Its very foundation is mathematical statistics, including most (if not all) machine learning algorithms to some extent. For instance, many ensemble methods boil down to averaging. I was teaching this stuff before Granville entered the playing field. I worked on one ML algorithm, which only survived in the gas and oil industry for a short time back in 1990. Once it got close to a solution it reverted to traditional convergence methods. I was “born and raised” as a military Operations Research Analyst and we ate this stuff for breakfast.

Leave a Reply

Your email address will not be published. Required fields are marked *