Last month I wrote:
Computer scientists are often brilliant but they can be unfamiliar with what is done in the worlds of data collection and analysis. This goes the other way too: statisticians such as myself can look pretty awkward, reinventing (or failing to reinvent) various wheels when we write computer programs or, even worse, try to design software.Andrew MacNamara writes:
Andrew MacNamara followed up with some thoughts:
I [MacNamara] had some basic statistics training through my MBA program, after having completed an undergrad degree in computer science. Since then I’ve been very interested in learning more about statistical techniques, including things like GLM and censored data analyses as well as machine learning topics like neural nets, SVMs, etc. I began following your blog after some research into Bayesian analysis topics and I am trying to dig deeper on that side of things.
One thing I have noticed is that there seems to be a distinction between data analysis as approached from a statistical perspective (e.g., generalized linear models) versus from a computer science perspective (e.g., SVM), even if—as I understand it—mathematically some of the results/algorithms are the same. Many of the computer scientists I work with approach a data analysis problem by throwing as many ‘features’ at the model as possible, letting the computer do the work, and trying to get the best-performing model as measured by some cross-validation technique. On the other hand, when I was taught basic regression, the philosophy of approaching a problem was to try to understand the model driving the data, carefully choose explanatory variables by their real-world importance as well as their statistical significance (or lack thereof—one needs to consider why variables one thought would be significant are not!) and testing for statistical issues that are known to cause problems with models or diagnostics (e.g., outliers, leverage points, non-normal residuals, etc.).
To me, a symptom of this difference in philosophies is that the machine learning software packages I have tried do not seem to output any statistics showing the relative importance or errors of the input features like I would expect from a statistical regression package. Of course given my lack of experience I could very well just be missing something obvious.
I wonder if you’ve experienced anything similar or had any thoughts on this. It seems like in the world of “big data,” machine learning techniques and philosophies are coming to dominate some types of data analysis, and I’m concerned about my impression of the depth of understanding for the problems I’ve seen it applied to—I hope the driverless car teams can predict how their models will react to new situations!
The big difference I’ve noticed between the two fields is that statisticians like to demonstrate our methods on new examples whereas computer scientists seem to be prefer to show better performance on benchmark problems. Both approaches to evaluation make sense in their own way; I just have the impression that stat and CS have evolved to have different priorities. To a statistician, a method is powerful when it generalizes to new situations. To a computer scientist, though, solving a new problem is no big deal—they can solve problems whenever they want, and it is through benchmarks that they can make fair comparisons.
Now to return to the original question: Yes, CS methods seem to focus on prediction while statistical methods focus on understanding. One might describe the basic approaches of different quantitative fields as follows:
Economics: identify the causal effect;
Psychology: model the underlying process;
Statistics: fit the data;
Computer science: predict.
The other issue is sample size. About ten years ago I had several meetings with a computer scientist here at Columbia who was working on interesting statistical methods. I was wondering if his methods could help on my problems, or if my methods could help on his. Unfortunately, we couldn’t see it. I was working with relatively small problems, maybe a survey with 10,000 data points, and he didn’t think his throw-everything-into-the-prediction approach would work well there. Conversely, it seemed impossible to apply my computationally-intensive hierarchical modeling methods with his huge masses of information. I still felt (and feel) that some of our ideas were transferrable to the others’ problems, but doing this transfer in either direction just seemed too difficult so we gave up.
Finally, to return to your question about checking and understanding models. I’ve long thought that machine-learning-style approaches would benefit from predictive model checking. When you see where your model doesn’t fit data, this can give a sense of how it can make sense to put in improvements. Then again, I’ve long thought that statistical model fits should be checked to data also, and a lot of statisticians (particularly Bayesians) have resisted this.
Generative, both conceptually and computationally
It’s particularly easy to check the fit of Bayesian models because they are generative, both conceptually and computationally: there is a probability model for new data, and you can (typically) just press a button to simulate from this generative model (conditional on draws from the posterior distribution of the fitted parameters).
Machine learning methods are not always generative, in which case the first step to model checking is the construction of a generative model corresponding to (or approximating) the estimation procedure.
I think some interesting work is being done in connecting these ideas, for example this paper by David Blei on posterior predictive checking for topic models.