Peter Huber’s most famous work derives from his paper on robust statistics published nearly fifty years ago in which he introduced the concept of M-estimation (a generalization of maximum likelihood) to unify some ideas of Tukey and others for estimation procedures that were relatively insensitive to small departures from the assumed model.

Huber has in many ways been ahead of his time. While remaining connected to the theoretical ideas from the early part of his career, his interests have shifted to computational and graphical statistics. I never took Huber’s class on data analysis–he left Harvard while I was still in graduate school–but fortunately I have an opportunity to learn his lessons now, as he has just released a book, “Data Analysis: What Can Be Learned from the Past 50 Years.”

The book puts together a few articles published in the past 15 years, along with some new material. Many of the examples are decades old, which is appropriate given that Huber is reviewing fifty years of the development of his ideas. (I used to be impatient with statistics books that were full of dead examples but then I started to realize this was happening to me! The 8 schools experiments are almost 35 years old. The Electric Company is 40. The chicken brains are over 20 years old. The radon study is 15 years old, the data from the redistricting study are from the 1960s and 1970s, and so on. And of course even my more recent examples are getting older at the rate of one year per year and don’t keep so well once they’re out of the fridge. So at this point in my career I’d like to make a virtue of necessity and say that it’s *just fine* to work with old examples that we really understand.

OK. As noted, Huber is modern–a follower of Tukey–in his treatment of computing and graphics as central to the statistical enterprise. His ISP software is R-like (as we would say now; of course ISP came first), and the principle of interactivity was important. He also has worked on various graphical methods for data exploration and dimension reduction; although I have not used these programs myself, I view them as close in spirit to the graphical tools that we now use to explore our data in the context of our fitted models.

Right now, data analysis seems dominated by three approaches:

– Machine learning

– Bayes

– Graphical exploratory data analysis

with some overlap, of course.

Many other statistical approaches/methods exist (e.g., time series/spatial, generalized estimating equations, nonparametrics, even some old-fashioned extensions of Fisher, Neyman, and Pearson), but they seem more along the lines of closed approaches to “inference” rather than open-ended tools for “data analysis.”

I like Huber’s pluralistic perspective, which ranges from contamination models to object-oriented programming, from geophysics to data cleaning. His is not a book to turn to for specific advice; rather, I enjoyed reading his thoughts on a variety of statistical issues and reflecting upon the connections between Huber’s strategies for data analysis and his better-known theoretical work.

Huber writes:

Too much emphasis is put on futile attempts to automate non-routine tasks, and not enough effort is spent on facilitating routine work.

I really like this quote and would take it a step further: If a statistical method can be routinized it can be used much more often and its limitations better understood.

Huber also writes:

The interpretation of the results of goodness-of-fit tests must rely on judgment of content rather than on P-values.

This perspective is commonplace today but, as Huber writes, “for a traditional mathematical statistician, the implied primacy of judgment over mathematical proof and over statistical significance clearly goes against the grain.” The next question is where the judgment comes from. One answer is that an experienced statistician might work on a few hundred applied problems during his or her career, and that will impart some judgment. But what advice can we give to people without such a personal history? My approach has been to impart as much of the lessons I have learned into methods in my books, but Huber is surely right that any collection of specific instructions will miss something.

It is an occupational hazard of all scholars to have an incomplete perspective on work outside their own subfield. For example, section 5.2 of the book in question contains the following disturbing (to me) claim: “Bayesian statistics lacks a mechanism for assessing goodness-of-fit in absolute terms. . . . Within orthodox Bayesian statistics, we cannot even address the question whether a model Mi, under consideration at stage i of the investigation, is consonant with the data y.”

Huh? Huh? Also please see chapter 6 of Bayesian Data Analysis and my article, “A Bayesian formulation of exploratory data analysis and goodness-of-fit testing,” which appeared in the International Statistical Review in 2003. (Huber’s chapter 5 was written in 2000 so too soon for my 2003 paper, but the first edition of our book and our paper on posterior predictive checks had already appeared several years before.)

Just to be clear: I’m not faulting Huber for not citing my work. The statistics literature is huge and ever-expanding. It’s just unfortunate that such a basic misunderstanding–the idea that Bayesians can’t check their models–persists.

I like what Huber writes about approximately specified models, and I think he’d be very comfortable with our formulation of Bayesian data analysis, from the very first page of our book, as comprising three steps: (1) Model building, (2) Inference, (3) Model checking. Step 3 is crucial to making steps 1 and 2 work. Statisticians have written a lot about the problems with inference in a world in which models are tested–and that’s fine, such biases are a worthy topic of study–but consider the alternative, in which models were fit *without* ever being checked. This would be horrible indeed.

Here’s a quote that is all too true (from section 5.7, following a long and interesting discussion of a decomposition of a time series in physics):

For some parts of the model (usually the less interesting ones) we may have an abundance of degrees of freedom, and a scarcity for the interesting parts.

This reminds me of a conversation I’ve had with Don Rubin in the context of several different examples. Like many (most?) statisticians, my tendency is to try to model the data. Don, in contrast, prefers to set up a model that matches what the scientists in the particular field of application are studying. He doesn’t worry so much about fit to the data and doesn’t do much graphing. For example, the schizophrenics’ reaction time example (featured in the mixture-modeling chapter of Bayesian Data Analysis), we used the model Don recommended of a mixture of normal distributions with a fixed lag between them. Looking at the data and thinking about the phenomenon, a fixed lag didn’t make sense to me, but Don emphasized that the psychology researchers were interested in an average difference and so it didn’t make sense in his perspective to try to do any further modeling on these data. He said that if we wanted to model the variation of the lag, that would be fine but it would make sense to gather more data rather than knocking ourselves out on this particular small data set. In a field such as international relations, this get-more-data approach might not work, but in experimental psychology it seems like a good idea. (And I have to admit that I have not at all kept up with whatever research has been done in eye-tracking and schizophrenia in the past twenty years.)

This all reminds me of another story from when I was teaching at Berkeley. Phil Price and I had two students working with us on hierarchical modeling to estimate the distributions of home radon in U.S. counties. One day, one of the students simply quit. Why? He said he just wasn’t comfortable with the Bayesian approach. I was used to the old-style Berkeley environment so just accepted it: the kid had been indoctrinated and I didn’t have the energy to try to unbrainwash him. But Phil was curious, having just completed a cross-validation demonstrating how well Bayes was working in this example, Phil asked the student what he would do in our problem *instead* of a hierarchical model: if the student had a better idea, Phil said, we’d be happy to test it out. The student thought for a moment and said, well, I suppose Professor X (one of my colleagues from down the hall at the time) would say that the solution is to gather more data. At this point Phil blew up. Gather more data! We already have measurements from 80,000 houses! Could you tell us how many more measurements you think you’d need? The student had no answer to this but remained steadfast in his discomfort with the idea of performing statistical inference using conditional probability.

I think that student, and others like him, would benefit from reading Huber’s book and realizing that even a deep theoretician saw the need for using a diversity of statistical methods.

Also relevant to those who worship the supposed purity of likelihoods, or permutation tests, or whatever, is this line from Huber’s book:

A statistician rarely sees the raw data themselves–most large data collections in the sciences are being heavily preprocessed already in the collection stage, and the scientists not only tend to forget to mention it, but sometimes they also forget exactly what they had done.

We’re often torn between modeling the raw raw data or modeling the processed data. The latter choice can throw away important information but has the advantage, not only of computational convenience but also, sometimes, conceptual simplicity: processed data are typically closer to the form of the scientific concepts being modeled. For example, an economist might prefer to analyze some sort of preprocessed price data rather than data on individual transactions. Sure, there’s information in the transactions but, depending on the context of the analysis, this behavioral story might distract from the more immediate goals of the economist. Other times, though, the only way to solve a problem is to go back to the raw data, and Huber provides several such examples in his book.

I will conclude with a discussion of a couple of Huber’s examples that overlap with my own applied research.

*Radon.* In section 3.8, Huber writes:

We found (through exploratory data analysis of a large environmental data set) that very high radon levels were tightly localized and occurred in houses sitting on the locations of old mine shafts. . . . The issue here is one of “data mining” in the sense of looking for a rare nugget, not one of looking, like a traditional statistician, “for a central tendency, a measure of variability, measures of pairwise association between a number of variables.” Random samples would have been useless, too: either one would have missed the exceptional values altogether, or one would have thrown them out as outliers.

I’m not so sure. Our radon research was based on two random samples, one of which, as noted above, included 80,000 houses. I agree that if you have a nonrandom sample of a million houses, it’s a good idea to use it for some exploratory analysis, so I’m not at all knocking what Huber has done, but I think he’s a bit too quick to dismiss random samples as “useless.” Also, I don’t buy his claim that extreme values, if found, would’ve been discarded as outliers. The point about outliers is that you look at them, you don’t just throw them out!

*Aggregation.* In chapter 6, Huber deplores that not enough attention is devoted to Simpson’s paradox. But then he demonstrates the idea with two fake-data examples. If a problem is important, I think it should be important enough to appear in real data. I recommend our Red State Blue State article for starters.

*Survey data.* In section 7.2, Huber analyzes data from a small survey of the opinions of jurors. When I looked at the list of survey items, I immediately thought of how I would reverse some of the scales to put everything in the same direction (this is basic textbook advice). Huber ends up doing this too, but only after performing a singular value decomposition. That’s fine but in general I’d recommend doing all the easy scalings first so the statistical method has a chance to discover something new. More generally, methods such as singular value decomposition and principal components analyses have their limitations–they can work fine for balanced data such as in this example but in more complicated problems I’d go with item-response or ideal-point models. In general I prefer approaches based on models rather than algorithms: when a model goes wrong I can look for the assumption that was violated, whereas when an algorithm spits out a result that doesn’t make sense, I’m not always sure how to proceed. This may be a matter of taste or emphasis more than anything else; see my discussion on Tukey’s philosophy.

The next example in Huber’s book is the problem of reconstructing maps. I think he’d be interested in the work of Josh Tenenbaum and his collaborators on learning structured models such as maps and trees. Multidimensional scaling is fine–Huber gives a couple of references from 1970–but we can do a lot more now!

In conclusion, I found Huber’s book to be enjoyable and thought provoking. It’s good to have a sense of what a prominent theoretical statistician thinks about applied statistics.

Great review and cooments. I will definitely have to check out that text. Too bad it's published by Wiley which means I won't be able to find it for lower than $100 + my first born.

Nice review. I've added Huber's book on my wish list…

Insofar as one is able to learn something from data, as I see it, one is interested in carrying out an inference. While some standard (frequentist) approaches to statistical inference are set out as "within" a statistical model (e.g., N-P tests), it is very misleading to maintain they are somehow "closed" while data analysis is "open". One does not set out in advance the possible claims that might be inferred. What renders the process appropriately "inferential" is being able to critically appraise any candidate for what has been learned (e.g., by scrutinizing the relevant error statistical credentials of the method at hand).

Regarding the Berkeley student who said 80,000 data points wasn't enough for our purposes: I remember it well. It was my first encounter with a phenomenon you had described to me but that I had trouble believing: statisticians (in this case, a student who was echoing his professors) who don't "believe" in Bayesian statistics. I couldn't, and still can't, understand where these people are coming from: Bayes' theorem is a theorem after all, not an idle speculation. If you want to say "I've got a better way to do things," fine, maybe you do, let's try it and see. Or if you want to argue about the prior distribution, or the form of the likelihood function, well, sure, "all models are wrong" and all that, there's plenty of scope for disagreement. But to refuse to accept the validity of a whole approach, even though it is on a firm foundation, still seems bizarre to me.

A few years after that, I went to a talk in the UC Berkeley stats department. I got there a bit early, and the only other people in the room were some UCB stats professors. They were talking about the talk we were about to hear, and disparaging it in various ways based on the abstract. I remember one remark in particular: "Well, of course the whole thing is suspect from the start because he has to assume exchangeability." I almost laughed.

But I think the statistics culture has changed a lot in the last ten or fifteen years. Bayesian methods now seem to be almost universally accepted…not that everyone thinks they're the best approach to a particular problem (or even to any problem) but you don't see the bizarre prejudice that was evident in the '90s.