I wrote the following for the occasion of his recent retirement party but I thought these thoughts might of general interest:
When Carl Morris came to our department in 1989, I and my fellow students were so excited. We all took his class. The funny thing is, though, the late 1980s might well have been the worst time to be Carl Morris, from the standpoint of what was being done in statistics at that time—not just at Harvard, but in the field in general. Carl has made great contributions to statistical theory and practice, developing ideas which have become particularly important in statistics in the last two decades. In 1989, though, Carl’s research was not in the mainstream of statistics, or even of Bayesian statistics.
When Carl arrived to teach us at Harvard, he was both a throwback and ahead of his time.
Let me explain. Two central aspects of Carl’s research are the choice of probability distribution for hierarchical models, and frequency evaluations in hierarchical settings where both Bayesian calibration (conditional on inferences) and classical bias and variance (conditional on unknown parameter values) are relevant. In Carl’s terms, these are “NEF-QVF” and “empirical Bayes.” My point is: both of these areas were hot at the beginning of Carl’s career and they are hot now, but somewhere in the 1980s they languished.
In the wake of Charles Stein’s work on admissibility in the late 1950s there was an interest, first theoretical but with clear practical motivations, to produce lower-risk estimates, to get the benefits of partial pooling while maintaining good statistical properties conditional on the true parameter values, to produce the Bayesian omelet without cracking the eggs, so to speak. In this work, the functional form of the hierarchical distribution plays an important role—and in a different way than had been considered in statistics up to that point. In classical distribution theory, distributions are typically motivated by convolution properties (for example, the sum of two gamma distributions with a common shape parameter is itself gamma), or by stable laws such as the central limit theorem, or by some combination or transformation of existing distributions. But in Carl’s work, the choice of distribution for a hierarchical model can be motivated based on the properties of the resulting partially pooled estimates. In this way, Carl’s ideas are truly non-Bayesian because he is considering the distribution of the parameters in a hierarchical model not as a representation of prior belief about the set of unknowns, and not as a model for a population of parameters, but as a device to obtain good estimates.
So, using a Bayesian structure to get good classical estimates. Or, Carl might say, using classical principles to get better Bayesian estimates. I don’t know that they used the term “robust” in the 1950s and 1960s, but that’s how we could think of it now.
The interesting thing is, if we take Carl’s work seriously (and we should), we now have two principles for choosing a hierarchical model. In the absence of prior information about the functional form of the distribution of group-level parameters, and in the absence of prior information about the values of the hyperparameters that would underly such a model, we should use some form with good statistical properties. On the other hand, if we do have good prior information, we should of course use it—even R. A. Fisher accepted Bayesian methods in those settings where the prior distribution is known.
But, then, what do we do in those cases in between—the sorts of problems that arose in Carl’s applied work in health policy and other areas? I learned from Carl to use our prior information to structure the model, for example to pick regression coefficients, to decide which groups to pool together, to decide which parameters to model as varying, and then use robust hierarchical modeling to handle the remaining, unexplained variation. This general strategy wasn’t always so clear in the theoretical papers on empirical Bayes, but it came through in the Carl’s applied work, as well as that of Art Dempster, Don Rubin, and others, much of which flowered in the late 1970s—not coincidentally, a few years after Carl’s classic articles with Brad Efron that put hierarchical modeling on a firm foundation that connected with the edifice of theoretical statistics, gradually transforming these ideas from a parlor trick into a way of life.
In a famous paper, Efron and Morris wrote of “Stein’s paradox in statistics,” but as a wise man once said, once something is understood, it is no longer a paradox. In un-paradoxing shrinkage estimation, Efron and Morris finished the job that Gauss, Laplace, and Galton had begun.
So far, so good. We’ve hit the 1950s, the 1960s, and the 1970s. But what happened next? Why do I say that, as of 1989, Carl’s work was “out of time”? The simplest answer would be that these ideas were a victim of their own success: once understood, no longer mysterious. But it was more than that. Carl’s specific research contribution was not just hierarchical modeling but the particular intricacies involved in the combination of data distribution and group-level model. His advice was not simply “do Bayes” or even “do empirical Bayes” but rather had to do with a subtle examination of this interaction. And, in the late 1980s and early 1990s, there wasn’t so much interest in this in the field of statistics. On one side, the anti-Bayesians were still riding high in their rejection of all things prior, even in some quarters a rejection of probability modeling itself. On the other side, a growing number of Bayesians—inspired by applied successes in fields as diverse as psychometrics, pharmacology, and political science—were content to just fit models and not worry about their statistical properties.
Similarly with empirical Bayes, a term which in the hands of Efron and Morris represented a careful, even precarious, theoretical structure intended to capture classical statistical criteria in a setting where the classical ideas did not quite apply, a setting that mixed estimation and prediction—but which had devolved to typically just be shorthand for “Bayesian inference, plugging in point estimates for the hyperparameters.” In an era where the purveyors of classical theory didn’t care to wrestle with the complexities of empirical Bayes, and where Bayesians had built the modeling and technical infrastructure needed to fit full Bayesian inference, hyperpriors and all, there was not much of a market for Carl’s hybrid ideas.
This is why I say that, at the time Carl Morris came to Harvard, his work was honored and recognized as pathbreaking, but his actual research agenda was outside the mainstream.
As noted above, though, I think things have changed. The first clue—although it was not at all clear to me at the time—was Trevor Hastie and Rob Tibshirani’s lasso regression, which was developed in the early 1990s and which has of course become increasingly popular in statistics, machine learning, and all sorts of applications. Lasso is important to me partly as the place where Bayesian ideas of shrinkage or partial polling entered what might be called the Stanford school of statistics. But for the present discussion what is most relevant is the centrality of the functional form. The point of lasso is not just partial pooling, it’s partial pooling with an exponential prior. As I said, I did not notice the connection with Carl’s work and other Stein-inspired work back when lasso was introduced—at that time, much was made of the shrinkage of certain coefficients all the way to zero, which indeed is important (especially in practical problems with large numbers of predictors), but my point here is that the ideas of the late 1950s and early 1960s again become relevant. It’s not enough just to say you’re partial pooling—it matters _how_ this is being done.
In recent years there’s been a flood of research on prior distributions for hierarchical models, for example the work by Nick Polson and others on the horseshoe distribution, and the issues raised by Carl in his classic work are all returning. I can illustrate with a story from my own work. A few years ago some colleagues and I published a paper on penalized marginal maximum likelihood estimation for hierarchical models using, for the group-level variance, a gamma prior with shape parameter 2, which has the pleasant feature of keeping the point estimate off of zero while allowing it to be arbitrarily close to zero if demanded by the data (a pair of properties that is not satisfied by the uniform, lognormal, or inverse-gamma distributions, all of which had been proposed as classes of priors for this model). I was (and am) proud of this result, and I linked it to the increasingly popular idea of weakly informative priors. After talking with Carl, I learned that these ideas were not new to me, indeed these were closely related to the questions that Carl has been wrestling with for decades in his research, as they relate both to the technical issue of the combination of prior and data distributions, and the larger concerns about default Bayesian (or Bayesian-like) inferences.
In short: in the late 1980s, it was enough to be Bayesian. Or, perhaps I should say, Bayesian data analysis was in its artisanal period, and we tended to be blissfully ignorant about the dependence of our inferences on subtleties of the functional forms of our models. Or, to put a more positive spin on things: when our inferences didn’t make sense, we changed our models, hence the methods we used (in concert with the prior information implicitly encoded in that innocent-sounding phrase, “make sense”) had better statistical properties than one would think based on theoretical analysis alone. Real-world inferences can be superefficient, as Xiao-Li Meng might say, because they make use of tacit knowledge.
In recent years, however, Bayesian methods (or, more generally, regularization, thus including lasso and other methods that are only partly in the Bayesian fold) have become routine, to the extent that we need to think of them as defaults, which means we need to be concerned about . . . their frequency properties. Hence the re-emergence of truly empirical Bayesian ideas such as weakly informative priors, and the re-emergence of research on the systematic properties of inferences based on different classes of priors or regularization. Again, this all represents a big step beyond the traditional classification of distributions: in the robust or empirical Bayesian perspective, the relevant properties of a prior distribution depend crucially on the data model to which it is linked.
So, over 25 years after taking Carl’s class, I’m continuing to see the centrality of his work to modern statistics: ideas from the early 1960s that were in many ways ahead of their time.
Let me conclude with the observation that Carl seemed to us to be a “man out of time” on the personal level as well. In 1989 he seemed ageless to us both physically and in his personal qualities, and indeed I still view him that way. When he came to Harvard he was not young (I suppose he was about the same age as I am now!) but he had, as the saying goes, the enthusiasm of youth, which indeed continues to stay with him. At the same time, he has always been even-tempered, and I expect that, in his youth, people remarked upon his maturity. It has been nearly fifty years since Carl completed his education, and his ideas remain fresh, and I continue to enjoy his warmth, humor, and insights.