Kevin Gray sent me a bunch of questions on Bayesian statistics and I responded. The interview is here at KDnuggets news. For some reason the KDnuggets editors gave it the horrible, horrible title, “Bayesian Basics, Explained.” I guess they don’t waste their data mining and analytics skills on writing blog post titles!
That said, I like a lot of the things I wrote, so I’ll repeat the material (with some slight reorganization) here:
What is Bayesian statistics?
Bayesian statistics uses the mathematical rules of probability to combine data with prior information to yield inferences which (if the model being used is correct) are more precise than would be obtained by either source of information alone.
In contrast, classical statistical methods avoid prior distributions. In classical statistics, you might include in your model a predictor (for example), or you might exclude it, or you might pool it as part of some larger set of predictors in order to get a more stable estimate. These are pretty much your only choices. In Bayesian inference you can—OK, you must—assign a prior distribution representing the set of values the coefficient can be. You can reproduce the classical methods using Bayesian inference: In a regression prediction context, setting the prior of a coefficient to uniform or “noninformative” is mathematically equivalent to including the corresponding predictor in a least squares or maximum likelihood estimate; setting the prior to a spike at zero is the same as excluding the predictor, and you can reproduce a pooling of predictors thorough a joint deterministic prior on their coefficients. But in Bayesian inference you can do much more: by setting what is called an “informative prior,” you can partially constrain a coefficient, setting a compromise between noisy least-squares estimation or completely setting it to zero. It turns out this is a powerful tool in many problems—especially because in problems with structure, we can fit so-called hierarchical models which allow us to estimate aspects of the prior distribution from data.
The theory of Bayesian inference originates with its namesake, Thomas Bayes, a 18th-century English cleric, but it really took off in the late 18th century with the work of the French mathematician and physicist Pierre-Simon Laplace. Bayesian methods were used for a long time after that to solve specific problems in science, but it was in the mid-20th century that they became proposed as a general statistical tool. Some key figures include John Maynard Keynes and Frank Ramsey who in the 1920s developed an axiomatic theory of probability; Harold Jeffreys and Edwin Jaynes, who from the 1930s through the 1970s developed Bayesian methods for a variety of problems in the physical sciences; Jimmie Savage and Dennis Lindley, mathematicians who in research from the 1950s through the 1970s connected and contrasted Bayesian methods with classical statistics; and, not least, Alan Turing, who used Bayesian probability methods to crack the Enigma code in the second world war, and his colleague I. J. Good, who explored and wrote prolifically about these ideas over the succeeding decades.
Within statistics, Bayesian and related methods have become gradually more popular over the past several decades, often developed in different applied fields, such as animal breeding in the 1950s, educational measurement in the 1960s and 1970s, spatial statistics in the 1980s, and marketing and political science in the 1990s. Eventually a sort of critical mass developed in which Bayesian models and methods that had been developed in different applied fields became recognized as more broadly useful.
Another factor that has fostered the spread of Bayesian methods is progress in computing speed and improved computing algorithms. Except in simple problems, Bayesian inference requires difficult mathematical calculations—high-dimensional integrals—which are often most practically computed using stochastic simulation, that is, computation using random numbers. This is the so-called Monte Carlo method, which was developed systematically by the mathematician Stanislaw Ulam and others when trying out designs for the hydrogen bomb in the 1940s and then rapidly picked up in the worlds of physics and chemistry. The potential for these methods to solve otherwise intractable statistics problems became apparent in the 1980s, and since then each decade has seen big jumps in the sophistication of algorithms, the capacity of computers to run these algorithms in real time, and the complexity of the statistical models that practitioners are now fitting to data.
Now, don’t get me wrong—computational and algorithmic advances have become hugely important in non-Bayesian statistical and machine learning methods as well. Bayesian inference has moved, along with statistics more generally, away from simple formulas toward simulation-based algorithms.
Comparisons to other statistical methods
I wouldn’t say there’s anything that only Bayesian statistics can provide. When Bayesian methods work best, it’s by providing a clear set of paths connecting data, mathematical/statistical models, and the substantive theory of the variation and comparison of interest. From this perspective, the greatest benefits of the Bayesian approach come not from default implementations, valuable as they can be in practice, but in the active process of model building, checking, and improvement. In classical statistics, improvements in methods often seem distressingly indirect: you try a new test that’s supposed to capture some subtle aspect of your data, or you restrict your parameters or smooth your weights, in some attempt to balance bias and variance. Under a Bayesian approach, all the tuning parameters are supposed to be interpretable in real-world terms, which implies—or should imply—that improvements in a Bayesian model come from, or supply, improvements in understanding of the underlying problem under studied.
The drawback of this Bayesian approach is that it can require a bit of a commitment to construction of a model that might be complicated, and you can get end up putting effort into modeling aspects of data that maybe aren’t so relevant for your particular inquiry.
Bayesian methods are often characterized as “subjective” because the user must choose a prior distribution, that is, a mathematical expression of prior information. The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more “subjective” than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth. Indeed, Bayesian methods can in many ways be more “objective” than conventional approaches in that Bayesian inference, with its smoothing and partial pooling, is well adapted to including diverse sources of information and thus can reduce the number of data coding or data exclusion choice points in an analysis.
There’s room for lots of methods. What’s important in any case is what problems they can solve. We use the methods we already know and then learn something new when we need to go further. Bayesian methods offer a clarity that comes from the explicit specification of a so-called “generative model”: a probability model of the data-collection process and a probability model of the underlying parameters. But construction of these models can take work, and it makes sense to me that for problems where you have a simpler model that does the job, you just go with that.
Looking at the comparison from the other direction, when it comes to big problems with streaming data, Bayesian methods are useful but the Bayesian computation can in practice only be approximate. And once you enter the zone of approximation, you can’t cleanly specify where the modeling approximation ends and the computing approximation begins. At that point, you need to evaluate any method, Bayesian or otherwise, by looking at what it does to the data, and the best available method for any particular problem might well be set up in a non-Bayesian way.
Bayesian inference and big data
The essence of Bayesian statistics is the combination of information from multiple sources. We call this data and prior information, or hierarchical modeling, or dynamic updating, or partial pooling, but in any case it’s all about putting together data to understand a larger structure. Big data, or data coming from the so-called internet of things, are inherently messy: scraped data not random samples, observational data not randomized experiments, available data not constructed measurements. So statistical modeling is needed to put data from these different sources on a common footing. I see this in the analysis of internet surveys where we use multilevel Bayesian models to use non-random samples to make inferences about the general population, and the same ideas occur over and over again in modern messy-data settings.
Using Bayesian methods yourself
You have to learn by doing, and one place to start is to look at some particular problem. One example that interested me recently was a website constructed by the sociologist Pierre-Antoine Kremp, who used the open-source statistics language R and the open-source Bayesian inference engine Stan (named after Stanislaw Ulam, the inventor of the Monte Carlo method mentioned earlier) to combine U.S. national and state polls to make daily forecasts of the U.S. presidential election. In an article for Slate, I called this “the open-source poll aggregator that will put all other poll aggregators out of business” because ultimately you can’t beat the positive network effects of free and open-source: the more people who see this model, play with it, and probe its weaknesses, the better it can become. The Bayesian formalism allows a direct integration of data from different sorts of polls in the context of a time-series prediction models.
Is there any warnings? As a famous cartoon character once said, With great power comes great responsibility. Bayesian inference is powerful in the sense that it allows the sophisticated combination of information from multiple sources via partial pooling (that is, local inferences are constructed in part from local information and in part from models fit to non-local data), but the flip side is that when assumptions are very wrong, conclusions can be far off too. That’s why Bayesian methods need to be continually evaluated with calibration checks, comparisons of observed data to simulated replications under the model, and other exercises that give the model an opportunity to fail. Statistical model building, but maybe especially in its Bayesian form, is an ongoing process of feedback and quality control.
A statistical procedure is a sort of machine that can run for awhile on its own, but eventually needs maintenance and adaptation to new conditions. That’s what we’ve seen in the recent replication crisis in psychology and other social sciences: methods of null hypothesis significance testing and p-values, which had been developed for analysis of certain designed experiments in the 1930s, were no longer working a modern settings of noisy data and uncontrolled studies. Savvy observers had realized this for awhile—psychologist Paul Meehl was writing acerbically about statistically-driven pseudoscience as early as the 1960s—but it took awhile for researchers in many professions to catch on. I’m hoping that Bayesian modelers will be sooner to recognize their dead ends, and in my own research I’ve put a lot of effort into developing methods for checking model fit and evaluating predictions.
Different software will serve different needs. Many users will not know a lot of statistics and will want to choose among some menu of models or analyses, and I respect that. We have written wrappers in Stan with pre-coded versions of various standard choices such as linear and logistic regression, ordered regression, multilevel models with varying intercepts and slopes, and so forth, and we’re working on tutorials that will allow the new user to fit these models in R or Stata or other familiar software.
Other users come to Stan because they want to build their own models, or, better still, want to explore their data by fitting multiple models, comparing them, and evaluating their fit. Indeed, our motivation in developing Stan was to solve problems in my own applied research, to fit models that I could not easily fit any other way.
Statistics is sometimes divided between graphical or “exploratory” data analysis, and formal or “confirmatory” inference. But I think that division is naive: in my experience, data exploration is most effectively done using models, and, conversely, our most successful models are constructed as the result of an intensive period of exploration and feedback. So, for me, I want model-fitting software that is:
– Flexible (so I can fit the models I want and expand them in often unanticipated ways);
– Fast (so I can fit many models);
– Connected to other software (so I can prepare my datasets before entering them in the model, and I can graphically and otherwise explore the fitted model relative to the data);
– Open (so I can engage my collaborators and the larger scientific community in my work, and conversely so I can contribute by sharing my modeling expertise in a common language);
– Readable and transparent (both so I can communicate my models with others and so I can actually understand what my models are doing).
Our efforts on Stan move us toward these goals.
Lots of directions here. From the modeling direction, we have problems such as polling where our samples are getting worse and worse, less and less representative, and we need to do more and more modeling to make reasonable inferences from sample to population. For decision making we need causal inference, which typically requires modeling to adjust for differences between so-called treatment and control groups in observational studies. And just about any treatment effect we care about will vary depending on scenario. The challenge here is to estimate this variation, while accepting that in practice we will have a large residue of uncertainty. We’re no longer in the situation where “p less than .05” can be taken as a sign of a discovery. We need to accept uncertainty and embrace variation. And that’s true no matter how “big” our data are.
In practice, much of my thought goes into computing. We know our data are messy, we know we want to fit big models, but the challenge is to do so stably and in reasonable time—in the current jargon, we want “scalable” inference. Efficiency, stability, and speed of computing are essential. And we want more speed than you might think, because, as discussed earlier, when I’m learning from data I want to fit lots and lots of models. Of course then you have to be concerned about overfitting, but that’s another story. For most of the problems I’ve worked on, there are potential big gains from exploration, especially if that exploration is done through substantively-based models and controlled with real prior information. That is, Bayesian data analysis.