## The Future of Data Analysis

Introduction A few days ago I was trying to explain the benefits of the Bayesian approach to a physicist who didn’t care about the religion of truth and inference but primarily about solving a particular detection problem in particle physics. The probabilistic approach is rather standard and requires little persuasion, but the Bayesian aspect is is a level further than the probabilistic approach. So what is the benefit of the Bayesian approach? This posting will attempt to provide several reasons, from the most obvious to the least.

Frequentist Probability Probability is easily justified as a very elegant way of dealing with uncertainty in cases and variables. But probability is not observed directly but instead inferred – as are the parameters in contrast to observable predictors and outcomes. Frequentists state that the probability should be measured through the gold standard of an infinite sequence of observations, and question the benefit of Bayesian approach while criticizing the fact that inferring a parameter Bayesianly can yield worse accuracy than their favored method of “estimators” – and a bad prior can totally mess up inference. So why not use estimators if their asymptotic properties are good and the methodology often simpler than Bayes?

Overfitting Dividing the number of positive outcomes with the number of all outcomes to estimate the probability of the positive outcome is a very simple estimator: it’s easy to have enough data to calculate this. But most interesting questions are not as simple: it is not interesting to calculate the probability of getting cancer, and the probability of getting cancer given smoking also requires removing the obvious effect of age. All these additional variables make a model more complicated, and the number of parameters greater. Without care and attention the model can start hallucinating properties that aren’t there. The problem is shown in the following picture:

If your modeling problem is in the green area, you can happily use estimators or maximum likelihood. If you’re entering the yellow area and want to retain some generalization power, you need some sort of regularization, epitomized by L1 and L2 regularization, AIC, feature selection or support vector machines. So why shouldn’t we just regularize?

Priors Priors are how a Bayesian would perform regularization. After seeing a large number of regression problems from medical domains, we can safely assign a prior distribution to the size of a regression coefficient, as we have done in our paper. But then, what is the advantage over regularization? A prior is just a distribution of what the parameters should be over a particular category of problems! Isn’t this a nice way to formulate regularization?

Model Uncertainty The crux of Bayes is in using probability to represent the uncertainty about the Platonic – the model, its parameters, the probability. The Bayesian approach truly starts paying a dividend when there is uncertainty in models and parameters, when we have insufficient data to accurately fit the model. Even if an estimator could rather accurately match the predictions obtained by a posterior, the variance in the posterior allows us to understand when the model can’t be fit. To the best of my knowledge, no other methodology can automatically detect such problems.

Another problem that Andrew identified is that there might be situations where the data doesn’t match the model very well – and even though there might be lots of data and a relatively simple model – it just doesn’t fit, and the posterior will be vague.

Language of Modeling WinBUGS is an example of a higher-level modeling language. Just as programming languages have been celebrated as improving programmers’ productivity: they do not require the programmer to think in terms of individual statements such as SET or JMP but in terms of functions, procedures, loops. Similarly, with Bayesian models we no longer have to think in terms of derivatives and fitting algorithms, but in terms of parameters having distributions and tied together in models. Gibbs sampler is a general-purpose fitter and proto-compiler. Of course, it’s not nearly as efficient as a hand-written optimizer, but in the future tools like the Hierarchical Bayes Compiler (HBC) will create custom fitters given a higher-level specification of the model.

Summary The primary value of the Bayesian paradigm is its formal elegance which allows automation of key problems: probability takes care of unpredictability in phenomena, priors help prevent overfitting by providing outside experience (AI practitioners would refer to it as background knowledge), the use of model uncertainty helps determine the reliability of predictions, and applied Bayesians are beginning to develop model compilers!

Future The theory and practice of data analysis is currently all mixed up among a number of overlapping disciplines: (applied/mathematical/geo/medical/…)statistics, machine learning, data mining, (econo/psycho/bio)metrics, bioinformatics. All of them pursue the same problems with different but qualitatively similar tools, lacking the scale to build tools that would help them get to the next level. It is important to disentangle them. The future of data analysis should lie on these four fronts:

1. reliable compilers and samplers that will work with large databases, provide reliable sampling (see BUGS, HBC – empowered by the new generation of programming languages such as Haskell)
2. internet databases intended to manage background knowledge and related data sets, where the same variable appears and the same phenomenon appear in multiple tables, allowing priors to be based on more than a single data set. Research should be presented as raw data in a standardized form, not as reports and aggregates that prevent others from building on top of the finished work. Too many people are working on the same problems but not sharing the data because of an unsolved issue of the rights of the collectors of data who can only gain credit for publications (see FreeBase, Machine Learning Repository, Trendrr, Swivel, OECD.Stat)
3. visualization & modeling environments that make it easier to clean and transform data, experiment with models, to present insights, to reduce the amount of time needed to turn data into a model that can be communicated. (see R Project, Processing, Gapminder)
4. interpretable modeling is important to bring formal models closer to human intuition. It is still not clear what is the importance of a predictor for the outcome – the regression coefficient is close, but yet often confusing. With more powerful modeling frameworks, it is going to be possible to focus on this – not being worried about what one can fit, but instead with model choice, model selection, model language, visual language.

What do you think? What links did we miss?

1. Chris says:

We have a pretty flexible package for building hierarchical models using Python, called PyMC:

Version 2 is still a beta, but will be released soon.

2. Aki Vehtari says:

You say "Priors are how a Bayesian would perform regularization.". Even with a vague prior, integrating over the posterior "regularizes", too. For example, consider the binomial example with the uniform prior distribution (eg. Bayesian data analysis, Ch 2). Now, the MAP estimate is same as the ML estimate and at the extreme observations y=0 and y=n the predictive probabilities are 0 and 1, and one might say that these are "overfitted". Even if we do not change the prior, but integrate over the posterior, the predictive probabilities are then 1/(n+2) and (n+1)/(n+2), which result could be said to be more "regularized". I think that this point should be emphasized more (not downplaying the usefulness of informative priors, when available).

On the other hand, in case of complex models, we are not able to integrate over the posterior exactly, and due to imperfect integration may get more overfitted predictions.

3. Aleks says:

Aki, good and important points – one should never forget that.

Chris, didn't know! Will check it out! Python is great.

4. Andrew says:

Aki: The Uniform(0,1) prior for the binomial probability _does_ contain information. I would call it a weakly informative prior. I agree that it does well (Agresti and Coull have written about the good frequency properties of the point estimate that results from it). But I would say, yes, this regularization is coming from the prior information. Weakly informative != vague.

Aleks: Regarding your comparison with frequentist probability, I recommend reading Larry's comment on my Bayesian Analysis article and, better still, my rejoinder (in particular, pages 473-476).