Bayesian inference for A/B testing: Lauren Kennedy and I speak at the NYC Women in Machine Learning and Data Science meetup tomorrow (Tues 27 Mar) 7pm

Here it is:

Bayesian inference for A/B testing

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University
Lauren Kennedy, Columbia Population Research Center, Columbia University

Suppose we want to use empirical data to compare two or more decisions or treatment options. Classical statistical methods based on statistical significance and p-values break down in the context of incremental improvement: that is, when there is a stream of innovations, each only slightly better (or possibly slightly worse) than what came before. In contrast, a Bayesian approach is ideally suited to decision making under uncertainty. We discuss the implications for applied statistics and code up some of these models in R and Stan, based on a case study by Bob Carpenter.

5 thoughts on “Bayesian inference for A/B testing: Lauren Kennedy and I speak at the NYC Women in Machine Learning and Data Science meetup tomorrow (Tues 27 Mar) 7pm

  1. The paper from Benevolli, Corani, Demsar and Zaffalon, “Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis”, JMLR 2017, should be also interesting. Yes, NHST is discussed as well.

    • It’s great to see this kind of thing making its way to the ML literature. It starts by trying to convert classical tests into Bayes, but then changes gears and explains why you want to follow Andrew, Jennifer and Masanao’s advice in using hierarchical model estimates as the basis of comparisons (and yes, they do a good job citing the literature).

  2. Thanks for the credit, but that case study was a draft and I had misunderstood Thompson sampling. Luckily, Lauren caught it before the event and is going to help me finish off the case study with actual comparisons with Thompson sampling. And maybe some more interesting cases than Bernoulli returns.

    For the record, probability matching chooses a bandit arm to pull with probabilities proportional to the current estimate that the bandit provides the highest expected return. Thompson sampling just estimates that expected return with a single draw, so it’s using an unbiased estimator, but one that’s very noisy.

Leave a Reply to Bob Carpenter Cancel reply

Your email address will not be published. Required fields are marked *