We recently had an email discussion among the Stan team regarding the use of predictive accuracy in evaluating computing algorithms. I thought this could be of general interest so I’m sharing it here.

It started when Bob said he’d been at a meting on probabilistic programming where there was confusion on evaluation. In particular, some of the people at the meeting had the naive view that you could just compare everything on cross-validated proportion-predicted-correct for binary data.

But this won’t work, for three reasons:

1. With binary data, cross-validation is noisy. Model B can be much better than model A but the difference might barely show up in the empirical cross-validation, even for a large data set. Wei Wang and I discuss that point in our article, Difficulty of selecting among multilevel models using predictive accuracy.

2. 0-1 loss is not in general a good measure. You can see this by supposing you’re predicting a rare disease. Upping the estimated probability from 1 in a million to 1 in a thousand will have zero effect on your 0-1 loss (your best point prediction is 0 in either case) but it can be a big real-world improvement.

3. And, of course, a corpus is just a corpus. What predicts well in one corpus might not generalize. That’s one reason we like to understand our predictive models if possible.

Bob in particular felt strongly about point 1 above. He wrote:

Given that everyone (except maybe those SVM folks) are doing *probabilistic* programming, why not use log loss? That’s the metric that most of the Kaggle competitions moved to. It tests how well calibrated the probability statements of a model are in a way that neither 0/1 loss, squared error, or ROC curve metrics like mean precision don’t.

My own story dealing with this involved a machine learning

researcher trying to predict industrial failures who built a logistic regression where the highest likelihood of a component failure was 0.2 or so. They were confused because the model didn’t seem to predict any failures at all, which seemed wrong. That’s just a failure to think in terms of expectations (20 parts with a 20% chance of failure each would lead to 4 expected failures). I also tried explaining that the model may be well calibrated and there may not be a part that has more than a 20% chance of failure. But they wound up doing what PPAML’s about to do for the image tagging task, namely compute some kind of ROC curve evaluation based on varying thresholds, which of course, doesn’t measure how well calibrated the probabilities are, because it’s only sensitive to ranking.

Tom Dietterich concurred:

Regarding holdout likelihood, yes, this is an excellent suggestion. We have evaluated on hold-out likelihood on some of our previous challenge problems. In CP6, we focused on the other metrics (mAP and balanced error rate) because that is what the competing “machine learning” methods employed.

Within the machine learning/computer vision/natural language processing communities, there is a wide-spread belief that fitting to optimize metrics related to the specific decision problem in the application is a superior approach. It would be interesting to study that question more deeply.

To which Bob elaborated:

I completely agree, which is why I don’t like things like mean average precision (MAP), balanced 0/1 loss, and balanced F measure, none of which relate to any relevant decision problem.

It’s also why I don’t like 0/1 loss (either straight up, through balanced F measures, through macro-averaged F measure, etc.), because that’s never the operating point anyone wants. At least in 10 years working in industrial machine learning, it was never the decision problem anyone wanted. Customers almost always had asymmetric utility for false positives and false negatives (think epidemiology, suggesting search spelling corrections, speech recognition in an online dialogue system for airplane reservations, etc.) and wanted to operate at either very high precision (positive predictive accuracy) or very high recall (sensitivity). No customer or application I’ve ever seen other than writing NIPS or Computational Linguistics papers ever cared about balanced F measure in a large data set in an application.

The advantage of log loss is a better measure for generic decision making than area under the curve because it measures how well calibrated the probabilistic inferences are. Well-calibrated inferences are optimal for all decision operating points assuming you want to make Bayes-optimal decisions to maximize expected utility while minimizing risk. There’s a ton of theory around this, starting with Berger’s influential book on Bayesian decision theory from the 1980s. And it doesn’t just apply to Bayesian models, though almost everything in the machine learning world can be viewed as an approximate Bayesian technique.

Being Bayesian, the log loss isn’t a simple log likelihood with point estimated parameters plugged in (popular approximate technique in the machine learning world), but a true posterior predictive estimate as I described in my paper. Of course, if your computing power isn’t up to it, you can approximate with

point estimates and log loss by treating your posterior as a delta function around its mean (or even mode if you can’t even do variational inference).Sometimes ranking is enough of a proxy for decision making, which is why mean average precision (truncated to high precison, say average precision at 5) is relevant for some search apps, such as Google’s, and mean average precision

(truncated to high recall) is relevant to other search apps, such as that of a biology post-doc or an intelligence analyst. I used to do a lot of work with DoD and DARPA and they were quite keen to have very very high recall — the intelligence analysts really didn’t like systems that had 90% recall so that 10% of the data were missed! At some points, I think they

kept us in the evaluations because provided an exact boolean search that had 100% recall, so they could look at the data, type in a phrase, and be guaranteed to find it. That doesn’t work with first-pass first-best analyses.

I suggested to Bob that he blog this but then we decided it would be more time-efficient for me to do it. The only thing is, then it won’t appear till October.

**P.S.** Here are Bob’s slides from that conference. He spoke on Stan.

Great discussion. I was excited to see Bob mention Kaggle’s trend toward mostly using log loss. (I was a small part of pushing things that direction at Kaggle.)

I’d love to see people go beyond log loss and use machine learning algorithms and evaluation metrics that involve the correlation structure. If we think of log likelihood as log(P(observed outcoems | predicted probabilities, assuming independence)), then it’d be great to ask algorithms to include the correlation structure in their predictions, and remove the independence assumption from the evaluation metric. Anyone know if something like that is happening already?

“They were confused because the model didn’t seem to predict any failures at all, which seemed wrong. That’s just a failure to think in terms of expectations (20 parts with a 20% chance of failure each would lead to 4 expected failures).”

I come across this mindset numerous time. I think it stems from not thinking probabilistically (which is ironic), but rather in terms of yes/no classification (single machine either fails or doesn’t).

Likewise. Although thinking in terms of generative probabilistic models seems completely natural (and correct) to those who do it, it’s very learned. And it’s especially not the sort of way you’d think about the world if you spend your time learning more classical machine learning (tree/nnet/SVM)-based techniques.

I agree with Bob on basically everything he said — I’d just like to highlight an important distinction he made between academic, leaderboard-style evaluation of models and actual applications of probability models in the real world. In “model in a vacuum” evaluations, we assume total ignorance about what downstream decision and end-user could make using the outputs of the model. In real world applications, we usually know what decision process the model is going to be fed into. Bob highlighted a few examples of the latter.

As Bob said, model calibration (and log-loss as a measure for assessing this) is the “right” principle to follow if you have no knowledge of what downstream decisions are going to be made with the output. However, if you do have knowledge of the decision procedure, then calibrating the overall decision rule is more important than calibrating the model. In some of the industrial examples Bob gave where precision or recall had high priority, the end user would likely be completely comfortable with a model that misclassified certain rare cases with probability 100%, resulting in infinite log-loss, if this meant that misclassifications in the _other_ direction were minimized. If the model is really meant to make a single decision, then wasting degrees of freedom calibrating parts of the model that aren’t relevant to the decision-maker’s implied loss function is counterproductive. Of course, if the decision-maker wants to make many distinct decisions on the basis of the model, we approach the “generic decision-making” context that Bob mentioned, and log-loss becomes a more attractive option. And if we zoom all the way out to scientific users who want to add their model to the corpus of human knowledge, and thus think it should be used in _all_ downstream decisions going forward, then log-loss is clearly the right metric.

There’s a problem here that sort of mimics the structure of cross-validation. Ideally, we would think about methodological inquiries in academic settings as being scale-models of how the methods would be applied and evaluated in the real world, in the same way we’d like to have our held-out samples serve as proxies for truly out-of-sample data. But it seems that the evaluation mechanisms we’ve set up for Statistics and ML methods, where a few scalars in a table quantify “performance”, are in most cases fundamentally incongruent with real-world applications. It might be too much to ask for researchers to go out and implement a real-world case study where their new method helped somebody make better decisions (although this is absolutely the ideal), but a better incorporation of (perhaps hypothetical) use-cases in model evaluation might be nice.

>”Within the machine learning/computer vision/natural language processing communities, there is a wide-spread belief that fitting to optimize metrics related to the specific decision problem in the application is a superior approach. It would be interesting to study that question more deeply.”

Can we get some context for this? You want your loss function to correspond to the evaluation metric so that is what your algorithm optimizes. Otherwise it will optimize something else you may or may not care about. I would think this is the first thing people figure out.

Excellent post! I’m working through _Statistical Rethinking_ and was just thinking about the relationship between poster predictive checks, target shuffling (a resampling statistical technique), and decision problems in the real world.

Bob is spot-on in talking about precision: I’ve done a lot of work with investigators (fraud, etc) and all of them ultimately care about precision.

An important point got lost in the shuffle.

2 + 2 = 5?What surprised me most of all about my discussions at the PPAML PI meeting was the lack of concern for getting the right answer. By that I mean that if I write down a Bayesian model (joint density) and give you some data, I want the right inferences for the posterior. When you’re focused on an end-to-end measure, even something like root-mean-square-error or log loss, in some sense it doesn’t matter if the model you wrote down is being estimated properly.

You see this a lot in machine learning (or at least used to) with things like “early stopping” rules (there’s even a Wikipedia page for early stopping!). You’d build a model, write in the paper that you were doing an MLE, then run a hill climbing algorithm for only a few iterations. The result is a kind of ad-hoc regularization, which can help with inference. But rather than saying, “hey, the model’s wrong, I need a prior or regularizer”, the approach of just using cross-validation to pick a number of iterations to run an iterative method was often used.

We ran into the same issue with our own variational inference tool. It recovered means on fairly complex models with a lot of data (the ones used in the NIPS paper eval), but when Ben tried to run it on models from the Gelman and Hill book or the BUGS examples, it got the wrong answer. That’s why it didn’t show up in RStan right away and is still labeled

experimental. Ben’s solved some of this problem by using a QR decomposition of predictors under the hood in RStanARM.But Alp and Dustin (or their advisor, Dave Blei) didn’t seem surprised and weren’t even particularly concerned. They said (and I paraphrase), “That problem’s too easy.” Hence Andrew’s comment, “We put in 2+2 and it spit out 5.” (Though it was more like we put in 2+2 and it spit out 0.05 or 50; we were OK with putting in 2000 + 2000 and getting 4100 — we expected that order of bias.

Log loss is really flat; decisions are where it’s atI’ve done more thinking on log loss and it’s very flat through most of the response, which is what Wei and Andrew were getting at in the linked paper above. It heavily penalizes tail mistakes (estimating 0.001 chance of something that happened, but the difference between a 0.1 and a 0.15 prediction is negligible). So indeed this all needs to be put into some kind of more realistic decision problem, as Alex D says above and as I was getting at in the applications.

Sharpness and CalibrationI’m writing up a tutorial for Stan on hierarchical modeling with a focus on predictive posterior inference and model checking. I’m using a binomial model for repeated binary trial data. Now the prediction is a number, and in that situation, it’s hard to imagine doing 0/1 loss. You could do something like root-mean-square-error, but it seems more natural to set it up probabilistically as an interval prediction problem that can be calibrated.

Michael Betancourt sent me references to the following when I asked what the concept of “sharpness” was called:

Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007) Probabilistic forecasts, calibration and sharpness.

Journal of the Royal Statistical Society: Series B(Statistical Methodology), 69(2), 243–268.You also can’t go wrong reading Michael’s own words on the subject,

Michael Betancourt (2015) A Unified Treatment of Predictive Model Comparison.

arXiv:1506.02273.There, I wound up writing a whole blog post as a comment.

This whole topic seems to be yet another instance where people first try some intuitive ad-hoc procedure, discover after long trial and error it fails often, only to discover after an even longer theoretical investigation that they should have just stuck with the simple Bayesian result which can be written down in minutes.

If you consider P(Model|d1,…,dn) it naturally factors so the model is judged as a product of out of sample predictions. But it does so in a way that clears up a host of problems. Consider one example inspired by Gelman’s article.

Suppose you’re trying to model the proportion of the vote total a candidate gets in an election. The true answer is .4. Model_1 estimates .42 while Model_2 estimates .44. The article claims the predictive log loss of these two is fairly similar even thought Model_1 is often practically significantly more ‘accurate’ than Model_2. The solution evidently is to use some other comparison which magnifies the difference and makes Model_1 look much better.

But is Model_1 really better than Model_2? The point estimate is closer to the true value, but what if we had:

Model_1: .42 +/- .000001

Model_2: .44 +/- .05

Now which model is better? Imagine for example, the other candidate gets .41 fraction of the vote. Model_1 would convince you the first candidate is the winner when actually they loose. Model_2 would correctly warn you there’s too much uncertainty involved to make a definite prediction. The more “accurate” model is the one which fools you into making a mistake!

What’s happening here is that the simple equations of probability theory naturally “calibrate” the models from their predictive performance, but they do so in very slick way. They don’t simply calibrate the point estimates or any variation of them the way Frequentists would want, they in essence

calibrate a combination of point estimates and the uncertainties. The only question remaining is how long everyone plans on fiddle farting around with log loss or whatever until this sinks in they and just do it right.That’s what the Gneiting et al. paper’s getting at with its notion of calibration (N% of the actual values are expected to be in the N% posterior intervals), and sharpness (assuming calibration, narrower intervals are better).

The reason you can’t do this in a machine learning bakeoff is that not everyone’s going to give you a Bayesian posterior CDF. Some people still use SVMs, or heuristics, or completely uncalibrated approaches like naive Bayes.

Naive question: Other than what you call an end-to-end measure what other ways do we have to verify that

“the model you wrote down is being estimated properly”?I can get the part about reasoning out which is the right end to end measure (e.g. log loss vs 0/1 vs AUC etc. ). But the part that confuses me is not having a measure.

You can simulate parameters from the prior, then generate data from the sampling distribution and check that you recover the parameters from the data with whatever your fitting procedure is. That’s what we encourage everyone to do with Stan.

For many models, you can run something relatively inefficient like MCMC and assuming you did it right, get the correct answer to within some arithmetic precision. That’s what Andrew did in

BDAto test variational inference, for example.When things get too hairy, usually due to multimodality, you can’t really do either. For example, nothing that fits LDA is ever going to recover the parameters from a simulation. Same for deep belief networks. Everything’s just too underlyingly unidentified.

But this can only be done if one is using a Bayesian Model in the first place, right? Or do you mean one could use your approach to compare two generic models agnostic to the underlying modeling technique used?

Is the goal deciding whether one model is better than another or whether one Bayesian model is better than another Bayesian model?

Again, sorry if I’m asking stupid questions.

Yes, you need a Bayesian model to generate from the prior. But that’s not the point I was trying to make. There are really two points floating around related to computation and model evaluation.

1. Assessing whether your inference algorithm is correct — that is, it samples from the posterior or recovers the correct MLE parameters. Using simulated data gives you a case where you know the right answer. It’s unit testing of a sort for an algorithm.

2. Evaluating a model’s inferences for data — that is, do they make sense and are they useful. This is something like a predictive evaluation and my point is just that 0/1 loss isn’t a very good one. Given that the goal is downstream decision making when we talk about predictions, you really want to concentrate on calibration and sharpness of the model’s predictions from data for quantities of interest. To the extent that one model’s better than another, it’s that it gives you more useful predictions or more useful insight into data that you already have. The latter is the point Wei and Andrew were trying to make — log loss is a blunt tool (response is relatively flat in most of the predictive range) — you need to look at the structure of the predictions.

You can follow essentially the same methodology with non-probabilistic methods. This is basically the only way to debug a machine learning algorithm. Generate data for which you know the right answer and compare the predictions of the fitted “model” to the right answer. One can look at a variety of side predictions as well, a bit like posterior predictive checks.

Yes, but that’s still predictive accuracy but we just argue about what constitutes the right measure of predictive accuracy. Right?

But Andrew talks about the

“difficulty of selecting among multilevel models using predictive accuracy”Now that seems different & bigger: It seems to circumvent / sidestep predictive accuracy itself. And then we are not just arguing just about the right metric.

Just to be fair—because it is simplifying to glomp “machine learners” on one side and “statisticians” on another—much literature on machine learning does do model checking. Implicitly they are using a lot of the tools that we statisticians have formalized; often, however, do not specifically put them in a quantifiable/graphical and principled manner. Even in deep learning papers for example, they generate images from their fitted models and compare them visually to real data as a benchmark. Here’s even a trending paper at a deep learning conference on model checking (http://arxiv.org/abs/1511.01844); and certainly Bayesian machine learners, due to culture, are also very much for held-out log-likelihood and PPC’s more than other subfields of machine learning; see e.g., Hannah’s paper (http://homepages.inf.ed.ac.uk/imurray2/pub/09etm/etm.pdf) and David Mimno’s paper (http://www.cs.columbia.edu/~blei/papers/MimnoBlei2011.pdf) concerning evaluation on topic models which is the canonical graphical model example in Bayesian machine learning.

Also to slightly play contrarian, I dislike AUCs as well, but log loss is not an end-all solution. There’s no single proposal that statisticians can hand out and which everyone can understand without delving much effort and research studying the topic. Assessing model fit to particular components of the data using PPCs is ideal, but does it not promise as much generality as a single comprehensive quantity (I personally would argue against that to begin with but that’s another story). If one really wanted to go that way, it still seems that we statisticians are stuck in debates on this manner: LOO-CV following, say, Aki, Andrew, and Jonah’s paper makes progress toward this direction but this is a 2015 paper, not some 1981 paper we can point to and which has been tested through time. Moreover, these principles significantly hinge on importance sampling, and it’s unknown how practical this is for many of the problems machine learners are concerned about.

Early stopping, ad hoc regularization rules, and so on is a different story. These are done because many do not mind coupling model+inference. The two are tied together in many people’s philosophy, so long as the fitted model at the end “performs well”. I think this can in principle be wrong too, because it’s difficult to know what made the fitted model succeed (the approximate inference or the posited model?), and thus it is difficult to iterate over the data analysis procedure. However, in practice, it can be too restrictive to assume this separation when the fitted model is all that matters at the end, and post hoc procedures are done to improve its success. I think this is what often drives much of the practical success of machine learning on “complex models” and “massive data”, to which many applied statistics papers do not concern themselves with.

> But Alp and Dustin (or their advisor, Dave Blei) didn’t seem surprised and weren’t even particularly concerned. They said (and I paraphrase), “That problem’s too easy.”

I can’t speak exactly for Alp or Dave, but I know they haven’t said that… The problem is very nuanced and I don’t know why you’re bringing it up here: if a linear model is initialized with parameters at one million and the true set of parameters is at 10, then there’s absolutely no way any stochastic optimization algorithm would practically converge to the true set of parameters. The numerical precision of the step-size decay would cause the algorithm to eventually “converge” to the middle. We’ve been repeating this. Of course there’s no way we could “tune” ADVI to get that to work, without additional innovations on convergence diagnostics (multiple chains of stochastic optimization requires something analogous but not the same as R-hat, and principled ways need to be done to do inference checking).

To summarize, I think ideologically we can come up with as many proposed (and clearly not end-all) solutions as we want; however, there really is a difference between what we can state in principle and what we can state and do in practice. Some things should certainly be done immediately, but the lack of knowledge about such practices prevents it; we should make concerted effort to changing these. Other things are more debatable, and we don’t have a clear idea how to separate these; telling everyone to clearly separate their models from inference just is not practical.

I didn’t mean to imply that log loss is the ultimate solution — that’s why I cited the papers with more nuanced discussions of calibration, which is where the statistical (and I believe inferential) action is at. And sorry to overgeneralize. I know there’s a whole lot going on under the heading of “machine learning” and “statistics” and I’m in part classifying for my own convenience.

But let’s just compare and contrast for a second.

1. In Alp, Rajesh, Andrew and Dave’s arXiv paper (classify the authors as you like) on variational inference in Stan, the only evaluation is on predictive performance, i.e. log loss [section 3; empirical study].

2. In

BDA(the first author of which is the third author of the above paper), the only evaluation for variational inference is in terms of the similarity of the marginal posteriors for the parameters in the true posterior (calculated via MCMC) and the mean-field variational approximation (by design, mean-field ignores the covariance structure, though there are steps that can be taken, as in Tamara Broderick’s papers).The approach in Alp et al.’s paper is similar to what I see again and again in NIPS papers, is the official evaluation protocol for the DARPA PPAML project, and it’s how all the Kaggle competitions are run. The approach in Andrew et al.’s book is what I often see in stats journals. That’s all I’m trying to say — there’s a correlation in how things are evaluated among papers published in certain venues.

And I could’ve swore that you guys said the BUGS problems just weren’t big enough. Maybe I misheard or it was just a flippant remark, but I thought you guys really meant it. For example, I say that sort of things in comparisons of HMC to Gibbs or random-walk Metropolis on very simple conjugate models; we designed Stan to be relatively efficient on hard problems—the overhead on simple problems can tank its relative performance (more on this in future posts!).

Bob: are you also familiar with “target shuffling”? It’s a resampling technique where you take your data, shuffle the targets, then refit and score the model; do this many times and you get a distribution of scores and your score on the real (unshuffled-target) data had better be to the right, otherwise your model isn’t actually doing what you think it’s doing. It’s not for model comparison, so it’s tangential to this discussion, but it came to mind and I was wondering what you thought.

Hey, nobody commented on my awesome clickbait title! I guess you’re all getting inured to these.

I did notice that January is the new October…

What happened to the article on “Null hypothesis” = “A specific random number generator”? It has disappeared from your blog.

Roger:

It got bumped for some of that horrible power-pose stuff, is scheduled to appear in May. I hope it’ll be worth the wait!

I found the post confusing. I thought that the null hypothesis is chosen before any random numbers are chosen. Do you mean that rejecting the null hypothesis with a p-value means rejecting a set of random numbers?

Oh well, maybe you will clarify it sometime between now and May. Or maybe I will forget about it. This must be the only blog with such a long lag time.

http://andrewgelman.com/2015/07/21/a-bad-definition-of-statistical-significance-from-the-u-s-department-of-health-and-human-services-effective-health-care-program/#comment-229229

Why use a single performance criterion? What about using multiple metrics? Along with log-loss, include Brier’s Score, parameter stability, BIC, complexity, etc.

Okay, I see. You are countering the phrase “occurred by chance”, which is even more confusing, as all the experiments have chance built in.

I think Andrew’s original point #1 primarily makes sense when you want to make inferences from the fitted model (as in the paper he cites). In many engineering applications, there is no interest in statistical inference, and hence, optimizing the ultimate decision making objective is all that matters. If that objective is 0/1 loss, then cross-validating on 0/1 loss makes sense. Similarly, choosing the amount of regularization by early stopping is a perfectly acceptable method (and computationally attractive). I agree with Bob, of course, that the objective is rarely 0/1 loss.

A closely related point is that in many engineering applications, the model is known to be very far from reality. It is often chosen for computational convenience. For example, Kalman filters and HMMs are applied everywhere with full knowledge that the underlying dynamics are not linear and not Markovian. In such cases, isn’t it meaningless to even talk about “getting the right answer”? The model is a “curve fitting sponge”, and we are adjusting parameters to get it to “behave well”. (Speaking of sponges, look at the success of the “deep neural networks” for signal interpretation tasks!)

The ML community knows very well that 0/1 loss is rarely the right loss. We have a whole subfield of “cost-sensitive learning” that looks at other families of losses (e.g., where a false negative is much more expensive than a false positive).

Returning to the world of probabilistic models, there has been interest in the ML community on studying the properties of proper scoring rules. As with log loss, these guarantee to converge to a properly calibrated distribution. I mention them only to remind folks that log loss is not the only sensible choice.

Does anyone know of work on Bayesian or robust approaches to decision problems when I have uncertainty about the true application loss function? For example, maybe I’m interested in precision@K, but I don’t know the value of K. That has been the (very shakey) justification for optimizing AUC in the machine learning community. For very big data problems, it might make sense to define a prior over the loss function, fit a complex model, and then optimize the decision rule against the uncertainty over the loss function.

Similarly, I wonder if it is the case that by using a carefully-constructed probabilistic model one could obtain additional guarantees of various kinds of robustness? For example, obviously a big advantage of a good Bayesian probabilistic model is that I can first fit (and validate) the model, and then I can pose a wide variety of queries (each with its own query-specific loss) against the model without having to re-fit the model. Perhaps I’m more likely to detect errors in the data and/or errors in my model family if I pursue the BDA methodology? Are there any formal results along these lines, or do people find that in practice they need to tweak the model for each query?

Tom:

I think the key is when you write, “the objective is rarely 0/1 loss.” Or even log loss or squared loss. But it’s standard to use these for evaluation. Sometimes I think this works well, other times not so much. Research is needed to better understand this, but often people take very broad principles such as “Bayes” or “cross-validation” and assume that these solve all their problems, as if the Bayesian solution or the minimum cross-validation solution is necessarily best for their application.

Perhaps of interest to this discussion: Murphy diagrams for comparing competing forecasts under any consistent scoring function (http://arxiv.org/pdf/1503.08195v2.pdf). Though it is talking about point forecasts.

You can think of proper scoring rules such as log-loss, Brier score, etc. as characterizing your expected loss when you have a probability distribution over the decision problems that you might face. Roughly speaking, log-loss is appropriate when you expect to face high-stakes decisions that require accurate judgments around 0 and 1, spherical loss is appropriate when you particularly need good accuracy around .5, and Brier score is appropriate when you regard the accuracy of your judgments to be equally important over the whole 0-1 interval. These results give you a way to think about cases that are intermediate between a very specific decision-theoretic context and no particular decision-theoretic context at all.

See Schervish, M. (1989). A general method for comparing probability assessors. The Annals of Statistics 17, 1856–1879. (https://projecteuclid.org/euclid.aos/1176347398)

Ben Levinstein has a draft paper that aims to explain this result to philosophers, which may be more accessible: https://www.dropbox.com/s/544yr9374bsyvop/Schervish%20draft%201.0.pdf?dl=0

I’ll actually take the stance of defending 0-1 loss in certain problems.

In many problems, such as the extremely popular recommender-system problem, we are very concerned about the number of 0’s and 1’s we correctly label (assuming a binary recommendation). In such problems, we really don’t care if we estimate p = 0, yet observe y = 1, but rather just the correct number of labels. As with almost every machine learning problem, we always have the overfitting/underfitting issue. This really rears an ugly head with the recommender problem; it’s my experience you are either guaranteed to miss clear strong trends OR occasionally fit some p = 1 with y = 0 across a wide range of models.

If you use out-of-sample logistic loss, you’ve chosen to miss clear strong trends. I cannot emphasize enough how silly these results get; my experience is that you will find yourself choosing a model with 75% accuracy over a model with 99% accuracy just to avoid one or two p = 1 cases (of millions).

Of course, you can avoid the infinite out-of-sample loss issue by penalties/priors…but these penalized models typically end up giving the exact same suggestions as the unpenalized methods, and still don’t seem to favor the model that 99% accuracy model over the 75% accuracy model.