Comparing prediction errors

Someone named James writes:

I’m working on a classification task, sentence segmentation. The classifier algorithm we use (BoosTexter, a boosted learning algorithm) classifies each word independently conditional on its features, i.e. a bag-of-words model, so any contextual clues need to be encoded into the features. The feature extraction system I am proposing in my thesis uses a heteroscedastic LDA to transform data to produce the features the classifier runs on. The HLDA system has a couple parameters I’m testing, and I’m running a 3×2 full factorial experiment. That’s the background which may or may not be relevant to the question.

The output of each trial is a class (there are only 2 classes, right now) for every word in the dataset. Because of the nature of the task, one class strongly predominates, say 90-95% of the data. My question is this: in terms of overall performance (we use F1 score), many of these trials are pretty close together, which leads me to ask whether the parameter settings don’t matter, or they do but the performance of the trials just happened to be very similar. Is there statistical test to see if two trials (a vector of classes calculated from the same ordered list of data) are significantly different, especially given they will both pick the majority class very often?

My reply:

I too have found that error rates and even log-likelihoods can be noisy measures of prediction accuracy for discrete-data models. One way you could compare two methods would be a head-to-head comparison where you saw which cases were predicted correctly by both, incorrectly by both, or correctly by A and incorrectly by B. If you’re considering multiple conditions, you could fit a logistic regression with varying intercepts for cases (or, to keep it simple, exclude from your analysis all the easy cases that are predicted correctly under all conditions and also exclude those tricky cases that are always mispredicted). What’s left should be more informative about the differences; essentially you’re doing something closer to a matched-pairs or two-way comparison which will give you information if the prediction errors are correlated (which I expect they will be).

3 thoughts on “Comparing prediction errors

  1. To compare two probabilistic binary classifiers, how about two scatterplots, one for positive and one for negative cases (according to the gold standard) with axes for the prediction of classifier 1 and of classifier 2? Or put them both together with color-coded true categories. (I vaguely recall David Chiang of ISI making a plot like this at a talk at Columbia fairly recently.)

    If you have more than two classifiers, but still relatively few (say six or eight), then you can just make a grid of the pairwise comparisons.

    Out of curiosity, what are you trying to segment?

    If you use a classifier like logistic regression, you could build a joint model of LDA feature extraction and classification. Matt Hoffman and I were just discussing this setup and its relationship to what the folks up north are calling “deep neural nets”.

    • The task is sentence segmentation. We operate on the output of an automatic speech recognizer, so the speech is already split into word segments. Sentence segmentation is then a classification problem of whether a word-final boundary is a sentence boundary or not.

Comments are closed.