Suppose you have a binary classifier. It looks at things and tries to guess whether they’re Dogs or Not Dogs.
More precisely, the classifier outputs a numeric score, which is higher for things it thinks are more likely to be Dogs.
There are a bunch of ways to assess how good the classifier is. Many of them, like false-positive rate and false-negative rate, start by forcing your classifier to output discrete predictions instead of scores:
Fix some threshold. Anything higher is a “predicted Dog”, anything lower is a “predicted Not Dog”.
See how often the classifier correctly predicts that Dogs are Dogs, and how often it correctly predicts that Not Dogs are Not Dogs.
Calculate some function of those numbers.
A lot of metrics — like F1 score — also assume a population with a particular ratio of Dogs to Not Dogs, which can be problematic in some applications.
The AUC metric doesn’t require a fixed threshold. Instead, it works as follows:
Select a random Dog and a random Not Dog.
Compare the score for the Dog to the score for the Not Dog.
Repeat steps 1-2 many times. AUC is the fraction of times the Dog scored higher.
Or rather, that’s one way to define it. The other way is to draw the ROC curve, which plots the relationship between true-positive rate (sensitivity) and false-positive rate (1-specificity) as the classification threshold is varied. AUC is the Area Under this Curve. That means it’s also the average sensitivity (averaged across every possible specificity), and the average specificity (averaged across sensitivities). If this is confusing, google [ROC AUC] for lots of explanations with more detail and nice pictures.
AUC is nice because of the threshold-independence, and because it’s invariant under strictly-monotonic rescaling of the classifier score. It also tells you about (an average of) the classifier’s performance in different threshold regimes.
Sometimes, though, you care more about some regimes than others. For example, maybe you’re okay with misclassifying 25% of Not Dogs as Dogs, but if you classify even 1% of Dogs as Not Dogs then it’s a total disaster. Equivalently, suppose you care more about low thresholds for Dogness score, or the high-sensitivity / low-specificity corner of the ROC curve.
As I recently figured out, you can generalize AUC to this case! Let’s call it N-AUC.
There are two ways to define N-AUC, just as with AUC. First way:
Select N random Dogs and one random Not Dog.
Compare the score for the Not Dog to the scores for all of the Dogs.
Repeat steps 1-2 many times. N-AUC is the fraction of times that every Dog scored higher than the single Not Dog.
Second way:
N-AUC is the integral of the function sensitivityN−1 over the region under the ROC curve in the (sensitivity, 1-specificity) plane.
Fun exercise: These are equivalent.
Of course, 1-AUC is just the usual AUC.
You can also emphasize the opposite high-threshold regime by comparing one Dog to N Not Dogs, or integrating specificityN−1.
In fact, you can generalize further, to (N,M)-AUC:
Compute P(N Dogs>M Not Dogs), or integrate sensitivityN−1×specificityM−1 under the curve. For large, comparable values of M and N, this weights towards the middle of the ROC curve, favoring classifiers that do well in that regime.
I thought of this generalization while working on Redwood’s adversarial training project, which involves creating a classifier with very low false-negative rate and moderate false-positive rate. In that context, “Dogs” are snippets of text that describe somebody being injured, and “Not Dogs” are snippets that don’t. We’re happy to discard quite a lot of innocuous text as long as we can catch nearly every injury in the process. Regular old AUC turned out to be good enough for our purposes, so we haven’t tried this version, but I thought it was interesting enough to make for a good blog post.
A Generalization of ROC AUC for Binary Classifiers
Link post
Suppose you have a binary classifier. It looks at things and tries to guess whether they’re Dogs or Not Dogs.
More precisely, the classifier outputs a numeric score, which is higher for things it thinks are more likely to be Dogs.
There are a bunch of ways to assess how good the classifier is. Many of them, like false-positive rate and false-negative rate, start by forcing your classifier to output discrete predictions instead of scores:
Fix some threshold. Anything higher is a “predicted Dog”, anything lower is a “predicted Not Dog”.
See how often the classifier correctly predicts that Dogs are Dogs, and how often it correctly predicts that Not Dogs are Not Dogs.
Calculate some function of those numbers.
A lot of metrics — like F1 score — also assume a population with a particular ratio of Dogs to Not Dogs, which can be problematic in some applications.
The AUC metric doesn’t require a fixed threshold. Instead, it works as follows:
Select a random Dog and a random Not Dog.
Compare the score for the Dog to the score for the Not Dog.
Repeat steps 1-2 many times. AUC is the fraction of times the Dog scored higher.
Or rather, that’s one way to define it. The other way is to draw the ROC curve, which plots the relationship between true-positive rate (sensitivity) and false-positive rate (1-specificity) as the classification threshold is varied. AUC is the Area Under this Curve. That means it’s also the average sensitivity (averaged across every possible specificity), and the average specificity (averaged across sensitivities). If this is confusing, google [ROC AUC] for lots of explanations with more detail and nice pictures.
AUC is nice because of the threshold-independence, and because it’s invariant under strictly-monotonic rescaling of the classifier score. It also tells you about (an average of) the classifier’s performance in different threshold regimes.
Sometimes, though, you care more about some regimes than others. For example, maybe you’re okay with misclassifying 25% of Not Dogs as Dogs, but if you classify even 1% of Dogs as Not Dogs then it’s a total disaster. Equivalently, suppose you care more about low thresholds for Dogness score, or the high-sensitivity / low-specificity corner of the ROC curve.
As I recently figured out, you can generalize AUC to this case! Let’s call it N-AUC.
There are two ways to define N-AUC, just as with AUC. First way:
Select N random Dogs and one random Not Dog.
Compare the score for the Not Dog to the scores for all of the Dogs.
Repeat steps 1-2 many times. N-AUC is the fraction of times that every Dog scored higher than the single Not Dog.
Second way:
N-AUC is the integral of the function sensitivityN−1 over the region under the ROC curve in the (sensitivity, 1-specificity) plane.
Fun exercise: These are equivalent.
Of course, 1-AUC is just the usual AUC.
You can also emphasize the opposite high-threshold regime by comparing one Dog to N Not Dogs, or integrating specificityN−1.
In fact, you can generalize further, to (N,M)-AUC:
Compute P(N Dogs>M Not Dogs), or integrate sensitivityN−1×specificityM−1 under the curve. For large, comparable values of M and N, this weights towards the middle of the ROC curve, favoring classifiers that do well in that regime.
I thought of this generalization while working on Redwood’s adversarial training project, which involves creating a classifier with very low false-negative rate and moderate false-positive rate. In that context, “Dogs” are snippets of text that describe somebody being injured, and “Not Dogs” are snippets that don’t. We’re happy to discard quite a lot of innocuous text as long as we can catch nearly every injury in the process. Regular old AUC turned out to be good enough for our purposes, so we haven’t tried this version, but I thought it was interesting enough to make for a good blog post.