It depends on how much signal there is in your data. If the base rate is 60%, but there’s so little signal in the data that the Bayes-optimal predictions only vary between 55% and 65%, then even a perfect model isn’t going to do any better than chance on accuracy.
Makes sense.
I’m not really sure what you mean by this. There’s no such thing as an objectively “bad parameter” for sensitivity (well, unless your ROC curve is non-convex); it depends on the relative cost of type I and type II errors.
I think they both have their strengths and weaknesses. When you give your model to a non-statistician to use, you’ll set a decision threshold. If the ROC curve is non-convex, then yes, some regions are strictly dominated by others. Then area under the curve is a broken metric because it gives some weight to completely useless areas. You could replace the dud areas with the bits that they’re dominated by, but that’s inelegant. If the second derivative is near zero, then AUC still cares too much about regions that will still only be used for an extreme utility function.
So in a way it’s better to take a balanced F1 score, and maximise it. Then, you’re ignoring the performance of the model at implausible decision thresholds. If you are implicitly using a very wrong utility function, then at least people can easily call you out on it.
For example, here the two models have similar AUC but for the range of decision thresholds that you would plausibly set the blue model, blue is better—at least it’s clearly good at something.
Obviously, ROC has its advantages too and may be better overall, I’m just pointing out a couple of overlooked strengths of the simpler metric.
Because different utility functions can rank models differently, the problem “find a utility-function-independent model statistic that is good at ranking classifiers” is ill-posed. A lot of debates over model scoring statistics seem to cash out to debates over which statistics seem to produce model selection that works well robustly over common real-world utility functions.
Makes sense.
I think they both have their strengths and weaknesses. When you give your model to a non-statistician to use, you’ll set a decision threshold. If the ROC curve is non-convex, then yes, some regions are strictly dominated by others. Then area under the curve is a broken metric because it gives some weight to completely useless areas. You could replace the dud areas with the bits that they’re dominated by, but that’s inelegant. If the second derivative is near zero, then AUC still cares too much about regions that will still only be used for an extreme utility function.
So in a way it’s better to take a balanced F1 score, and maximise it. Then, you’re ignoring the performance of the model at implausible decision thresholds. If you are implicitly using a very wrong utility function, then at least people can easily call you out on it.
For example, here the two models have similar AUC but for the range of decision thresholds that you would plausibly set the blue model, blue is better—at least it’s clearly good at something.
Obviously, ROC has its advantages too and may be better overall, I’m just pointing out a couple of overlooked strengths of the simpler metric.
Yes.