When you say ‘system A works better than system B’ this implies that system A should be used and this is clear cut. But the notion ‘works better’ lacks a rigorous definition.
What? These are generally binary decisions, with a known cost to false positives and false negatives, and known rates of false positives and false negatives. It should be be trivial to go from that to a utility-valued error score.
You just presumed away my argument. I claim specifically that the relationship between various classes of errors is not well-defined. This can lead to abuse of the term ‘better’.
Please tell me why I should take that as a presumption.
Because those are the class of problems this post discusses.
From the top of the post:
A parole board considers the release of a prisoner: Will he be violent again? A hiring officer considers a job candidate: Will she be a valuable asset to the company? A young couple considers marriage: Will they have a happy marriage?
The cached wisdom for making such high-stakes predictions is to have experts gather as much evidence as possible, weigh this evidence, and make a judgment. But 60 years of research has shown that in hundreds of cases, a simple formula called a statistical prediction rule (SPR) makes better predictions than leading experts do.
A parole board considers the release of a prisoner: Will he be violent again?
I think this is the kind of question that Miller is talking about. Just because a system is correct more often, doesn’t necessarily mean it’s better.
For example if the human experts allowed more people out who went on to commit relatively minor violent offences and the SPRs do this less often, but are more likely to release prisoners who go on to commit murder then there would be legitimate discussion over whether the SPR is actually better.
I think this is exactly what he is talking about when he says
Where AI’s compete well generally they beat trained humans fairly marginally on easy (or even most) cases, and then fail miserably at border or novel cases. This can make it dangerous to use them if the extreme failures are dangerous.
Whether or not there is evidence that says this is a real effect I don’t know, but to address it what you really need to measure is total utility of outcomes rather than accuracy.
A test that reports 100 false positives for every 100 false negatives for disease X
A test that reports 110 false positives for every 90 false negatives for disease X
The cost of fp vs. fn is not defined automatically. If humans are closer to #1 than #2, and I develop a system like #2, I might define #2 to be better. Then later on down the line I stop talking about how I defined better, and I just use the word better, and no one questions it because hey… better is better, right?
Which is better: Releasing a violent prisoner, or keeping a harmless one incarcerated? If you can find an answer that 90% of the population agrees on, then I think you’ve done better than every politician in history.
That people do NOT agree suggest to me that it’s hardly a trivial question...
Releasing a violent prisoner, or keeping a harmless one incarcerated?
How violent, how preventably violent, how harmless, how incarcerated, how long incarcerated? For any specific case with these agreed-upon, I am confident a supermajority would agree.
That people do NOT agree suggest to me that it’s hardly a trivial question...
That people don’t agree suggests one side is comparing releasing a serial killer to incarcerating a drifter in jail a short while, and the other side is comparing releasing a middle-aged man who in a fit of passion struck his adulterous wife to incarcerating Ghandi for the term of his natural life. More generally, they are deciding based on one specific example they have strongly available to them.
In the state you phrased it, that question is about as answerable as “how long is a piece of string?”.
Many tests have a continuous, adjustable parameter for sensitivity, letting you set the trade-off however you want. In that case, we can refrain from judging the relative badness of false positives and false negatives, and use ROCA, which is basically the integral over all such trade-offs. Tests that are going to be combined into a larger predictor are usually measured this way.
Machine learning packages generally let you specify a “cost matrix”, which is the cost of each possible confusion. For a 2-valued test, it would be a 2x2 matrix with zeroes on the diagonal, and the cost of A->B and B->A errors in the other two spots. For a test with N possible results, the matrix is NxN, with zeroes on the diagonals, and each (row,col) position is the cost of a mistake that confuses the result corresponding to that row with the result corresponding to that column.
What? These are generally binary decisions, with a known cost to false positives and false negatives, and known rates of false positives and false negatives. It should be be trivial to go from that to a utility-valued error score.
You just presumed away my argument. I claim specifically that the relationship between various classes of errors is not well-defined. This can lead to abuse of the term ‘better’.
Please tell me why I should take that as a presumption.
Because those are the class of problems this post discusses.
From the top of the post:
I think this is the kind of question that Miller is talking about. Just because a system is correct more often, doesn’t necessarily mean it’s better.
For example if the human experts allowed more people out who went on to commit relatively minor violent offences and the SPRs do this less often, but are more likely to release prisoners who go on to commit murder then there would be legitimate discussion over whether the SPR is actually better.
I think this is exactly what he is talking about when he says
Whether or not there is evidence that says this is a real effect I don’t know, but to address it what you really need to measure is total utility of outcomes rather than accuracy.
Yes. You got it, exactly.
No. I’m talking about classes of errors.
As in, which is better?
A test that reports 100 false positives for every 100 false negatives for disease X
A test that reports 110 false positives for every 90 false negatives for disease X
The cost of fp vs. fn is not defined automatically. If humans are closer to #1 than #2, and I develop a system like #2, I might define #2 to be better. Then later on down the line I stop talking about how I defined better, and I just use the word better, and no one questions it because hey… better is better, right?
Which is more costly, false positives or false negatives? This is an easy question to answer.
If false positives, #1 is better. If false negatives, #2. I really do not see what your point is. These problems you bring up are easily solved.
Which is better: Releasing a violent prisoner, or keeping a harmless one incarcerated? If you can find an answer that 90% of the population agrees on, then I think you’ve done better than every politician in history.
That people do NOT agree suggest to me that it’s hardly a trivial question...
How violent, how preventably violent, how harmless, how incarcerated, how long incarcerated? For any specific case with these agreed-upon, I am confident a supermajority would agree.
That people don’t agree suggests one side is comparing releasing a serial killer to incarcerating a drifter in jail a short while, and the other side is comparing releasing a middle-aged man who in a fit of passion struck his adulterous wife to incarcerating Ghandi for the term of his natural life. More generally, they are deciding based on one specific example they have strongly available to them.
In the state you phrased it, that question is about as answerable as “how long is a piece of string?”.
Yes. Thank you. Since at least one person understood me, I’m gonna jump off the merry-go-round at this point.
(For reference, I realize an expert runs in to the same issue, I just think it’s unfair to say that the issue is “easily solved”)
Many tests have a continuous, adjustable parameter for sensitivity, letting you set the trade-off however you want. In that case, we can refrain from judging the relative badness of false positives and false negatives, and use ROCA, which is basically the integral over all such trade-offs. Tests that are going to be combined into a larger predictor are usually measured this way.
Machine learning packages generally let you specify a “cost matrix”, which is the cost of each possible confusion. For a 2-valued test, it would be a 2x2 matrix with zeroes on the diagonal, and the cost of A->B and B->A errors in the other two spots. For a test with N possible results, the matrix is NxN, with zeroes on the diagonals, and each (row,col) position is the cost of a mistake that confuses the result corresponding to that row with the result corresponding to that column.