shokwave comments on Statistical Prediction Rules Out-Perform Expert Human Judgments

shokwave 18 Jan 2011 12:44 UTC
0 points
Because those are the class of problems this post discusses.

From the top of the post:

A parole board considers the release of a prisoner: Will he be violent again? A hiring officer considers a job candidate: Will she be a valuable asset to the company? A young couple considers marriage: Will they have a happy marriage?

The cached wisdom for making such high-stakes predictions is to have experts gather as much evidence as possible, weigh this evidence, and make a judgment. But 60 years of research has shown that in hundreds of cases, a simple formula called a statistical prediction rule (SPR) makes better predictions than leading experts do.
- kybernetikos 19 Jan 2011 17:32 UTC
  2 points
  Parent
  
  A parole board considers the release of a prisoner: Will he be violent again?
  
  I think this is the kind of question that Miller is talking about. Just because a system is correct more often, doesn’t necessarily mean it’s better.
  
  For example if the human experts allowed more people out who went on to commit relatively minor violent offences and the SPRs do this less often, but are more likely to release prisoners who go on to commit murder then there would be legitimate discussion over whether the SPR is actually better.
  
  I think this is exactly what he is talking about when he says
  
  Where AI’s compete well generally they beat trained humans fairly marginally on easy (or even most) cases, and then fail miserably at border or novel cases. This can make it dangerous to use them if the extreme failures are dangerous.
  
  Whether or not there is evidence that says this is a real effect I don’t know, but to address it what you really need to measure is total utility of outcomes rather than accuracy.
  - Miller 19 Jan 2011 22:03 UTC
    0 points
    Parent
    Yes. You got it, exactly.
- Miller 18 Jan 2011 12:54 UTC
  −2 points
  Parent
  No. I’m talking about classes of errors.
  
  As in, which is better?
  - A test that reports 100 false positives for every 100 false negatives for disease X
  - A test that reports 110 false positives for every 90 false negatives for disease X
  The cost of fp vs. fn is not defined automatically. If humans are closer to #1 than #2, and I develop a system like #2, I might define #2 to be better. Then later on down the line I stop talking about how I defined better, and I just use the word better, and no one questions it because hey… better is better, right?
  - shokwave 18 Jan 2011 13:41 UTC
    3 points
    Parent
    Which is more costly, false positives or false negatives? This is an easy question to answer.
    
    If false positives, #1 is better. If false negatives, #2. I really do not see what your point is. These problems you bring up are easily solved.
    - handoflixue 18 Jan 2011 20:26 UTC
      6 points
      Parent
      Which is better: Releasing a violent prisoner, or keeping a harmless one incarcerated? If you can find an answer that 90% of the population agrees on, then I think you’ve done better than every politician in history.
      
      That people do NOT agree suggest to me that it’s hardly a trivial question...
      - shokwave 19 Jan 2011 6:19 UTC
        0 points
        Parent
        
        Releasing a violent prisoner, or keeping a harmless one incarcerated?
        
        How violent, how preventably violent, how harmless, how incarcerated, how long incarcerated? For any specific case with these agreed-upon, I am confident a supermajority would agree.
        
        That people do NOT agree suggest to me that it’s hardly a trivial question...
        
        That people don’t agree suggests one side is comparing releasing a serial killer to incarcerating a drifter in jail a short while, and the other side is comparing releasing a middle-aged man who in a fit of passion struck his adulterous wife to incarcerating Ghandi for the term of his natural life. More generally, they are deciding based on one specific example they have strongly available to them.
        
        In the state you phrased it, that question is about as answerable as “how long is a piece of string?”.
      - Miller 18 Jan 2011 23:00 UTC
        0 points
        Parent
        Yes. Thank you. Since at least one person understood me, I’m gonna jump off the merry-go-round at this point.
      - handoflixue 18 Jan 2011 20:29 UTC
        0 points
        Parent
        (For reference, I realize an expert runs in to the same issue, I just think it’s unfair to say that the issue is “easily solved”)
  - jimrandomh 18 Jan 2011 14:18 UTC
    2 points
    Parent
    Many tests have a continuous, adjustable parameter for sensitivity, letting you set the trade-off however you want. In that case, we can refrain from judging the relative badness of false positives and false negatives, and use ROCA, which is basically the integral over all such trade-offs. Tests that are going to be combined into a larger predictor are usually measured this way.
    
    Machine learning packages generally let you specify a “cost matrix”, which is the cost of each possible confusion. For a 2-valued test, it would be a 2x2 matrix with zeroes on the diagonal, and the cost of A->B and B->A errors in the other two spots. For a test with N possible results, the matrix is NxN, with zeroes on the diagonals, and each (row,col) position is the cost of a mistake that confuses the result corresponding to that row with the result corresponding to that column.