Houshalter comments on LINK: AI Researcher Yann LeCun on AI function

Houshalter 24 Mar 2014 2:22 UTC
0 points
I don’t get it. You gave some people the drug and some people you didn’t. It seems pretty straightforward to estimate how likely someone is to die if you give them medicine.
- gwern 24 Mar 2014 2:38 UTC
  2 points
  Parent
  
  It seems pretty straightforward to estimate how likely someone is to die if you give them medicine.
  
  Certainly it’s straightforward. Here’s how one can apply your logic. You gave some people [the ones whose disease has progressed the most] the drug and some people you didn’t [because their disease isn’t so bad you’re willing to risk it]; the % of people dying in the first drugged group is much higher than the % of deaths in the second non-drugged group; therefore, this drug is poison and you’re a mass murderer.
  
  See the problem?
  - IlyaShpitser 24 Mar 2014 12:22 UTC
    2 points
    Parent
    Of course people say “but this is silly, obviously we need to condition on health status.”
    
    The point is: what if we can’t? Or what if we there are other causally relevant factors here? In fact, what is “causally relevant” anyways… We need a system! ML people don’t think about these questions very hard, generally, because culturally they are more interested in “algorithmic approaches” to prediction problems.
    
    (This is a clarification of gwern’s response to the grandparent, not a reply to gwern.)
  - Houshalter 31 Mar 2014 14:53 UTC
    0 points
    Parent
    The problem is the data is biased. The ML algorithm doesn’t know whether the bias is a natural part of the data or artificially induced. Garbage In—Garbage Out.
    
    However it can still be done if the algorithm has more information. Maybe some healthy patients ended up getting the medicine anyways and were far more likely to live, or some unhealthy ones didn’t and were even more likely to die. Now it’s straightforward prediction again: How likely is a patient to live based on their current health and whether or not they take the drug?
    - gwern 31 Mar 2014 15:32 UTC
      5 points
      Parent
      
      The problem is the data is biased. The ML algorithm doesn’t know whether the bias is a natural part of the data or artificially induced. Garbage In—Garbage Out.
      
      You’re making up excuses. The data is not ‘biased’, it just is, nor is it garbage—it’s not made up, no one is lying or falsifying data or anything like that. If your theory cannot handle clean data from a real-world problem, that’s a big problem (especially if there are more sophisticated alternatives which can handle it).
      - Houshalter 31 Mar 2014 16:47 UTC
        2 points
        Parent
        Biased data is a real thing and this is a great example. No method can solve the problem you’ve given without additional information.
        gwern 31 Mar 2014 17:11 UTC
        7 points
        Parent
        This is not biased data. No one tampered with it. No one preferentially left out some data. There is no Cartesian daemon tampering with you. It’s a perfectly ordinary causal problem for which one has all the available data. If you run a regression on the data, you will get accurate predictions of future similar data—just not what happens when you intervene and realize the counterfactual. You can’t throw your hands up and disdainfully refuse to solve the problem, proclaiming, ‘oh, that’s biased’. It may be hard, and the best available solution weak or require strong assumptions, but if that is the case, the correct method should say as much and specify what additional data or interventions would allow stronger conclusions.
        Houshalter 28 Feb 2015 5:08 UTC
        0 points
        Parent
        I’m not certain why I used the word “bias”. I think I was getting at that the data isn’t representative of the population of interest.
        
        Regardless, no other method can solve the problem specified without additional information (which you claimed). And with additional information, it’s straightforward prediction again.
        
        That is, condition on their prior health status, not just the fact they’ve been given the drug. And prior probabilities.
        Lumifer 31 Mar 2014 17:02 UTC
        0 points
        Parent
        
        No method can solve the problem you’ve given without additional information.
        
        What do you call “solving the problem”?
        
        Any method will output some estimates. Some methods will output better estimates, some worse. As people have pointed out, this was an example of a real problem and yes, real-life data is usually pretty messy. We need methods which can handle messy data and not work just on spherical cows in vacuum.