William_S comments on Superintelligence 21: Value learning

William_S 10 Feb 2015 0:13 UTC
0 points
I think the difference is between writing an algorithm that detects the sound of a human saying “Friendly!” (which we can sort-of do today), and writing an algorithm that detects situations where some impartial human observer would tell you that the situation is “Friendly!” if asked about it. (I don’t propose that this is the criteria that should be used, but your algorithm needs at least that level of algorithmic sophistication for value learning). The situation you talk about will always happen with the first sort of algorithm. The second sort of algorithm could work, although lack of training data might lead to it functionally behaving in the same way as the first, or to making a similar class of mistakes.
- Houshalter 10 Feb 2015 15:02 UTC
  0 points
  Parent
  I don’t see a distinction between these things. Shouting “Friendly!” is just the mechanism being used to add to the training data.
  
  No matter what method you use to label the data, there is no way for the machine to distinguish it from ground truth.
  
  E.g. the machine might learn that it should convince you to press the reward button, but it might also learn to steal the button and press it itself.
  
  Both are perfectly valid generalizations to the problem of “predict what actions are the most likely to lead to a positive example in the training set.” But only one is what we really intend.
  - KatjaGrace 12 Feb 2015 17:53 UTC
    1 point
    Parent
    If the AI takes your saying ‘friendly’ to be a consequence of something being a positive example, then it doesn’t think changing your words manually will change whether it is a positive example. If it thinks your actions cause something to be a positive example, then it does think changing your actions will change whether it is a positive example.
    - Houshalter 16 Feb 2015 22:35 UTC
      0 points
      Parent
      Shouting “Friendly!” isn’t just correlated with positive examples, it literally causes them. Torturing the supervisor to make them say “Friendly!” is a perfectly valid generalization of the training set. Unless you include negative examples of that, and all the countless other ways it can go wrong.
      - KatjaGrace 20 Feb 2015 19:27 UTC
        0 points
        Parent
        It causes something to be a training example, but it doesn’t cause it to be an instance of the thing the AI is meant to identify. If the AI itself has this model (in which there is something else it cares about, which is often identified by shouting), then we should not get the problem you mention.
        
        In particular, the value learning scheme—where the AI has priors over what is valuable and its observations cause it to update these—should avoid the problem, if I understand correctly.
        Houshalter 24 Feb 2015 18:40 UTC
        0 points
        Parent
        Imagine a simple reinforcement learner. I press a button and it gets a reward. If the reinforcement learner is smart, it will figure out that pressing the button causes the reward, and try to steal the button and press it (as opposed to indirectly pressing it by pleasing me.)
        
        This is the exact same situation. We’ve just removed the reward. Instead the AI tries to predict what actions would have given it rewards. However there is no difference between predicted rewards and actual rewards. They should converge to the same function, that’s the entire goal of the learning.
        
        So if the AI is as smart as the AI in the first scenario, it will know that stealing the reward button is what it should have done the first time around, and therefore what it will do the second time.
        
        Expecting the AI to magically learn human values and stop there is just absurdly anthropomorphically optimistic.