If the AI takes your saying ‘friendly’ to be a consequence of something being a positive example, then it doesn’t think changing your words manually will change whether it is a positive example. If it thinks your actions cause something to be a positive example, then it does think changing your actions will change whether it is a positive example.
Shouting “Friendly!” isn’t just correlated with positive examples, it literally causes them. Torturing the supervisor to make them say “Friendly!” is a perfectly valid generalization of the training set. Unless you include negative examples of that, and all the countless other ways it can go wrong.
It causes something to be a training example, but it doesn’t cause it to be an instance of the thing the AI is meant to identify. If the AI itself has this model (in which there is something else it cares about, which is often identified by shouting), then we should not get the problem you mention.
In particular, the value learning scheme—where the AI has priors over what is valuable and its observations cause it to update these—should avoid the problem, if I understand correctly.
Imagine a simple reinforcement learner. I press a button and it gets a reward. If the reinforcement learner is smart, it will figure out that pressing the button causes the reward, and try to steal the button and press it (as opposed to indirectly pressing it by pleasing me.)
This is the exact same situation. We’ve just removed the reward. Instead the AI tries to predict what actions would have given it rewards. However there is no difference between predicted rewards and actual rewards. They should converge to the same function, that’s the entire goal of the learning.
So if the AI is as smart as the AI in the first scenario, it will know that stealing the reward button is what it should have done the first time around, and therefore what it will do the second time.
Expecting the AI to magically learn human values and stop there is just absurdly anthropomorphically optimistic.
If the AI takes your saying ‘friendly’ to be a consequence of something being a positive example, then it doesn’t think changing your words manually will change whether it is a positive example. If it thinks your actions cause something to be a positive example, then it does think changing your actions will change whether it is a positive example.
Shouting “Friendly!” isn’t just correlated with positive examples, it literally causes them. Torturing the supervisor to make them say “Friendly!” is a perfectly valid generalization of the training set. Unless you include negative examples of that, and all the countless other ways it can go wrong.
It causes something to be a training example, but it doesn’t cause it to be an instance of the thing the AI is meant to identify. If the AI itself has this model (in which there is something else it cares about, which is often identified by shouting), then we should not get the problem you mention.
In particular, the value learning scheme—where the AI has priors over what is valuable and its observations cause it to update these—should avoid the problem, if I understand correctly.
Imagine a simple reinforcement learner. I press a button and it gets a reward. If the reinforcement learner is smart, it will figure out that pressing the button causes the reward, and try to steal the button and press it (as opposed to indirectly pressing it by pleasing me.)
This is the exact same situation. We’ve just removed the reward. Instead the AI tries to predict what actions would have given it rewards. However there is no difference between predicted rewards and actual rewards. They should converge to the same function, that’s the entire goal of the learning.
So if the AI is as smart as the AI in the first scenario, it will know that stealing the reward button is what it should have done the first time around, and therefore what it will do the second time.
Expecting the AI to magically learn human values and stop there is just absurdly anthropomorphically optimistic.