I think the difference is between writing an algorithm that detects the sound of a human saying “Friendly!” (which we can sort-of do today), and writing an algorithm that detects situations where some impartial human observer would tell you that the situation is “Friendly!” if asked about it. (I don’t propose that this is the criteria that should be used, but your algorithm needs at least that level of algorithmic sophistication for value learning). The situation you talk about will always happen with the first sort of algorithm. The second sort of algorithm could work, although lack of training data might lead to it functionally behaving in the same way as the first, or to making a similar class of mistakes.
I don’t see a distinction between these things. Shouting “Friendly!” is just the mechanism being used to add to the training data.
No matter what method you use to label the data, there is no way for the machine to distinguish it from ground truth.
E.g. the machine might learn that it should convince you to press the reward button, but it might also learn to steal the button and press it itself.
Both are perfectly valid generalizations to the problem of “predict what actions are the most likely to lead to a positive example in the training set.” But only one is what we really intend.
If the AI takes your saying ‘friendly’ to be a consequence of something being a positive example, then it doesn’t think changing your words manually will change whether it is a positive example. If it thinks your actions cause something to be a positive example, then it does think changing your actions will change whether it is a positive example.
Shouting “Friendly!” isn’t just correlated with positive examples, it literally causes them. Torturing the supervisor to make them say “Friendly!” is a perfectly valid generalization of the training set. Unless you include negative examples of that, and all the countless other ways it can go wrong.
It causes something to be a training example, but it doesn’t cause it to be an instance of the thing the AI is meant to identify. If the AI itself has this model (in which there is something else it cares about, which is often identified by shouting), then we should not get the problem you mention.
In particular, the value learning scheme—where the AI has priors over what is valuable and its observations cause it to update these—should avoid the problem, if I understand correctly.
Imagine a simple reinforcement learner. I press a button and it gets a reward. If the reinforcement learner is smart, it will figure out that pressing the button causes the reward, and try to steal the button and press it (as opposed to indirectly pressing it by pleasing me.)
This is the exact same situation. We’ve just removed the reward. Instead the AI tries to predict what actions would have given it rewards. However there is no difference between predicted rewards and actual rewards. They should converge to the same function, that’s the entire goal of the learning.
So if the AI is as smart as the AI in the first scenario, it will know that stealing the reward button is what it should have done the first time around, and therefore what it will do the second time.
Expecting the AI to magically learn human values and stop there is just absurdly anthropomorphically optimistic.
I think the difference is between writing an algorithm that detects the sound of a human saying “Friendly!” (which we can sort-of do today), and writing an algorithm that detects situations where some impartial human observer would tell you that the situation is “Friendly!” if asked about it. (I don’t propose that this is the criteria that should be used, but your algorithm needs at least that level of algorithmic sophistication for value learning). The situation you talk about will always happen with the first sort of algorithm. The second sort of algorithm could work, although lack of training data might lead to it functionally behaving in the same way as the first, or to making a similar class of mistakes.
I don’t see a distinction between these things. Shouting “Friendly!” is just the mechanism being used to add to the training data.
No matter what method you use to label the data, there is no way for the machine to distinguish it from ground truth.
E.g. the machine might learn that it should convince you to press the reward button, but it might also learn to steal the button and press it itself.
Both are perfectly valid generalizations to the problem of “predict what actions are the most likely to lead to a positive example in the training set.” But only one is what we really intend.
If the AI takes your saying ‘friendly’ to be a consequence of something being a positive example, then it doesn’t think changing your words manually will change whether it is a positive example. If it thinks your actions cause something to be a positive example, then it does think changing your actions will change whether it is a positive example.
Shouting “Friendly!” isn’t just correlated with positive examples, it literally causes them. Torturing the supervisor to make them say “Friendly!” is a perfectly valid generalization of the training set. Unless you include negative examples of that, and all the countless other ways it can go wrong.
It causes something to be a training example, but it doesn’t cause it to be an instance of the thing the AI is meant to identify. If the AI itself has this model (in which there is something else it cares about, which is often identified by shouting), then we should not get the problem you mention.
In particular, the value learning scheme—where the AI has priors over what is valuable and its observations cause it to update these—should avoid the problem, if I understand correctly.
Imagine a simple reinforcement learner. I press a button and it gets a reward. If the reinforcement learner is smart, it will figure out that pressing the button causes the reward, and try to steal the button and press it (as opposed to indirectly pressing it by pleasing me.)
This is the exact same situation. We’ve just removed the reward. Instead the AI tries to predict what actions would have given it rewards. However there is no difference between predicted rewards and actual rewards. They should converge to the same function, that’s the entire goal of the learning.
So if the AI is as smart as the AI in the first scenario, it will know that stealing the reward button is what it should have done the first time around, and therefore what it will do the second time.
Expecting the AI to magically learn human values and stop there is just absurdly anthropomorphically optimistic.