didn’t understand how this was derived or what other results/ideas it is referencing.
The idea here is that the AI has a rough model of human values, and is pointed at those values when making decisions (e.g. the embedding is known and it’s optimizing for the embedded values, in the case of an optimizer). It may not have perfect knowledge of human values, but it would e.g. design its successor to build a more precise model of human values than itself (assuming it expects that successor to have more relevant data) and point the successor toward that model, because that’s the action which best optimizes for its current notion of human values.
Contrast to e.g. an AI which is optimizing for human approval. If it can do things which makes a human approve, even though the human doesn’t actually want those things (e.g. deceptive behavior), then it will do so. When that AI designs its successor, it will want the successor to be even better at gaining human approval, which means making the successor even better at deception.
This probably needs more explanation, but I’m not sure which parts need more explanation, so feedback would be appreciated.
Contrast to e.g. an AI which is optimizing for human approval. [...] When that AI designs its successor, it will want the successor to be even better at gaining human approval, which means making the successor even better at deception.
Is the idea that the AI is optimizing for humans approving of things, as opposed to humans approving of its actions? It seems that if its optimizing for humans approving of its actions, it doesn’t necessarily have an incentive to make a successor that optimizes for approval (though I admit it’s not clear why it would make a successor at all in this case; perhaps it’s designed to not plan against being deactivated after some time)
Right, I should clarify that. I was imagining that it’s designing a successor which will take over the AI’s own current input/output channels, so “its actions” in the future will actually be the successor’s actions. (Equivalently, we could imagine the AI contemplating self-modification.)
The idea here is that the AI has a rough model of human values, and is pointed at those values when making decisions (e.g. the embedding is known and it’s optimizing for the embedded values, in the case of an optimizer). It may not have perfect knowledge of human values, but it would e.g. design its successor to build a more precise model of human values than itself (assuming it expects that successor to have more relevant data) and point the successor toward that model, because that’s the action which best optimizes for its current notion of human values.
Contrast to e.g. an AI which is optimizing for human approval. If it can do things which makes a human approve, even though the human doesn’t actually want those things (e.g. deceptive behavior), then it will do so. When that AI designs its successor, it will want the successor to be even better at gaining human approval, which means making the successor even better at deception.
This probably needs more explanation, but I’m not sure which parts need more explanation, so feedback would be appreciated.
Is the idea that the AI is optimizing for humans approving of things, as opposed to humans approving of its actions? It seems that if its optimizing for humans approving of its actions, it doesn’t necessarily have an incentive to make a successor that optimizes for approval (though I admit it’s not clear why it would make a successor at all in this case; perhaps it’s designed to not plan against being deactivated after some time)
Right, I should clarify that. I was imagining that it’s designing a successor which will take over the AI’s own current input/output channels, so “its actions” in the future will actually be the successor’s actions. (Equivalently, we could imagine the AI contemplating self-modification.)
This is helpful.