Contrast to e.g. an AI which is optimizing for human approval. [...] When that AI designs its successor, it will want the successor to be even better at gaining human approval, which means making the successor even better at deception.
Is the idea that the AI is optimizing for humans approving of things, as opposed to humans approving of its actions? It seems that if its optimizing for humans approving of its actions, it doesn’t necessarily have an incentive to make a successor that optimizes for approval (though I admit it’s not clear why it would make a successor at all in this case; perhaps it’s designed to not plan against being deactivated after some time)
Right, I should clarify that. I was imagining that it’s designing a successor which will take over the AI’s own current input/output channels, so “its actions” in the future will actually be the successor’s actions. (Equivalently, we could imagine the AI contemplating self-modification.)
Is the idea that the AI is optimizing for humans approving of things, as opposed to humans approving of its actions? It seems that if its optimizing for humans approving of its actions, it doesn’t necessarily have an incentive to make a successor that optimizes for approval (though I admit it’s not clear why it would make a successor at all in this case; perhaps it’s designed to not plan against being deactivated after some time)
Right, I should clarify that. I was imagining that it’s designing a successor which will take over the AI’s own current input/output channels, so “its actions” in the future will actually be the successor’s actions. (Equivalently, we could imagine the AI contemplating self-modification.)