Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the “how do we make AIs honest in the first place” part.
If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can’t rely on the selection mechanisms because the AI games them.
Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the “how do we make AIs honest in the first place” part.
If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can’t rely on the selection mechanisms because the AI games them.