In the above comic, the AI is trained by having the human look at its behavior, judge what method the AI is solving the desired task and how much progress it’s making along that method, and then selecting the ones that make more progress.
If we see the AI start studying the buttons to decide what to do, we can just select against that. It gets its capabilities entirely from our judgement of the likely consequences, so while that can lead to deception in simple cases where it accidentally stumbles into confusing us (e.g. by going in front of the sugar), this doesn’t imply a selection in favor of complex misalignment/deception that is so unlikely to happen by chance that you wouldn’t stumble into it without many steps of intentional selection.
In the above comic, the AI is trained by having the human look at its behavior, judge what method the AI is solving the desired task and how much progress it’s making along that method, and then selecting the ones that make more progress.
If we see the AI start studying the buttons to decide what to do, we can just select against that. It gets its capabilities entirely from our judgement of the likely consequences, so while that can lead to deception in simple cases where it accidentally stumbles into confusing us (e.g. by going in front of the sugar), this doesn’t imply a selection in favor of complex misalignment/deception that is so unlikely to happen by chance that you wouldn’t stumble into it without many steps of intentional selection.
See also: reward is not the optimization target.