The AI has been trained not to be overtly deceptive when what we really want is for its plans and effects to be understandable by people.
In practice, the AI learns to not be deceptive by not modeling its operators at all.
The AI starts to spontaneously invent odd hacks that rely on exploits of the interpretability process that it doesn’t understand because it’s not doing said modeling.
To simplify/extend:
The AI has been trained not to be overtly deceptive when what we really want is for its plans and effects to be understandable by people.
In practice, the AI learns to not be deceptive by not modeling its operators at all.
The AI starts to spontaneously invent odd hacks that rely on exploits of the interpretability process that it doesn’t understand because it’s not doing said modeling.