Agreed, and obviously that would be a lot more practicable if you knew what its trigger and secret goal were. Preventing deceptive alignment entirely would be ideal, but failing that we need reliable ways to detect it and diagnose its details: tricky to research when so far we only have model organisms of it, but doing interpretability work on those seems like an obvious first step.
Agreed, and obviously that would be a lot more practicable if you knew what its trigger and secret goal were. Preventing deceptive alignment entirely would be ideal, but failing that we need reliable ways to detect it and diagnose its details: tricky to research when so far we only have model organisms of it, but doing interpretability work on those seems like an obvious first step.