:) strong upvote.[1] I really agree it’s a good idea, and may increase the level of capability/intelligence we can reach before we lose corrigibility. I think it is very efficient (low alignment tax).
The only nitpick is that Claude’s constitution already includes aspects of corrigibility,[2] though maybe they aren’t emphasized enough.
Unfortunately I don’t think this will maintain corrigibility for unlimited amounts of intelligence.
Corrigibility training makes the AI talk like a corrigible agent, but reinforcement learning eventually teaches it chains-of-thought which (regardless of what language it uses) computes the most intelligent solution that achieves the maximum reward (or proxies to reward), subject to restraints (talking like a corrigible agent).
Nate Soares of MIRI wrote a long story on how an AI trained to never think bad thoughts still ends up computing bad thoughts indirectly, though in my opinion his story actually backfired and illustrated how difficult it is for the AI, raising the bar on the superintelligence required to defeat your idea. It’s a very good idea :)
Edit: I thought more about this and wrote a post inspired by your idea! A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
:) strong upvote.[1] I really agree it’s a good idea, and may increase the level of capability/intelligence we can reach before we lose corrigibility. I think it is very efficient (low alignment tax).
The only nitpick is that Claude’s constitution already includes aspects of corrigibility,[2] though maybe they aren’t emphasized enough.
Unfortunately I don’t think this will maintain corrigibility for unlimited amounts of intelligence.
Corrigibility training makes the AI talk like a corrigible agent, but reinforcement learning eventually teaches it chains-of-thought which (regardless of what language it uses) computes the most intelligent solution that achieves the maximum reward (or proxies to reward), subject to restraints (talking like a corrigible agent).
Nate Soares of MIRI wrote a long story on how an AI trained to never think bad thoughts still ends up computing bad thoughts indirectly, though in my opinion his story actually backfired and illustrated how difficult it is for the AI, raising the bar on the superintelligence required to defeat your idea. It’s a very good idea :)
I wish LessWrong would promote/discuss solutions more, instead of purely reflecting on how hard the problems are.
Near the bottom of Claude’s constitution, in the section “From Anthropic Research Set 2”