Yes. It’s not a sure-fire safeguard and it doesn’t work against all UFAIs, but if done correctly, you can think of corrigibility as granting a saving throw. But note that while this paper is a huge step forward, “how to do corrigibility correctly” is not nearly a solved problem yet.
(Corrigibility was a topic at the second MIRIx Boston workshop, and we have results that build on this paper which we are working on writing up.)
Yes. It’s not a sure-fire safeguard and it doesn’t work against all UFAIs, but if done correctly, you can think of corrigibility as granting a saving throw. But note that while this paper is a huge step forward, “how to do corrigibility correctly” is not nearly a solved problem yet.
(Corrigibility was a topic at the second MIRIx Boston workshop, and we have results that build on this paper which we are working on writing up.)