Yes. It’s not a sure-fire safeguard and it doesn’t work against all UFAIs, but if done correctly, you can think of corrigibility as granting a saving throw. But note that while this paper is a huge step forward, “how to do corrigibility correctly” is not nearly a solved problem yet.
(Corrigibility was a topic at the second MIRIx Boston workshop, and we have results that build on this paper which we are working on writing up.)
No, at least not anything like the corrigibility we’re currently considering. Everything we’ve written about so far relies on having the ability to specify the utility function in detail, the utility function being reflectively stable, the utility function being able to contain references to external objects like ‘the shutdown button’ with the corresponding problems of adapting to new ontologies as the surrounding system shifts representations (see the notion of an ‘ontological crisis’), etcetera. It’s a precaution for a Friendly AI in the process of being built; you couldn’t tack it onto super-Eurisko.
is a “corrigibility module” a plausible safeguard against some (significant) classes of UfAIs?
Yes. It’s not a sure-fire safeguard and it doesn’t work against all UFAIs, but if done correctly, you can think of corrigibility as granting a saving throw. But note that while this paper is a huge step forward, “how to do corrigibility correctly” is not nearly a solved problem yet.
(Corrigibility was a topic at the second MIRIx Boston workshop, and we have results that build on this paper which we are working on writing up.)
No, at least not anything like the corrigibility we’re currently considering. Everything we’ve written about so far relies on having the ability to specify the utility function in detail, the utility function being reflectively stable, the utility function being able to contain references to external objects like ‘the shutdown button’ with the corresponding problems of adapting to new ontologies as the surrounding system shifts representations (see the notion of an ‘ontological crisis’), etcetera. It’s a precaution for a Friendly AI in the process of being built; you couldn’t tack it onto super-Eurisko.