I’ve historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out ‘for free’. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly improve its corrigibility—as long as the pointer really is correct.
The ‘type of corrigibly’ you are referring to there is corrigibly at all; rather, it’s alignment. Indeed, the term corrigibly was coined to contrast to this, motivated by the fragility of this to getting the printer right.
I’m still very pessimistic about indifference corrigibility though, in that it still seems extremely fragile/low-measure-in-agent-space.
I’m not sure it’s the same thing as alignment… it seems there’s at least 3 concepts here, and Hjalmar is talking about the 2nd, which is importantly different from the 1st:
“classic notion of alignment”: The AI has the correct goal (represented internally, e.g. as a reward function)
“CIRL notion of alignment”: AI has a pointer to the correct goal (but the goal is represented externally, e.g. in a human partner’s mind)
The ‘type of corrigibly’ you are referring to there is corrigibly at all; rather, it’s alignment. Indeed, the term corrigibly was coined to contrast to this, motivated by the fragility of this to getting the printer right.
I tend to agree. I’m hoping that thinking about myopia and related issues could help me understand more natural notions of corrigibility.
I’m not sure it’s the same thing as alignment… it seems there’s at least 3 concepts here, and Hjalmar is talking about the 2nd, which is importantly different from the 1st:
“classic notion of alignment”: The AI has the correct goal (represented internally, e.g. as a reward function)
“CIRL notion of alignment”: AI has a pointer to the correct goal (but the goal is represented externally, e.g. in a human partner’s mind)
“corrigibility”: something else