1) If we have an AGI that is corrigible, it will not randomly drift to be not corrigible, because it will proactively notice and correct potential errors or loss of corrigibility.
2) If we have an AGI that is partly corrigible, it will help us ‘finish out’ the definition of corrigibility / edit itself to be more corrigible, because we want it to be more corrigible and it’s trying to do what we want.
Good point on distinguishing these two arguments. It sounds like we agree on 1. I also thought the OP was talking about 1.
For 2, I don’t think we can make a dimensionality argument (as in the OP), because we’re talking about edits that are the ones that the AI chooses for itself. You can’t apply dimensionality arguments to choices made by intelligent agents (e.g. presumably you wouldn’t argue that every glass in my house must be broken because the vast majority of ways of interacting with glasses breaks them). Or put another way, the structural similarity is just “the AI wouldn’t choose to do <bad thing #N>”, in all cases because it’s intelligent and understands what it’s doing.
Now the question of “how right do we need to get the initial definition of corrigibility” is much less obvious. If you told me we got the definition wrong in a million different ways, I would indeed be worried and probably wouldn’t expect it to self-correct (depending on the meaning of “different”). But like… really? We get it wrong a million different ways? I don’t see why we’d expect that.
Good point on distinguishing these two arguments. It sounds like we agree on 1. I also thought the OP was talking about 1.
For 2, I don’t think we can make a dimensionality argument (as in the OP), because we’re talking about edits that are the ones that the AI chooses for itself. You can’t apply dimensionality arguments to choices made by intelligent agents (e.g. presumably you wouldn’t argue that every glass in my house must be broken because the vast majority of ways of interacting with glasses breaks them). Or put another way, the structural similarity is just “the AI wouldn’t choose to do <bad thing #N>”, in all cases because it’s intelligent and understands what it’s doing.
Now the question of “how right do we need to get the initial definition of corrigibility” is much less obvious. If you told me we got the definition wrong in a million different ways, I would indeed be worried and probably wouldn’t expect it to self-correct (depending on the meaning of “different”). But like… really? We get it wrong a million different ways? I don’t see why we’d expect that.