The claims “If X drifts away from corrigibility along dimension {N}, it will get pulled back” are clearly structurally similar, and the broad basin of corrigibility argument is meant to be an argument that argues for all of them.
To be clear, I think there are two very different arguments here:
1) If we have an AGI that is corrigible, it will not randomly drift to be not corrigible, because it will proactively notice and correct potential errors or loss of corrigibility.
2) If we have an AGI that is partly corrigible, it will help us ‘finish up’ the definition of corrigibility / edit itself to be more corrigible, because we want it to be more corrigible and it’s trying to do what we want.
The first is “corrigibility is a stable attractor”, and I think there’s structural similarity between arguments that different deviations will be corrected. The second is the “broad basin of corrigibility”, where for any barely acceptable initial definition of “do what we want”, it will figure out that “help us find the right definition of corrigibility and implement it” will score highly on its initial metric of “do what we want.”
Like, it’s not the argument that corrigibility is a stable attractor; it’s an argument that corrigibility is a stable attractor with no nearby attractors. (At least in the dimensions that it’s ‘broad’ in.)
I find it less plausible that missing pieces in our definition of “do what we want” will be fixed in structurally similar ways, and I think there are probably a lot of traps where a plausible sketch definition doesn’t automatically repair itself. One can lean here on “barely acceptable”, but I don’t find that very satisfying. [In particular, it would be nice if we had a definition of corrigibility where could look at it and say “yep, that’s the real deal or grows up to be the real deal,” tho that likely requires knowing what the “real deal” is; the “broad basin” argument seems to me to be meaningful only in that it claims “something that grows into the real deal is easy to find instead of hard to find,” and when I reword that claim as “there aren’t any dead ends near the real deal” it seems less plausible.]
1. Why aren’t the dimensions symmetric?
In physical space, generally things are symmetric between swapping the dimensions around; in algorithm-space, that isn’t true. (Like, permute the weights in a layer and you get different functional behavior.) Thus while it’s sort of wacky in a physical environment to say “oh yeah, df/dx, df/dy, and dy/dz are all independently sampled from a distribution” it’s less wacky to say that of neural network weights (or the appropriate medium-sized analog).
1) If we have an AGI that is corrigible, it will not randomly drift to be not corrigible, because it will proactively notice and correct potential errors or loss of corrigibility.
2) If we have an AGI that is partly corrigible, it will help us ‘finish out’ the definition of corrigibility / edit itself to be more corrigible, because we want it to be more corrigible and it’s trying to do what we want.
Good point on distinguishing these two arguments. It sounds like we agree on 1. I also thought the OP was talking about 1.
For 2, I don’t think we can make a dimensionality argument (as in the OP), because we’re talking about edits that are the ones that the AI chooses for itself. You can’t apply dimensionality arguments to choices made by intelligent agents (e.g. presumably you wouldn’t argue that every glass in my house must be broken because the vast majority of ways of interacting with glasses breaks them). Or put another way, the structural similarity is just “the AI wouldn’t choose to do <bad thing #N>”, in all cases because it’s intelligent and understands what it’s doing.
Now the question of “how right do we need to get the initial definition of corrigibility” is much less obvious. If you told me we got the definition wrong in a million different ways, I would indeed be worried and probably wouldn’t expect it to self-correct (depending on the meaning of “different”). But like… really? We get it wrong a million different ways? I don’t see why we’d expect that.
Like, it’s not the argument that corrigibility is a stable attractor; it’s an argument that corrigibility is a stable attractor with no nearby attractors. (At least in the dimensions that it’s ‘broad’ in.)
Just want to echo Rohin in saying that this is a very helpful distinction, thanks!
I was actually making the stronger argument that it’s not a stable attractor at all—at least not until someone solves the problem of how to maintain stable goals / motivations under learning / reflecting / ontological crises.
(The “someone” who solves the problem could be the AI, but it seems to be a hard problem even for human-level intelligence; cf. my comment here.)
To be clear, I think there are two very different arguments here:
1) If we have an AGI that is corrigible, it will not randomly drift to be not corrigible, because it will proactively notice and correct potential errors or loss of corrigibility.
2) If we have an AGI that is partly corrigible, it will help us ‘finish up’ the definition of corrigibility / edit itself to be more corrigible, because we want it to be more corrigible and it’s trying to do what we want.
The first is “corrigibility is a stable attractor”, and I think there’s structural similarity between arguments that different deviations will be corrected. The second is the “broad basin of corrigibility”, where for any barely acceptable initial definition of “do what we want”, it will figure out that “help us find the right definition of corrigibility and implement it” will score highly on its initial metric of “do what we want.”
Like, it’s not the argument that corrigibility is a stable attractor; it’s an argument that corrigibility is a stable attractor with no nearby attractors. (At least in the dimensions that it’s ‘broad’ in.)
I find it less plausible that missing pieces in our definition of “do what we want” will be fixed in structurally similar ways, and I think there are probably a lot of traps where a plausible sketch definition doesn’t automatically repair itself. One can lean here on “barely acceptable”, but I don’t find that very satisfying. [In particular, it would be nice if we had a definition of corrigibility where could look at it and say “yep, that’s the real deal or grows up to be the real deal,” tho that likely requires knowing what the “real deal” is; the “broad basin” argument seems to me to be meaningful only in that it claims “something that grows into the real deal is easy to find instead of hard to find,” and when I reword that claim as “there aren’t any dead ends near the real deal” it seems less plausible.]
In physical space, generally things are symmetric between swapping the dimensions around; in algorithm-space, that isn’t true. (Like, permute the weights in a layer and you get different functional behavior.) Thus while it’s sort of wacky in a physical environment to say “oh yeah, df/dx, df/dy, and dy/dz are all independently sampled from a distribution” it’s less wacky to say that of neural network weights (or the appropriate medium-sized analog).
Good point on distinguishing these two arguments. It sounds like we agree on 1. I also thought the OP was talking about 1.
For 2, I don’t think we can make a dimensionality argument (as in the OP), because we’re talking about edits that are the ones that the AI chooses for itself. You can’t apply dimensionality arguments to choices made by intelligent agents (e.g. presumably you wouldn’t argue that every glass in my house must be broken because the vast majority of ways of interacting with glasses breaks them). Or put another way, the structural similarity is just “the AI wouldn’t choose to do <bad thing #N>”, in all cases because it’s intelligent and understands what it’s doing.
Now the question of “how right do we need to get the initial definition of corrigibility” is much less obvious. If you told me we got the definition wrong in a million different ways, I would indeed be worried and probably wouldn’t expect it to self-correct (depending on the meaning of “different”). But like… really? We get it wrong a million different ways? I don’t see why we’d expect that.
Just want to echo Rohin in saying that this is a very helpful distinction, thanks!
I was actually making the stronger argument that it’s not a stable attractor at all—at least not until someone solves the problem of how to maintain stable goals / motivations under learning / reflecting / ontological crises.
(The “someone” who solves the problem could be the AI, but it seems to be a hard problem even for human-level intelligence; cf. my comment here.)