I now think that corrigibility is a single, intuitive property
My intuitive notion of corrigibility can be straightforwardly leveraged to build a formal, mathematical measure.
This formal measure is still lacking, and almost certainly doesn’t actually capture what I mean by “corrigibility.”
I don’t know, maybe it’s partially or mostly my fault for reading too much optimism into these passages… But I think it would have managed my expectations better to say something like “my notion of corrigibility heavily depends on a subnotion of ‘don’t manipulate the principals’ values’ which is still far from being well-understood or formalizable.”
Switching topics a little, I think I’m personally pretty confused about what human values are and therefore what it means to not manipulate someone’s values. Since you’re suggesting relying less on formalization and more on “examples of corrigibility collected in a carefully-selected dataset”, how would you go about collecting such examples?
(One concern is that you could easily end up with a dataset that embodies a hodgepodge of different ideas of what “don’t manipulate” means and then it’s up to luck whether the AI generalizes from that in a correct or reasonable way.)
I think you’re right to point to this issue. It’s a loose end. I’m not at all sure it’s a dealbreaker for corrigibility.
The core intuition/proposal is (I think) that a corrigible agent wants to do what the principal wants, at all times. If the principal currently wants to not have their future values/wants manipulated, then the corrigible agent wants to not do that. If they want to be informed but protected against outside manipulation, then the corrigible agent wants that. The principal will want to balance these factors, and the corrigible agent wants to figure out what balance their principal wants, and do that.
I was going to say that my instruction-following variant of corrigibility might be better for working out that balance, but it actually seems pretty straightforward in Max’s pure corrigibility version, now that I’ve written out the above.
I don’t think “a corrigible agent wants to do what the principal wants, at all times” matches my proposal. The issue that we’re talking here shows up in the math, above, in that the agent needs to consider the principal’s values in the future, but those values are themselves dependent on the agent’s action. If the principal gave a previous command to optimize for having a certain set of values in the future, sure, the corrigible agent can follow that command, but to proactively optimize for having a certain set of values doesn’t seem necessarily corrigible, even if it matches the agent’s sense of the present principal’s values.
For instance, suppose Monday-Max wants Tuesday-Max to want to want to exercise, but also Monday-Max feels a bunch of caution around self-modification such that he doesn’t trust having the AI rearrange his neurons to make this change. It seems to me that the corrigible thing for the AI to do is ignore Monday-Max’s preferences and simply follow his instructions (and take other actions related to being correctable), even if Monday-Max’s mistrust is unjustified. It seems plausible to me that your “do what the principal wants” agent might manipulate Tuesday-Max into wanting to want to exercise, since that’s what Monday-Max wants on the base-level.
This sounds like we’re saying the same thing? My “at all times” is implied and maybe confusing. I’m saying it doesn’t guess what the principal will want in the future, it just does what they want now. That probably includes not manipulating their future values. Their commands are particularly strong evidence of what they want, but at core, it’s just having the agent’s goals be a pointer to the principal’s goals.
This formulation occurred to me since talking to you, and it seems like a compact and intuitive formulation of why your notion of corrigibility seems coherent and simple.
Edit: to address your example, I both want and don’t-want to be manipulated into wanting to exercise next week. It’s confusing for me, so it should be confusing for my corrigible AGI. It should ask me to clarify when and how I want to be manipulated, rather than taking a guess when I don’t know the answer. I probably haven’t thought about it deeply, and overall it’s pretty important to accurately doing what I want, so a good corrigible helper will suggest I spend some time clarifying for it and for myself. This is a point where things could go wrong if it takes bad guesses instead of getting clarification, but there are lots of those.
It sounds like you’re proposing a system that is vulnerable to the Fully Updated Deference problem, and where if it has a flaw in how it models your preferences, it can very plausibly go against your words. I don’t think that’s corrigible.
In the specific example, just because one is confused about what they want doesn’t mean the AI will be (or should be). It seems like you think the AGI should not “take a guess” at the preferences of the principal, but it should listen to what the principal says. Where is the qualitative line between the two? In your system, if I write in my diary that I want the AI to do something, should it not listen to that? Certainly the diary entry is strong evidence about what I want, which it seems is how you’re thinking about commands. Suppose the AGI can read my innermost desires using nanomachines, and set up the world according to those desires. Is it corrigible? Notably, if that machine is confident that it knows better than me (which is plausible), it won’t stop if I tell it to shut down, because shutting down is a bad way to produce MaxUtility. (See the point in my document, above, where I discuss Queen Alice being totally disempowered by sufficiently good “servants”.)
My model of Seth says “It’s fine if the AGI does what I want and not what I say, as long as it’s correct about what I want.” But regardless of whether that’s true, I think it’s important not to confuse that system with one that’s corrigible.
I don’t understand your proposal if it doesn’t boil down to “do what the principal wants” or “do what the principal says” (correctly interpreted and/or carefully verified). This makes me worried that what you have in mind is not that simple and coherent and therefore relatively easy to define or train into an AGI.
This (maybe misunderstanding) of your corrigibility=figure out what I want is why I currently prefer the instruction-following route to corrigibility. I don’t want the AGI to guess at what I want any more than necessary. This has downsides, too; back to those at the end.
I do think what your model of me says, but I think it’s only narrowly true and probably not very useful that
It’s fine if the AGI does what I want and not what I say, as long as it’s correct about what I want.
I think this is true for exactly the right definition of “what I want”, but conveying that to an AGI is nontrivial, and re-introduces the difficulty of value learning. That’s mixed with the danger that it’s incorrect about what I want. That is, it could be right about what I want in one sense, but not the sense I wanted to convey to it (E.G., it decides I’d really rather be put into an experience machine where I’m the celebrated hero of the world, rather than make the real world good for everyone like I’d hoped to get).
Maybe I’ve misunderstood your thesis, but I did read it pretty carefully, so there might be something to learn from how I’ve misunderstood. All of your examples I remember correspond to “doing what the principal wants” by a pretty common interpetation of that phrase.
Instruction-following puts a lot of the difficulty back on the human(s) in charge. This is potentially very bad, but I think humans will probably choose this route anyway. You’ve pointed out some ways that following instructions could be a danger (although I think your genie examples aren’t the most relevant for a modest takeoff speed). But I think unless something changes, humans are likely to prefer keeping the power and the responsibility to trying to put more of the project into the AGIs alignment. That’s another reason I’m spending my time thinking through this route to corrigibility instead of the one you propose.
Although again, I might be missing something about your scheme.
I just went back and reread 2. Corrigibility Intuition (after writing the above, which I won’t try to revise). Everything there still looks like a flavor of “do what I want”. My model of Max says “corrigibility is more like ‘do your best to be correctable’”. It seems like correctable means correctable toward what the principal wants. So I wonder if your formulation reduces to “do what I want, with an emphasis on following instructions and being aware that you might be wrong about what I want”. That sounds very much like the Do What I Mean And Check formulation of my instruction-following approach to corrigibility.
Thanks for engaging. I think this is productive.
Just to pop back to the top level briefly, I’m focusing on instruction-following because I think it will work well and be the more likely pick for a nascent language-model agent AGI, from below human level to somewhat above it. If RL is heavily involved in creating that agent, that might shift the balance and make your form of corrigibility more attractive (and still vastly more attractive than attempting value alignment in any broader way). I think working through both of these is worthwhile, because those are the two most likely forms of first AGI, and the two most likely actual alignment targets.
I definitely haven’t wrapped my head around all of the pitfalls with either method, but I continue to think that this type of alignment target makes good outcomes much more likely, at least as far as we’ve gotten with the analysis so far.
I think this type of alignment target is also important because the strongest and most used arguments for alignment difficulty don’t apply to them. So when we’re debating slowing down AGI, proponents of going forward will be talking about these approaches. If the alignment community hasn’t thought through them carefully, there will be no valid counterargument. I’d still prefer that we slow AGI even though I think these methods give us a decent chance of succeeding at technical alignment. So that’s one more reason I find this topic worthwhile.
This has gotten pretty discursive, so don’t worry about responding to all of it.
But yeah, I basically think one needs to start with a hodgepodge of examples that are selected for being conservative and uncontroversial. I’d collect them by first identifying a robust set of very in-distribution tasks and contexts and try to exhaustively identify what manipulation would look like in that small domain, then aggressively train on passivity outside of that known distribution. The early pseudo-agent will almost certainly be mis-generalizing in a bunch of ways, but if it’s set up cautiously we can suspect that it’ll err on the side of caution, and that this can be gradually peeled back in a whitelist-style way as the experimentation phase proceeds and attempts to nail down true corrigibility.
I don’t know, maybe it’s partially or mostly my fault for reading too much optimism into these passages… But I think it would have managed my expectations better to say something like “my notion of corrigibility heavily depends on a subnotion of ‘don’t manipulate the principals’ values’ which is still far from being well-understood or formalizable.”
Switching topics a little, I think I’m personally pretty confused about what human values are and therefore what it means to not manipulate someone’s values. Since you’re suggesting relying less on formalization and more on “examples of corrigibility collected in a carefully-selected dataset”, how would you go about collecting such examples?
(One concern is that you could easily end up with a dataset that embodies a hodgepodge of different ideas of what “don’t manipulate” means and then it’s up to luck whether the AI generalizes from that in a correct or reasonable way.)
I think you’re right to point to this issue. It’s a loose end. I’m not at all sure it’s a dealbreaker for corrigibility.
The core intuition/proposal is (I think) that a corrigible agent wants to do what the principal wants, at all times. If the principal currently wants to not have their future values/wants manipulated, then the corrigible agent wants to not do that. If they want to be informed but protected against outside manipulation, then the corrigible agent wants that. The principal will want to balance these factors, and the corrigible agent wants to figure out what balance their principal wants, and do that.
I was going to say that my instruction-following variant of corrigibility might be better for working out that balance, but it actually seems pretty straightforward in Max’s pure corrigibility version, now that I’ve written out the above.
I don’t think “a corrigible agent wants to do what the principal wants, at all times” matches my proposal. The issue that we’re talking here shows up in the math, above, in that the agent needs to consider the principal’s values in the future, but those values are themselves dependent on the agent’s action. If the principal gave a previous command to optimize for having a certain set of values in the future, sure, the corrigible agent can follow that command, but to proactively optimize for having a certain set of values doesn’t seem necessarily corrigible, even if it matches the agent’s sense of the present principal’s values.
For instance, suppose Monday-Max wants Tuesday-Max to want to want to exercise, but also Monday-Max feels a bunch of caution around self-modification such that he doesn’t trust having the AI rearrange his neurons to make this change. It seems to me that the corrigible thing for the AI to do is ignore Monday-Max’s preferences and simply follow his instructions (and take other actions related to being correctable), even if Monday-Max’s mistrust is unjustified. It seems plausible to me that your “do what the principal wants” agent might manipulate Tuesday-Max into wanting to want to exercise, since that’s what Monday-Max wants on the base-level.
This sounds like we’re saying the same thing? My “at all times” is implied and maybe confusing. I’m saying it doesn’t guess what the principal will want in the future, it just does what they want now. That probably includes not manipulating their future values. Their commands are particularly strong evidence of what they want, but at core, it’s just having the agent’s goals be a pointer to the principal’s goals.
This formulation occurred to me since talking to you, and it seems like a compact and intuitive formulation of why your notion of corrigibility seems coherent and simple.
Edit: to address your example, I both want and don’t-want to be manipulated into wanting to exercise next week. It’s confusing for me, so it should be confusing for my corrigible AGI. It should ask me to clarify when and how I want to be manipulated, rather than taking a guess when I don’t know the answer. I probably haven’t thought about it deeply, and overall it’s pretty important to accurately doing what I want, so a good corrigible helper will suggest I spend some time clarifying for it and for myself. This is a point where things could go wrong if it takes bad guesses instead of getting clarification, but there are lots of those.
It sounds like you’re proposing a system that is vulnerable to the Fully Updated Deference problem, and where if it has a flaw in how it models your preferences, it can very plausibly go against your words. I don’t think that’s corrigible.
In the specific example, just because one is confused about what they want doesn’t mean the AI will be (or should be). It seems like you think the AGI should not “take a guess” at the preferences of the principal, but it should listen to what the principal says. Where is the qualitative line between the two? In your system, if I write in my diary that I want the AI to do something, should it not listen to that? Certainly the diary entry is strong evidence about what I want, which it seems is how you’re thinking about commands. Suppose the AGI can read my innermost desires using nanomachines, and set up the world according to those desires. Is it corrigible? Notably, if that machine is confident that it knows better than me (which is plausible), it won’t stop if I tell it to shut down, because shutting down is a bad way to produce MaxUtility. (See the point in my document, above, where I discuss Queen Alice being totally disempowered by sufficiently good “servants”.)
My model of Seth says “It’s fine if the AGI does what I want and not what I say, as long as it’s correct about what I want.” But regardless of whether that’s true, I think it’s important not to confuse that system with one that’s corrigible.
This seems productive.
I don’t understand your proposal if it doesn’t boil down to “do what the principal wants” or “do what the principal says” (correctly interpreted and/or carefully verified). This makes me worried that what you have in mind is not that simple and coherent and therefore relatively easy to define or train into an AGI.
This (maybe misunderstanding) of your corrigibility=figure out what I want is why I currently prefer the instruction-following route to corrigibility. I don’t want the AGI to guess at what I want any more than necessary. This has downsides, too; back to those at the end.
I do think what your model of me says, but I think it’s only narrowly true and probably not very useful that
I think this is true for exactly the right definition of “what I want”, but conveying that to an AGI is nontrivial, and re-introduces the difficulty of value learning. That’s mixed with the danger that it’s incorrect about what I want. That is, it could be right about what I want in one sense, but not the sense I wanted to convey to it (E.G., it decides I’d really rather be put into an experience machine where I’m the celebrated hero of the world, rather than make the real world good for everyone like I’d hoped to get).
Maybe I’ve misunderstood your thesis, but I did read it pretty carefully, so there might be something to learn from how I’ve misunderstood. All of your examples I remember correspond to “doing what the principal wants” by a pretty common interpetation of that phrase.
Instruction-following puts a lot of the difficulty back on the human(s) in charge. This is potentially very bad, but I think humans will probably choose this route anyway. You’ve pointed out some ways that following instructions could be a danger (although I think your genie examples aren’t the most relevant for a modest takeoff speed). But I think unless something changes, humans are likely to prefer keeping the power and the responsibility to trying to put more of the project into the AGIs alignment. That’s another reason I’m spending my time thinking through this route to corrigibility instead of the one you propose.
Although again, I might be missing something about your scheme.
I just went back and reread 2. Corrigibility Intuition (after writing the above, which I won’t try to revise). Everything there still looks like a flavor of “do what I want”. My model of Max says “corrigibility is more like ‘do your best to be correctable’”. It seems like correctable means correctable toward what the principal wants. So I wonder if your formulation reduces to “do what I want, with an emphasis on following instructions and being aware that you might be wrong about what I want”. That sounds very much like the Do What I Mean And Check formulation of my instruction-following approach to corrigibility.
Thanks for engaging. I think this is productive.
Just to pop back to the top level briefly, I’m focusing on instruction-following because I think it will work well and be the more likely pick for a nascent language-model agent AGI, from below human level to somewhat above it. If RL is heavily involved in creating that agent, that might shift the balance and make your form of corrigibility more attractive (and still vastly more attractive than attempting value alignment in any broader way). I think working through both of these is worthwhile, because those are the two most likely forms of first AGI, and the two most likely actual alignment targets.
I definitely haven’t wrapped my head around all of the pitfalls with either method, but I continue to think that this type of alignment target makes good outcomes much more likely, at least as far as we’ve gotten with the analysis so far.
I think this type of alignment target is also important because the strongest and most used arguments for alignment difficulty don’t apply to them. So when we’re debating slowing down AGI, proponents of going forward will be talking about these approaches. If the alignment community hasn’t thought through them carefully, there will be no valid counterargument. I’d still prefer that we slow AGI even though I think these methods give us a decent chance of succeeding at technical alignment. So that’s one more reason I find this topic worthwhile.
This has gotten pretty discursive, so don’t worry about responding to all of it.
Thanks. Picking out those excerpts is very helpful.
I’ve jotted down my current (confused) thoughts about human values.
But yeah, I basically think one needs to start with a hodgepodge of examples that are selected for being conservative and uncontroversial. I’d collect them by first identifying a robust set of very in-distribution tasks and contexts and try to exhaustively identify what manipulation would look like in that small domain, then aggressively train on passivity outside of that known distribution. The early pseudo-agent will almost certainly be mis-generalizing in a bunch of ways, but if it’s set up cautiously we can suspect that it’ll err on the side of caution, and that this can be gradually peeled back in a whitelist-style way as the experimentation phase proceeds and attempts to nail down true corrigibility.