For instance, after reading some of Eliezer and Nate’s recent writing, I now think it’s probably a better idea to use corrigibility as an AI alignment target (assuming corrigibility turns out to be a coherent thing at all), as opposed to directly targeting human values
I’d be interested in you elaborating on this update. Specifically, how do you expect these approaches to differ in terms of mechanical interventions in the AGI’s structure/training story, and what advantages one has over the other?
I’ve actually updated toward a target which could arguably be called either corrigibility or human values, and is different from previous framings I’ve seen of either. But I didn’t want to get into that in the post; probably I’ll have another post on it at some point. Some brief points to gesture in the general direction:
The basic argument for corrigibility over values is some combination of “wider basin of attraction” and “we’ll get to iterate”.
I had previously mostly ignored corrigibility as a target because (1) MIRI had various results showing that some simple desiderata of corrigibility are incompatible with EU maximization, and (2) I expected some of the benefits of corrigibility to come naturally from targeting values: insofar as humans want their AGI to be corrigible, a value-aligned AI will be corrigible, and therefore presumably there’s a basin of attraction in which we get values “right enough” along the corrigibility-relevant axes.
The most interesting update for me was when Eliezer reframed point (2) from the opposite direction (in an in-person discussion): insofar as humans value corrigibility (or particular aspects of corrigibility), the same challenges of expressing corrigibility mathematically also need to be solved in order to target values. In other words, the key mathematical challenges of corrigibility are themselves robust subproblems. (Note: this is my takeaway from that discussion, not necessarily the point Eliezer intended.)
That argument convinced me to think some more about the old corrigibility results. And… they’re not very impressive? Like, people tried a few hacks, and the hacks didn’t work. Fully Updated Deference is the only real barrier they found, and I don’t think it’s that much of a barrier—it mostly just shows that something is wrong with the assumed type-signature of the child agent, which isn’t exactly shocking.
The main reason to expect that corrigibility is a natural concept at all, in Eliezer’s words:
The “hard problem of corrigibility” is interesting because of the possibility that it has a relatively simple core or central principle—rather than being value-laden on the details of exactly what humans value, there may be some compact core of corrigibility that would be the same if aliens were trying to build a corrigible AI, or if an AI were trying to build another AI.
The angle that suggests to me is: an AI will build a child AI such that the two of them together implement a single optimization algorithm, optimizing the parent’s objective. Whatever protocol the two use to act as a single joint distributed optimizer should probably have some corrigibility properties.
Corresponding alignment target: human + AI should together act as a joint distributed optimizer of the human’s values. My current best guess is that the child agent needs to not be an EU maximizer wrt the interface between itself and the human, or wrt the human’s markov blanket assumed by that interface, in order for the joint optimization condition to robustly hold. (That’s not to say it won’t be an EU maximizer over some other variables, of course, but when it comes to agent type signatures maybe 30% of the action is in “EU maximizer over which variables, and with respect to what measuring stick?”.)
(There’s still various other strange things about the type signature of values/corrigibility which need to be worked out, but that’s the main point which pertains to MIRI’s hardness results.)
In terms of concrete strategies: the general path of “figure out interpretability, then see what alignment targets are easily expressible in an AI’s internal language” is still probably the step where most of the action happens. If the corrigibility thing works out, it will be a mathematically simple thing (whereas human values aren’t), so it should be much easier to check that we’ve identified the right internal concepts in the AI; that’s the main mechanical difference (at least that I’ve noticed so far).
the key trick is simply to maximize the sum of mutualinfo(user command, agent action) + mutualinfo(agent action, environment change) as the only reward function for a small model.
in my current view, [spoilers for my judgement so you can form your own if you like,]
it’s one of the most promising corrigibility papers I’ve seen. It still has some problems from the perspective of corrigibility, but it seems to have the interesting effect of making a reinforcement learner desperately want to be instructed. there are probably still very severe catastrophic failures hiding in slivers of plan space that would make a mimi superplanner dangerous (eg, at approximately human levels of optimization, it might try to force you to spend time with it and give it instructions?), and I don’t think it works to train a model any other way than from scratch, and it doesn’t solve multi-agent, and it doesn’t solve interpretability (though it might combine really well with runtime interpretability visualizations, doubly so if the visualizations are mechanistically exact), so over-optimization would still break it, and it only works when a user is actively controlling it—but it seems much less prone to failure than vanilla rlhf, because it produces an agent that, if I understand correctly, stops moving when the user stops moving (to first approximation).
it seems to satisfy the mathematical simplicity you’re asking for. I’m likely going to attempt follow-up research—I want to figure out if there’s a way to do something similar with much bigger models, ie 1m to 6b parameter range. and I want to see how weird the behavior is when you try to steer a pretrained model with it. a friend is advising me casually on the project.
It seems to me that there are some key desiderata for corrigibility that it doesn’t satisfy, in particular it isn’t terribly desperate to explain itself to you, it just wants your commands to seem to be at sensible times that cause you to have control of its action timings in order to control the environment. but it makes your feedback much denser through the training process and produces a model that, if I understand correctly, gets bored without instruction. also it seems like with some tweaking it might also be a good model of what makes human relationship satisfying, which is a key tell I look for.
That would be a very poetic way to die: an AI desperately pulling every bit of info it can out of a human, and dumping that info into the environment. They do say that humanity’s death becomes more gruesome and dystopian the closer the proposal is to working, and that does sound decidedly gruesome and dystopian.
Anyway, more concretely, the problem which jumps out to me is that maximizing mutualinfo(user command, agent action) + mutualinfo(agent action, environment change) just means that all the info from the command routes through the action and into the environment in some way; the semantics or intent of the command need not have anything at all to do with the resulting environmental change. Like, maybe there’s a prompt on my screen which says “would you like the lightswitch on (1) or off (0)?”, and I enter “1″, and then the AI responds by placing a coin heads-side-up. There’s no requirement that my one bit actually needs to be encoded into the environment in a way which has anything to do with the lightswitch.
When I sent him the link to this comment, he replied:
ah i think you forgot the first term in the MIMI objective, I(s_t, x_t), which makes the mapping intuitive by maximizing information flow from the environment into the user. what you proposed was similar to optimizing only the second term, I(x_t, s_t+1 | s_t), which would indeed suffer from the problems that john mentions in his reply
I’ve actually updated toward a target which could arguably be called either corrigibility or human values, and is different from previous framings I’ve seen of either
Right, in my view the line between them is blurry as well. One distinction that makes sense to me is:
“Value learning” means making the AGI care about human values directly — as in, putting them in place of its utility function.
“Corrigibility” means making the AGI care about human values through the intermediary of humans — making it terminally care about “what agents with the designation ‘human’ care about”. (Or maybe “what my creator cares about”, “what a highly-specific input channel tells me to do”, etc.)
Put like this, yup, “corrigibility” seems like a better target to aim for. In particular, it’s compact and, as you point out, convergent — “an agent” and “this agent’s goals” are likely much easier to express than the whole suite of human values, and would be easier for us to locate in the AGI’s ontology (e. g., we should be able to side-step a lot of philosophical headaches, outsource them to the AI itself).
In that sense, “strawberry alignment”, in MIRI’s parlance, is indeed easier than “eudaimonia alignment”.
However...
insofar as humans value corrigibility (or particular aspects of corrigibility), the same challenges of expressing corrigibility mathematically also need to be solved in order to target values
I’ve been pretty confused about why MIRI thought that corrigibility is easier, and this is exactly why. Imparting corrigibility still requires making the AI care about some very specific conceptions humans have about how their commands should be executed, e. g. “don’t optimize for this too hard” and other DWIMs. But if we can do that, if it understands us well enough to figure out all the subtle implications in our orders, then why can’t we just tell it to “build an utopia” and expect that to go well? It seems like a strawberry-aligned AI should interpret that order faithfully as well… Which is a view that Nate/Eliezer seem not to outright rule out, they talk about “short reflection” sometimes.
But other times “corrigibility” seems to mean a grab-bag of tricks for essentially upper-bounding the damage an AI can inflict, presumably followed by a pivotal act (with a large amount of collateral damage) via this system and then long reflection. On that model, there’s also a meaningful presumed difficulty difference between strawberry alignment and eudaimonia alignment: the former doesn’t require us to be very good at retargeting the AGI at all. But it also seems obviously doomed to me, and not necessarily easier (inasmuch as this flavor of “corrigibility” doesn’t seem like a natural concept at all).
My reading is that you endorse the former type of corrigibility as well, not the latter?
My reading is that you endorse the former type of corrigibility as well, not the latter?
Yes. I also had the “grab-bag of tricks” impression from MIRI’s previous work on the topic, since it was mostly just trying various hacks, and that was also part of why I mostly ignored it. The notion that there’s a True Name to be found here, we’re not just trying hacks, is a big part of why I now have hope for corrigibility.
“Corrigibility” means making the AGI care about human values through the intermediary of humans — making it terminally care about “what agents with the designation ‘human’ care about”. (Or maybe “what my creator cares about”,
Interesting—that is actually what I’ve considered to be proper ‘value learning’: correctly locating and pointing to humans and their values in the agent’s learned world model, in a way that naturally survives/updates correctly with world model ontology updates. The agent then has a natural intrinsic motivation to further improve its own understanding of human values (and thus its utility function) simply through the normal curiosity drive for value of information improvement to its world model.
I wasn’t making a definitive statement on what I think people mean when they say “corrigibility”, to be clear. The point I was making is that any implementation of corrigibility that I think is worth trying for necessarily has the “faithfulness” component — i. e., the AI would have to interpret its values/tasks/orders the way they were intended by the order-giver, instead of some other way. Which, in turn, likely requires somehow making it locate humans in its world-model (though likely implemented as “locate the model of [whoever is giving me the order]” in the AI’s utility function, not necessarily referring to [humans] specifically).
And building off that definition, if “value learning” is supposed to mean something different, then I’d define it as pointing at human values not through humans, but directly. I. e., making the AI value the same things that humans value not because it knows that it’s what humans value, but just because.
Again, I don’t necessarily think that it’s what most people mean by these terms most times — I would natively view both approaches to this as something like “value learning” as well. But this discussion started from John (1) differentiating between them, and (2) viewing both approaches as viable. This is just how I’d carve it under these two constraints.
I’d be interested in you elaborating on this update. Specifically, how do you expect these approaches to differ in terms of mechanical interventions in the AGI’s structure/training story, and what advantages one has over the other?
I’ve actually updated toward a target which could arguably be called either corrigibility or human values, and is different from previous framings I’ve seen of either. But I didn’t want to get into that in the post; probably I’ll have another post on it at some point. Some brief points to gesture in the general direction:
The basic argument for corrigibility over values is some combination of “wider basin of attraction” and “we’ll get to iterate”.
I had previously mostly ignored corrigibility as a target because (1) MIRI had various results showing that some simple desiderata of corrigibility are incompatible with EU maximization, and (2) I expected some of the benefits of corrigibility to come naturally from targeting values: insofar as humans want their AGI to be corrigible, a value-aligned AI will be corrigible, and therefore presumably there’s a basin of attraction in which we get values “right enough” along the corrigibility-relevant axes.
The most interesting update for me was when Eliezer reframed point (2) from the opposite direction (in an in-person discussion): insofar as humans value corrigibility (or particular aspects of corrigibility), the same challenges of expressing corrigibility mathematically also need to be solved in order to target values. In other words, the key mathematical challenges of corrigibility are themselves robust subproblems. (Note: this is my takeaway from that discussion, not necessarily the point Eliezer intended.)
That argument convinced me to think some more about the old corrigibility results. And… they’re not very impressive? Like, people tried a few hacks, and the hacks didn’t work. Fully Updated Deference is the only real barrier they found, and I don’t think it’s that much of a barrier—it mostly just shows that something is wrong with the assumed type-signature of the child agent, which isn’t exactly shocking.
The main reason to expect that corrigibility is a natural concept at all, in Eliezer’s words:
The angle that suggests to me is: an AI will build a child AI such that the two of them together implement a single optimization algorithm, optimizing the parent’s objective. Whatever protocol the two use to act as a single joint distributed optimizer should probably have some corrigibility properties.
Corresponding alignment target: human + AI should together act as a joint distributed optimizer of the human’s values. My current best guess is that the child agent needs to not be an EU maximizer wrt the interface between itself and the human, or wrt the human’s markov blanket assumed by that interface, in order for the joint optimization condition to robustly hold. (That’s not to say it won’t be an EU maximizer over some other variables, of course, but when it comes to agent type signatures maybe 30% of the action is in “EU maximizer over which variables, and with respect to what measuring stick?”.)
(There’s still various other strange things about the type signature of values/corrigibility which need to be worked out, but that’s the main point which pertains to MIRI’s hardness results.)
In terms of concrete strategies: the general path of “figure out interpretability, then see what alignment targets are easily expressible in an AI’s internal language” is still probably the step where most of the action happens. If the corrigibility thing works out, it will be a mathematically simple thing (whereas human values aren’t), so it should be much easier to check that we’ve identified the right internal concepts in the AI; that’s the main mechanical difference (at least that I’ve noticed so far).
are you familiar with mutual information maximizing interfaces?
the key trick is simply to maximize the sum of mutualinfo(user command, agent action) + mutualinfo(agent action, environment change) as the only reward function for a small model.
in my current view, [spoilers for my judgement so you can form your own if you like,]
it’s one of the most promising corrigibility papers I’ve seen. It still has some problems from the perspective of corrigibility, but it seems to have the interesting effect of making a reinforcement learner desperately want to be instructed. there are probably still very severe catastrophic failures hiding in slivers of plan space that would make a mimi superplanner dangerous (eg, at approximately human levels of optimization, it might try to force you to spend time with it and give it instructions?), and I don’t think it works to train a model any other way than from scratch, and it doesn’t solve multi-agent, and it doesn’t solve interpretability (though it might combine really well with runtime interpretability visualizations, doubly so if the visualizations are mechanistically exact), so over-optimization would still break it, and it only works when a user is actively controlling it—but it seems much less prone to failure than vanilla rlhf, because it produces an agent that, if I understand correctly, stops moving when the user stops moving (to first approximation).
it seems to satisfy the mathematical simplicity you’re asking for. I’m likely going to attempt follow-up research—I want to figure out if there’s a way to do something similar with much bigger models, ie 1m to 6b parameter range. and I want to see how weird the behavior is when you try to steer a pretrained model with it. a friend is advising me casually on the project.
It seems to me that there are some key desiderata for corrigibility that it doesn’t satisfy, in particular it isn’t terribly desperate to explain itself to you, it just wants your commands to seem to be at sensible times that cause you to have control of its action timings in order to control the environment. but it makes your feedback much denser through the training process and produces a model that, if I understand correctly, gets bored without instruction. also it seems like with some tweaking it might also be a good model of what makes human relationship satisfying, which is a key tell I look for.
very curious to hear your thoughts.
That would be a very poetic way to die: an AI desperately pulling every bit of info it can out of a human, and dumping that info into the environment. They do say that humanity’s death becomes more gruesome and dystopian the closer the proposal is to working, and that does sound decidedly gruesome and dystopian.
Anyway, more concretely, the problem which jumps out to me is that maximizing mutualinfo(user command, agent action) + mutualinfo(agent action, environment change) just means that all the info from the command routes through the action and into the environment in some way; the semantics or intent of the command need not have anything at all to do with the resulting environmental change. Like, maybe there’s a prompt on my screen which says “would you like the lightswitch on (1) or off (0)?”, and I enter “1″, and then the AI responds by placing a coin heads-side-up. There’s no requirement that my one bit actually needs to be encoded into the environment in a way which has anything to do with the lightswitch.
When I sent him the link to this comment, he replied:
my imprecision may have mislead you :)
Right, in my view the line between them is blurry as well. One distinction that makes sense to me is:
“Value learning” means making the AGI care about human values directly — as in, putting them in place of its utility function.
“Corrigibility” means making the AGI care about human values through the intermediary of humans — making it terminally care about “what agents with the designation ‘human’ care about”. (Or maybe “what my creator cares about”, “what a highly-specific input channel tells me to do”, etc.)
Put like this, yup, “corrigibility” seems like a better target to aim for. In particular, it’s compact and, as you point out, convergent — “an agent” and “this agent’s goals” are likely much easier to express than the whole suite of human values, and would be easier for us to locate in the AGI’s ontology (e. g., we should be able to side-step a lot of philosophical headaches, outsource them to the AI itself).
In that sense, “strawberry alignment”, in MIRI’s parlance, is indeed easier than “eudaimonia alignment”.
However...
I’ve been pretty confused about why MIRI thought that corrigibility is easier, and this is exactly why. Imparting corrigibility still requires making the AI care about some very specific conceptions humans have about how their commands should be executed, e. g. “don’t optimize for this too hard” and other DWIMs. But if we can do that, if it understands us well enough to figure out all the subtle implications in our orders, then why can’t we just tell it to “build an utopia” and expect that to go well? It seems like a strawberry-aligned AI should interpret that order faithfully as well… Which is a view that Nate/Eliezer seem not to outright rule out, they talk about “short reflection” sometimes.
But other times “corrigibility” seems to mean a grab-bag of tricks for essentially upper-bounding the damage an AI can inflict, presumably followed by a pivotal act (with a large amount of collateral damage) via this system and then long reflection. On that model, there’s also a meaningful presumed difficulty difference between strawberry alignment and eudaimonia alignment: the former doesn’t require us to be very good at retargeting the AGI at all. But it also seems obviously doomed to me, and not necessarily easier (inasmuch as this flavor of “corrigibility” doesn’t seem like a natural concept at all).
My reading is that you endorse the former type of corrigibility as well, not the latter?
Yes. I also had the “grab-bag of tricks” impression from MIRI’s previous work on the topic, since it was mostly just trying various hacks, and that was also part of why I mostly ignored it. The notion that there’s a True Name to be found here, we’re not just trying hacks, is a big part of why I now have hope for corrigibility.
Interesting—that is actually what I’ve considered to be proper ‘value learning’: correctly locating and pointing to humans and their values in the agent’s learned world model, in a way that naturally survives/updates correctly with world model ontology updates. The agent then has a natural intrinsic motivation to further improve its own understanding of human values (and thus its utility function) simply through the normal curiosity drive for value of information improvement to its world model.
I wasn’t making a definitive statement on what I think people mean when they say “corrigibility”, to be clear. The point I was making is that any implementation of corrigibility that I think is worth trying for necessarily has the “faithfulness” component — i. e., the AI would have to interpret its values/tasks/orders the way they were intended by the order-giver, instead of some other way. Which, in turn, likely requires somehow making it locate humans in its world-model (though likely implemented as “locate the model of [whoever is giving me the order]” in the AI’s utility function, not necessarily referring to [humans] specifically).
And building off that definition, if “value learning” is supposed to mean something different, then I’d define it as pointing at human values not through humans, but directly. I. e., making the AI value the same things that humans value not because it knows that it’s what humans value, but just because.
Again, I don’t necessarily think that it’s what most people mean by these terms most times — I would natively view both approaches to this as something like “value learning” as well. But this discussion started from John (1) differentiating between them, and (2) viewing both approaches as viable. This is just how I’d carve it under these two constraints.