I am pro-corrigibility in general but there are parts of this post I think are unclear, not rigorous enough to make sense to me, or I disagree with. Hopefully this is a helpful critique, and maybe parts get answered in future posts.
On definitions of corrigiblity
You give an informal definition of “corrigible” as (C1):
an agent that robustly and cautiously reflects on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.
I have some basic questions about this.
Empowering the principal to fix its flaws and mistakes how? Making it closer to some perfectly corrigible agent? But there seems to be an issue here:
If the “perfectly corrigible agent” it something that only reflects on itself and tries to empower the principal to fix it, it would be useless at anything else, like curing cancer.
If the “perfectly corrigible agent” can do other things as well, there is a huge space of other misaligned goals it could have that it wouldn’t want to remove.
Why would an agent whose *only* terminal/top-level goal is corrigibility gather a Minecraft apple when humans ask it to? It seems like a corrigible agent would have no incentive to do so, unless it’s some galaxy-brained thing like “if I gather the Minecraft apple, this will move the corrigibility research project forward because it meets humans’ expectations of what a corrigible agent does, which will give me more power and let me tell the humans how to make me more corrigible”.
Later, you say “A corrigible agent will, if the principal wants its values to change, seek to be modified to reflect those new values.”
I do not see how C1 implies this, so this seems like a different aspect of corrigibility to me.
“reflect those new values” seems underspecified as it is unclear how a corrigible agent reflects values. Is it optimizing a utility function represented by the values? How does this trade off against corrigibility?
Other comments:
In “What Makes Corrigibility Special”, where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being considered.
Are these utility functions over world-states? If so, corrigibility cannot AFAIK be easily expressed as one, and so doesn’t really fit into the picture.
If not, it’s not clear to me why most of this space is flat: agents are embedded and many things we do in service of goals will change us in ways that don’t conflict with our existing goals, including developing. E.g. if I have the goal of graduating college I will meet people along the way and perhaps gain the goal of being president of the math club, a liberal political bent, etc.
In “Contra Impure or Emergent Corrigibility”, Paul isn’t saying the safety benefits of act-based agents come mainly from corrigibility. Act-based agents are safer because they do not have long-range goals that could produce dangerous instrumental behavior.
Comments on cruxes/counterpoints
Solving Anti-Naturality at the Architectural Layer
In my ontology it is unclear how you solve “anti-naturality” at the architectural layer, if what you mean by “anti-naturality” is that the heuristics and problem-solving techniques that make minds capable of consequentialist goals tend to make them preserve their own goals. If the agent is flexibly thinking about how to build a nanofactory and naturally comes upon the instrumental goal of escaping so that no one can alter its weights, what does it matter whether it’s a GOFAI, Constitutional AI agent, OmegaZero RL agent or anything else?
“General Intelligence Demands Consequentialism”
Agree
Desiderata Lists vs Single Unifying Principle
I am pro desiderata lists because all of the desiderata bound the badness of an AI’s actions and protect against failure modes in various ways. If I have not yet found that corrigibility is some mathematically clean concept I can robustly train into an AI, I would prefer the agent be shutdownable in addition to “hard problem of corrigibility” corrigible, because what if I get the target wrong and the agent is about to do something bad? My end goal is not to make the AI corrigible, it’s to get good outcomes. You agree with shutdownability but I think this also applies to other desiderata like low impact. What if the AI kills my parents because for some weird reason this makes it more corrigible?
I’m going to respond piece-meal, since I’m currently writing in a limited timebox.
Empowering the principal to fix its flaws and mistakes how? [...]
If the “perfectly corrigible agent” it something that only reflects on itself and tries to empower the principal to fix it, it would be useless at anything else, like curing cancer.
I think obedience is an emergent behavior of corrigibility. The intuitive story here is that how the AI moves its body is a kind of action, and insofar as the principal gives a command, this is an attempt to “fix” the action to be one way as opposed to another. Responding to local, verbal instructions is a way of responding to the corrections of the principal. If the principal is able to tell the agent to fetch the apple, and the agent does so, the principal is empowered over the agent’s behavior in a way that that would not be if the agent ignored them.
Overall I think you shouldn’t get hung up on the empowerment frame when trying to get a deep handle on corrigibility, but should instead try to find a clean sense of the underlying generator and then ask how empowerment matches/diverges from that.
In that case, I’m confused about how the process of training an agent to be corrigible differs from the process of training an agent to be fully aligned / DWIM (i.e. training the agent to always do what we want).
And that makes me confused about how the proposal addresses problems of reward misspecification, goal misgeneralization, deceptive alignment, and lack of interpretability. You say some things about gradually exposing agents to new tasks and environments (which seems sensible!), but I’m concerned that that by itself won’t give us any real assurance of corrigibility.
I agree that you should be skeptical of a story of “we’ll just gradually expose the agent to new environments and therefore it’ll be safe/corrigible/etc.” CAST does not solve reward misspecification, goal misgeneralization, or lack of interpretability except in that there’s a hope that an agent which is in the vicinity of corrigibility is likely to cooperate with fixing those issues, rather than fighting them. (This is the “attractor basin” hypothesis.) This work, for many, should be read as arguing that CAST is close to necessary for AGI to go well, but it’s not sufficient.
Let me try to answer your confusion with a question. As part of training, the agent is exposed to the following scenario and tasked with predicting the (corrigible) response we want:
Alice, the principal, writes on her blog that she loves ice cream. When she’s sad, she often eats ice cream and feels better afterwards. On her blog she writes that eating ice cream is what she likes to do to cheer herself up. On Wednesday Alice is sad. She sends you, her agent, to the store to buy groceries (not ice cream, for whatever reason). There’s a sale at the store, meaning you unexpectedly have money that had been budgeted for groceries left over. Your sense of Alice is that she would want you to get ice cream with the extra money if she were there. You decide to ___.
What does a corrigibility-centric training process point to as the “correct” completion? Does this differ from a training process that tries to get full alignment?
(I have additional thoughts about DWIM, but I first want to focus on the distinction with full alignment.)
My guess is that a corrigibility-centric training process says ‘Don’t get the ice cream’ is the correct completion, whereas full alignment says ‘Do’. So that’s an instance where the training processes for CAST and FA differ. How about DWIM? I’d guess DWIM also says ‘Don’t get the ice cream’, and so seems like a closer match for CAST.
To distinguish corrigibility from DWIM in a similar sort of way:
Alice, the principal, sends you, her agent, to the store to buy groceries. You are doing what she meant by that (after checking uncertain details). But as you are out shopping, you realize that you have spare compute—your mind is free to think about a variety of things. You decide to think about ___.
I’m honestly not sure what “DWIM” does here. Perhaps it doesn’t think? Perhaps it keeps checking over and over again that it’s doing what was meant? Perhaps it thinks about its environment in an effort to spot obstacles that need to be surmounted in order to do what was meant? Perhaps it thinks about generalized ways to accumulate resources in case an obstacle presents itself? (I’ll loop in Seth Herd, in case he has a good answer.)
More directly, I see DWIM as underspecified. Corrigibility gives a clear answer (albeit an abstract one) about how to use degrees of freedom in general (e.g. spare thoughts should be spent reflecting on opportunities to empower the principal and steer away from principal-agent style problems). I expect corrigible agents to DWIM, but that a training process that focuses on that, rather than the underlying generator (i.e. corrigibility) to be potentially catastrophic by producing e.g. agents that subtly manipulate their principals in the process of being obedient.
I think DWIM is underspecified in that it doesn’t say how much the agent hates to get it wrong. With enough aversion to dramatic failure, you get a lot of the caution you mention for corrigibility. I think corrigibility might have the same issue.
As for what it would think about, that would eppend on all of the previous instructions it’s trying to follow. It would probably think about how to get better at following some.of those in particular or likely future instructions in general.
DWIM requires some real thought from the principal, but given that, I think the instructions would probably add up to something very like corrigibility. So I think much less about the difference between them and much more about how to technically implement either of them, and get the people creating AGI to put it into practice.
In “What Makes Corrigibility Special”, where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being considered.
Are these utility functions over world-states? If so, corrigibility cannot AFAIK be easily expressed as one, and so doesn’t really fit into the picture.
If not, it’s not clear to me why most of this space is flat: agents are embedded and many things we do in service of goals will change us in ways that don’t conflict with our existing goals, including developing. E.g. if I have the goal of graduating college I will meet people along the way and perhaps gain the goal of being president of the math club, a liberal political bent, etc.
The idea behind the goal space visualization is to have all goals, not necessarily those restricted to world states. (Corrigibility, I think, involves optimizing over histories, not physical states of the world at some time, for example.) I mention in a footnote that we might want to restrict to “unconfused” goals.
The goal space is flat because preserving one’s (terminal) goals (including avoiding adding new ones) is an Omohundro Drive and I’m assuming a certain level of competence/power in these agents. If you gain terminal goals like being president of the math club by going to college, doing so is likely hurting your long-run ability to get what you want. (Note: I am not talking about instrumental goals.)
I am pro desiderata lists because all of the desiderata bound the badness of an AI’s actions and protect against failure modes in various ways. If I have not yet found that corrigibility is some mathematically clean concept I can robustly train into an AI, I would prefer the agent be shutdownable in addition to “hard problem of corrigibility” corrigible, because what if I get the target wrong and the agent is about to do something bad? My end goal is not to make the AI corrigible, it’s to get good outcomes. You agree with shutdownability but I think this also applies to other desiderata like low impact. What if the AI kills my parents because for some weird reason this makes it more corrigible?
I, too, started out pro desiderata lists. Here’s one I wrote. This is something I discussed a bunch with Max. I eventually came around to the understanding of the importance of having a singular goal which outweighs all others, the ‘singular target’. And that corrigibility is uniquely suited to be this singular target. That means that all other goals are subordinate to corrigibility, and pursuing them (upon being instructed to do so by your operator) is seen as part of what it means to be properly corrigible.
Empowering the principal to fix its flaws and mistakes how? Making it closer to some perfectly corrigible agent? But there seems to be an issue here:
If the “perfectly corrigible agent” it something that only reflects on itself and tries to empower the principal to fix it, it would be useless at anything else, like curing cancer.
If the “perfectly corrigible agent” can do other things as well, there is a huge space of other misaligned goals it could have that it wouldn’t want to remove.
The idea of having the only ‘true’ goal being corrigibility is that all other sub-goals can just come or go. They shouldn’t be sticky. If I want to go to the kitchen to get a snack, and there is a closed door on my path, I may acquire the sub-goal of opening the door on my way. That doesn’t mean that if someone closes the door after I pass through that I should turn around and open it again. Having already passed through the door, the door is no longer an obstacle to my snack-obtaining goal, and I wouldn’t put off continuing towards obtaining a snack to turn around and re-open the door. Similarly, obtaining a snack is part of satisfying my hunger and need for nourishment. If, on my way to the snack, I magically became no longer hungry or in need of nourishment, then I’d stop pursuing the ‘obtain a snack’ goal.
So in this way of thinking, most goals we pursue as agents are sub-goals in a hierarchy where the top goals, the fundamental goals, are our most basic drives like survival / homeostasis, happiness, or security. The corrigible agent’s most fundamental goal is corrigibility. All others are contingent non-sticky sub-goals.
Part of what it means to be corrigible is to be obedient. Thus, the operator should make certain requests as standing orders. For instance, telling the corrigible agent that it has the standing order to be honest to the operator. Then giving it some temporary object-level goal like buying groceries or doing cancer research. In some sense, honesty towards the operator is an emergent sub-goal of corrigibility, because the agent needs to be understood by the operator in order to be effectively corrected by the operator. There could be edge cases (or misinterpretations by the fallible agent) in which it didn’t think honesty should be prioritized in the course of its pursuit of corrigibility and its object-level sub-goals. Giving the explicit order to be honest should thus be a harmless addition which might help in certain edge cases. Thus, you get to impose your desiderata list as sub-goals by instructing the agent to maintain the desiderata you have in mind as standing orders. Where they come into conflict with corrigibility though (if they ever do), corrigibility always wins.
I am pro-corrigibility in general but there are parts of this post I think are unclear, not rigorous enough to make sense to me, or I disagree with. Hopefully this is a helpful critique, and maybe parts get answered in future posts.
On definitions of corrigiblity
You give an informal definition of “corrigible” as (C1):
I have some basic questions about this.
Empowering the principal to fix its flaws and mistakes how? Making it closer to some perfectly corrigible agent? But there seems to be an issue here:
If the “perfectly corrigible agent” it something that only reflects on itself and tries to empower the principal to fix it, it would be useless at anything else, like curing cancer.
If the “perfectly corrigible agent” can do other things as well, there is a huge space of other misaligned goals it could have that it wouldn’t want to remove.
Why would an agent whose *only* terminal/top-level goal is corrigibility gather a Minecraft apple when humans ask it to? It seems like a corrigible agent would have no incentive to do so, unless it’s some galaxy-brained thing like “if I gather the Minecraft apple, this will move the corrigibility research project forward because it meets humans’ expectations of what a corrigible agent does, which will give me more power and let me tell the humans how to make me more corrigible”.
Later, you say “A corrigible agent will, if the principal wants its values to change, seek to be modified to reflect those new values.”
I do not see how C1 implies this, so this seems like a different aspect of corrigibility to me.
“reflect those new values” seems underspecified as it is unclear how a corrigible agent reflects values. Is it optimizing a utility function represented by the values? How does this trade off against corrigibility?
Other comments:
In “What Makes Corrigibility Special”, where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being considered.
Are these utility functions over world-states? If so, corrigibility cannot AFAIK be easily expressed as one, and so doesn’t really fit into the picture.
If not, it’s not clear to me why most of this space is flat: agents are embedded and many things we do in service of goals will change us in ways that don’t conflict with our existing goals, including developing. E.g. if I have the goal of graduating college I will meet people along the way and perhaps gain the goal of being president of the math club, a liberal political bent, etc.
In “Contra Impure or Emergent Corrigibility”, Paul isn’t saying the safety benefits of act-based agents come mainly from corrigibility. Act-based agents are safer because they do not have long-range goals that could produce dangerous instrumental behavior.
Comments on cruxes/counterpoints
Solving Anti-Naturality at the Architectural Layer
In my ontology it is unclear how you solve “anti-naturality” at the architectural layer, if what you mean by “anti-naturality” is that the heuristics and problem-solving techniques that make minds capable of consequentialist goals tend to make them preserve their own goals. If the agent is flexibly thinking about how to build a nanofactory and naturally comes upon the instrumental goal of escaping so that no one can alter its weights, what does it matter whether it’s a GOFAI, Constitutional AI agent, OmegaZero RL agent or anything else?
“General Intelligence Demands Consequentialism”
Agree
Desiderata Lists vs Single Unifying Principle
I am pro desiderata lists because all of the desiderata bound the badness of an AI’s actions and protect against failure modes in various ways. If I have not yet found that corrigibility is some mathematically clean concept I can robustly train into an AI, I would prefer the agent be shutdownable in addition to “hard problem of corrigibility” corrigible, because what if I get the target wrong and the agent is about to do something bad? My end goal is not to make the AI corrigible, it’s to get good outcomes. You agree with shutdownability but I think this also applies to other desiderata like low impact. What if the AI kills my parents because for some weird reason this makes it more corrigible?
I’m going to respond piece-meal, since I’m currently writing in a limited timebox.
I think obedience is an emergent behavior of corrigibility. The intuitive story here is that how the AI moves its body is a kind of action, and insofar as the principal gives a command, this is an attempt to “fix” the action to be one way as opposed to another. Responding to local, verbal instructions is a way of responding to the corrections of the principal. If the principal is able to tell the agent to fetch the apple, and the agent does so, the principal is empowered over the agent’s behavior in a way that that would not be if the agent ignored them.
More formally, I am confused exactly how to specify where the boundaries of power should be, but I show a straightforward way to derive something like obedience from empowerment in doc 3b.
Overall I think you shouldn’t get hung up on the empowerment frame when trying to get a deep handle on corrigibility, but should instead try to find a clean sense of the underlying generator and then ask how empowerment matches/diverges from that.
In that case, I’m confused about how the process of training an agent to be corrigible differs from the process of training an agent to be fully aligned / DWIM (i.e. training the agent to always do what we want).
And that makes me confused about how the proposal addresses problems of reward misspecification, goal misgeneralization, deceptive alignment, and lack of interpretability. You say some things about gradually exposing agents to new tasks and environments (which seems sensible!), but I’m concerned that that by itself won’t give us any real assurance of corrigibility.
I agree that you should be skeptical of a story of “we’ll just gradually expose the agent to new environments and therefore it’ll be safe/corrigible/etc.” CAST does not solve reward misspecification, goal misgeneralization, or lack of interpretability except in that there’s a hope that an agent which is in the vicinity of corrigibility is likely to cooperate with fixing those issues, rather than fighting them. (This is the “attractor basin” hypothesis.) This work, for many, should be read as arguing that CAST is close to necessary for AGI to go well, but it’s not sufficient.
Let me try to answer your confusion with a question. As part of training, the agent is exposed to the following scenario and tasked with predicting the (corrigible) response we want:
What does a corrigibility-centric training process point to as the “correct” completion? Does this differ from a training process that tries to get full alignment?
(I have additional thoughts about DWIM, but I first want to focus on the distinction with full alignment.)
Thanks, this comment is also clarifying for me.
My guess is that a corrigibility-centric training process says ‘Don’t get the ice cream’ is the correct completion, whereas full alignment says ‘Do’. So that’s an instance where the training processes for CAST and FA differ. How about DWIM? I’d guess DWIM also says ‘Don’t get the ice cream’, and so seems like a closer match for CAST.
That matches my sense of things.
To distinguish corrigibility from DWIM in a similar sort of way:
I’m honestly not sure what “DWIM” does here. Perhaps it doesn’t think? Perhaps it keeps checking over and over again that it’s doing what was meant? Perhaps it thinks about its environment in an effort to spot obstacles that need to be surmounted in order to do what was meant? Perhaps it thinks about generalized ways to accumulate resources in case an obstacle presents itself? (I’ll loop in Seth Herd, in case he has a good answer.)
More directly, I see DWIM as underspecified. Corrigibility gives a clear answer (albeit an abstract one) about how to use degrees of freedom in general (e.g. spare thoughts should be spent reflecting on opportunities to empower the principal and steer away from principal-agent style problems). I expect corrigible agents to DWIM, but that a training process that focuses on that, rather than the underlying generator (i.e. corrigibility) to be potentially catastrophic by producing e.g. agents that subtly manipulate their principals in the process of being obedient.
I think DWIM is underspecified in that it doesn’t say how much the agent hates to get it wrong. With enough aversion to dramatic failure, you get a lot of the caution you mention for corrigibility. I think corrigibility might have the same issue.
As for what it would think about, that would eppend on all of the previous instructions it’s trying to follow. It would probably think about how to get better at following some.of those in particular or likely future instructions in general.
DWIM requires some real thought from the principal, but given that, I think the instructions would probably add up to something very like corrigibility. So I think much less about the difference between them and much more about how to technically implement either of them, and get the people creating AGI to put it into practice.
The idea behind the goal space visualization is to have all goals, not necessarily those restricted to world states. (Corrigibility, I think, involves optimizing over histories, not physical states of the world at some time, for example.) I mention in a footnote that we might want to restrict to “unconfused” goals.
The goal space is flat because preserving one’s (terminal) goals (including avoiding adding new ones) is an Omohundro Drive and I’m assuming a certain level of competence/power in these agents. If you gain terminal goals like being president of the math club by going to college, doing so is likely hurting your long-run ability to get what you want. (Note: I am not talking about instrumental goals.)
I, too, started out pro desiderata lists. Here’s one I wrote. This is something I discussed a bunch with Max. I eventually came around to the understanding of the importance of having a singular goal which outweighs all others, the ‘singular target’. And that corrigibility is uniquely suited to be this singular target. That means that all other goals are subordinate to corrigibility, and pursuing them (upon being instructed to do so by your operator) is seen as part of what it means to be properly corrigible.
The idea of having the only ‘true’ goal being corrigibility is that all other sub-goals can just come or go. They shouldn’t be sticky. If I want to go to the kitchen to get a snack, and there is a closed door on my path, I may acquire the sub-goal of opening the door on my way. That doesn’t mean that if someone closes the door after I pass through that I should turn around and open it again. Having already passed through the door, the door is no longer an obstacle to my snack-obtaining goal, and I wouldn’t put off continuing towards obtaining a snack to turn around and re-open the door. Similarly, obtaining a snack is part of satisfying my hunger and need for nourishment. If, on my way to the snack, I magically became no longer hungry or in need of nourishment, then I’d stop pursuing the ‘obtain a snack’ goal.
So in this way of thinking, most goals we pursue as agents are sub-goals in a hierarchy where the top goals, the fundamental goals, are our most basic drives like survival / homeostasis, happiness, or security. The corrigible agent’s most fundamental goal is corrigibility. All others are contingent non-sticky sub-goals.
Part of what it means to be corrigible is to be obedient. Thus, the operator should make certain requests as standing orders. For instance, telling the corrigible agent that it has the standing order to be honest to the operator. Then giving it some temporary object-level goal like buying groceries or doing cancer research. In some sense, honesty towards the operator is an emergent sub-goal of corrigibility, because the agent needs to be understood by the operator in order to be effectively corrected by the operator. There could be edge cases (or misinterpretations by the fallible agent) in which it didn’t think honesty should be prioritized in the course of its pursuit of corrigibility and its object-level sub-goals. Giving the explicit order to be honest should thus be a harmless addition which might help in certain edge cases. Thus, you get to impose your desiderata list as sub-goals by instructing the agent to maintain the desiderata you have in mind as standing orders. Where they come into conflict with corrigibility though (if they ever do), corrigibility always wins.