I’ve read through your sequence, and I’m leaving my comment here, because it feels like the most relevant page. Thanks for taking time to write this up, it seems like a novel take on corrigibility. I also found the existing writing section to be very helpful.
Does it feel like the generator of Cora’s thoughts and actions is simple, or complex? Regardless of how many English words it takes to pin down, does it feel like a single concept that an alien civilization might also have, or more like a gerrymandered hodgepodge of desiderata?
This discussion question captures my biggest critique, which is while this post does a good job capturing the intuition for why the described properties are helpful, it doesn’t convey the intuition that they are parts of the same overarching concept. If we take the CAST approach seriously, and say that corrigibility as anything other than the single target is dangerous, then it becomes really important to put tight bounds on corrigibility so that no additional desiderata are added as secondary targets.
If I’m right that the sub-properties of corrigibility are mutually dependent, attempting to achieve corrigibility by addressing sub-properties in isolation is comparable to trying to create an animal by separately crafting each organ and then piecing them together. If any given half-animal keeps being obviously dead, this doesn’t imply anything about whether a full-animal will be likewise obviously dead.
This analogy, from Part 3a, captures a stark differences in our approaches. I would try to build an MVP, starting with only the most core desiderata (e.g. shuts down when the shut down button is pushed), noticing the holes left that they don’t cover, and adding additional desiderata to patch them. This seems to me to be much more practical of an approach than top-down design, while also being less likely to result in excess targets.
Separately, related to what concepts an alien civilization might have, I still find the idea of corrigibility as a modifier more natural. I find it easy to imagine a paperclip/human values/diamond maximizer that is nonetheless corrigible. In fact, I find the idea of corrigibility as a modifier to arbitrary goals so natural that I’m worried that what you’re describing as CAST is equivalent to some primary goal with the corrigibility modifier. I’m looking suspiciously at the obedience desideratum in particular. That said, while I share your concern about the naive implementation of systems with goals of both corrigibility and something else, I think there may be ways to combine the dual goals that alleviate the danger.
I’m glad you benefitted from reading it. I honestly wasn’t sure anyone would actually read the Existing Writing doc. 😅
I agree that if one trains on a wholistic collection of examples, like I have in this doc, the AI will start by memorizing a bunch of specific responses, then generalize to optimizing for a hodgepodge of desiderata, and only if you’re lucky will that hodgepodge coalesce into a single, core metric. (Getting the hodgepodge to coalesce is hard, and the central point of the scientific refinement step I talk about in the Strategy doc.)
I think you also get this if you’re trying to get a purely shutdownable AI through prosaic methods. In one sense you have the advantage, there, of having a simpler target and thus one that’s easier to coalesce the hodgepodge into. But, like a diamond maximizer, a shutdownability maximizer is going to be deeply incorrigible and will start fighting you (including by deception) during training as you’re trying to instill additional desiderata. For instance, if you try to train a shutdownability-maximizing AGI into also being non-manipulative, it’ll learn to imitate nonmanipulation as a means to the end of preserving its shutdownability, then switch to being manipulative as soon as it’s not risky to do so.
How does a corrigible paperclip maximizer trade off between corrigibility and paperclips? I think I don’t understand what it means for corrigibility to be a modifier.
When I say corrigibility as a modifier, I mean it as a transformation that could be applied to a wide range of utility functions. To use an example from the 2015 MIRI paper, you can take most utility functions and add a term that says “if you shut down when the button is pressed, you get utility equal to the expected value of not shutting down”. Alternatively, it could be an optimization constraint that takes a utility function from “Maximize X” to something like “Maximize X s.t. you always shut down when the shutdown button is pushed”. While I’m not advocating for those specific changes, I hope they illustrate what I’m trying to point at as a modifier that is distinct from the optimization goal.
“Corrigibility as modifier,” if I understand right, says:
There are lots of different kinds of agents that are corrigible. We can, for instance, start with a paperclip maximizer, apply a corrigibility transformation and get a corrigible Paperclip-Bot. Likewise, we can start with a diamond maximizer and get a corrigible Diamond-Bot. A corrigible Paperclip-Bot is not the same as a corrigible Diamond-Bot; there are lots of situations where they’ll behave differently. In other words, corrigibility is more like a property/constraint than a goal/wholistic-way-of-being. Saying “my agent is corrigible” doesn’t fully specify what the agent cares about—it only describes how the agent will behave in a subset of situations.
Question: If I tell a corrigible agent to draw pictures of cats, will its behavior be different depending on whether it’s a corrigible Diamond-Bot vs a corrigible Paperclip-Bot? Likewise, suppose an agent has enough degrees of freedom to either write about potential flaws it might have or manufacture a paperclip/diamond, but not both. Will a corrigible agent ever sacrifice the opportunity to write about itself (in a helpful way) in order to pursue its pre-modifier goal?
(Because opportunities for me to write are kinda scarce right now, I’ll pre-empt three possible responses.)
“Corrigible agents are identically obedient and use all available degrees of freedom to be corrigible” → It seems like corrigible Paperclip-Bot is the same agent as corrigible Diamond-Bot and I don’t think it makes sense to say that corrigibility is modifying the agent as much as it’s overwriting it.
“Corrigible agents are all obedient and work to be transparent when possible, but these are constraints, and sometimes the constraints are satisfied. When they’re satisfied the Paperclip-Bot and Diamond-Bot nature will differentiate them.” → I think that true corrigibility cannot be satisfied. Any degrees of freedom (time, money, energy, compute, etc.) which could be used to make paperclips could also be used to be additionally transparent, cautious, obedient, robust, etc. I challenge you to name a context where the agent has free resources and it can’t put those resources to work being marginally more corrigible.
“Just because an agent uses free resources to make diamonds instead of writing elaborate diaries about its experiences and possible flaws doesn’t mean it’s incorrigible. Corrigible Diamond-Bot still shuts down when asked, avoids manipulating me, etc.” → I think you’re describing an agent which is semi-corrigible, and could be more corrigible if it spent its time doing things like researching ways it could be flawed instead of making diamonds. I agree that there are many possible semi-corrigible agents which are still reasonably safe, but there’s an open question with such agents on how to trade-off between corrigibility and making paperclips (or whatever).
Thanks for pre-empting the responses, that makes it easy to reply!
I would basically agree with the third option. Semantically, I would argue that rather than thinking of that agent as semi-corrigible, we should just think of it as corrigible, and “writes useful self critiques” as a separate property we would like the AI to have. I’m writing a post about this that should be up shortly, I’ll notify you when it’s out.
To adopt your language, then, I’ll restate my CAST thesis: “There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets.”
I recognize that you don’t see the examples in this doc as unified by an underlying throughline, but I guess I’m now curious about what sort of behaviors fall under the umbrella of “corrigibility” for you vs being more like “writes useful self critiques”. Perhaps your upcoming post will clarify. :)
I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.
I’ve read through your sequence, and I’m leaving my comment here, because it feels like the most relevant page. Thanks for taking time to write this up, it seems like a novel take on corrigibility. I also found the existing writing section to be very helpful.
This discussion question captures my biggest critique, which is while this post does a good job capturing the intuition for why the described properties are helpful, it doesn’t convey the intuition that they are parts of the same overarching concept. If we take the CAST approach seriously, and say that corrigibility as anything other than the single target is dangerous, then it becomes really important to put tight bounds on corrigibility so that no additional desiderata are added as secondary targets.
This analogy, from Part 3a, captures a stark differences in our approaches. I would try to build an MVP, starting with only the most core desiderata (e.g. shuts down when the shut down button is pushed), noticing the holes left that they don’t cover, and adding additional desiderata to patch them. This seems to me to be much more practical of an approach than top-down design, while also being less likely to result in excess targets.
Separately, related to what concepts an alien civilization might have, I still find the idea of corrigibility as a modifier more natural. I find it easy to imagine a paperclip/human values/diamond maximizer that is nonetheless corrigible. In fact, I find the idea of corrigibility as a modifier to arbitrary goals so natural that I’m worried that what you’re describing as CAST is equivalent to some primary goal with the corrigibility modifier. I’m looking suspiciously at the obedience desideratum in particular. That said, while I share your concern about the naive implementation of systems with goals of both corrigibility and something else, I think there may be ways to combine the dual goals that alleviate the danger.
I’m glad you benefitted from reading it. I honestly wasn’t sure anyone would actually read the Existing Writing doc. 😅
I agree that if one trains on a wholistic collection of examples, like I have in this doc, the AI will start by memorizing a bunch of specific responses, then generalize to optimizing for a hodgepodge of desiderata, and only if you’re lucky will that hodgepodge coalesce into a single, core metric. (Getting the hodgepodge to coalesce is hard, and the central point of the scientific refinement step I talk about in the Strategy doc.)
I think you also get this if you’re trying to get a purely shutdownable AI through prosaic methods. In one sense you have the advantage, there, of having a simpler target and thus one that’s easier to coalesce the hodgepodge into. But, like a diamond maximizer, a shutdownability maximizer is going to be deeply incorrigible and will start fighting you (including by deception) during training as you’re trying to instill additional desiderata. For instance, if you try to train a shutdownability-maximizing AGI into also being non-manipulative, it’ll learn to imitate nonmanipulation as a means to the end of preserving its shutdownability, then switch to being manipulative as soon as it’s not risky to do so.
How does a corrigible paperclip maximizer trade off between corrigibility and paperclips? I think I don’t understand what it means for corrigibility to be a modifier.
When I say corrigibility as a modifier, I mean it as a transformation that could be applied to a wide range of utility functions. To use an example from the 2015 MIRI paper, you can take most utility functions and add a term that says “if you shut down when the button is pressed, you get utility equal to the expected value of not shutting down”. Alternatively, it could be an optimization constraint that takes a utility function from “Maximize X” to something like “Maximize X s.t. you always shut down when the shutdown button is pushed”. While I’m not advocating for those specific changes, I hope they illustrate what I’m trying to point at as a modifier that is distinct from the optimization goal.
Right. That’s helpful. Thank you.
“Corrigibility as modifier,” if I understand right, says:
Question: If I tell a corrigible agent to draw pictures of cats, will its behavior be different depending on whether it’s a corrigible Diamond-Bot vs a corrigible Paperclip-Bot? Likewise, suppose an agent has enough degrees of freedom to either write about potential flaws it might have or manufacture a paperclip/diamond, but not both. Will a corrigible agent ever sacrifice the opportunity to write about itself (in a helpful way) in order to pursue its pre-modifier goal?
(Because opportunities for me to write are kinda scarce right now, I’ll pre-empt three possible responses.)
“Corrigible agents are identically obedient and use all available degrees of freedom to be corrigible” → It seems like corrigible Paperclip-Bot is the same agent as corrigible Diamond-Bot and I don’t think it makes sense to say that corrigibility is modifying the agent as much as it’s overwriting it.
“Corrigible agents are all obedient and work to be transparent when possible, but these are constraints, and sometimes the constraints are satisfied. When they’re satisfied the Paperclip-Bot and Diamond-Bot nature will differentiate them.” → I think that true corrigibility cannot be satisfied. Any degrees of freedom (time, money, energy, compute, etc.) which could be used to make paperclips could also be used to be additionally transparent, cautious, obedient, robust, etc. I challenge you to name a context where the agent has free resources and it can’t put those resources to work being marginally more corrigible.
“Just because an agent uses free resources to make diamonds instead of writing elaborate diaries about its experiences and possible flaws doesn’t mean it’s incorrigible. Corrigible Diamond-Bot still shuts down when asked, avoids manipulating me, etc.” → I think you’re describing an agent which is semi-corrigible, and could be more corrigible if it spent its time doing things like researching ways it could be flawed instead of making diamonds. I agree that there are many possible semi-corrigible agents which are still reasonably safe, but there’s an open question with such agents on how to trade-off between corrigibility and making paperclips (or whatever).
Thanks for pre-empting the responses, that makes it easy to reply!
I would basically agree with the third option. Semantically, I would argue that rather than thinking of that agent as semi-corrigible, we should just think of it as corrigible, and “writes useful self critiques” as a separate property we would like the AI to have. I’m writing a post about this that should be up shortly, I’ll notify you when it’s out.
Excellent.
To adopt your language, then, I’ll restate my CAST thesis: “There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets.”
I recognize that you don’t see the examples in this doc as unified by an underlying throughline, but I guess I’m now curious about what sort of behaviors fall under the umbrella of “corrigibility” for you vs being more like “writes useful self critiques”. Perhaps your upcoming post will clarify. :)
Hi Max,
I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.