Iâm glad you benefitted from reading it. I honestly wasnât sure anyone would actually read the Existing Writing doc. đ
I agree that if one trains on a wholistic collection of examples, like I have in this doc, the AI will start by memorizing a bunch of specific responses, then generalize to optimizing for a hodgepodge of desiderata, and only if youâre lucky will that hodgepodge coalesce into a single, core metric. (Getting the hodgepodge to coalesce is hard, and the central point of the scientific refinement step I talk about in the Strategy doc.)
I think you also get this if youâre trying to get a purely shutdownable AI through prosaic methods. In one sense you have the advantage, there, of having a simpler target and thus one thatâs easier to coalesce the hodgepodge into. But, like a diamond maximizer, a shutdownability maximizer is going to be deeply incorrigible and will start fighting you (including by deception) during training as youâre trying to instill additional desiderata. For instance, if you try to train a shutdownability-maximizing AGI into also being non-manipulative, itâll learn to imitate nonmanipulation as a means to the end of preserving its shutdownability, then switch to being manipulative as soon as itâs not risky to do so.
How does a corrigible paperclip maximizer trade off between corrigibility and paperclips? I think I donât understand what it means for corrigibility to be a modifier.
When I say corrigibility as a modifier, I mean it as a transformation that could be applied to a wide range of utility functions. To use an example from the 2015 MIRI paper, you can take most utility functions and add a term that says âif you shut down when the button is pressed, you get utility equal to the expected value of not shutting downâ. Alternatively, it could be an optimization constraint that takes a utility function from âMaximize Xâ to something like âMaximize X s.t. you always shut down when the shutdown button is pushedâ. While Iâm not advocating for those specific changes, I hope they illustrate what Iâm trying to point at as a modifier that is distinct from the optimization goal.
âCorrigibility as modifier,â if I understand right, says:
There are lots of different kinds of agents that are corrigible. We can, for instance, start with a paperclip maximizer, apply a corrigibility transformation and get a corrigible Paperclip-Bot. Likewise, we can start with a diamond maximizer and get a corrigible Diamond-Bot. A corrigible Paperclip-Bot is not the same as a corrigible Diamond-Bot; there are lots of situations where theyâll behave differently. In other words, corrigibility is more like a property/âconstraint than a goal/âwholistic-way-of-being. Saying âmy agent is corrigibleâ doesnât fully specify what the agent cares aboutâit only describes how the agent will behave in a subset of situations.
Question: If I tell a corrigible agent to draw pictures of cats, will its behavior be different depending on whether itâs a corrigible Diamond-Bot vs a corrigible Paperclip-Bot? Likewise, suppose an agent has enough degrees of freedom to either write about potential flaws it might have or manufacture a paperclip/âdiamond, but not both. Will a corrigible agent ever sacrifice the opportunity to write about itself (in a helpful way) in order to pursue its pre-modifier goal?
(Because opportunities for me to write are kinda scarce right now, Iâll pre-empt three possible responses.)
âCorrigible agents are identically obedient and use all available degrees of freedom to be corrigibleâ â It seems like corrigible Paperclip-Bot is the same agent as corrigible Diamond-Bot and I donât think it makes sense to say that corrigibility is modifying the agent as much as itâs overwriting it.
âCorrigible agents are all obedient and work to be transparent when possible, but these are constraints, and sometimes the constraints are satisfied. When theyâre satisfied the Paperclip-Bot and Diamond-Bot nature will differentiate them.â â I think that true corrigibility cannot be satisfied. Any degrees of freedom (time, money, energy, compute, etc.) which could be used to make paperclips could also be used to be additionally transparent, cautious, obedient, robust, etc. I challenge you to name a context where the agent has free resources and it canât put those resources to work being marginally more corrigible.
âJust because an agent uses free resources to make diamonds instead of writing elaborate diaries about its experiences and possible flaws doesnât mean itâs incorrigible. Corrigible Diamond-Bot still shuts down when asked, avoids manipulating me, etc.â â I think youâre describing an agent which is semi-corrigible, and could be more corrigible if it spent its time doing things like researching ways it could be flawed instead of making diamonds. I agree that there are many possible semi-corrigible agents which are still reasonably safe, but thereâs an open question with such agents on how to trade-off between corrigibility and making paperclips (or whatever).
Thanks for pre-empting the responses, that makes it easy to reply!
I would basically agree with the third option. Semantically, I would argue that rather than thinking of that agent as semi-corrigible, we should just think of it as corrigible, and âwrites useful self critiquesâ as a separate property we would like the AI to have. Iâm writing a post about this that should be up shortly, Iâll notify you when itâs out.
To adopt your language, then, Iâll restate my CAST thesis: âThere is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets.â
I recognize that you donât see the examples in this doc as unified by an underlying throughline, but I guess Iâm now curious about what sort of behaviors fall under the umbrella of âcorrigibilityâ for you vs being more like âwrites useful self critiquesâ. Perhaps your upcoming post will clarify. :)
I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.
Iâm glad you benefitted from reading it. I honestly wasnât sure anyone would actually read the Existing Writing doc. đ
I agree that if one trains on a wholistic collection of examples, like I have in this doc, the AI will start by memorizing a bunch of specific responses, then generalize to optimizing for a hodgepodge of desiderata, and only if youâre lucky will that hodgepodge coalesce into a single, core metric. (Getting the hodgepodge to coalesce is hard, and the central point of the scientific refinement step I talk about in the Strategy doc.)
I think you also get this if youâre trying to get a purely shutdownable AI through prosaic methods. In one sense you have the advantage, there, of having a simpler target and thus one thatâs easier to coalesce the hodgepodge into. But, like a diamond maximizer, a shutdownability maximizer is going to be deeply incorrigible and will start fighting you (including by deception) during training as youâre trying to instill additional desiderata. For instance, if you try to train a shutdownability-maximizing AGI into also being non-manipulative, itâll learn to imitate nonmanipulation as a means to the end of preserving its shutdownability, then switch to being manipulative as soon as itâs not risky to do so.
How does a corrigible paperclip maximizer trade off between corrigibility and paperclips? I think I donât understand what it means for corrigibility to be a modifier.
When I say corrigibility as a modifier, I mean it as a transformation that could be applied to a wide range of utility functions. To use an example from the 2015 MIRI paper, you can take most utility functions and add a term that says âif you shut down when the button is pressed, you get utility equal to the expected value of not shutting downâ. Alternatively, it could be an optimization constraint that takes a utility function from âMaximize Xâ to something like âMaximize X s.t. you always shut down when the shutdown button is pushedâ. While Iâm not advocating for those specific changes, I hope they illustrate what Iâm trying to point at as a modifier that is distinct from the optimization goal.
Right. Thatâs helpful. Thank you.
âCorrigibility as modifier,â if I understand right, says:
Question: If I tell a corrigible agent to draw pictures of cats, will its behavior be different depending on whether itâs a corrigible Diamond-Bot vs a corrigible Paperclip-Bot? Likewise, suppose an agent has enough degrees of freedom to either write about potential flaws it might have or manufacture a paperclip/âdiamond, but not both. Will a corrigible agent ever sacrifice the opportunity to write about itself (in a helpful way) in order to pursue its pre-modifier goal?
(Because opportunities for me to write are kinda scarce right now, Iâll pre-empt three possible responses.)
âCorrigible agents are identically obedient and use all available degrees of freedom to be corrigibleâ â It seems like corrigible Paperclip-Bot is the same agent as corrigible Diamond-Bot and I donât think it makes sense to say that corrigibility is modifying the agent as much as itâs overwriting it.
âCorrigible agents are all obedient and work to be transparent when possible, but these are constraints, and sometimes the constraints are satisfied. When theyâre satisfied the Paperclip-Bot and Diamond-Bot nature will differentiate them.â â I think that true corrigibility cannot be satisfied. Any degrees of freedom (time, money, energy, compute, etc.) which could be used to make paperclips could also be used to be additionally transparent, cautious, obedient, robust, etc. I challenge you to name a context where the agent has free resources and it canât put those resources to work being marginally more corrigible.
âJust because an agent uses free resources to make diamonds instead of writing elaborate diaries about its experiences and possible flaws doesnât mean itâs incorrigible. Corrigible Diamond-Bot still shuts down when asked, avoids manipulating me, etc.â â I think youâre describing an agent which is semi-corrigible, and could be more corrigible if it spent its time doing things like researching ways it could be flawed instead of making diamonds. I agree that there are many possible semi-corrigible agents which are still reasonably safe, but thereâs an open question with such agents on how to trade-off between corrigibility and making paperclips (or whatever).
Thanks for pre-empting the responses, that makes it easy to reply!
I would basically agree with the third option. Semantically, I would argue that rather than thinking of that agent as semi-corrigible, we should just think of it as corrigible, and âwrites useful self critiquesâ as a separate property we would like the AI to have. Iâm writing a post about this that should be up shortly, Iâll notify you when itâs out.
Excellent.
To adopt your language, then, Iâll restate my CAST thesis: âThere is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets.â
I recognize that you donât see the examples in this doc as unified by an underlying throughline, but I guess Iâm now curious about what sort of behaviors fall under the umbrella of âcorrigibilityâ for you vs being more like âwrites useful self critiquesâ. Perhaps your upcoming post will clarify. :)
Hi Max,
I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.