Max Harms comments on 2. Corrigibility Intuition

Max Harms 18 Jun 2024 16:49 UTC
LW: 4 AF: 3
0
AF
I’m glad you benefitted from reading it. I honestly wasn’t sure anyone would actually read the Existing Writing doc. 😅
I agree that if one trains on a wholistic collection of examples, like I have in this doc, the AI will start by memorizing a bunch of specific responses, then generalize to optimizing for a hodgepodge of desiderata, and only if you’re lucky will that hodgepodge coalesce into a single, core metric. (Getting the hodgepodge to coalesce is hard, and the central point of the scientific refinement step I talk about in the Strategy doc.)
I think you also get this if you’re trying to get a purely shutdownable AI through prosaic methods. In one sense you have the advantage, there, of having a simpler target and thus one that’s easier to coalesce the hodgepodge into. But, like a diamond maximizer, a shutdownability maximizer is going to be deeply incorrigible and will start fighting you (including by deception) during training as you’re trying to instill additional desiderata. For instance, if you try to train a shutdownability-maximizing AGI into also being non-manipulative, it’ll learn to imitate nonmanipulation as a means to the end of preserving its shutdownability, then switch to being manipulative as soon as it’s not risky to do so.
How does a corrigible paperclip maximizer trade off between corrigibility and paperclips? I think I don’t understand what it means for corrigibility to be a modifier.
- Rubi J. Hudson 22 Jun 2024 5:26 UTC
  LW: 2 AF: 2
  0
  AF Parent
  When I say corrigibility as a modifier, I mean it as a transformation that could be applied to a wide range of utility functions. To use an example from the 2015 MIRI paper, you can take most utility functions and add a term that says “if you shut down when the button is pressed, you get utility equal to the expected value of not shutting down”. Alternatively, it could be an optimization constraint that takes a utility function from “Maximize X” to something like “Maximize X s.t. you always shut down when the shutdown button is pushed”. While I’m not advocating for those specific changes, I hope they illustrate what I’m trying to point at as a modifier that is distinct from the optimization goal.
  - Max Harms 23 Jun 2024 16:22 UTC
    LW: 4 AF: 3
    0
    AF Parent
    Right. That’s helpful. Thank you.
    “Corrigibility as modifier,” if I understand right, says:
    There are lots of different kinds of agents that are corrigible. We can, for instance, start with a paperclip maximizer, apply a corrigibility transformation and get a corrigible Paperclip-Bot. Likewise, we can start with a diamond maximizer and get a corrigible Diamond-Bot. A corrigible Paperclip-Bot is not the same as a corrigible Diamond-Bot; there are lots of situations where they’ll behave differently. In other words, corrigibility is more like a property/constraint than a goal/wholistic-way-of-being. Saying “my agent is corrigible” doesn’t fully specify what the agent cares about—it only describes how the agent will behave in a subset of situations.
    Question: If I tell a corrigible agent to draw pictures of cats, will its behavior be different depending on whether it’s a corrigible Diamond-Bot vs a corrigible Paperclip-Bot? Likewise, suppose an agent has enough degrees of freedom to either write about potential flaws it might have or manufacture a paperclip/diamond, but not both. Will a corrigible agent ever sacrifice the opportunity to write about itself (in a helpful way) in order to pursue its pre-modifier goal?
    (Because opportunities for me to write are kinda scarce right now, I’ll pre-empt three possible responses.)
    “Corrigible agents are identically obedient and use all available degrees of freedom to be corrigible” → It seems like corrigible Paperclip-Bot is the same agent as corrigible Diamond-Bot and I don’t think it makes sense to say that corrigibility is modifying the agent as much as it’s overwriting it.
    “Corrigible agents are all obedient and work to be transparent when possible, but these are constraints, and sometimes the constraints are satisfied. When they’re satisfied the Paperclip-Bot and Diamond-Bot nature will differentiate them.” → I think that true corrigibility cannot be satisfied. Any degrees of freedom (time, money, energy, compute, etc.) which could be used to make paperclips could also be used to be additionally transparent, cautious, obedient, robust, etc. I challenge you to name a context where the agent has free resources and it can’t put those resources to work being marginally more corrigible.
    “Just because an agent uses free resources to make diamonds instead of writing elaborate diaries about its experiences and possible flaws doesn’t mean it’s incorrigible. Corrigible Diamond-Bot still shuts down when asked, avoids manipulating me, etc.” → I think you’re describing an agent which is semi-corrigible, and could be more corrigible if it spent its time doing things like researching ways it could be flawed instead of making diamonds. I agree that there are many possible semi-corrigible agents which are still reasonably safe, but there’s an open question with such agents on how to trade-off between corrigibility and making paperclips (or whatever).
    - Rubi J. Hudson 24 Jun 2024 20:34 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Thanks for pre-empting the responses, that makes it easy to reply!
      I would basically agree with the third option. Semantically, I would argue that rather than thinking of that agent as semi-corrigible, we should just think of it as corrigible, and “writes useful self critiques” as a separate property we would like the AI to have. I’m writing a post about this that should be up shortly, I’ll notify you when it’s out.
      - Max Harms 26 Jun 2024 16:40 UTC
        LW: 1 AF: 1
        0
        AF Parent
        Excellent.
        To adopt your language, then, I’ll restate my CAST thesis: “There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets.”
        I recognize that you don’t see the examples in this doc as unified by an underlying throughline, but I guess I’m now curious about what sort of behaviors fall under the umbrella of “corrigibility” for you vs being more like “writes useful self critiques”. Perhaps your upcoming post will clarify. :)
        Rubi J. Hudson 16 Jul 2024 22:47 UTC
        LW: 5 AF: 4
        0
        AF Parent
        Hi Max,
        I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.