Rubi J. Hudson comments on 2. Corrigibility Intuition

Rubi J. Hudson 24 Jun 2024 20:34 UTC
LW: 2 AF: 2
0
AF
Thanks for pre-empting the responses, that makes it easy to reply!
I would basically agree with the third option. Semantically, I would argue that rather than thinking of that agent as semi-corrigible, we should just think of it as corrigible, and “writes useful self critiques” as a separate property we would like the AI to have. I’m writing a post about this that should be up shortly, I’ll notify you when it’s out.
- Max Harms 26 Jun 2024 16:40 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Excellent.
  To adopt your language, then, I’ll restate my CAST thesis: “There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets.”
  I recognize that you don’t see the examples in this doc as unified by an underlying throughline, but I guess I’m now curious about what sort of behaviors fall under the umbrella of “corrigibility” for you vs being more like “writes useful self critiques”. Perhaps your upcoming post will clarify. :)
  - Rubi J. Hudson 16 Jul 2024 22:47 UTC
    LW: 5 AF: 4
    0
    AF Parent
    Hi Max,
    I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.