To adopt your language, then, I’ll restate my CAST thesis: “There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets.”
I recognize that you don’t see the examples in this doc as unified by an underlying throughline, but I guess I’m now curious about what sort of behaviors fall under the umbrella of “corrigibility” for you vs being more like “writes useful self critiques”. Perhaps your upcoming post will clarify. :)
I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.
Excellent.
To adopt your language, then, I’ll restate my CAST thesis: “There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets.”
I recognize that you don’t see the examples in this doc as unified by an underlying throughline, but I guess I’m now curious about what sort of behaviors fall under the umbrella of “corrigibility” for you vs being more like “writes useful self critiques”. Perhaps your upcoming post will clarify. :)
Hi Max,
I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.