Max Harms comments on 2. Corrigibility Intuition

Max Harms 26 Jun 2024 16:40 UTC
LW: 1 AF: 1
0
AF
Excellent.
To adopt your language, then, I’ll restate my CAST thesis: “There is a relatively simple goal that an agent might have which emergently generates nice properties like corrigibility and obedience, and I see training an agent to have this goal (and no others) as being both possible and significantly safer than other possible targets.”
I recognize that you don’t see the examples in this doc as unified by an underlying throughline, but I guess I’m now curious about what sort of behaviors fall under the umbrella of “corrigibility” for you vs being more like “writes useful self critiques”. Perhaps your upcoming post will clarify. :)
- Rubi J. Hudson 16 Jul 2024 22:47 UTC
  LW: 5 AF: 4
  0
  AF Parent
  Hi Max,
  I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.