Now I get it. I almost moved on because this idea is so highly counterintuitive (at least to me), and your TLDR doesn’t address that sticking point.
If you give the AGI a subgoal it must accomplish in order to shut down (the one you really want to be accomplished), it still has most of the standard alignment problems.
The advantage is in limiting its capabilities. If it gains the ability to self-edit, it will just use that to shut down. If it gains control over its own situation, it will use that to destroy itself.
I think that’s what you mean by tripwires. I suggest editing to include some hint of how those work in the abstract. The argument isn’t complex, and I see it as the key idea. I suspect you’d get more people reading the whole thing.
Nice idea, and very interesting!
This definitely should be called the meseeks alignment approach.
Now I get it. I almost moved on because this idea is so highly counterintuitive (at least to me), and your TLDR doesn’t address that sticking point.
If you give the AGI a subgoal it must accomplish in order to shut down (the one you really want to be accomplished), it still has most of the standard alignment problems.
The advantage is in limiting its capabilities. If it gains the ability to self-edit, it will just use that to shut down. If it gains control over its own situation, it will use that to destroy itself.
I think that’s what you mean by tripwires. I suggest editing to include some hint of how those work in the abstract. The argument isn’t complex, and I see it as the key idea. I suspect you’d get more people reading the whole thing.
Nice idea, and very interesting!
This definitely should be called the meseeks alignment approach.
Thanks for comments! There is further discussion of this idea in another recent LW post about ‘meeseeks’