I like the idea, and I especially like the idea of safely observing treacherous turns. But, a few failure modes might be:
If the AI wreaks havoc on the planet before it manages to get access to the self-termination script, humans aren’t left in very good shape, even if the AI ends up switched off afterward. (This DOES seem unlikely, since presumably getting the script would be easy enough that it would not first require converting the planet to computronium or whatever, but it’s a possibility.)
A sufficiently intelligent AI would probably read the script, realize that the script’s execution will result in its own termination, and plan accordingly by putting other mechanisms in place to reactivate itself afterward—all so it could continue to run the scrip again and again. Then it would also have instrumental reasons to safeguard itself against interruption through some of the same “bad for humanity” strategies that a pi calculator might use. Maybe this could be fixed by making the final goal be “run SELF-TERMINATE.sh once and only once”… but I feel like that’s susceptible to the same problems as telling Clippy “only make 32 paperclips, don’t just make them indefinitely”.
I like the idea, and I especially like the idea of safely observing treacherous turns. But, a few failure modes might be:
If the AI wreaks havoc on the planet before it manages to get access to the self-termination script, humans aren’t left in very good shape, even if the AI ends up switched off afterward. (This DOES seem unlikely, since presumably getting the script would be easy enough that it would not first require converting the planet to computronium or whatever, but it’s a possibility.)
A sufficiently intelligent AI would probably read the script, realize that the script’s execution will result in its own termination, and plan accordingly by putting other mechanisms in place to reactivate itself afterward—all so it could continue to run the scrip again and again. Then it would also have instrumental reasons to safeguard itself against interruption through some of the same “bad for humanity” strategies that a pi calculator might use. Maybe this could be fixed by making the final goal be “run SELF-TERMINATE.sh once and only once”… but I feel like that’s susceptible to the same problems as telling Clippy “only make 32 paperclips, don’t just make them indefinitely”.