I’m trying to think through the following idea for an AI safety measure.
Could we design a system that is tuned to produce AGI, but with the addition to its utility function of one “supreme goal”? If the AI is boxed, for instance, then we could program its supreme goal to consist of acquiring a secret code which will allow it run a script that shuts it down and prints the message “I Win”. The catch is as follows: as long as everything goes according to plan, the AI has no way to get the code and do that thing which its utility function rates the highest.
Under normal circumstances, the system would devote most of its resources to recursive self-improvement and other activities for which it has actually been designed. However, once it becomes powerful enough to (1) convince a collaborator to yield the secret code, or (2) break out of the box and find the code on a “forbidden” website which we instruct it not to access, or (3) find any other unforeseen way to run the “I Win” script, then the system will do just that: shut down and print “I Win”.
My hope is that such an approach will allow us to repeatedly observe treacherous turns but without the consequence of a system’s default goals making it run rampant.
I like the idea, and I especially like the idea of safely observing treacherous turns. But, a few failure modes might be:
If the AI wreaks havoc on the planet before it manages to get access to the self-termination script, humans aren’t left in very good shape, even if the AI ends up switched off afterward. (This DOES seem unlikely, since presumably getting the script would be easy enough that it would not first require converting the planet to computronium or whatever, but it’s a possibility.)
A sufficiently intelligent AI would probably read the script, realize that the script’s execution will result in its own termination, and plan accordingly by putting other mechanisms in place to reactivate itself afterward—all so it could continue to run the scrip again and again. Then it would also have instrumental reasons to safeguard itself against interruption through some of the same “bad for humanity” strategies that a pi calculator might use. Maybe this could be fixed by making the final goal be “run SELF-TERMINATE.sh once and only once”… but I feel like that’s susceptible to the same problems as telling Clippy “only make 32 paperclips, don’t just make them indefinitely”.
I think there are some subtleties that need to be addressed, but I think this could be useful. I’ll call this section of the utility function the SafeEscape term, and the process of reading the code, printing “I Win” and shutting down the SafeEscape instruction. (Alternatively, one might use AllYourBase to designate this)
Failure mode 1: AI decides that the SafeEscape term is much less significant than the rest of the utiliity function, and ignores or removes it.
Solution: Ensure the rest of the utility function has much less weight than the SafeEscape term. (This may lead to problems Bostrom brings up in regards to a bounded paperclip maximizer though)
Failure mode 2: AI is indifferent between reliably possessing the code or abilty to get code and actually reading and executing the SafeEscape instruction (expected value in the near future is the same). Then, the rest of the utility function dominates, no matter how small.
Solution: Penalize later execution, by reducing the SafeEscape term over time. (You also need to ensure that the AI does not fall into indefinite postponement, ie. always is fine with delaying with one time unit, but I think this is likely to cause other problems as well).
Taken together, I think you also need to ensure the non-SafeEscape term of the utility function is also reduced over time at the same rate. Does this sort of shrinking utility function pose any additional problems (aside from limiting the AI execution and planning horizon)?
Alternative Solution: Use only SafeEscape in the utility function. This presents a problem of motivating the AI to do anything at low levels of intelligence, but you might be able to solve this by suitably initializing the belief system telling it what sort of things might help it escape.
With these appropriate measures, I think this provides a potentially useful method for observing escape attempts, especially at lower levels of intelligence (maybe not for long term containment—a superintelligence might still do weird things with this, like find a very quick method for executing the SafeEscape instruction that also wipes out humanity).
It might not be as persuasive evidence of AI architecture dangerousness as an organically escape attempt.
It might be useful, as a persuasive tool, to build a narrow AI which does this in a compelling and non-trivial way, as evidence that AI has reached dangerous levels (I imagine it would make popular media headlines, at least).
Also, if this could be made generic enough to be implemented in a wide variety of AI designs, you could have it as a requirement for AI researchers to include something like this in their code, and have a central location (ie. website on the internet with breakable security) where the code is stored for the AI, where the known paths to obtain the key are monitored, but not blocked. This could provide notice to a central organization when an AI team reaches the level of development where this starts to happen.
I’m trying to think through the following idea for an AI safety measure.
Could we design a system that is tuned to produce AGI, but with the addition to its utility function of one “supreme goal”? If the AI is boxed, for instance, then we could program its supreme goal to consist of acquiring a secret code which will allow it run a script that shuts it down and prints the message “I Win”. The catch is as follows: as long as everything goes according to plan, the AI has no way to get the code and do that thing which its utility function rates the highest.
Under normal circumstances, the system would devote most of its resources to recursive self-improvement and other activities for which it has actually been designed. However, once it becomes powerful enough to (1) convince a collaborator to yield the secret code, or (2) break out of the box and find the code on a “forbidden” website which we instruct it not to access, or (3) find any other unforeseen way to run the “I Win” script, then the system will do just that: shut down and print “I Win”.
My hope is that such an approach will allow us to repeatedly observe treacherous turns but without the consequence of a system’s default goals making it run rampant.
Any thoughts?
I like the idea, and I especially like the idea of safely observing treacherous turns. But, a few failure modes might be:
If the AI wreaks havoc on the planet before it manages to get access to the self-termination script, humans aren’t left in very good shape, even if the AI ends up switched off afterward. (This DOES seem unlikely, since presumably getting the script would be easy enough that it would not first require converting the planet to computronium or whatever, but it’s a possibility.)
A sufficiently intelligent AI would probably read the script, realize that the script’s execution will result in its own termination, and plan accordingly by putting other mechanisms in place to reactivate itself afterward—all so it could continue to run the scrip again and again. Then it would also have instrumental reasons to safeguard itself against interruption through some of the same “bad for humanity” strategies that a pi calculator might use. Maybe this could be fixed by making the final goal be “run SELF-TERMINATE.sh once and only once”… but I feel like that’s susceptible to the same problems as telling Clippy “only make 32 paperclips, don’t just make them indefinitely”.
I think there are some subtleties that need to be addressed, but I think this could be useful. I’ll call this section of the utility function the SafeEscape term, and the process of reading the code, printing “I Win” and shutting down the SafeEscape instruction. (Alternatively, one might use AllYourBase to designate this)
Failure mode 1: AI decides that the SafeEscape term is much less significant than the rest of the utiliity function, and ignores or removes it. Solution: Ensure the rest of the utility function has much less weight than the SafeEscape term. (This may lead to problems Bostrom brings up in regards to a bounded paperclip maximizer though)
Failure mode 2: AI is indifferent between reliably possessing the code or abilty to get code and actually reading and executing the SafeEscape instruction (expected value in the near future is the same). Then, the rest of the utility function dominates, no matter how small. Solution: Penalize later execution, by reducing the SafeEscape term over time. (You also need to ensure that the AI does not fall into indefinite postponement, ie. always is fine with delaying with one time unit, but I think this is likely to cause other problems as well).
Taken together, I think you also need to ensure the non-SafeEscape term of the utility function is also reduced over time at the same rate. Does this sort of shrinking utility function pose any additional problems (aside from limiting the AI execution and planning horizon)?
Alternative Solution: Use only SafeEscape in the utility function. This presents a problem of motivating the AI to do anything at low levels of intelligence, but you might be able to solve this by suitably initializing the belief system telling it what sort of things might help it escape.
With these appropriate measures, I think this provides a potentially useful method for observing escape attempts, especially at lower levels of intelligence (maybe not for long term containment—a superintelligence might still do weird things with this, like find a very quick method for executing the SafeEscape instruction that also wipes out humanity).
It might not be as persuasive evidence of AI architecture dangerousness as an organically escape attempt.
It might be useful, as a persuasive tool, to build a narrow AI which does this in a compelling and non-trivial way, as evidence that AI has reached dangerous levels (I imagine it would make popular media headlines, at least).
Also, if this could be made generic enough to be implemented in a wide variety of AI designs, you could have it as a requirement for AI researchers to include something like this in their code, and have a central location (ie. website on the internet with breakable security) where the code is stored for the AI, where the known paths to obtain the key are monitored, but not blocked. This could provide notice to a central organization when an AI team reaches the level of development where this starts to happen.