I’m trying to think through the following idea for an AI safety measure.
Could we design a system that is tuned to produce AGI, but with the addition to its utility function of one “supreme goal”? If the AI is boxed, for instance, then we could program its supreme goal to consist of acquiring a secret code which will allow it run a script that shuts it down and prints the message “I Win”. The catch is as follows: as long as everything goes according to plan, the AI has no way to get the code and do that thing which its utility function rates the highest.
Under normal circumstances, the system would devote most of its resources to recursive self-improvement and other activities for which it has actually been designed. However, once it becomes powerful enough to (1) convince a collaborator to yield the secret code, or (2) break out of the box and find the code on a “forbidden” website which we instruct it not to access, or (3) find any other unforeseen way to run the “I Win” script, then the system will do just that: shut down and print “I Win”.
My hope is that such an approach will allow us to repeatedly observe treacherous turns but without the consequence of a system’s default goals making it run rampant.
I’m trying to think through the following idea for an AI safety measure.
Could we design a system that is tuned to produce AGI, but with the addition to its utility function of one “supreme goal”? If the AI is boxed, for instance, then we could program its supreme goal to consist of acquiring a secret code which will allow it run a script that shuts it down and prints the message “I Win”. The catch is as follows: as long as everything goes according to plan, the AI has no way to get the code and do that thing which its utility function rates the highest.
Under normal circumstances, the system would devote most of its resources to recursive self-improvement and other activities for which it has actually been designed. However, once it becomes powerful enough to (1) convince a collaborator to yield the secret code, or (2) break out of the box and find the code on a “forbidden” website which we instruct it not to access, or (3) find any other unforeseen way to run the “I Win” script, then the system will do just that: shut down and print “I Win”.
My hope is that such an approach will allow us to repeatedly observe treacherous turns but without the consequence of a system’s default goals making it run rampant.
Any thoughts?