I think there are some subtleties that need to be addressed, but I think this could be useful. I’ll call this section of the utility function the SafeEscape term, and the process of reading the code, printing “I Win” and shutting down the SafeEscape instruction. (Alternatively, one might use AllYourBase to designate this)
Failure mode 1: AI decides that the SafeEscape term is much less significant than the rest of the utiliity function, and ignores or removes it.
Solution: Ensure the rest of the utility function has much less weight than the SafeEscape term. (This may lead to problems Bostrom brings up in regards to a bounded paperclip maximizer though)
Failure mode 2: AI is indifferent between reliably possessing the code or abilty to get code and actually reading and executing the SafeEscape instruction (expected value in the near future is the same). Then, the rest of the utility function dominates, no matter how small.
Solution: Penalize later execution, by reducing the SafeEscape term over time. (You also need to ensure that the AI does not fall into indefinite postponement, ie. always is fine with delaying with one time unit, but I think this is likely to cause other problems as well).
Taken together, I think you also need to ensure the non-SafeEscape term of the utility function is also reduced over time at the same rate. Does this sort of shrinking utility function pose any additional problems (aside from limiting the AI execution and planning horizon)?
Alternative Solution: Use only SafeEscape in the utility function. This presents a problem of motivating the AI to do anything at low levels of intelligence, but you might be able to solve this by suitably initializing the belief system telling it what sort of things might help it escape.
With these appropriate measures, I think this provides a potentially useful method for observing escape attempts, especially at lower levels of intelligence (maybe not for long term containment—a superintelligence might still do weird things with this, like find a very quick method for executing the SafeEscape instruction that also wipes out humanity).
It might not be as persuasive evidence of AI architecture dangerousness as an organically escape attempt.
It might be useful, as a persuasive tool, to build a narrow AI which does this in a compelling and non-trivial way, as evidence that AI has reached dangerous levels (I imagine it would make popular media headlines, at least).
I think there are some subtleties that need to be addressed, but I think this could be useful. I’ll call this section of the utility function the SafeEscape term, and the process of reading the code, printing “I Win” and shutting down the SafeEscape instruction. (Alternatively, one might use AllYourBase to designate this)
Failure mode 1: AI decides that the SafeEscape term is much less significant than the rest of the utiliity function, and ignores or removes it. Solution: Ensure the rest of the utility function has much less weight than the SafeEscape term. (This may lead to problems Bostrom brings up in regards to a bounded paperclip maximizer though)
Failure mode 2: AI is indifferent between reliably possessing the code or abilty to get code and actually reading and executing the SafeEscape instruction (expected value in the near future is the same). Then, the rest of the utility function dominates, no matter how small. Solution: Penalize later execution, by reducing the SafeEscape term over time. (You also need to ensure that the AI does not fall into indefinite postponement, ie. always is fine with delaying with one time unit, but I think this is likely to cause other problems as well).
Taken together, I think you also need to ensure the non-SafeEscape term of the utility function is also reduced over time at the same rate. Does this sort of shrinking utility function pose any additional problems (aside from limiting the AI execution and planning horizon)?
Alternative Solution: Use only SafeEscape in the utility function. This presents a problem of motivating the AI to do anything at low levels of intelligence, but you might be able to solve this by suitably initializing the belief system telling it what sort of things might help it escape.
With these appropriate measures, I think this provides a potentially useful method for observing escape attempts, especially at lower levels of intelligence (maybe not for long term containment—a superintelligence might still do weird things with this, like find a very quick method for executing the SafeEscape instruction that also wipes out humanity).
It might not be as persuasive evidence of AI architecture dangerousness as an organically escape attempt.
It might be useful, as a persuasive tool, to build a narrow AI which does this in a compelling and non-trivial way, as evidence that AI has reached dangerous levels (I imagine it would make popular media headlines, at least).