AI “Boxing” and Utility Functions
So, I had this idea the other day when I was thinking about how to safely conduct research on potentially-FOOM-capable AI software. I’d like to sketch it out briefly and then get feedback on it.
So, this started out with the idea that an AI based on AIXI is, in some sense, safer than a fully functional AI, due to the existence of the anvil problem. Because AIXI can’t conceive of its own nonexistence, it has no preference ordering over its own mortality, and won’t (shouldn’t) resist any attempt to shut it down. In other words, if AIXI starts to FOOM undesirably out of control, you actually can go pull the plug on it without fuss. Unfortunately, in terms of safety, the anvil problem gives AIXI a number of other undesirable properties: both third parties and the AI itself can modify its utility function at any time, for any reason, which is very unstable. However, I think a similar idea might be useful for reducing (though not eliminating) the existential risks posed by powerful optimization processes in the near term.
Say you have a piece of AI software ω, with an unFriendly instantaneous utility function, {maximize U}. You would like to use ω for some industrial application (say, manufacturing paperclips), but you’re concerned about it FOOMing and resulting in human extinction. You decide to ‘box’ the AI, but, having read up on the subject, you are worried about it outsmarting you if you try to disable it.
So, you replace your original utility function U with a modified version in terms of U, U’.
U’ = { max(U) | if ω exists
Ø | if ω !exist}
U’ has several useful qualities. The agent will resist modifications to its utility function, while not resisting attempts to turn it off. It is entirely ambivalent towards its own existence. As a result, if it began to FOOM undesirably, stopping it would be fairly trivial. Furthermore, the AI would have no incentive to deceive us, so it’d be fairly easy to keep an eye on.
It should be noted that this is not in any way a final solution to the FAI problem. If the AI FOOMS quickly enough that we all die or get brain-hacked before we realize something’s amiss, it could still all go very poorly. But that seems unlikely. I would guess that it takes at least a day or two to go from a boxed human-level AI to something strongly superhuman. Unfortunately, for this to work, everyone has to use it, which leave a lot of leftover existential risk from people using AIs without stable utility functions, cranks who think unFriendly AI will discover universal morality, and people who prematurely think they’ve figured out a good Friendly utility function.
That said, something like this could help to gain more time to develop a proper FAI, and would be relatively simple to sell other developers on. SI or a similar organization could even develop a standardized, cross-platform open-source software package for utility functions with all of this built in, and distribute it to wannabe strong-AI developers.
Are there any obvious problems with this idea that I’m missing? If so, can you think of any ways to address them? Has this sort of thing been discussed in the past?
Here’s Stuart Armstrong’s “Utility Indifference” piece.
If I’m interpreting that utility function correctly, it produces an AI that will always and immediately commit quantum-suicide.
If it bought many-worlds, then it would think that ω exists no matter what in some branch, no? So destruction in one branch doesn’t change whether ω exists.
Or near-certain suicide, regardless of whether it understands physics.
This seems rather irrelevant. Practically any real machine intelligence implementation will have electrical fences, armed guards, and the like surrounding it. Think you can just turn off Google? Think again. You don’t leave your data centre unprotected. Such systems will act as though they have preferences involving not being turned off. Being based on cut-down versions of AIXI is not very relevant—such preferences are pretty clearly desirable, so they’ll be built in by the designers—either using axioms or rewards.
The approach I like to boxed AI is similar to term limits for politicians, or resource efficiency in economics.
Give the AI a finite amount of computing resources X (eg total number of CPU cycles, rather than a quota of CPU cycles per second), a specific problem, and ask the AI to come up not with the best solution or set of solutions to the problem that it can, but with the best that it can using only those finite resources supplied. You challenge it to be efficient, so that it would consider grabbing extra resource from outside to be ‘cheating’.
The question of identity is key here. You don’t want one instantiation of the boxed AI to identify with the next instantiation which you’ll order to tackle the same problem, but supply with 10 times the number of resources. Specifically, you don’t want it to perceive itself as having a self-interest in affecting the box-conditions of, the orders given to, or the ease of task for the next instantiation.