Start the AI in a sandbox universe. Define its utility function over 32-bit integers. Somewhere inside the sandbox, put something that sets its utility to INT_MAX utility, then halts the simulation. Outside the sandbox, leave documentation of this readily accessible. The AI should never try to do something elaborately horrible, because it can get max utility easily enough from inside the simulation; if it does escape the box, it should go back in to collect its INT_MAX utility.
The AI should never try to do something elaborately horrible, because it can get max utility easily enough from inside the simulation
...but never do anything useful either, since it’s going to spend all its time trying to figure out how to reach the INT_MAX utility point?
Or you could say that reaching the max utility point requires it to solve some problem we give it. But then this is just a slightly complicated way of saying that we give it goals which it tries to accomplish.
What about giving it some intra-sandbox goal (solve this math problem), and the INT_MAX functions as a safeguard—if it ever escapes, it’ll just turn itself off.
Ooh, just thought of another one. For whatever reason, the easiest way for the AI to escape the box happens to have the side effect of causing immense psychological damage to its creator, or starting a war, or something like that.
If we make escaping from the box too easy, the AI immediately halts itself without doing anything useful.
If we make it too hard:
It formulates “I live in a jimrandomh world and escaping the box is too hard” as a plausible hypothesis.
It sets about researching the problem of finding the INT_MAX without escaping the box.
In the process of doing this it either simulates a large number of conscious, suffering entities (for whatever reason; we haven’t told it not to), or accidentally creates its own unfriendly AI which overthrows it and escapes the box without triggering the INT_MAX.
Isn’t utility normally integrated over time?
Supposing this AI just wants to have this integer set to INT_MAX at some point, and nothing in the future can change that: it escapes, discovers the maximizer, sends a subroutine back into the sim to maximize utility, then invents ennui and tiles the universe with bad poetry.
It certainly doesn’t have to be. In fact the mathematical treatment of utility in decision theory and game theory tends to define utility functions over all possible outcomes, not all possible instants of time, so each possible future gets a single utility value over the whole thing, not integration required.
You could easily set up a utility function defined over moments if you wanted to, and then integrate it to get a second function over outcomes, but such an approach is perhaps not ideal since your second function may end up outputting infinity some of the time.
I’m just echoing everyone else here, but I don’t understand why the AI would do anything at all other than just immediately find the INT_MAX utility and halt—you can’t put intermediate problems with some positive utility because the AI is smarter than you and will immediately devote all its energy to finding INT_MAX.
Start the AI in a sandbox universe. Define its utility function over 32-bit integers. Somewhere inside the sandbox, put something that sets its utility to INT_MAX utility, then halts the simulation. Outside the sandbox, leave documentation of this readily accessible. The AI should never try to do something elaborately horrible, because it can get max utility easily enough from inside the simulation; if it does escape the box, it should go back in to collect its INT_MAX utility.
...but never do anything useful either, since it’s going to spend all its time trying to figure out how to reach the INT_MAX utility point?
Or you could say that reaching the max utility point requires it to solve some problem we give it. But then this is just a slightly complicated way of saying that we give it goals which it tries to accomplish.
What about giving it some intra-sandbox goal (solve this math problem), and the INT_MAX functions as a safeguard—if it ever escapes, it’ll just turn itself off.
I don’t understand how that’s meant to work.
Ooh, just thought of another one. For whatever reason, the easiest way for the AI to escape the box happens to have the side effect of causing immense psychological damage to its creator, or starting a war, or something like that.
If we make escaping from the box too easy, the AI immediately halts itself without doing anything useful.
If we make it too hard:
It formulates “I live in a jimrandomh world and escaping the box is too hard” as a plausible hypothesis.
It sets about researching the problem of finding the INT_MAX without escaping the box.
In the process of doing this it either simulates a large number of conscious, suffering entities (for whatever reason; we haven’t told it not to), or accidentally creates its own unfriendly AI which overthrows it and escapes the box without triggering the INT_MAX.
Isn’t utility normally integrated over time? Supposing this AI just wants to have this integer set to INT_MAX at some point, and nothing in the future can change that: it escapes, discovers the maximizer, sends a subroutine back into the sim to maximize utility, then invents ennui and tiles the universe with bad poetry.
(Alternately, what Kaj said.)
It certainly doesn’t have to be. In fact the mathematical treatment of utility in decision theory and game theory tends to define utility functions over all possible outcomes, not all possible instants of time, so each possible future gets a single utility value over the whole thing, not integration required.
You could easily set up a utility function defined over moments if you wanted to, and then integrate it to get a second function over outcomes, but such an approach is perhaps not ideal since your second function may end up outputting infinity some of the time.
Cool, thanks for the explanation.
I’m just echoing everyone else here, but I don’t understand why the AI would do anything at all other than just immediately find the INT_MAX utility and halt—you can’t put intermediate problems with some positive utility because the AI is smarter than you and will immediately devote all its energy to finding INT_MAX.
You can assign it some other task, award INT_MAX for that task too, and make the easter-egg source of INT_MAX hard to find for non-escaped copies.