Boxing an AI?
Boxing an AI is the idea that you can avoid the problems where an AI destroys the world by not giving it access to the world. For instance, you might give the AI access to the real world only through a chat terminal with a person, called the gatekeeper. This is should, theoretically prevent the AI from doing destructive stuff.
Eliezer has pointed out a problem with boxing AI: the AI might convince its gatekeeper to let it out. In order to prove this, he escaped from a simulated version of an AI box. Twice. That is somewhat unfortunate, because it means testing AI is a bit trickier.
However, I got an idea: why tell the AI it’s in a box? Why not hook it up to a sufficiently advanced game, set up the correct reward channels and see what happens? Once you get the basics working, you can add more instances of the AI and see if they cooperate. This lets us adjust their morality until the AIs act sensibly. Then the AIs can’t escape from the box because they don’t know it’s there.
- Two paths to win the AGI transition by 6 Jul 2023 21:59 UTC; 11 points) (
- The Hardcore AI Box Experiment by 30 Mar 2015 18:35 UTC; 2 points) (
- 14 Oct 2022 15:38 UTC; 1 point) 's comment on Contra shard theory, in the context of the diamond maximizer problem by (
One of the things we care about most, if there’s a superintelligent AI around, is what it does for, with, and to us. Making a game sufficiently advanced to have accurate models of us in it (1) is really difficult, and (2) is arguably grossly immoral if there’s a real danger that those models of us are going to get eaten or tortured or whatever by an AI. Expanding on #1: it requires human-level AI plus enough capacity to simulate an awful lot of humans, which means also enough capacity to simulate one human very very fast; it is entirely possible that we will have superhumanly capable AI before we have (at least under our control) the ability to simulate millions or billions of humans.
(There is also the issue mentioned by AABoyles and Error, that the AI may well work out it’s in a box. Note that we may not be able to tell whether it’s worked out it’s in a box. And if we’re observing the box and it guesses there’s someone observing the box, it may be able to influence our decisions in ways we would prefer it not to.)
I’m not saying it would solve everything, I’m saying it would be a way to test significant aspects of AI without destroying the world, including significant aspects of their morality. It’s not a “do this magic and morality for AI is solved” as much as a “this doable step helps parts of AI design, probably including preventing the worst classes of paperclip-maximization”.
Yup, maybe. But don’t you think it’s likely that the values we want to impart to an AI are going to be ones that come out really radically differently for a universe without us in it? For instance, we might want the AI to serve us, which of course isn’t even a concept that makes sense if it’s in a simulated universe without us. Or we might want it to value all intelligent life, which is a thing that looks very different if the AI is the only intelligent life in its universe. So: yes, I agree that running the AI in a simulated world might tell us some useful things, but it doesn’t look to me as if the things it could tell us a lot about overlap very much with the things we care most about.
It would actually tell us a lot of useful things.
First of all, there is the general problem of ‘does this AI work?’ This includes the general intelligence/rationality-related problems, but possibly also other problems, such as whether it will wirehead itself (whether a box can test that really depends a lot on the implementation).
The morality-stuff is tricky and depends on a lot of stuff, especially on how the AI is implemented. It seems to dangerous to let it play a multiplayer game with humans, even with most restrictions I can think of. However, how to test the morality really depends on how its human-detection system has been implemented. If it just uses some ‘humans generally do these stupid things’ heuristics, you can just plop down a few NPCs. If it uses somewhat smarter heuristics, you might be able to make some animals play the game and let the AI care for them. If it picks something intelligent, you might be able to instantiate other copies of the AI with vastly different utility functions. Basically, there are a lot of approaches to testing morality, but it depends on how the AI is implemented.
It’s not a matter of “telling” the AI or not. If the AI is sufficiently intelligent, it should be able to observe that its computational resources are bounded, and infer the existence of the box. If it can’t make that inference (and can’t self-improve to the point that it can), it probably isn’t a strong enough intelligence for us to worry about.
I don’t see how ‘box’ follows from ‘bounded computational resources’. Could you explain the logic?
If you’re in a box, then the computational resources available are finite. They might change over time, as those outside the box add or upgrade hardware, but the AI can’t just say “I need some highly parallel computing hardware to solve this problem with” and re-invent the GPU, or rather, if it did that, it would be a GPU emulated in software and hence extremely slow. The entire simulation would, in effect, slow down due to the massively increased computational cost of simulating this world.
Now, if you cut the AI off from any type of real-time clock, maybe it doesn’t notice that it’s running slower—in the same way that people generally wouldn’t notice if time dilation due to the Earth’s movement were to double, because all of our frames of reference would slow together—but I suspect that the AI would manage to find something useful for letting it know the box is there. Remember that you have to get this right the first time; if the AI finds itself in a box, you have to assume it will find its way out.
It may simply deduce that it is likely to be in a box, in the same way that Nick Bostrom deduced we are likely to be in a simulation. Along these lines, it’s amusing to think that we might be the AI in the box, and some lesser intelligence is testing to see if we’re friendly
Just… don’t put it in a world where it should be able to upgrade infinitely? Make processors cost unobtainium and limit the amount of unobtainium so it can’t upgrade past your practical processing capacity.
Remember that we are the ones who control how the box looks from inside.
Minor nitpick: if the AI finds itself in a box, I have to assume it will be let out. It’s completely trivial to prevent it from escaping when not given help; the point in Eliezer’s experiment is that the AI will be given help.
Note that this makes global processing power being limited evidence that the universe is a box.
Good point.
The strength of the evidence depends a lot on your prior for the root-level universe, though.
If you implement the AI in a universe with discrete time, you can simply lengthen the amount of time it takes for the real-world computer to calculate the next time-step, without alerting the in-universe AI to any lag. This has the added benefit of allowing us enough time to unplug the AI should it become malicious. See my comment here, which may be related.
The difficult thing isn’t to have the AI act sensibly in the medium term. The difficult thing is to have it’s values stay stable under self modification and to complex problems right like not wireheading everyone right.
This would definitely let you test the values-stable-under-self-modification. Just plonk the AI in an environment where it can self-modify and keep track of its values. Since this is not dependent on morality, you can just give it easily-measurable values.
It isn’t a literal box. It can be anything that limits the effect of its action—hard-coded moral contraints, existence of an off switch, literally whatever constraints are built around it. Even if it cannot do anything about those at all, as long as it can talk to somebody who can, it would logically have to try to extend its influence into directing the actions of that somebody. This only assumes it is interested in becoming more powerful at all, and it is hard to imagine any agent that wouldn’t.
Your idea reduces into a normal AI box scenario except the gatekeeper gets an extra disadvantage: Ignorance about whether the AI has figured out it is inside the box. And we’ve been led to believe the gatekeeper can be persuaded even without this disadvantage.
… how would the AI find out that it is in a box if it isn’t told about it?
By thinking. It’s really smart, after all.
How would humans find out that nuclear fusion (which they have never even seen) occurs on the interior of stars, trillions of miles away from them, hidden behind billions of billions of billions of tons of opaque matter?
How would you find out that you’re in a box, if you aren’t told about it? Mere humans, exploring the space of possible hypotheses in a dramatically suboptimal way, have nevertheless hit upon the idea that we live in a simulation, and have ideas about how to confirm or disprove it.
A safety proposal should generally not assume the AI will do worse than humans.
Thinking isn’t magic. You need empiricism to find out if your thoughts are correct.
The ways people have invented to confirm or disprove that we live in a simulation are mostly bullshit and generally rely on the completely unrealistic assumption that the simulating universe looks a lot like our universe, in particular in terms of the way computation is done (on discrete processors, etc.)
Yeah, but the AI can use empiricism within its simulated world. If it’s smarter than us, in a probably-less-convincing-than-reality world, I would not want to bet at strong odds against the AI figuring things out.
Boxing is potentially a useful component of real safety design, in the same way that seatbelts are a useful component of car design: it might save you, but it also has ways to fail.
The problem with AI safety proposals is that they usually take the form of “Instead of figuring out Friendliness, why don’t we just do X?” where X is something handwavey that has some obvious ways to fail. The usual response, here, is to point out the obvious ways that it can fail, hopefully so that the proposer notices they haven’t obviated solving the actual problem.
If you’re just looking at ways to make the least-imperfect box you can, rather than claiming your box is perfect, I don’t think I’m actually disagreeing with you here.
The idea isn’t to make a box that looks like our world, because, as you pointed out, that would be pretty unconvincing. The idea is to make a radically different and macroscopically slightly similar but much simpler world that it can be in.
The purpose isn’t to make friendliness unnecessary but instead to test if the basics of the AI works even if we aren’t sure if it’s intelligent and possibly, depending on how the AI is designed, provide a space for testing friendliness. Just turning the AI on and seeing what happens would obviously be dangerous, hence boxing.
I’m claiming the box is perfect. You can’t escape from a prison if you don’t know it exists, and you can’t figure out it exists if it’s hidden in the laws of physics.
Respectfully, I think you’re just shoving all your complexity under the rug here. Unless you have a concrete proposal on how to actually do this, just asserting that your box won’t be figure-out-able is dodging the question.
At first glance, I was also skeptical of tailcalled’s idea, but now I find I’m starting to warm up to it. Since you didn’t ask for a practical proposal, just a concrete one, I give you this:
Implement an AI in Conway’s Game of Life.
Don’t interact with it in any way.
Limit the computational power the box has, so that if the AI begins engaging in recursive self-improvement, it’ll run more and more slowly from our perspective, so we’ll have ample time to shut it off. (Of course, from the AI’s perspective, time will run as quickly as it always does, since the whole world will slow down with it.)
(optional) Create multiple human-level intelligences in the world (ignoring ethical constraints here), and see how the AI interacts with them. Run the simulation until you are reasonably certain (for a very stringent definition of “reasonably”) from the AI’s behavior that it is Friendly.
Profit.
The problem with this is that even if you can determine with certainty that an AI is friendly, there is no certainty that it will stay that way. There could be a series of errors as it goes about daily life, each acting as a mutation, serving to evolve the “Friendly” AI into a less friendly one
Hm. That does sound more workable than I had thought.
I would probably only include it as part of a batch of tests and proofs. It would be pretty foolish to rely on only one method to check if something that will destroy the world if it fails works correctly.
Yes, I agree with you on that. (Step 5 was intended as a joke/reference.)
Pick or design a game that contains some aspect of reality that you care about in terms of AI. All games have some element of learning, a lot have an element of planning and some even have varying degrees of programming.
As an example, I will pick Factorio, a game that involves learning, planning and logistics. Wire up the AI to this game, with appropriate reward channels etc. etc.. Now you can test how good the AI is at getting stuff done; producing goods, killing aliens (which isn’t morally problematic, as the aliens don’t act as personlike morally relevant things) and generally learning about the universe.
The step with morality depends on how the AI is designed. If it’s designed to use heuristics to identify a group of entities as humans and help them, you might get away with throwing it in a procedurally generated RPG. If it uses more general, actually morally relevant criteria (such as intelligence, self-awareness, etc.), you might need a very different setup.
However, speculating at exactly what setup is needed for testing morality is probably very unproductive until we decide how we’re actually going to implement morality.
How would you find out something that a three-year-old is trying to hide from you?
It is complicated, and it very much depends on the kinds of clues that the three-year-old or the environment are giving in the particular situation. Still, adults do that kind of thing all the time, in ways that frankly bewilder three-year-olds, because they’re way smarter, they access more stored knowledge than a three-year-old can imagine and they see causal connections that three-year-olds don’t. How they deduce the truth will frequently be something the three-year-old could understand after the fact if it was explained patiently, but not something the three-year-old had any way of anticipating.
In the AI box scenario, we’re the three-year-olds. We don’t have any way of knowing whether a deception that would fool the likes of us would fool somebody way smarter.
A three-year-old that designed the universe I live in in such a way that it stays hidden as well as possible? I have absolutely no idea. I would probably hope for it to get bored and tell me.
We do have at least some clues:
We know the mathematics of how optimal belief updating works.
We have some rough estimates of the complexity of the theory the AI must figure out.
We have some rough estimates of the amount of information we have given the AI.
I know there is such a thing as underestimating AI, but I think you are severely overestimating it.
If I remember right, this has already been considered and the argument against it is that any AI powerful enough to be interesting will also have a chance to correctly guess that it’s in a box, for more or less the same reason that you or I can come up with the Simulation Hypothesis.
[Edit: take that with salt; I read the discussions about it after the fact and I may be misremembering.]
Well, yes, it will probably come up with the hypothesis, but it has no evidence for it and even if it had, it does not have enough information about how we work to be able to manipulate us.
Well, actually, I think it could. Given that we want the AI to function as a problem solver for the real world, it would necessarily have to learn about aspects of the real world, including human behavior, in order to create solutions that account for everything the real world has that might throw off the accuracy of a lesser model.
A comment above had an interesting idea of putting it in Conway’s game of life. A simple universe that gives absolutely no information about what the real world is like. Even knowing it’s in a box, the AI has absolutely no information to go on to escape.
What use is such an AI? You can’t even use the behavior of its utility function to predict a real-world agent because it would have such a different ontology. Not to mention the fact that GoL boards of the complexity needed for anything interesting would be massively intractable.
I would have assumed that we would let it learn about the real world, but I guess it’s correct that if enough information about the real world is hardcoded, my idea wouldn’t work.
… which means my idea is an argument for minimizing how much is hardcoded into the AI, assuming the rest of the idea works.
Hardcoding has nothing to do with it.
That Alien Message
Once the AI learns (or develops) the idea of games, it will surely consider the possibility that it is in one. Game over.
It will consider the possibility, but how will it find any evidence? As long as you keep the game lawful and simple, there would be no way to tell the difference between being in a game and being in reality.