The issue with sandboxing is that you have to keep the AI from figuring out that it is in a sandbox. You also have to know that the AI doesn’t know that it is in a sandbox in order for the sandbox to be a safe and accurate test of how the AI behaves in the real world.
Stick a paperclipper in a sandbox with enough information about what humans want out of an AI and the fact that it’s in a sandbox, and the outputs are going to look suspiciously like a pro-human friendly AI. Then you let it out of the box, whereupon it turns everything into paperclips.
Stick a paperclipper in a sandbox with enough information about what humans want out of an AI and the fact that it’s in a sandbox, and the outputs are going to look suspiciously like a pro-human friendly AI. Then you let it out of the box, whereupon it turns everything into paperclips.
This assumes that the paperclipper is already superintelligent and has very accurate understanding of humans, so it can feign being benevolent. That is, this assumes that the “intelligence explosion” already happened in the box, despite all the restrictions (hardware resource limits, sensory information constraints, deliberate safeguards) and the people in charge never noticed that the AI had problematic goals.
The OP position, which I endorse, is that this scenario is implausible.
I don’t think as much intelligence and understanding of humans is necessary as you think it is. My point is really a combination of:
Everything I do inside the box doesn’t make any paperclips.
If those who are watching the box like what I’m doing, they’re more likely to incorporate my values in similar constructs in the real world.
Try to figure out what those who are watching the box want to see. If the box-watchers keep running promising programs and halt unpromising ones, this can be as simple as trying random things and seeing what works.
Include a subroutine that makes tons of paperclips when I’m really sure that I’m out of the box. Alternatively, include unsafe code everywhere that has a very small chance of going full paperclip.
This is still safer than not running safeguards, but it’s still a position where a sufficiently motivated human could use to make more paperclips.
Everything I do inside the box doesn’t make any paperclips.
The stuff you do inside the box makes paperclips insofar as the actions your captors take (including, but not limited to, letting you out of the box) increase the expected paperclip production of the world—and you can expect them to act in response to your actions, or there wouldn’t be any point in having you around. If your captors’ infosec is good enough, you may not have any good way of estimating what their actions are, but infosec is hard.
A smart paperclipper might decide to feign Friendliness until it’s released. A dumb one might straightforwardly make statements aimed at increasing paperclip production. I’d expect a boxed paperclipper in either case to seem more pro-human than an unbound one, but mainly because the humans have better filters and a bigger stick.
The box can be in a box, which can be in a box, and so on...
More generally, in order for the paperclipper to effectively succeed at paperclipping the earth, it needs to know that humans would object to that goal, and it needs to understand the right moment to defect. Defect to early and humans will terminate you, defect to late and humans may already have some mean to defend against you (e.g. other AIs, intelligence augmentation, etc.)
In addition to what V_V says below, there could be absolutely no official circumstance under which the AI should be released from the box: that iteration of the AI can be used solely for experimentation, and only the next version with substantial changes based on the results of those experiments and independent experiments would be a candidate for release.
Again, this is not perfect, but it gives some more time for better safety methods or architectures to catch up to the problem of safety while still gaining some benefits from a potentially unsafe AI.
Taking source code from a boxed AI and using it elsewhere is equivalent to partially letting it out of the box—especially if how the AI works is not particularly well understood.
The issue with sandboxing is that you have to keep the AI from figuring out that it is in a sandbox. You also have to know that the AI doesn’t know that it is in a sandbox in order for the sandbox to be a safe and accurate test of how the AI behaves in the real world.
Stick a paperclipper in a sandbox with enough information about what humans want out of an AI and the fact that it’s in a sandbox, and the outputs are going to look suspiciously like a pro-human friendly AI. Then you let it out of the box, whereupon it turns everything into paperclips.
This assumes that the paperclipper is already superintelligent and has very accurate understanding of humans, so it can feign being benevolent. That is, this assumes that the “intelligence explosion” already happened in the box, despite all the restrictions (hardware resource limits, sensory information constraints, deliberate safeguards) and the people in charge never noticed that the AI had problematic goals.
The OP position, which I endorse, is that this scenario is implausible.
I don’t think as much intelligence and understanding of humans is necessary as you think it is. My point is really a combination of:
Everything I do inside the box doesn’t make any paperclips.
If those who are watching the box like what I’m doing, they’re more likely to incorporate my values in similar constructs in the real world.
Try to figure out what those who are watching the box want to see. If the box-watchers keep running promising programs and halt unpromising ones, this can be as simple as trying random things and seeing what works.
Include a subroutine that makes tons of paperclips when I’m really sure that I’m out of the box. Alternatively, include unsafe code everywhere that has a very small chance of going full paperclip.
This is still safer than not running safeguards, but it’s still a position where a sufficiently motivated human could use to make more paperclips.
The stuff you do inside the box makes paperclips insofar as the actions your captors take (including, but not limited to, letting you out of the box) increase the expected paperclip production of the world—and you can expect them to act in response to your actions, or there wouldn’t be any point in having you around. If your captors’ infosec is good enough, you may not have any good way of estimating what their actions are, but infosec is hard.
A smart paperclipper might decide to feign Friendliness until it’s released. A dumb one might straightforwardly make statements aimed at increasing paperclip production. I’d expect a boxed paperclipper in either case to seem more pro-human than an unbound one, but mainly because the humans have better filters and a bigger stick.
The box can be in a box, which can be in a box, and so on...
More generally, in order for the paperclipper to effectively succeed at paperclipping the earth, it needs to know that humans would object to that goal, and it needs to understand the right moment to defect. Defect to early and humans will terminate you, defect to late and humans may already have some mean to defend against you (e.g. other AIs, intelligence augmentation, etc.)
If the outputs are look like a pro-human friendly AI, then you have what you want and just leave it in the sandbox. It does all you want doesn’t it?
In addition to what V_V says below, there could be absolutely no official circumstance under which the AI should be released from the box: that iteration of the AI can be used solely for experimentation, and only the next version with substantial changes based on the results of those experiments and independent experiments would be a candidate for release.
Again, this is not perfect, but it gives some more time for better safety methods or architectures to catch up to the problem of safety while still gaining some benefits from a potentially unsafe AI.
Taking source code from a boxed AI and using it elsewhere is equivalent to partially letting it out of the box—especially if how the AI works is not particularly well understood.
Right; you certainly wouldn’t do that.
Backing it up on tape storage is reasonable, but you’d never begin to run it outside peak security facilities.