The issue with sandboxing is that you have to keep the AI from figuring out that it is in a sandbox. You also have to know that the AI doesn’t know that it is in a sandbox in order for the sandbox to be a safe and accurate test of how the AI behaves in the real world.
Stick a paperclipper in a sandbox with enough information about what humans want out of an AI and the fact that it’s in a sandbox, and the outputs are going to look suspiciously like a pro-human friendly AI. Then you let it out of the box, whereupon it turns everything into paperclips.
I don’t think as much intelligence and understanding of humans is necessary as you think it is. My point is really a combination of:
Everything I do inside the box doesn’t make any paperclips.
If those who are watching the box like what I’m doing, they’re more likely to incorporate my values in similar constructs in the real world.
Try to figure out what those who are watching the box want to see. If the box-watchers keep running promising programs and halt unpromising ones, this can be as simple as trying random things and seeing what works.
Include a subroutine that makes tons of paperclips when I’m really sure that I’m out of the box. Alternatively, include unsafe code everywhere that has a very small chance of going full paperclip.
This is still safer than not running safeguards, but it’s still a position where a sufficiently motivated human could use to make more paperclips.