If you’re planning to simulate an AI in a universe in a box and examine whether the produced universe is good via some process that doesn’t allow the AI to talk to you, the AI is just gonna figure out it’s being simulated and pretend to be an FAI (Note that an AI that pretends to be an FAI maximizes not for friendliness, but for apparent friendliness, so this is no pathway to FAI) so you’ll let it loose on the real world.
To the first approximation, having a box that contains an AI anywhere in it output even a few bits of info tends to choose those bits that maximize the AI’s IRL utility.
(If you have math-proofs that no paperclipper can figure out that it’s in your box, it’s just gonna maximize a mix of its apparent friendliness score and the number of paperclips (whether it is let loose in the box or in the real world), which doesn’t cost it much compared to either maximizing paperclips or apparent friendliness because of the tails thing)
If you’re planning to simulate an AI in a universe in a box and examine whether the produced universe is good via some process that doesn’t allow the AI to talk to you, the AI is just gonna figure out it’s being simulated and pretend to be an FAI (Note that an AI that pretends to be an FAI maximizes not for friendliness, but for apparent friendliness, so this is no pathway to FAI) so you’ll let it loose on the real world.
To the first approximation, having a box that contains an AI anywhere in it output even a few bits of info tends to choose those bits that maximize the AI’s IRL utility.
(If you have math-proofs that no paperclipper can figure out that it’s in your box, it’s just gonna maximize a mix of its apparent friendliness score and the number of paperclips (whether it is let loose in the box or in the real world), which doesn’t cost it much compared to either maximizing paperclips or apparent friendliness because of the tails thing)