Forcing false beliefs on an AI seems like it could be a very bad idea. Once it learns enough about the world, the best explanations it can find consistent with those false beliefs might be very weird.
(You might think that beliefs about being in a simulation are obviously harmless because they’re one level removed from object-level beliefs about the world. But if you think you’re in a simulation then careful thought about the motives of whoever designed it, the possible hardware limitations on whatever’s implementing it, the possibility of bugs, etc., could very easily influence your beliefs about what the allegedly-simulated world is like.)
I agree. Note though that the beliefs I propose aren’t actually false. They are just different from what humans believe, but there is no way to verify which of them is correct.
You are right that it could lead to some strange behavior, given the point of view of a human, who has different priors than the AI. However, that is kind of the point of the theory. After all, the plan is to deliberately induce behaviors that are beneficial to humanity.
The question is: After giving an AI strange beliefgs, would the unexpected effects outweigh the planned effects?
Forcing false beliefs on an AI seems like it could be a very bad idea. Once it learns enough about the world, the best explanations it can find consistent with those false beliefs might be very weird.
(You might think that beliefs about being in a simulation are obviously harmless because they’re one level removed from object-level beliefs about the world. But if you think you’re in a simulation then careful thought about the motives of whoever designed it, the possible hardware limitations on whatever’s implementing it, the possibility of bugs, etc., could very easily influence your beliefs about what the allegedly-simulated world is like.)
I agree. Note though that the beliefs I propose aren’t actually false. They are just different from what humans believe, but there is no way to verify which of them is correct.
You are right that it could lead to some strange behavior, given the point of view of a human, who has different priors than the AI. However, that is kind of the point of the theory. After all, the plan is to deliberately induce behaviors that are beneficial to humanity.
The question is: After giving an AI strange beliefgs, would the unexpected effects outweigh the planned effects?