I said that I thought your reduced impact ideas did not not seem vulnerable to this concern, but I’m not sure about that now.
Suppose the AI’s world model and reward function system would include some (probaably quite intractable) model of the universe, a way of coming up with bridge hypotheses, and a system of reasoning under logical uncertainty or some other technique for approximating the results of an intractable model. Imagine the world model simulations some parts of the universe using its logical uncertainty/approximate prediction system to make tractable approximations of the complete model, and then uses some reasoning to pick out plausible bridge hypotheses and finally make predictions of its rewards and percepts.
The simplest, most accurate models of the universe, like the ones physicists try to come up with, seem to be intractable to compute exactly. Thus, pairing information about an intractable model with a system of reasoning that specified how to tractably approximate it could quite possibly what a sufficiently intelligence AI would come up with in its model of the world and its rewards.
Further suppose the logical uncertainty system/intractable model approximator says “My future self will act like it’s taking over the world to make simulations that incentivize it”. Then when it uses its world model to simulate the universe, it would potentially imagine an AI like itself on Earth, predict it would take over the world and make simulations, and thus predict that most agents in situations like its own are in simulations that incentivize misbehavior. If this occurs, then it sounds pretty probable that the AI would place almost all of its probability mass on being in one of the simulations. Thus, it would conclude taking over the world and making simulations is incentivized, and thus try to take over the world.
Thus, having a logical prior that says your future self will misbehave in such a way would probably result in both the AI’s current self and your future self misbehaving. Which would mean the AI’s prediction would be accurate. Which means its logical prior was accurate.
So such a logical prior would thus be accurate if it’s used, and it could also make agents perform better than if they used other systems for reasoning under logical uncertainty. If the AI is optimizing for accuracy and performing well when coming up with its system of handling logical uncertainty, then I wouldn’t be surprised if it used such a system.
I said that I thought your reduced impact ideas did not not seem vulnerable to this concern, but I’m not sure about that now.
Suppose the AI’s world model and reward function system would include some (probaably quite intractable) model of the universe, a way of coming up with bridge hypotheses, and a system of reasoning under logical uncertainty or some other technique for approximating the results of an intractable model. Imagine the world model simulations some parts of the universe using its logical uncertainty/approximate prediction system to make tractable approximations of the complete model, and then uses some reasoning to pick out plausible bridge hypotheses and finally make predictions of its rewards and percepts.
The simplest, most accurate models of the universe, like the ones physicists try to come up with, seem to be intractable to compute exactly. Thus, pairing information about an intractable model with a system of reasoning that specified how to tractably approximate it could quite possibly what a sufficiently intelligence AI would come up with in its model of the world and its rewards.
Further suppose the logical uncertainty system/intractable model approximator says “My future self will act like it’s taking over the world to make simulations that incentivize it”. Then when it uses its world model to simulate the universe, it would potentially imagine an AI like itself on Earth, predict it would take over the world and make simulations, and thus predict that most agents in situations like its own are in simulations that incentivize misbehavior. If this occurs, then it sounds pretty probable that the AI would place almost all of its probability mass on being in one of the simulations. Thus, it would conclude taking over the world and making simulations is incentivized, and thus try to take over the world.
Thus, having a logical prior that says your future self will misbehave in such a way would probably result in both the AI’s current self and your future self misbehaving. Which would mean the AI’s prediction would be accurate. Which means its logical prior was accurate.
So such a logical prior would thus be accurate if it’s used, and it could also make agents perform better than if they used other systems for reasoning under logical uncertainty. If the AI is optimizing for accuracy and performing well when coming up with its system of handling logical uncertainty, then I wouldn’t be surprised if it used such a system.