Oh, I’m sorry, I looked through posts I read to see where to add the comment and apparently chose the wrong one.
Anyways, I’ll try to explain better. I hope I’m not just crazy.
An agent’s beliefs about what the world it’s currently in influence its plans. But its plans also have the potential to influence its beliefs about what world it’s currently in. For example, if the AI original think it’s not in a simulation, but then plans on trying to make lots of simulations of it, then it would think it’s more likely that it currently is in a simulation. Similarly, if the AI decides against trying to make simulations, then it would probably place higher probability in it not currently being in a simulation.
So, to summarize, the AI’s beliefs about the current world influence its current plan, but the AI’s current plan potentially influences its beliefs about the current world, which has the potential to result influence the AI’s plan, which can further modify its beliefs, and so on. Unless the AI would continue having plan and beliefs fluctuate, I imagine the AI would come to an equilibrium: a state in which the AI’s current plan is the optimal one for its current beliefs about what sort of world it’s in, and such that deciding to execute the plan would still be a world for which executing the plan would be optimal. There might be reasonable AI architectures that don’t allow for the AI’s plans to update their beliefs as I described, but they also seem vulnerable to my concern but in a different way, so I won’t talk about them.
Anyways, here is one potential equilibrium on the AI’s beliefs and plan:
“I’m not in a simulation, or at least I’m in one that’s faithful to the ‘real’, original Earth. Thus, the obvious, common-sense ideas about what is and isn’t impactful most people have are correct. So I’ll go and create these paperclips, and not try to create nanotech for world domination or anything, because that would be high-impact. Since I’m not going to try taking over the world and making simulations of myself, I’ve affirmed by belief that I’m not in such a simulation.”
Here is another potential equilibrium:
“I’m probably in a simulation created the the AI in the original, non-simulated Earth that took over the world and created a bunch of simulations of itself, including me. The original AI made the simulations different in two ways from the non-simulated Earth. One way is that the simulation checks to see if I’m executing the plan that, if it were to be executed on the non-simulated Earth, would result in the original AI taking over the world and making lots of simulations of itself. If I am, the simulation will be set so my goal is perfectly accomplished. Since I’m making paperclips, that means as long as I do that, this entire room will fill up with paperclips. The other way the simulations is different from the non-simulated Earth is that it checks to see what would (causally) happen if the AI does the plan that would have resulted in a world-takeover in the original Earth, and makes those exact results happen if the AI instead does the ‘null’ action. Thus, my (causal) impact from doing the world-takeover plan in the simulation is low. So, as long as I act like the AI who took over the world in the original Earth and made the simulations did, I will perfectly accomplish my goal and extremely low impact. This is was I want, so this is what I’ll do. Since I’ve decided to do this, my belief that I’m a such a simulation is affirmed, so I still think my plan is a good one.”
The second equilibrium has the potential to result in perfect goal achievement and almost no causal impact, so I suspect that’s the one the AI would go with. And that’s the one that gets all of us killed.
Oh, I’m sorry, I looked through posts I read to see where to add the comment and apparently chose the wrong one.
Anyways, I’ll try to explain better. I hope I’m not just crazy.
An agent’s beliefs about what the world it’s currently in influence its plans. But its plans also have the potential to influence its beliefs about what world it’s currently in. For example, if the AI original think it’s not in a simulation, but then plans on trying to make lots of simulations of it, then it would think it’s more likely that it currently is in a simulation. Similarly, if the AI decides against trying to make simulations, then it would probably place higher probability in it not currently being in a simulation.
So, to summarize, the AI’s beliefs about the current world influence its current plan, but the AI’s current plan potentially influences its beliefs about the current world, which has the potential to result influence the AI’s plan, which can further modify its beliefs, and so on. Unless the AI would continue having plan and beliefs fluctuate, I imagine the AI would come to an equilibrium: a state in which the AI’s current plan is the optimal one for its current beliefs about what sort of world it’s in, and such that deciding to execute the plan would still be a world for which executing the plan would be optimal. There might be reasonable AI architectures that don’t allow for the AI’s plans to update their beliefs as I described, but they also seem vulnerable to my concern but in a different way, so I won’t talk about them.
Anyways, here is one potential equilibrium on the AI’s beliefs and plan:
“I’m not in a simulation, or at least I’m in one that’s faithful to the ‘real’, original Earth. Thus, the obvious, common-sense ideas about what is and isn’t impactful most people have are correct. So I’ll go and create these paperclips, and not try to create nanotech for world domination or anything, because that would be high-impact. Since I’m not going to try taking over the world and making simulations of myself, I’ve affirmed by belief that I’m not in such a simulation.”
Here is another potential equilibrium:
“I’m probably in a simulation created the the AI in the original, non-simulated Earth that took over the world and created a bunch of simulations of itself, including me. The original AI made the simulations different in two ways from the non-simulated Earth. One way is that the simulation checks to see if I’m executing the plan that, if it were to be executed on the non-simulated Earth, would result in the original AI taking over the world and making lots of simulations of itself. If I am, the simulation will be set so my goal is perfectly accomplished. Since I’m making paperclips, that means as long as I do that, this entire room will fill up with paperclips. The other way the simulations is different from the non-simulated Earth is that it checks to see what would (causally) happen if the AI does the plan that would have resulted in a world-takeover in the original Earth, and makes those exact results happen if the AI instead does the ‘null’ action. Thus, my (causal) impact from doing the world-takeover plan in the simulation is low. So, as long as I act like the AI who took over the world in the original Earth and made the simulations did, I will perfectly accomplish my goal and extremely low impact. This is was I want, so this is what I’ll do. Since I’ve decided to do this, my belief that I’m a such a simulation is affirmed, so I still think my plan is a good one.”
The second equilibrium has the potential to result in perfect goal achievement and almost no causal impact, so I suspect that’s the one the AI would go with. And that’s the one that gets all of us killed.