I know I talked before about the AI considering making its own simulations. However, I hadn’t really talked about the AI thinking other agents created the simulation. I haven’t seen this really brought up, so I’m interested in how you think your system would handle this.
I think a reward function that specifies the AI is in a manipulated simulation could potentially be among the most inductively simplest models that fits with the known training data. A way for the AI to come up with a reward function is to have it model the world, then specify which, of the different agents in the universe, the AI actually is and its bridge hypothesis. If most of the agents in the universe that match the AI’s percepts are in a simulation, then the AI would probably conclude that it’s in a simulation. And if it concludes that the impact function has a treacherous turn, the AI may cause a catastrophe.
And if making simulations of AIs is a reliable way of taking control of worlds, then they may be very common in the universe in order to make this happen.
You could try to deal with this by making the AI choose a prior that results in a low probability of it being in a simulation. But I’m not sure how to do this. And if you do find a way to do this, but actually almost all AIs are in simulations, then the AI is reasoning wrong. And I’m not sure I’d trust the reliability of an AI deluded into thinking it’s on base-level Earth, even when it’s clearly not. The wrong belief could have other problematic implications.
I know I talked before about the AI considering making its own simulations. However, I hadn’t really talked about the AI thinking other agents created the simulation. I haven’t seen this really brought up, so I’m interested in how you think your system would handle this.
I think a reward function that specifies the AI is in a manipulated simulation could potentially be among the most inductively simplest models that fits with the known training data. A way for the AI to come up with a reward function is to have it model the world, then specify which, of the different agents in the universe, the AI actually is and its bridge hypothesis. If most of the agents in the universe that match the AI’s percepts are in a simulation, then the AI would probably conclude that it’s in a simulation. And if it concludes that the impact function has a treacherous turn, the AI may cause a catastrophe.
And if making simulations of AIs is a reliable way of taking control of worlds, then they may be very common in the universe in order to make this happen.
You could try to deal with this by making the AI choose a prior that results in a low probability of it being in a simulation. But I’m not sure how to do this. And if you do find a way to do this, but actually almost all AIs are in simulations, then the AI is reasoning wrong. And I’m not sure I’d trust the reliability of an AI deluded into thinking it’s on base-level Earth, even when it’s clearly not. The wrong belief could have other problematic implications.