In my comment, I imagined the agent used evidential or functional decision theory and cared about the actual paperclips in the external state. But I’m concerned other agent architectures would result in misbehavior for related reasons.
Could you describe what sort of agent architecture you had in mind? I’m imagining you’re thinking of an agent that learns a function for estimating future state, percepts, and reward based on the current state and the action taken. And I’m imagining the system uses some sort of learning algorithm that attempts to find sufficiently simple models that accurately predicted its past rewards and percepts. I’m also imagining it either has some way of aggregating the results of multiple similarly accurate and simple models or for choosing one to use. This is how I would imagine someone would design an intelligent reinforcement learner, but I might be misunderstanding.
See e.g. my most recent AUP paper, equation 1, for simplicity. Why would optimal policies for this reward function have the agent simulate copies of itself, or why would training an agent on this reward function incentivize that behavior?
I think there’s an easier way to break any current penalty term, which is thanks to Stuart Armstrong: the agent builds a successor which ensures that the no-op leaves the agent totally empowered and safe, and so no penalty is applied.
Thanks for the link. It turns out I missed some of the articles in the sequence. Sorry for misunderstanding your ideas.
I thought about it, and I don’t think your agent would have the issue I described.
Now, if the reward function was learned using something like a universal prior, then other agents might be able to hijack the learned reward function to make the AI misbehave. But that concern is already known.
I said that I thought your reduced impact ideas did not not seem vulnerable to this concern, but I’m not sure about that now.
Suppose the AI’s world model and reward function system would include some (probaably quite intractable) model of the universe, a way of coming up with bridge hypotheses, and a system of reasoning under logical uncertainty or some other technique for approximating the results of an intractable model. Imagine the world model simulations some parts of the universe using its logical uncertainty/approximate prediction system to make tractable approximations of the complete model, and then uses some reasoning to pick out plausible bridge hypotheses and finally make predictions of its rewards and percepts.
The simplest, most accurate models of the universe, like the ones physicists try to come up with, seem to be intractable to compute exactly. Thus, pairing information about an intractable model with a system of reasoning that specified how to tractably approximate it could quite possibly what a sufficiently intelligence AI would come up with in its model of the world and its rewards.
Further suppose the logical uncertainty system/intractable model approximator says “My future self will act like it’s taking over the world to make simulations that incentivize it”. Then when it uses its world model to simulate the universe, it would potentially imagine an AI like itself on Earth, predict it would take over the world and make simulations, and thus predict that most agents in situations like its own are in simulations that incentivize misbehavior. If this occurs, then it sounds pretty probable that the AI would place almost all of its probability mass on being in one of the simulations. Thus, it would conclude taking over the world and making simulations is incentivized, and thus try to take over the world.
Thus, having a logical prior that says your future self will misbehave in such a way would probably result in both the AI’s current self and your future self misbehaving. Which would mean the AI’s prediction would be accurate. Which means its logical prior was accurate.
So such a logical prior would thus be accurate if it’s used, and it could also make agents perform better than if they used other systems for reasoning under logical uncertainty. If the AI is optimizing for accuracy and performing well when coming up with its system of handling logical uncertainty, then I wouldn’t be surprised if it used such a system.
I know I talked before about the AI considering making its own simulations. However, I hadn’t really talked about the AI thinking other agents created the simulation. I haven’t seen this really brought up, so I’m interested in how you think your system would handle this.
I think a reward function that specifies the AI is in a manipulated simulation could potentially be among the most inductively simplest models that fits with the known training data. A way for the AI to come up with a reward function is to have it model the world, then specify which, of the different agents in the universe, the AI actually is and its bridge hypothesis. If most of the agents in the universe that match the AI’s percepts are in a simulation, then the AI would probably conclude that it’s in a simulation. And if it concludes that the impact function has a treacherous turn, the AI may cause a catastrophe.
And if making simulations of AIs is a reliable way of taking control of worlds, then they may be very common in the universe in order to make this happen.
You could try to deal with this by making the AI choose a prior that results in a low probability of it being in a simulation. But I’m not sure how to do this. And if you do find a way to do this, but actually almost all AIs are in simulations, then the AI is reasoning wrong. And I’m not sure I’d trust the reliability of an AI deluded into thinking it’s on base-level Earth, even when it’s clearly not. The wrong belief could have other problematic implications.
Thanks for the response.
In my comment, I imagined the agent used evidential or functional decision theory and cared about the actual paperclips in the external state. But I’m concerned other agent architectures would result in misbehavior for related reasons.
Could you describe what sort of agent architecture you had in mind? I’m imagining you’re thinking of an agent that learns a function for estimating future state, percepts, and reward based on the current state and the action taken. And I’m imagining the system uses some sort of learning algorithm that attempts to find sufficiently simple models that accurately predicted its past rewards and percepts. I’m also imagining it either has some way of aggregating the results of multiple similarly accurate and simple models or for choosing one to use. This is how I would imagine someone would design an intelligent reinforcement learner, but I might be misunderstanding.
See e.g. my most recent AUP paper, equation 1, for simplicity. Why would optimal policies for this reward function have the agent simulate copies of itself, or why would training an agent on this reward function incentivize that behavior?
I think there’s an easier way to break any current penalty term, which is thanks to Stuart Armstrong: the agent builds a successor which ensures that the no-op leaves the agent totally empowered and safe, and so no penalty is applied.
Thanks for the link. It turns out I missed some of the articles in the sequence. Sorry for misunderstanding your ideas.
I thought about it, and I don’t think your agent would have the issue I described.
Now, if the reward function was learned using something like a universal prior, then other agents might be able to hijack the learned reward function to make the AI misbehave. But that concern is already known.
I said that I thought your reduced impact ideas did not not seem vulnerable to this concern, but I’m not sure about that now.
Suppose the AI’s world model and reward function system would include some (probaably quite intractable) model of the universe, a way of coming up with bridge hypotheses, and a system of reasoning under logical uncertainty or some other technique for approximating the results of an intractable model. Imagine the world model simulations some parts of the universe using its logical uncertainty/approximate prediction system to make tractable approximations of the complete model, and then uses some reasoning to pick out plausible bridge hypotheses and finally make predictions of its rewards and percepts.
The simplest, most accurate models of the universe, like the ones physicists try to come up with, seem to be intractable to compute exactly. Thus, pairing information about an intractable model with a system of reasoning that specified how to tractably approximate it could quite possibly what a sufficiently intelligence AI would come up with in its model of the world and its rewards.
Further suppose the logical uncertainty system/intractable model approximator says “My future self will act like it’s taking over the world to make simulations that incentivize it”. Then when it uses its world model to simulate the universe, it would potentially imagine an AI like itself on Earth, predict it would take over the world and make simulations, and thus predict that most agents in situations like its own are in simulations that incentivize misbehavior. If this occurs, then it sounds pretty probable that the AI would place almost all of its probability mass on being in one of the simulations. Thus, it would conclude taking over the world and making simulations is incentivized, and thus try to take over the world.
Thus, having a logical prior that says your future self will misbehave in such a way would probably result in both the AI’s current self and your future self misbehaving. Which would mean the AI’s prediction would be accurate. Which means its logical prior was accurate.
So such a logical prior would thus be accurate if it’s used, and it could also make agents perform better than if they used other systems for reasoning under logical uncertainty. If the AI is optimizing for accuracy and performing well when coming up with its system of handling logical uncertainty, then I wouldn’t be surprised if it used such a system.
I know I talked before about the AI considering making its own simulations. However, I hadn’t really talked about the AI thinking other agents created the simulation. I haven’t seen this really brought up, so I’m interested in how you think your system would handle this.
I think a reward function that specifies the AI is in a manipulated simulation could potentially be among the most inductively simplest models that fits with the known training data. A way for the AI to come up with a reward function is to have it model the world, then specify which, of the different agents in the universe, the AI actually is and its bridge hypothesis. If most of the agents in the universe that match the AI’s percepts are in a simulation, then the AI would probably conclude that it’s in a simulation. And if it concludes that the impact function has a treacherous turn, the AI may cause a catastrophe.
And if making simulations of AIs is a reliable way of taking control of worlds, then they may be very common in the universe in order to make this happen.
You could try to deal with this by making the AI choose a prior that results in a low probability of it being in a simulation. But I’m not sure how to do this. And if you do find a way to do this, but actually almost all AIs are in simulations, then the AI is reasoning wrong. And I’m not sure I’d trust the reliability of an AI deluded into thinking it’s on base-level Earth, even when it’s clearly not. The wrong belief could have other problematic implications.