The RL agent will only know whether its plans are any good if they actually get carried out. The reward signal is something that it essentially sought out through trial and error. All (most?) RL agents start out not knowing anything about the impact their plans will have, or even anything about the causal structure of the environment. All of that has to be learned through experience.
For agents that play board games like chess or Go, the environment can be fully determined in simulation. So, sure, in those cases you can have them generate plans and then not take their advice on a physical game board. And those plans do tend to be power-seeking for well-trained agents in the sense that they tend to reach states that maximize the number of winnable options that they have while minimizing the winnable options of their opponents.
However, for an AI to generate power seeking plans for the real world, it would need to have access either to a very computationally expensive simulator or to the actual real world. The latter is an easier setup to design but more dangerous to train, above a certain level of capability.
I agree with everything you’ve said. Obviously, AI (in most domains) would need to evaluate its plans in the real world to acquire training data. But my point is that we have the choice to not carry out some of the agent’s plans in the real-world. For some of the AI’s plans, we can say no—we have a veto button. It seems to me that the AI would be completely fine with that—is that correct? If so, it makes safety a much more tractable problem than it otherwise would be.
The problem is that at the beginning, its plans are generally going to be complete nonsense. It has to have a ton of interaction with (at least a reasonable model of) its environment, both with its reward signal and with its causal structure, before it approaches a sensible output.
There is no utility for the RL agent’s operators to have an oracle AI with no practical experience. The power of RL is that a simple feedback signal can teach it everything it needs to know to act rationally in its environment. But if you want it to make rational plans for the real world without actually letting it get direct feedback from the real world, you need to add on vast layers of additional computational complexity to its training manually, which would more or less be taken care of automatically for an RL agent interacting with the real world. The incentives aren’t in your favor here.
The RL agent will only know whether its plans are any good if they actually get carried out. The reward signal is something that it essentially sought out through trial and error. All (most?) RL agents start out not knowing anything about the impact their plans will have, or even anything about the causal structure of the environment. All of that has to be learned through experience.
For agents that play board games like chess or Go, the environment can be fully determined in simulation. So, sure, in those cases you can have them generate plans and then not take their advice on a physical game board. And those plans do tend to be power-seeking for well-trained agents in the sense that they tend to reach states that maximize the number of winnable options that they have while minimizing the winnable options of their opponents.
However, for an AI to generate power seeking plans for the real world, it would need to have access either to a very computationally expensive simulator or to the actual real world. The latter is an easier setup to design but more dangerous to train, above a certain level of capability.
I agree with everything you’ve said. Obviously, AI (in most domains) would need to evaluate its plans in the real world to acquire training data. But my point is that we have the choice to not carry out some of the agent’s plans in the real-world. For some of the AI’s plans, we can say no—we have a veto button. It seems to me that the AI would be completely fine with that—is that correct? If so, it makes safety a much more tractable problem than it otherwise would be.
The problem is that at the beginning, its plans are generally going to be complete nonsense. It has to have a ton of interaction with (at least a reasonable model of) its environment, both with its reward signal and with its causal structure, before it approaches a sensible output.
There is no utility for the RL agent’s operators to have an oracle AI with no practical experience. The power of RL is that a simple feedback signal can teach it everything it needs to know to act rationally in its environment. But if you want it to make rational plans for the real world without actually letting it get direct feedback from the real world, you need to add on vast layers of additional computational complexity to its training manually, which would more or less be taken care of automatically for an RL agent interacting with the real world. The incentives aren’t in your favor here.
Thanks, I appreciate the explanation!