It’s totally valid to take a perspective in which an AI trained to play Tetris “doesn’t want to play good Tetris, it just searches for plans that correspond to good Tetris.”
Or even that an AI trained to navigate and act in the real world “doesn’t want to navigate the real world, it just searches for plans that do useful real-world things.”
But it’s also a valid perspective to say “you know, the AI that’s trained to navigate the real world really does want the things it searches for plans to achieve.” It’s just semantics in the end.
But! Be careful about switching perspectives without realizing it. When you take one perspective on an AI, and you want to compare it to a human, you should keep applying that same perspective!
From the perspective where the real-world-navigating AI doesn’t really want things, humans don’t really want things either. They’re merely generating a series of outputs that they think will constitute a good plan for moving their bodies.
The RL agent will only know whether its plans are any good if they actually get carried out. The reward signal is something that it essentially sought out through trial and error. All (most?) RL agents start out not knowing anything about the impact their plans will have, or even anything about the causal structure of the environment. All of that has to be learned through experience.
For agents that play board games like chess or Go, the environment can be fully determined in simulation. So, sure, in those cases you can have them generate plans and then not take their advice on a physical game board. And those plans do tend to be power-seeking for well-trained agents in the sense that they tend to reach states that maximize the number of winnable options that they have while minimizing the winnable options of their opponents.
However, for an AI to generate power seeking plans for the real world, it would need to have access either to a very computationally expensive simulator or to the actual real world. The latter is an easier setup to design but more dangerous to train, above a certain level of capability.
I agree with everything you’ve said. Obviously, AI (in most domains) would need to evaluate its plans in the real world to acquire training data. But my point is that we have the choice to not carry out some of the agent’s plans in the real-world. For some of the AI’s plans, we can say no—we have a veto button. It seems to me that the AI would be completely fine with that—is that correct? If so, it makes safety a much more tractable problem than it otherwise would be.
The problem is that at the beginning, its plans are generally going to be complete nonsense. It has to have a ton of interaction with (at least a reasonable model of) its environment, both with its reward signal and with its causal structure, before it approaches a sensible output.
There is no utility for the RL agent’s operators to have an oracle AI with no practical experience. The power of RL is that a simple feedback signal can teach it everything it needs to know to act rationally in its environment. But if you want it to make rational plans for the real world without actually letting it get direct feedback from the real world, you need to add on vast layers of additional computational complexity to its training manually, which would more or less be taken care of automatically for an RL agent interacting with the real world. The incentives aren’t in your favor here.
Everything is a matter of perspective.
It’s totally valid to take a perspective in which an AI trained to play Tetris “doesn’t want to play good Tetris, it just searches for plans that correspond to good Tetris.”
Or even that an AI trained to navigate and act in the real world “doesn’t want to navigate the real world, it just searches for plans that do useful real-world things.”
But it’s also a valid perspective to say “you know, the AI that’s trained to navigate the real world really does want the things it searches for plans to achieve.” It’s just semantics in the end.
But! Be careful about switching perspectives without realizing it. When you take one perspective on an AI, and you want to compare it to a human, you should keep applying that same perspective!
From the perspective where the real-world-navigating AI doesn’t really want things, humans don’t really want things either. They’re merely generating a series of outputs that they think will constitute a good plan for moving their bodies.
Thanks, that’s a really helpful framing!
The RL agent will only know whether its plans are any good if they actually get carried out. The reward signal is something that it essentially sought out through trial and error. All (most?) RL agents start out not knowing anything about the impact their plans will have, or even anything about the causal structure of the environment. All of that has to be learned through experience.
For agents that play board games like chess or Go, the environment can be fully determined in simulation. So, sure, in those cases you can have them generate plans and then not take their advice on a physical game board. And those plans do tend to be power-seeking for well-trained agents in the sense that they tend to reach states that maximize the number of winnable options that they have while minimizing the winnable options of their opponents.
However, for an AI to generate power seeking plans for the real world, it would need to have access either to a very computationally expensive simulator or to the actual real world. The latter is an easier setup to design but more dangerous to train, above a certain level of capability.
I agree with everything you’ve said. Obviously, AI (in most domains) would need to evaluate its plans in the real world to acquire training data. But my point is that we have the choice to not carry out some of the agent’s plans in the real-world. For some of the AI’s plans, we can say no—we have a veto button. It seems to me that the AI would be completely fine with that—is that correct? If so, it makes safety a much more tractable problem than it otherwise would be.
The problem is that at the beginning, its plans are generally going to be complete nonsense. It has to have a ton of interaction with (at least a reasonable model of) its environment, both with its reward signal and with its causal structure, before it approaches a sensible output.
There is no utility for the RL agent’s operators to have an oracle AI with no practical experience. The power of RL is that a simple feedback signal can teach it everything it needs to know to act rationally in its environment. But if you want it to make rational plans for the real world without actually letting it get direct feedback from the real world, you need to add on vast layers of additional computational complexity to its training manually, which would more or less be taken care of automatically for an RL agent interacting with the real world. The incentives aren’t in your favor here.
Thanks, I appreciate the explanation!