The problem is that at the beginning, its plans are generally going to be complete nonsense. It has to have a ton of interaction with (at least a reasonable model of) its environment, both with its reward signal and with its causal structure, before it approaches a sensible output.
There is no utility for the RL agent’s operators to have an oracle AI with no practical experience. The power of RL is that a simple feedback signal can teach it everything it needs to know to act rationally in its environment. But if you want it to make rational plans for the real world without actually letting it get direct feedback from the real world, you need to add on vast layers of additional computational complexity to its training manually, which would more or less be taken care of automatically for an RL agent interacting with the real world. The incentives aren’t in your favor here.
The problem is that at the beginning, its plans are generally going to be complete nonsense. It has to have a ton of interaction with (at least a reasonable model of) its environment, both with its reward signal and with its causal structure, before it approaches a sensible output.
There is no utility for the RL agent’s operators to have an oracle AI with no practical experience. The power of RL is that a simple feedback signal can teach it everything it needs to know to act rationally in its environment. But if you want it to make rational plans for the real world without actually letting it get direct feedback from the real world, you need to add on vast layers of additional computational complexity to its training manually, which would more or less be taken care of automatically for an RL agent interacting with the real world. The incentives aren’t in your favor here.
Thanks, I appreciate the explanation!