I’m generally on board with attempts to have more precise options for referring to these concepts, and in this context I agree that policy as a term is more appropriate and that gradients from RL training don’t magically include more agent juice.
That said, I do think there is an important distinction between the tendencies of systems built with RL versus supervised learning that arises from reward sparsity.
In traditional RL, individual policy outputs aren’t judged in as much detail as in supervised learning. Even when comparing against RL with reward shaping, it is still likely going to be far less densely defined and constrained than, say, per-output predictive loss.
Since the target is smaller and more distant, traditional RL gives the optimizer more room to roam. I think it’s correct to say that most RL implementations will have a lot of reactive bits and pieces that are selected to form the final policy, but because learning instrumental behavior is effectively required for traditional RL to get anywhere at all, it’s more likely (than in predictive loss) that nonmyopic internal goal-like representations will be learned as a part of those instrumental behaviors.
Training on purely predictive loss, in contrast, is both densely informative and extremely constraining. Goals are less obviously convergently useful, and any internal goal representations that are learned need to fit within the bounds enforced by the predictive loss and should tend to be more local in nature as a result. Learned values that overstep their narrowly-defined usefulness get directly slapped by other predictive samples.
I think the greater freedom RL training tends to have, and the greater tendency to learn more broadly applicable internal goals to drive the required instrumental behavior, do make RL-trained systems feel more “agentic” even if it is not absolutely fundamental to the training process, nor even really related to the model’s coherence.
I’m generally on board with attempts to have more precise options for referring to these concepts, and in this context I agree that policy as a term is more appropriate and that gradients from RL training don’t magically include more agent juice.
That said, I do think there is an important distinction between the tendencies of systems built with RL versus supervised learning that arises from reward sparsity.
In traditional RL, individual policy outputs aren’t judged in as much detail as in supervised learning. Even when comparing against RL with reward shaping, it is still likely going to be far less densely defined and constrained than, say, per-output predictive loss.
Since the target is smaller and more distant, traditional RL gives the optimizer more room to roam. I think it’s correct to say that most RL implementations will have a lot of reactive bits and pieces that are selected to form the final policy, but because learning instrumental behavior is effectively required for traditional RL to get anywhere at all, it’s more likely (than in predictive loss) that nonmyopic internal goal-like representations will be learned as a part of those instrumental behaviors.
Training on purely predictive loss, in contrast, is both densely informative and extremely constraining. Goals are less obviously convergently useful, and any internal goal representations that are learned need to fit within the bounds enforced by the predictive loss and should tend to be more local in nature as a result. Learned values that overstep their narrowly-defined usefulness get directly slapped by other predictive samples.
I think the greater freedom RL training tends to have, and the greater tendency to learn more broadly applicable internal goals to drive the required instrumental behavior, do make RL-trained systems feel more “agentic” even if it is not absolutely fundamental to the training process, nor even really related to the model’s coherence.