I would propose a third reason, which is just that learning done by the RL algorithm happens after the agent has taken all of its actions in the episode, whereas learning done inside the model can happen during the episode.
This is not true of RL algorithms in general—If I want, I can make weight updates after every observation. And yet, I suspect that if I meta-train a recurrent policy using such an algorithm on a distribution of bandit tasks, I will get a ‘learning-to-learn’ style policy.
So I think this is a less fundamental reason, though it is true in off-policy RL.
This is not true of RL algorithms in general—If I want, I can make weight updates after every observation.
You can’t update the model based on its action until its taken that action and gotten a reward for it. It’s obviously possible to throw in updates based on past data whenever you want, but that’s beside the point—the point is that the RL algorithm only gets new information with which to update the model after the model has taken its action, which means if taking actions requires learning, then the model itself has to do that learning.
I interpreted your previous point to mean you only take updates off-policy, but now I see what you meant. When I said you can update after every observation, I meant that you can update once you have made an environment transition and have an (observation, action, reward, observation) tuple. I now see that you meant the RL algorithm doesn’t have the ability to update on the reward before the action is taken, which I agree with. I think I still am not convinced, however.
And can we taboo the word ‘learning’ for this discussion, or keep it to the standard ML meaning of ‘update model weights through optimisation’? Of course, some domains require responsive policies that act differently depending on what they observe, which is what Rohin observes elsewhere in these comments. In complex tasks on the way to AGI, I can see the kind of responsiveness required become very sophisticated indeed, possessing interesting cognitive structure. But it doesn’t have to be the same kind of responsiveness as the learning process of an RL agent; and it doesn’t necessarily look like learning in the everyday sense of the word. Since the space of things that could be meant here is so big, it would be good to talk more concretely.
You can’t update the model based on its action until its taken that action and gotten a reward for it.
Right, I agree with that.
Now, I understand that you argue that if a policy was to learn an internal search procedure, or an internal learning procedure, then it could predict the rewards it would get for different actions. It would then pick the action that scores best according to its prediction, thereby ‘updating’ based on returns it hasn’t yet received, and actions it hasn’t yet made. I agree that it’s possible this is helpful, and it would be interesting to study existing meta-learners from this perspective (though my guess is that they don’t do anything so sophisticated). It isn’t clear to me a priori that from the point of view of the policy this is the best strategy to take.
But note that this argument means that to the extent learned responsiveness can do more than the RL algorithm’s weight updates can, that cannot be due to recurrence. If it was, then the RL algorithm could just simulate the recurrent updates using the agent’s weights, achieving performance parity. So for what you’re describing to be the explanation for emergent learning-to-learn, you’d need the model to do all of its learned ‘learning’ within a single forward pass. I don’t find that very plausible—or rather, whatever advantageous responsive computation happens in the forward pass, I wouldn’t be inclined to describe as learning.
You might argue that today’s RL algorithms can’t simulate the required recurrence using the weights—but that is a different explanation to the one you state, and essentially the explanation I would lean towards.
if taking actions requires learning, then the model itself has to do that learning.
I’m not sure what you mean when you say ‘taking actions requires learning’. Do you mean something other than the basic requirement that a policy depends on observations?
And can we taboo the word ‘learning’ for this discussion, or keep it to the standard ML meaning of ‘update model weights through optimisation’? Of course, some domains require responsive policies that act differently depending on what they observe, which is what Rohin observes elsewhere in these comments. In complex tasks on the way to AGI, I can see the kind of responsiveness required become very sophisticated indeed, possessing interesting cognitive structure. But it doesn’t have to be the same kind of responsiveness as the learning process of an RL agent; and it doesn’t necessarily look like learning in the everyday sense of the word. Since the space of things that could be meant here is so big, it would be good to talk more concretely.
I agree with all of that—I was using the term “learning” to be purposefully vague precisely because the space is so large and the point that I’m making is very general and doesn’t really depend on exactly what notion of responsiveness/learning you’re considering.
Now, I understand that you argue that if a policy was to learn an internal search procedure, or an internal learning procedure, then it could predict the rewards it would get for different actions. It would then pick the action that scores best according to its prediction, thereby ‘updating’ based on returns it hasn’t yet received, and actions it hasn’t yet made. I agree that it’s possible this is helpful, and it would be interesting to study existing meta-learners from this perspective (though my guess is that they don’t do anything so sophisticated). It isn’t clear to me a priori that from the point of view of the policy this is the best strategy to take.
This does in fact seem like an interesting angle from which to analyze this, though it’s definitely not what I was saying—and I agree that current meta-learners probably aren’t doing this.
I’m not sure what you mean when you say ‘taking actions requires learning’. Do you mean something other than the basic requirement that a policy depends on observations?
What I mean here is in fact very basic—let me try to clarify. Let π∗:S→A be the optimal policy. Furthermore, suppose that any polynomial-time (or some other similar constraint) algorithm that well-approximates π∗ has to perform some operation f. Then, my point is just that, for π to achieve performance comparable with π∗, it has to do f. And my argument for that is just simply because we know that you have to do f to get good performance, which means either π has to do f or the gradient descent algorithm has to—but we know the gradient descent algorithm can’t be doing something crazy like running f at each step and putting the result into the model because the gradient descent algorithm only updates the model on the given state after the model has already produced its action.
I am quite confused. I wonder if we agree on the substance but not on the wording, but perhaps it’s worthwhile talking this through.
I follow your argument, and it is what I had in mind when I was responding to you earlier. If approximating π∗(ot) within the constraints requires computing f(ot), then any policy that approximates π∗ must compute f(ot). (Assuming appropriate constraints that preclude the policy from being a lookup table precomputed by SGD; not sure if that’s what you meant by “other similar”, though this may be trickier to do formally than we take it to be).
My point is that for f = ‘learning’, I can’t see how anything I would call learning could meaningfully happen inside a single timestep. ‘Learning’ in my head is something that suggests non-ephemeral change; and any lasting change has to feed into the agent’s next state, by which point SGD would have had its chance to make the same change.
Could you give an example of what you mean (this is partially why I wanted to taboo learning)? Or, could you give an example of a task that would require learning in this way? (Note the within-timestep restriction; without that I grant you that there are tasks that require learning).
could you give an example of a task that would require learning in this way? (Note the within-timestep restriction; without that I grant you that there are tasks that require learning)
How about language modeling? I think that the task of predicting what a human will say next given some prompt requires learning in a pretty meaningful way, as it requires the model to be able to learn from the prompt what the human is trying to do and then do that.
Good point—I think I wasn’t thinking deeply enough about language modelling. I certainly agree that the model has to learn in the colloquial sense, especially if it’s doing something really impressive that isn’t well-explained by interpolating on dataset examples—I’m imagining giving GPT-X some new mathematical definitions and asking it to make novel proofs.
I think my confusion was rooted in the fact that you were replying to a section that dealt specifically with learning an inner RL algorithm, and the above sense of ‘learning’ is a bit different from that one. ‘Learning’ in your sense can be required for a task without requiring an inner RL algorithm; or at least, whether it does isn’t clear to me a priori.
This is not true of RL algorithms in general—If I want, I can make weight updates after every observation. And yet, I suspect that if I meta-train a recurrent policy using such an algorithm on a distribution of bandit tasks, I will get a ‘learning-to-learn’ style policy.
So I think this is a less fundamental reason, though it is true in off-policy RL.
You can’t update the model based on its action until its taken that action and gotten a reward for it. It’s obviously possible to throw in updates based on past data whenever you want, but that’s beside the point—the point is that the RL algorithm only gets new information with which to update the model after the model has taken its action, which means if taking actions requires learning, then the model itself has to do that learning.
I interpreted your previous point to mean you only take updates off-policy, but now I see what you meant. When I said you can update after every observation, I meant that you can update once you have made an environment transition and have an (observation, action, reward, observation) tuple. I now see that you meant the RL algorithm doesn’t have the ability to update on the reward before the action is taken, which I agree with. I think I still am not convinced, however.
And can we taboo the word ‘learning’ for this discussion, or keep it to the standard ML meaning of ‘update model weights through optimisation’? Of course, some domains require responsive policies that act differently depending on what they observe, which is what Rohin observes elsewhere in these comments. In complex tasks on the way to AGI, I can see the kind of responsiveness required become very sophisticated indeed, possessing interesting cognitive structure. But it doesn’t have to be the same kind of responsiveness as the learning process of an RL agent; and it doesn’t necessarily look like learning in the everyday sense of the word. Since the space of things that could be meant here is so big, it would be good to talk more concretely.
Right, I agree with that.
Now, I understand that you argue that if a policy was to learn an internal search procedure, or an internal learning procedure, then it could predict the rewards it would get for different actions. It would then pick the action that scores best according to its prediction, thereby ‘updating’ based on returns it hasn’t yet received, and actions it hasn’t yet made. I agree that it’s possible this is helpful, and it would be interesting to study existing meta-learners from this perspective (though my guess is that they don’t do anything so sophisticated). It isn’t clear to me a priori that from the point of view of the policy this is the best strategy to take.
But note that this argument means that to the extent learned responsiveness can do more than the RL algorithm’s weight updates can, that cannot be due to recurrence. If it was, then the RL algorithm could just simulate the recurrent updates using the agent’s weights, achieving performance parity. So for what you’re describing to be the explanation for emergent learning-to-learn, you’d need the model to do all of its learned ‘learning’ within a single forward pass. I don’t find that very plausible—or rather, whatever advantageous responsive computation happens in the forward pass, I wouldn’t be inclined to describe as learning.
You might argue that today’s RL algorithms can’t simulate the required recurrence using the weights—but that is a different explanation to the one you state, and essentially the explanation I would lean towards.
I’m not sure what you mean when you say ‘taking actions requires learning’. Do you mean something other than the basic requirement that a policy depends on observations?
I agree with all of that—I was using the term “learning” to be purposefully vague precisely because the space is so large and the point that I’m making is very general and doesn’t really depend on exactly what notion of responsiveness/learning you’re considering.
This does in fact seem like an interesting angle from which to analyze this, though it’s definitely not what I was saying—and I agree that current meta-learners probably aren’t doing this.
What I mean here is in fact very basic—let me try to clarify. Let π∗:S→A be the optimal policy. Furthermore, suppose that any polynomial-time (or some other similar constraint) algorithm that well-approximates π∗ has to perform some operation f. Then, my point is just that, for π to achieve performance comparable with π∗, it has to do f. And my argument for that is just simply because we know that you have to do f to get good performance, which means either π has to do f or the gradient descent algorithm has to—but we know the gradient descent algorithm can’t be doing something crazy like running f at each step and putting the result into the model because the gradient descent algorithm only updates the model on the given state after the model has already produced its action.
I am quite confused. I wonder if we agree on the substance but not on the wording, but perhaps it’s worthwhile talking this through.
I follow your argument, and it is what I had in mind when I was responding to you earlier. If approximating π∗(ot) within the constraints requires computing f(ot), then any policy that approximates π∗ must compute f(ot). (Assuming appropriate constraints that preclude the policy from being a lookup table precomputed by SGD; not sure if that’s what you meant by “other similar”, though this may be trickier to do formally than we take it to be).
My point is that for f = ‘learning’, I can’t see how anything I would call learning could meaningfully happen inside a single timestep. ‘Learning’ in my head is something that suggests non-ephemeral change; and any lasting change has to feed into the agent’s next state, by which point SGD would have had its chance to make the same change.
Could you give an example of what you mean (this is partially why I wanted to taboo learning)? Or, could you give an example of a task that would require learning in this way? (Note the within-timestep restriction; without that I grant you that there are tasks that require learning).
How about language modeling? I think that the task of predicting what a human will say next given some prompt requires learning in a pretty meaningful way, as it requires the model to be able to learn from the prompt what the human is trying to do and then do that.
Good point—I think I wasn’t thinking deeply enough about language modelling. I certainly agree that the model has to learn in the colloquial sense, especially if it’s doing something really impressive that isn’t well-explained by interpolating on dataset examples—I’m imagining giving GPT-X some new mathematical definitions and asking it to make novel proofs.
I think my confusion was rooted in the fact that you were replying to a section that dealt specifically with learning an inner RL algorithm, and the above sense of ‘learning’ is a bit different from that one. ‘Learning’ in your sense can be required for a task without requiring an inner RL algorithm; or at least, whether it does isn’t clear to me a priori.