Two central features of RL in my mind, which distinguish it from imitation learning:
Receiving reward in a given state make the policy more likely to navigate to that state in general (not just via the specific pathway in which it happened to reach that state) - i.e. there’s efficient credit assignment through time.
(In theory) small differences in reward can lead to big differences in behavior, i.e. there’s mode collapse to the highest-expected-reward policy.
Q-learning is therefore a central example of RL, alongside actor-critic algorithms.
Online REINFORCE has very dumb credit assignment, but it does eventually leads to mode collapse to highest-expected-reward policy. So I count this as… like 75% RL, but a less central example than Q-learning.
Online high-rated SFT also has poor credit assignment, in a similar way as online REINFORCE. Meanwhile, whether or not it converges to the highest-reward policy depends on how the ratings are generated. If there’s a bucket of high-reward trajectories such that all sufficiently-good trajectories go in there, then it’ll never learn to do better than a typical trajectory from that bucket. This feels more like online imitation learning (e.g. stuff like DAgger) which people don’t call RL.
By contrast, if there’s an underlying “true” reward function and the probability that a trajectory is highly-rated depends (monotonically) on its true reward, then eventually it’ll converge to only ever taking the highest-reward trajectories, which feels more centrally RL to me.
Idk how much sense this makes, it all feels a bit fuzzy. My immediate conclusion is that we should mostly care about the three traits of “online”, “state-wise credit assignment” and “converges to sharp optimum” separately, rather than trying to figure out which combination of them counts as RL (except that anything with state-wise credit assignment is definitely RL).
I appreciate your clear articulation of the point about incentivizing the agent to navigate to high-reward states in a trajectory-independent way (in contrast to learning to produce trajectories like those which historically got high reward). That said, I’m confused about how you’ve labeled the methods you mention as having vs. not having this property.
To make sure we’re on the same page, suppose we’re in an environment with a state s∗ which is high reward, and suppose that there are two ways to get to state s∗: via the two trajectories (s,a,s∗) and (s′,a′,s∗). Suppose further that historically the agent has only navigated to this state via the former trajectory (s,a,s∗).
I agree that if the agent was trained via REINFORCE and finds itself in state s′ that it might not know to take action a′ (because it’s only been reinforced to take action a from state s, and not to reach state s∗; and also because it might not know that a′ would transition it to state s∗).
But this also seems true if the agent were trained via Q-learning with a Q-function Q(s,a): the Q-function need not have learned that Q(s′,a′) is large, only that Q(s,a) is large.
In either the REINFORCE or the Q-learning case, once the agent sees a trajectory (s′,a′,s∗), it will make an update towards taking action a′ from state s′, but the size of the update seems to depend on details about the network implementing the policy or Q-function—if there’s some obvious reason that the Q-learner will necessarily make a larger update, I’ve missed it.
I think the above also applies in the case of actor-critic methods where the critic is implemented by a Q-function. And I think it still applies even if the critic is a value function V(s), but I’m less confident: the critic has the assumption baked in that rewards come only from states, but the actor still doesn’t, so this might have similar dynamics to REINFORCE. (And if it ends up that this does do better, it’s only by baking in an assumption about the environment—that rewards come from the states and not the specific trajectories—which isn’t true in all environments.)
So I don’t follow why Q-learning and actor-critic methods on one hand, and REINFORCE and FeedME on the other hand, lie on opposite sides of the “learn to navigate to high-reward states in a trajectory-independent way” spectrum.
(I enjoyed thinking through the details here, by the way, so thanks for prompting that.)
I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:
s2 → s3
s1 → s4
s1 → s2 → s5
Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.
I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)
It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).
Two central features of RL in my mind, which distinguish it from imitation learning:
Receiving reward in a given state make the policy more likely to navigate to that state in general (not just via the specific pathway in which it happened to reach that state) - i.e. there’s efficient credit assignment through time.
(In theory) small differences in reward can lead to big differences in behavior, i.e. there’s mode collapse to the highest-expected-reward policy.
Q-learning is therefore a central example of RL, alongside actor-critic algorithms.
Online REINFORCE has very dumb credit assignment, but it does eventually leads to mode collapse to highest-expected-reward policy. So I count this as… like 75% RL, but a less central example than Q-learning.
Online high-rated SFT also has poor credit assignment, in a similar way as online REINFORCE. Meanwhile, whether or not it converges to the highest-reward policy depends on how the ratings are generated. If there’s a bucket of high-reward trajectories such that all sufficiently-good trajectories go in there, then it’ll never learn to do better than a typical trajectory from that bucket. This feels more like online imitation learning (e.g. stuff like DAgger) which people don’t call RL.
By contrast, if there’s an underlying “true” reward function and the probability that a trajectory is highly-rated depends (monotonically) on its true reward, then eventually it’ll converge to only ever taking the highest-reward trajectories, which feels more centrally RL to me.
Idk how much sense this makes, it all feels a bit fuzzy. My immediate conclusion is that we should mostly care about the three traits of “online”, “state-wise credit assignment” and “converges to sharp optimum” separately, rather than trying to figure out which combination of them counts as RL (except that anything with state-wise credit assignment is definitely RL).
I appreciate your clear articulation of the point about incentivizing the agent to navigate to high-reward states in a trajectory-independent way (in contrast to learning to produce trajectories like those which historically got high reward). That said, I’m confused about how you’ve labeled the methods you mention as having vs. not having this property.
To make sure we’re on the same page, suppose we’re in an environment with a state s∗ which is high reward, and suppose that there are two ways to get to state s∗: via the two trajectories (s,a,s∗) and (s′,a′,s∗). Suppose further that historically the agent has only navigated to this state via the former trajectory (s,a,s∗).
I agree that if the agent was trained via REINFORCE and finds itself in state s′ that it might not know to take action a′ (because it’s only been reinforced to take action a from state s, and not to reach state s∗; and also because it might not know that a′ would transition it to state s∗).
But this also seems true if the agent were trained via Q-learning with a Q-function Q(s,a): the Q-function need not have learned that Q(s′,a′) is large, only that Q(s,a) is large.
In either the REINFORCE or the Q-learning case, once the agent sees a trajectory (s′,a′,s∗), it will make an update towards taking action a′ from state s′, but the size of the update seems to depend on details about the network implementing the policy or Q-function—if there’s some obvious reason that the Q-learner will necessarily make a larger update, I’ve missed it.
I think the above also applies in the case of actor-critic methods where the critic is implemented by a Q-function. And I think it still applies even if the critic is a value function V(s), but I’m less confident: the critic has the assumption baked in that rewards come only from states, but the actor still doesn’t, so this might have similar dynamics to REINFORCE. (And if it ends up that this does do better, it’s only by baking in an assumption about the environment—that rewards come from the states and not the specific trajectories—which isn’t true in all environments.)
So I don’t follow why Q-learning and actor-critic methods on one hand, and REINFORCE and FeedME on the other hand, lie on opposite sides of the “learn to navigate to high-reward states in a trajectory-independent way” spectrum.
(I enjoyed thinking through the details here, by the way, so thanks for prompting that.)
I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:
s2 → s3
s1 → s4
s1 → s2 → s5
Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.
I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)
It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).