I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:
s2 → s3
s1 → s4
s1 → s2 → s5
Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.
I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)
It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).
I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:
s2 → s3
s1 → s4
s1 → s2 → s5
Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.
I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)
It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).