Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).
Ah, nice example! I now see your point, and I agree with everything you wrote. Whereas REINFORCE and SFT only incentivize actions which in fact were historically part of high-reward trajectories, Q-learning and actor-critic incentivize actions which comprise trajectories that one can infer would be high-reward (even if those actions never actually appeared in high-reward trajectories previously).