I explain it in more detail in my original post. In short, in standard language modeling the model only tries to predict the most likely immediate next token (T1), and then the most likely token after that (T2) given T1, and so on; whereas in RL it’s trying to optimize a whole sequence of next tokens (T1, …, Tn) such that the rewards for all the tokens (up to Tn) are taken into account in the reward of the immediate next token (T1).
Why does RL necessarily mean that AIs are trained to plan ahead?
I explain it in more detail in my original post.
In short, in standard language modeling the model only tries to predict the most likely immediate next token (T1), and then the most likely token after that (T2) given T1, and so on; whereas in RL it’s trying to optimize a whole sequence of next tokens (T1, …, Tn) such that the rewards for all the tokens (up to Tn) are taken into account in the reward of the immediate next token (T1).