Speculation: RL rearranges and reweights latent model abilities, which SL created. (I think this mostly isn’t novel, just pulling together a few important threads)
Suppose I supervised-train a LM on an English corpus, and I want it to speak Spanish. RL is inappropriate for the task, because its on-policy exploration won’t output interestingly better or worse Spanish completions. So there’s not obvious content for me to grade.
More generally, RL can provide inexact gradients away from undesired behavior (e.g. negative reinforcement event → downweight logits on tokens which produced that event), but that doesn’t tell the agent what it should be doing instead (e.g. to which tokens the logits should have gone instead).
RL can also provide exact gradients towards desired behavior which was historically generated (e.g. the model outputs “Hola”, and you reinforce it, and logits go up on that output in that situation), but the behavior has to have been generated on-policy. So this is still limited. You can kinda get around this by doing clever reward shaping (reinforcing intermediate steps of cognition / behavior), but this requires knowing what to shape (increasingly Spanish-like generations???).
Supervised learning, by contrast, gives exact gradients towards long sequences of desired outputs (e.g. actual Spanish), which were generated by intelligences running the desired algorithms (e.g. spanish-Speaking generative models).
I think this is part of why feral children end up so messed up—they miss a ton of high-quality imitation data during critical periods with learning-friendly hyperparameters.
This mechanistically explains why pure RL tends to fail and be sample-inefficient (or so I recall), and why pretraining is so important.
Speculation: RL rearranges and reweights latent model abilities, which SL created. (I think this mostly isn’t novel, just pulling together a few important threads)
Suppose I supervised-train a LM on an English corpus, and I want it to speak Spanish. RL is inappropriate for the task, because its on-policy exploration won’t output interestingly better or worse Spanish completions. So there’s not obvious content for me to grade.
More generally, RL can provide inexact gradients away from undesired behavior (e.g. negative reinforcement event → downweight logits on tokens which produced that event), but that doesn’t tell the agent what it should be doing instead (e.g. to which tokens the logits should have gone instead).
RL can also provide exact gradients towards desired behavior which was historically generated (e.g. the model outputs “Hola”, and you reinforce it, and logits go up on that output in that situation), but the behavior has to have been generated on-policy. So this is still limited. You can kinda get around this by doing clever reward shaping (reinforcing intermediate steps of cognition / behavior), but this requires knowing what to shape (increasingly Spanish-like generations???).
Supervised learning, by contrast, gives exact gradients towards long sequences of desired outputs (e.g. actual Spanish), which were generated by intelligences running the desired algorithms (e.g. spanish-Speaking generative models).
I think this is part of why feral children end up so messed up—they miss a ton of high-quality imitation data during critical periods with learning-friendly hyperparameters.
This mechanistically explains why pure RL tends to fail and be sample-inefficient (or so I recall), and why pretraining is so important.