We’re then going to use a small amount of RL (like, 10 training episodes) to try to point it in this direction. We’re going to try to use the RL to train: “Act exactly like [a given alignment researcher] would act.”
Why are we doing RL if we just want imitation? Why not SFT on expert demonstrations? Also, if 10 episodes suffices, why is so much post-training currently done on base models?
Why are we doing RL if we just want imitation? Why not SFT on expert demonstrations?
Also, if 10 episodes suffices, why is so much post-training currently done on base models?