Vivek Hebbar comments on Discussion with Nate Soares on a key alignment difficulty

Vivek Hebbar 12 Oct 2024 23:57 UTC
LW: 4 AF: 3
2
AF
We’re then going to use a small amount of RL (like, 10 training episodes) to try to point it in this direction. We’re going to try to use the RL to train: “Act exactly like [a given alignment researcher] would act.”
Why are we doing RL if we just want imitation? Why not SFT on expert demonstrations?
Also, if 10 episodes suffices, why is so much post-training currently done on base models?