We’re then going to use a small amount of RL (like, 10 training episodes) to try to point it in this direction. We’re going to try to use the RL to train: “Act exactly like [a given alignment researcher] would act.”
Why are we doing RL if we just want imitation? Why not SFT on expert demonstrations?
Also, if 10 episodes suffices, why is so much post-training currently done on base models?
Do you want to try playing this game together sometime?