It’s possible that the “0 steps RLHF” model is the “Initial Policy” here with HHH prompt context distillation
I wondered about that when I read the original paper, and asked Ethan Perez about it here. He responded:
Good question, there’s no context distillation used in the paper (and none before RLHF)
Thanks. That’s pretty odd, then.
I wondered about that when I read the original paper, and asked Ethan Perez about it here. He responded:
Thanks. That’s pretty odd, then.