That’s true, I think the pretraining gradients training choice probably has more effect on the end model than the overfitting SFT model they start PPO with.
Huh, but Mysteries of mode collapse (and the update) were published before td-003 was released? How would you have ended up reading a post claiming td-002 was RLHF-trained when td-003 existed?
Meta note: it’s plausibly net positive that all the training details of these models has been obfuscated, but it’s frustrating how much energy has been sunk into speculation on The Way Things Work Inside OpenAI.
Meta note: it’s plausibly net positive that all the training details of these models has been obfuscated, but it’s frustrating how much energy has been sunk into speculation on The Way Things Work Inside OpenAI.
I was never a fan of this perspective, and I still think everything should have been transparent from the beginning.
That’s true, I think the pretraining gradients training choice probably has more effect on the end model than the overfitting SFT model they start PPO with.
Huh, but Mysteries of mode collapse (and the update) were published before td-003 was released? How would you have ended up reading a post claiming td-002 was RLHF-trained when td-003 existed?
Meta note: it’s plausibly net positive that all the training details of these models has been obfuscated, but it’s frustrating how much energy has been sunk into speculation on The Way Things Work Inside OpenAI.
I was never a fan of this perspective, and I still think everything should have been transparent from the beginning.