Vladimir_Nesov comments on LLMs Look Increasingly Like General Reasoners

Vladimir_Nesov 9 Nov 2024 17:03 UTC
4 points
0
Performance after post-training degrades if behavior gets too far from that of the base/SFT model (see Figure 1). Solving this issue would be an entirely different advancement from what o1-like post-training appears to do. So I expect that the model remains approximately as smart as the base model and the corresponding chatbot, it’s just better at packaging its intelligence into relevant long reasoning traces.
- eggsyntax 9 Nov 2024 18:58 UTC
  1 point
  0
  Parent
  Interesting, I didn’t know that. But it seems like that assumes that o1′s special-sauce training can be viewed as a kind of RLHF, right? Do we know enough about that training to know that it’s RLHF-ish? Or at least some clearly offline approach.