For sure! At the same time, a) we’ve continued to see new ways of eliciting greater capability from the models we have, and b) o1 could (AFAIK) involve enough additional training compute to no longer be best thought of as ‘the same model’ (one possibility, although I haven’t spent much time looking into what we know about o1: they may have started with a snapshot of the 4o base model, put it through additional pre-training, then done an arbitrary amount of RL on CoT). So I’m hesitant to think that ‘based on 4o’ sets a very strong limit on o1′s capabilities.
Performance after post-training degrades if behavior gets too far from that of the base/SFT model (see Figure 1). Solving this issue would be an entirely different advancement from what o1-like post-training appears to do. So I expect that the model remains approximately as smart as the base model and the corresponding chatbot, it’s just better at packaging its intelligence into relevant long reasoning traces.
Interesting, I didn’t know that. But it seems like that assumes that o1′s special-sauce training can be viewed as a kind of RLHF, right? Do we know enough about that training to know that it’s RLHF-ish? Or at least some clearly offline approach.
For sure! At the same time, a) we’ve continued to see new ways of eliciting greater capability from the models we have, and b) o1 could (AFAIK) involve enough additional training compute to no longer be best thought of as ‘the same model’ (one possibility, although I haven’t spent much time looking into what we know about o1: they may have started with a snapshot of the 4o base model, put it through additional pre-training, then done an arbitrary amount of RL on CoT). So I’m hesitant to think that ‘based on 4o’ sets a very strong limit on o1′s capabilities.
Performance after post-training degrades if behavior gets too far from that of the base/SFT model (see Figure 1). Solving this issue would be an entirely different advancement from what o1-like post-training appears to do. So I expect that the model remains approximately as smart as the base model and the corresponding chatbot, it’s just better at packaging its intelligence into relevant long reasoning traces.
Interesting, I didn’t know that. But it seems like that assumes that o1′s special-sauce training can be viewed as a kind of RLHF, right? Do we know enough about that training to know that it’s RLHF-ish? Or at least some clearly offline approach.