So far, the answer seems to be that it transfers some, and o1 and o1-pro still seem highly useful in ways beyond reasoning, but o1-style models mostly don’t ‘do their core thing’ in areas where they couldn’t be trained on definitive answers.
Based on:
rumors that talking to base models is very different from talking to RLHFed models and
how things work with humans
It seems likely to me that thinking skills transfer pretty well. But then this s trained out because this results in answers that raters don’t like. So model memorizes answers its supposed to go with.
Based on:
rumors that talking to base models is very different from talking to RLHFed models and
how things work with humans
It seems likely to me that thinking skills transfer pretty well. But then this s trained out because this results in answers that raters don’t like. So model memorizes answers its supposed to go with.