Kei comments on o1: A Technical Primer

Kei 10 Dec 2024 0:00 UTC
LW: 25 AF: 14
5
AF
In practice, the verifier is probably some kind of learned reward model (though it could be automated, like unit tests for code).

My guess is that a substantial amount of the verification (perhaps the majority?) was automated by training the model on domains where we have ground truth reward signals, like code, math, and standardized test questions. This would match the observed results in the o1 blog post showing that performance improved by a lot in domains that have ground truth or something close to ground truth, while performance was stagnant on things like creative writing which are more subjective. Nathan Lambert, the head of post-training at AI2, also found that doing continued RL training on ground truth rewards (which he calls RLVR) results in models that learn to say o1-like things like ‘wait, let me check my work’ in their chain of thought.
- Jesse Hoogland 11 Dec 2024 17:01 UTC
  LW: 2 AF: 1
  0
  AF Parent
  It’s worth noting that there are also hybrid approaches, for example, where you use automated verifiers (or a combination of automated verifiers and supervised labels) to train a process reward model that you then train your reasoning model against.