In practice, the verifier is probably some kind of learned reward model (though it could be automated, like unit tests for code).
My guess is that a substantial amount of the verification (perhaps the majority?) was automated by training the model on domains where we have ground truth reward signals, like code, math, and standardized test questions. This would match the observed results in the o1 blog post showing that performance improved by a lot in domains that have ground truth or something close to ground truth, while performance was stagnant on things like creative writing which are more subjective. Nathan Lambert, the head of post-training at AI2, also found that doing continued RL training on ground truth rewards (which he calls RLVR) results in models that learn to say o1-like things like ‘wait, let me check my work’ in their chain of thought.
I think you flipped the names from the iMessage conversation. As per the caption in the OpenAI blog post, the blue bubbles are for Altman and the grey bubbles are for Zilis.