Seth Herd comments on The Problem with Reasoners by Aidan McLaughin

Seth Herd 25 Nov 2024 22:57 UTC
5 points
2
This is quite compelling, as far as it goes.

But it doesn’t go far enough. The technique itself generalizes outside of easy-to-verify domains like math and physics.

You don’t need perfect or easy verification for this technique to work. You only need a little traction on guessing whether an answer is better or worse to use as a training signal. Humans and AlphaZero do this in different ways: they do a brief tree search using their current best quick guesses at the situation, do some sort of evaluation to estimate which is the best of those answers, then use that to update their next quick guesses in a similar situation. In those cases there’s some feedback from the environment, but that’s not necessary: you just need an evaluation process that’s marginally better than the guessing process to make forward progress.

The Marco o1 system takes this approach, and we’ll see more. And they’ll probably work. There’s a real question of whether they’re efficient enough to outpace just scaling up models and refining their synthetic data; but it’s one route toward progress.