The gain from such approaches are real and part of why LLMs work so well now.
However, the problem is that the gains from self-distillation or finetuning always top out quickly thus far. You can’t train more than 2 or 3 iterations before it stops working. There is something missing compared to self-play successes like TD-Gammon or AlphaZero. There cannot be any ChatGPT-Zero as currently constituted, because you’d run a few iterations and then it’d either stop progressing or collapse in some way as the LLM centipede eats its own increasingly degenerate outputs. Pretty soon, you do stop recognizing ‘better’ outputs because you were just trained to generate only better outputs! (Where does any additional ‘betterness’ or ‘betterness recognition’ come from? ‘Sorry, y’all look the same to me.’) RLHF or self-distillation are more about specialization than they are about increasing capability: they increase the prior of pre-existing outputs, nothing less, nothing more.
The search for LLMs is not great. It’s analogous to doing runtime search in a Go/chess model: you get a big boost in Elo from searching even 1 or 2 ply (especially in avoiding blunders), but then you run into fast diminishing returns, and your search doesn’t feed back into the original model to improve it (outside training). But I think that, beyond some highly abstract niches like math theorem proving (which pose different challenges), the main missing part is the active selection of new data for LLMs, which is implicit in games where your ‘new data’ is just part of search (of the game tree).
The gain from such approaches are real and part of why LLMs work so well now.
However, the problem is that the gains from self-distillation or finetuning always top out quickly thus far. You can’t train more than 2 or 3 iterations before it stops working. There is something missing compared to self-play successes like TD-Gammon or AlphaZero. There cannot be any ChatGPT-Zero as currently constituted, because you’d run a few iterations and then it’d either stop progressing or collapse in some way as the LLM centipede eats its own increasingly degenerate outputs. Pretty soon, you do stop recognizing ‘better’ outputs because you were just trained to generate only better outputs! (Where does any additional ‘betterness’ or ‘betterness recognition’ come from? ‘Sorry, y’all look the same to me.’) RLHF or self-distillation are more about specialization than they are about increasing capability: they increase the prior of pre-existing outputs, nothing less, nothing more.
The search for LLMs is not great. It’s analogous to doing runtime search in a Go/chess model: you get a big boost in Elo from searching even 1 or 2 ply (especially in avoiding blunders), but then you run into fast diminishing returns, and your search doesn’t feed back into the original model to improve it (outside training). But I think that, beyond some highly abstract niches like math theorem proving (which pose different challenges), the main missing part is the active selection of new data for LLMs, which is implicit in games where your ‘new data’ is just part of search (of the game tree).