In other cases, like inner-monologue or self-distillation, the gain comes from amortizing compute: the model gains absolutely no additional data about the external world, it just gets to see what it already thinks at greater length, and retrain to shortcut to its best already-known answer.
This can only give a bounded amount of improvement, but nothing in particular says that the bounds have to be low in practical terms. For a concrete example, LLMs are currently pretty bad at noticing and self-correcting when they make a reasoning error, but they are capable of the following:
Given an example of a valid chain of reasoning, come up with a description of a mistake that might be made during that chain of reasoning
Given a valid chain of reasoning and a description of a reasoning error, generate a chain of reasoning that exhibits that reasoning error. This gives training examples of “chains of reasoning with errors in them”.
Given an invalid chain of reasoning, determine where the first error occurred.
Given an invalid chain of reasoning and the location of the first error, attempt to recover from that error
Given an invalid chain of reasoning and a recovery attempt, and also given the correct chain of reasoning, determine whether the recovery attempt succeeded.
Given all of the above, determine whether the scenario makes a good example of recovering from a reasoning error and should thus be included in the next training set.
I expect that this cycle would produce improved reasoning error recovery as long as recognizing a good output is easier than generating a good output. And I expect that would probably remain true for a while. Also I expect that something like this has already been done, especially since it rhymes with constitutional AI.
Obviously this doesn’t work “from scratch”, you need enough training for the model to be able to distinguish good outputs from bad outputs and also ever produce good outputs on its own. We’re not going to get a ChatGPT-Zero. But I think this post does gesture in the general direction of something real.
The gain from such approaches are real and part of why LLMs work so well now.
However, the problem is that the gains from self-distillation or finetuning always top out quickly thus far. You can’t train more than 2 or 3 iterations before it stops working. There is something missing compared to self-play successes like TD-Gammon or AlphaZero. There cannot be any ChatGPT-Zero as currently constituted, because you’d run a few iterations and then it’d either stop progressing or collapse in some way as the LLM centipede eats its own increasingly degenerate outputs. Pretty soon, you do stop recognizing ‘better’ outputs because you were just trained to generate only better outputs! (Where does any additional ‘betterness’ or ‘betterness recognition’ come from? ‘Sorry, y’all look the same to me.’) RLHF or self-distillation are more about specialization than they are about increasing capability: they increase the prior of pre-existing outputs, nothing less, nothing more.
The search for LLMs is not great. It’s analogous to doing runtime search in a Go/chess model: you get a big boost in Elo from searching even 1 or 2 ply (especially in avoiding blunders), but then you run into fast diminishing returns, and your search doesn’t feed back into the original model to improve it (outside training). But I think that, beyond some highly abstract niches like math theorem proving (which pose different challenges), the main missing part is the active selection of new data for LLMs, which is implicit in games where your ‘new data’ is just part of search (of the game tree).
Obviously this doesn’t work “from scratch”, you need enough training for the model to be able to distinguish good outputs from bad outputs and also ever produce good outputs on its own. We’re not going to get a ChatGPT-Zero. But I think this post does gesture in the general direction of something real.
While I do think the process you outlined in your post is more concrete and would probably work better and be easier than learning “from scratch”, I don’t think it’s completely obvious that something like this wouldn’t work from scratch. It was done for humans, albeit through billions of years of genetic evolution and thousands of years of cultural evolution. Something like ChatGPT-Zero would probably require many more orders of magnitude of compute than systems we are training today, and also some algorithmic/architectural improvements, but I don’t think it’s completely impossible.
I feel like your post is implying something similar, given the last sentence, so maybe I’m misinterpreting what exactly you’re saying won’t work.
The specific thing I think wouldn’t work is trying to start the process without a bunch of pretraining data for at least the initial judge (i.e. pure self play from a randomized initialization with no human-generated data or judgments enteringthetraining the training run at any point). Not super insightful I know, just addressing what I meant by “zero” in my hypothetical ChatGPT-Zero.
This can only give a bounded amount of improvement, but nothing in particular says that the bounds have to be low in practical terms. For a concrete example, LLMs are currently pretty bad at noticing and self-correcting when they make a reasoning error, but they are capable of the following:
Given an example of a valid chain of reasoning, come up with a description of a mistake that might be made during that chain of reasoning
Given a valid chain of reasoning and a description of a reasoning error, generate a chain of reasoning that exhibits that reasoning error. This gives training examples of “chains of reasoning with errors in them”.
Given an invalid chain of reasoning, determine where the first error occurred.
Given an invalid chain of reasoning and the location of the first error, attempt to recover from that error
Given an invalid chain of reasoning and a recovery attempt, and also given the correct chain of reasoning, determine whether the recovery attempt succeeded.
Given all of the above, determine whether the scenario makes a good example of recovering from a reasoning error and should thus be included in the next training set.
I expect that this cycle would produce improved reasoning error recovery as long as recognizing a good output is easier than generating a good output. And I expect that would probably remain true for a while. Also I expect that something like this has already been done, especially since it rhymes with constitutional AI.
Obviously this doesn’t work “from scratch”, you need enough training for the model to be able to distinguish good outputs from bad outputs and also ever produce good outputs on its own. We’re not going to get a ChatGPT-Zero. But I think this post does gesture in the general direction of something real.
The gain from such approaches are real and part of why LLMs work so well now.
However, the problem is that the gains from self-distillation or finetuning always top out quickly thus far. You can’t train more than 2 or 3 iterations before it stops working. There is something missing compared to self-play successes like TD-Gammon or AlphaZero. There cannot be any ChatGPT-Zero as currently constituted, because you’d run a few iterations and then it’d either stop progressing or collapse in some way as the LLM centipede eats its own increasingly degenerate outputs. Pretty soon, you do stop recognizing ‘better’ outputs because you were just trained to generate only better outputs! (Where does any additional ‘betterness’ or ‘betterness recognition’ come from? ‘Sorry, y’all look the same to me.’) RLHF or self-distillation are more about specialization than they are about increasing capability: they increase the prior of pre-existing outputs, nothing less, nothing more.
The search for LLMs is not great. It’s analogous to doing runtime search in a Go/chess model: you get a big boost in Elo from searching even 1 or 2 ply (especially in avoiding blunders), but then you run into fast diminishing returns, and your search doesn’t feed back into the original model to improve it (outside training). But I think that, beyond some highly abstract niches like math theorem proving (which pose different challenges), the main missing part is the active selection of new data for LLMs, which is implicit in games where your ‘new data’ is just part of search (of the game tree).
While I do think the process you outlined in your post is more concrete and would probably work better and be easier than learning “from scratch”, I don’t think it’s completely obvious that something like this wouldn’t work from scratch. It was done for humans, albeit through billions of years of genetic evolution and thousands of years of cultural evolution. Something like ChatGPT-Zero would probably require many more orders of magnitude of compute than systems we are training today, and also some algorithmic/architectural improvements, but I don’t think it’s completely impossible.
I feel like your post is implying something similar, given the last sentence, so maybe I’m misinterpreting what exactly you’re saying won’t work.
The specific thing I think wouldn’t work is trying to start the process without a bunch of pretraining data for at least the initial judge (i.e. pure self play from a randomized initialization with no human-generated data or judgments enteringthetraining the training run at any point). Not super insightful I know, just addressing what I meant by “zero” in my hypothetical ChatGPT-Zero.
Thanks for clarifying! I do agree that that wouldn’t work, at least if we wanted what was produced to be in any way useful or meaningful to humans.