Makes sense that something like this would be possible. Humans can often teach themselves to be better at a skill through practice, even without a teacher or ground truth. Of course, this often runs into mostly useless, self-reinforcing attractors when the learner is unable to determine the true quality levels of their own attempts in the area of study. E.g., physics cranks, paranormal studies, etc.
I actually think it’s a good thing that chain of thought models can improve themselves like this. Chain of thought LLMs seem by far the most alignable of the plausible paths to AGI, since it directly incorporates a prior towards human-like cognition. I’d be a bit more worried about alignment if AGI approaches based on imitating / amplifying humans showed signs of flagging even at the current infra-human capabilities stage.
Further, stable self improvement seems like a particularly sensitive requirement for a scalable alignment solution. It seems much better to have the medium of this self improvement be human language, rather than something like self-modifying meta reinforcement learning from scratch.
Edited to add three more reasons why I think this is a good thing:
It gives us an example of self improvement on which we can run experiments. We can look for failure modes in current systems, without having to scale them up to dangerous capabilities.
Of all the plausible ways to do self improvement that I can imagine, this one seems the most stable and least likely to have sudden capabilities spikes. The reason is that the basic building block of the model’s self improvement is autoregressive pretraining. When the model generates a new dataset and trains on that dataset, the result is to move the model’s median generation in the direction of the dataset. We can simply look at the dataset the model generated to get a pretty good idea of the direction in which this particular step of self modification will move the model. Of course, the compounding effects of multiple self improvement rounds are a different matter, but like I mention above, we can run experiments to investigate.
It’s a good sign that this works. Humans can improve themselves in this manner. If language models had turned out to be incapable of doing so, that would have been an example of non-convergence between LMs and humans. Instead, we see convergence.
Crucially, this is a domain where convergence wasn’t necessarily implied by the LM training objective. Language models are trained to imitate human text autoregressively given a single context. It’s not a surprise when they generate human-like continuations of their provided contexts. That’s what they’re trained to do. However, that training objective does not directly imply that the LM should be capable of a human-esque self improvement trajectory that spans multiple prompt-completion pairs. I thus take this result as an update towards there being convergence in the higher order learning dynamics of humans and AIs.
I really appreciated this comment for making the connection between this paper and IDA.
More explicitly, to the extent that you can think of the original large language model as simulating a human, there’s an analogy between:
asking the LLM to reason about its inputs and then training on the conclusion of the reasoning (what this paper does)
asking simulated humans to reason about a problem and then training a new model on their conclusions (the basic idea behind iterated distillation and amplification).
This is a also a great chance for IDA skeptics to try to empirically demonstrate issues, either by finding failures of capabilities to scale, or by demonstrating alignment failures in amplified systems.
Humans can often teach themselves to be better at a skill through practice, even without a teacher or ground truth
Definitely, but I currently feel that the vast majority of human learning comes with a ground truth to reinforce good habits. I think this is why I’m surprised this works as much as it does: it kinda feels like letting an elementary school kid teach themself math by practicing certain skills they feel confident in without any regard to if that skill even is “mathematically correct”.
Sure, these skills are probably on the right track toward solving math problems—otherwise, the kid wouldn’t have felt as confident about them. But would this approach not ignore skills the student needs to work on, or even amplify “bad” skills? (Or maybe this is just a faulty analogy and I need to re-read the paper)
You do need a minimum degree of competence in the domain before your own judgement is sufficient to tell the difference between good and bad attempts. Though even for children, there are domains simple enough that they can make that determination. E.g., learning to stack blocks on top of each other has an obvious failure state, and children can learn to do it through trial and error, even though there is probably not a genetically hardcoded reward circuit for correctly stacking things on top of other things.
Math is a much more complex domain where self-directed learning works well, because mathematicians can formally verify the correctness of their attempts, and so have a reliable signal to identify good attempts at proving a theorem, developing a new approach, etc.
Makes sense that something like this would be possible. Humans can often teach themselves to be better at a skill through practice, even without a teacher or ground truth. Of course, this often runs into mostly useless, self-reinforcing attractors when the learner is unable to determine the true quality levels of their own attempts in the area of study. E.g., physics cranks, paranormal studies, etc.
I actually think it’s a good thing that chain of thought models can improve themselves like this. Chain of thought LLMs seem by far the most alignable of the plausible paths to AGI, since it directly incorporates a prior towards human-like cognition. I’d be a bit more worried about alignment if AGI approaches based on imitating / amplifying humans showed signs of flagging even at the current infra-human capabilities stage.
Further, stable self improvement seems like a particularly sensitive requirement for a scalable alignment solution. It seems much better to have the medium of this self improvement be human language, rather than something like self-modifying meta reinforcement learning from scratch.
Edited to add three more reasons why I think this is a good thing:
It gives us an example of self improvement on which we can run experiments. We can look for failure modes in current systems, without having to scale them up to dangerous capabilities.
Of all the plausible ways to do self improvement that I can imagine, this one seems the most stable and least likely to have sudden capabilities spikes. The reason is that the basic building block of the model’s self improvement is autoregressive pretraining. When the model generates a new dataset and trains on that dataset, the result is to move the model’s median generation in the direction of the dataset. We can simply look at the dataset the model generated to get a pretty good idea of the direction in which this particular step of self modification will move the model. Of course, the compounding effects of multiple self improvement rounds are a different matter, but like I mention above, we can run experiments to investigate.
It’s a good sign that this works. Humans can improve themselves in this manner. If language models had turned out to be incapable of doing so, that would have been an example of non-convergence between LMs and humans. Instead, we see convergence.
Crucially, this is a domain where convergence wasn’t necessarily implied by the LM training objective. Language models are trained to imitate human text autoregressively given a single context. It’s not a surprise when they generate human-like continuations of their provided contexts. That’s what they’re trained to do. However, that training objective does not directly imply that the LM should be capable of a human-esque self improvement trajectory that spans multiple prompt-completion pairs. I thus take this result as an update towards there being convergence in the higher order learning dynamics of humans and AIs.
I really appreciated this comment for making the connection between this paper and IDA.
More explicitly, to the extent that you can think of the original large language model as simulating a human, there’s an analogy between:
asking the LLM to reason about its inputs and then training on the conclusion of the reasoning (what this paper does)
asking simulated humans to reason about a problem and then training a new model on their conclusions (the basic idea behind iterated distillation and amplification).
This is a also a great chance for IDA skeptics to try to empirically demonstrate issues, either by finding failures of capabilities to scale, or by demonstrating alignment failures in amplified systems.
Definitely, but I currently feel that the vast majority of human learning comes with a ground truth to reinforce good habits. I think this is why I’m surprised this works as much as it does: it kinda feels like letting an elementary school kid teach themself math by practicing certain skills they feel confident in without any regard to if that skill even is “mathematically correct”.
Sure, these skills are probably on the right track toward solving math problems—otherwise, the kid wouldn’t have felt as confident about them. But would this approach not ignore skills the student needs to work on, or even amplify “bad” skills? (Or maybe this is just a faulty analogy and I need to re-read the paper)
You do need a minimum degree of competence in the domain before your own judgement is sufficient to tell the difference between good and bad attempts. Though even for children, there are domains simple enough that they can make that determination. E.g., learning to stack blocks on top of each other has an obvious failure state, and children can learn to do it through trial and error, even though there is probably not a genetically hardcoded reward circuit for correctly stacking things on top of other things.
Math is a much more complex domain where self-directed learning works well, because mathematicians can formally verify the correctness of their attempts, and so have a reliable signal to identify good attempts at proving a theorem, developing a new approach, etc.