Dario Amodei believes that LLMs/AIs can be aided to self-improve in a similar way to AlphaGo Zero (though LLMs/AIs will benefit from other things too, like scale), where the models can learn by themselves to gain significant capabilities.
The key for him is that Go has a set of rules that the AlphaGo model needs to abide by. These rules allow the model to become superhuman at Go with enough compute.
Dario essentially believes that to reach better capabilities, it will help to develop rules for all the domains we care about and that this will likely be possible for more real-world tasks (not just games like Go).
Therefore, I think the crux here is if you think it is possible to develop rules for science (physics, chemistry, math, biology) and other domains s.t., the models can do this sort of self-play to become superhuman for each of the things we care about.
So far, we have examples like AlphaGeometry, which relies on our ability to generate many synthetic examples to help the model learn. This makes sense for the geometry use case, but how do we know if this kind of approach will work for the kinds of things we actually care about? For games and geometry, this seems possible, but what about developing a cure for Alzheimer’s or coming up with novel scientific breakthroughs?
So, you’ve got some of the following issues to resolve:
Success metrics
Potentially much slower feedback loops
Need real-world testing
That said, I think Dario is banking on:
AIs will have a large enough world model that they can essentially set up ‘rules’ that provide enough signal to the model in domains other than games and ‘special cases’ like geometry. For example, they can run physics simulations of optimal materials informed by the latest research papers and use key metrics for the simulation as high-quality signals to reduce the amount of real-world feedback loops needed. Or, code having unit tests along with the goal.
Most of the things we care about (like writing code) will be able to go beyond superhuman, which will then lift up other domains that wouldn’t be able to become superhuman without it.
Science has been bottlenecked by slow humans, increasing complexity and bad coordination, AIs will be able to resolve these issues.
Even if you can’t generate novel breakthrough synthetic data, you can use synthetic data to nudge your model along the path to making breakthroughs.
I don’t think any of these amount to a claim that “to reach ASI, we simply need to develop rules for all the domains we care about”. Yes, AlphaGo Zero reached superhuman levels on the narrow task of playing Go, and that’s a nice demonstration that synthetic data could be useful, but it’s not about ASI and there’s no claim that this would be either necessary or sufficient.
(not going to speculate on object-level details though)
I think this type of autonomous learning is fairly likely to be achieved soon (1-2 years), and it doesn’t need to follow exactly AlphaZero’s self-play model.
The world has rules. Those rules are much more complex and stochastic than games or protein folding. But note that the feedback in Go comes only after something like 200 moves, yet the powerful critic head is able to use that to derive a good local estimate of what’s likely a good or bad move.
Humans use a similar powerful critic in the dopamine system working in concert with the cortex’s rich world model to decide what’s rewarding long before there’s a physical reward or punishment signal. This is one route to autonomous learning for LLM agents. I don’t know if Amodei is focused on base models or hybrid learning systems, and that matters.
Or maybe it doesn’t. I can think of more human-like ways of autonomous learning in a hybrid system, but a powerful critic may be adequate for self-play even in a base model. Existing RLHF techniques do use a critic—I think it’s proximal policy optimization (or DPO?) in the last OpenAI setup they publicly reported. (I haven’t looked at Anthropic’s RLAIF setup to see if they’re using a similar critic portion of the model- I’d guess they are, following OpenAIs success with it).
I’d expect they’re experimenting with using small sets of human feedback to leverage self-critique as in RLAIF, making a better critic that makes a better overall model.
Decomposing video into text and then predicting how people behave both physically and emotionally offer two new windows onto the rules of the world. I guess those aren’t quite in the self-play domain on their own, but having good predictions of outcomes might allow autonomous learning of agentic actions by taking feedback not from a real or simulated world, but from that trained predictor of physical and social outcomes.
Deriving a feedback signal directly from the world can be done in many ways. I expect there are more clever ideas out there.
So in sum, I don’t think this is guaranteed, but it’s quite possible.
Glancing back at this, I noted I missed the most obvious form of self-play: putting an agent in an interaction with another copy of itself. You could do any sort of “scoring” by having an automated of the outcome vs. the current goal.
This has some obvious downsides, in that the agents aren’t the same as people. But it might get you a good bit of extra training that predicting static datasets doesn’t give. A little interaction with real humans might be the cherry on top of the self-play whipped cream on the predictive learning sundae.
I am fairly skeptical that we don’t already have something close-enough-to-approximate this if we had access to all the private email logs of the relevant institutions matched to some sort of correlation of “when this led to an outcome” metric (e.g., when was the relevant preprint paper or strategy deck or whatever released)
Go has rules, and gives you direct and definitive feedback on how well you’re doing, but, while a very large space, it isn’t open-ended. A lot of the foundation model companies appear to be busily thinking about doing something AlphaZero-inspired in mathematics, which also has rules, and can be arranged to give you direct feedback on how you’re doing (there have been recent papers on how to make this more efficient with less human input). Similarly on writing and debugging software, likewise. Indeed, models have recently been getting better at Math and coding faster than other topics, suggesting that they’re making real progress. When I watched that Dario interview (the Scandinavian bank one, I assume) my assumption was that Dario was talking about those, but using AlphaGo as a clearer and more widely-familiar example.
Expanding this to other areas seems like it would come next: robotics seems a promising one that also gives you a lot of rapid feedback, science would be fascinating and exciting but the feedback loops are a lot longer, human interactions (on something like the Character AI platform) seem like another possibility (though the result of that might be models better at human manipulation and/or pillow-talk, which might not be entirely a good thing).
Dario Amodei believes that LLMs/AIs can be aided to self-improve in a similar way to AlphaGo Zero (though LLMs/AIs will benefit from other things too, like scale), where the models can learn by themselves to gain significant capabilities.
The key for him is that Go has a set of rules that the AlphaGo model needs to abide by. These rules allow the model to become superhuman at Go with enough compute.
Dario essentially believes that to reach better capabilities, it will help to develop rules for all the domains we care about and that this will likely be possible for more real-world tasks (not just games like Go).
Therefore, I think the crux here is if you think it is possible to develop rules for science (physics, chemistry, math, biology) and other domains s.t., the models can do this sort of self-play to become superhuman for each of the things we care about.
So far, we have examples like AlphaGeometry, which relies on our ability to generate many synthetic examples to help the model learn. This makes sense for the geometry use case, but how do we know if this kind of approach will work for the kinds of things we actually care about? For games and geometry, this seems possible, but what about developing a cure for Alzheimer’s or coming up with novel scientific breakthroughs?
So, you’ve got some of the following issues to resolve:
Success metrics
Potentially much slower feedback loops
Need real-world testing
That said, I think Dario is banking on:
AIs will have a large enough world model that they can essentially set up ‘rules’ that provide enough signal to the model in domains other than games and ‘special cases’ like geometry. For example, they can run physics simulations of optimal materials informed by the latest research papers and use key metrics for the simulation as high-quality signals to reduce the amount of real-world feedback loops needed. Or, code having unit tests along with the goal.
Most of the things we care about (like writing code) will be able to go beyond superhuman, which will then lift up other domains that wouldn’t be able to become superhuman without it.
Science has been bottlenecked by slow humans, increasing complexity and bad coordination, AIs will be able to resolve these issues.
Even if you can’t generate novel breakthrough synthetic data, you can use synthetic data to nudge your model along the path to making breakthroughs.
Thoughts?
Hey @Zac Hatfield-Dodds, I noticed you are looking for citations; these are the interview bits I came across (and here at 47:31).
It’s possible I misunderstood him; please correct me if I did!
I don’t think any of these amount to a claim that “to reach ASI, we simply need to develop rules for all the domains we care about”. Yes, AlphaGo Zero reached superhuman levels on the narrow task of playing Go, and that’s a nice demonstration that synthetic data could be useful, but it’s not about ASI and there’s no claim that this would be either necessary or sufficient.
(not going to speculate on object-level details though)
Ok, totally; there’s no specific claim about ASI. Will edit the wording.
I think this type of autonomous learning is fairly likely to be achieved soon (1-2 years), and it doesn’t need to follow exactly AlphaZero’s self-play model.
The world has rules. Those rules are much more complex and stochastic than games or protein folding. But note that the feedback in Go comes only after something like 200 moves, yet the powerful critic head is able to use that to derive a good local estimate of what’s likely a good or bad move.
Humans use a similar powerful critic in the dopamine system working in concert with the cortex’s rich world model to decide what’s rewarding long before there’s a physical reward or punishment signal. This is one route to autonomous learning for LLM agents. I don’t know if Amodei is focused on base models or hybrid learning systems, and that matters.
Or maybe it doesn’t. I can think of more human-like ways of autonomous learning in a hybrid system, but a powerful critic may be adequate for self-play even in a base model. Existing RLHF techniques do use a critic—I think it’s proximal policy optimization (or DPO?) in the last OpenAI setup they publicly reported. (I haven’t looked at Anthropic’s RLAIF setup to see if they’re using a similar critic portion of the model- I’d guess they are, following OpenAIs success with it).
I’d expect they’re experimenting with using small sets of human feedback to leverage self-critique as in RLAIF, making a better critic that makes a better overall model.
Decomposing video into text and then predicting how people behave both physically and emotionally offer two new windows onto the rules of the world. I guess those aren’t quite in the self-play domain on their own, but having good predictions of outcomes might allow autonomous learning of agentic actions by taking feedback not from a real or simulated world, but from that trained predictor of physical and social outcomes.
Deriving a feedback signal directly from the world can be done in many ways. I expect there are more clever ideas out there.
So in sum, I don’t think this is guaranteed, but it’s quite possible.
Glancing back at this, I noted I missed the most obvious form of self-play: putting an agent in an interaction with another copy of itself. You could do any sort of “scoring” by having an automated of the outcome vs. the current goal.
This has some obvious downsides, in that the agents aren’t the same as people. But it might get you a good bit of extra training that predicting static datasets doesn’t give. A little interaction with real humans might be the cherry on top of the self-play whipped cream on the predictive learning sundae.
I am fairly skeptical that we don’t already have something close-enough-to-approximate this if we had access to all the private email logs of the relevant institutions matched to some sort of correlation of “when this led to an outcome” metric (e.g., when was the relevant preprint paper or strategy deck or whatever released)
Go has rules, and gives you direct and definitive feedback on how well you’re doing, but, while a very large space, it isn’t open-ended. A lot of the foundation model companies appear to be busily thinking about doing something AlphaZero-inspired in mathematics, which also has rules, and can be arranged to give you direct feedback on how you’re doing (there have been recent papers on how to make this more efficient with less human input). Similarly on writing and debugging software, likewise. Indeed, models have recently been getting better at Math and coding faster than other topics, suggesting that they’re making real progress. When I watched that Dario interview (the Scandinavian bank one, I assume) my assumption was that Dario was talking about those, but using AlphaGo as a clearer and more widely-familiar example.
Expanding this to other areas seems like it would come next: robotics seems a promising one that also gives you a lot of rapid feedback, science would be fascinating and exciting but the feedback loops are a lot longer, human interactions (on something like the Character AI platform) seem like another possibility (though the result of that might be models better at human manipulation and/or pillow-talk, which might not be entirely a good thing).