I think this type of autonomous learning is fairly likely to be achieved soon (1-2 years), and it doesn’t need to follow exactly AlphaZero’s self-play model.
The world has rules. Those rules are much more complex and stochastic than games or protein folding. But note that the feedback in Go comes only after something like 200 moves, yet the powerful critic head is able to use that to derive a good local estimate of what’s likely a good or bad move.
Humans use a similar powerful critic in the dopamine system working in concert with the cortex’s rich world model to decide what’s rewarding long before there’s a physical reward or punishment signal. This is one route to autonomous learning for LLM agents. I don’t know if Amodei is focused on base models or hybrid learning systems, and that matters.
Or maybe it doesn’t. I can think of more human-like ways of autonomous learning in a hybrid system, but a powerful critic may be adequate for self-play even in a base model. Existing RLHF techniques do use a critic—I think it’s proximal policy optimization (or DPO?) in the last OpenAI setup they publicly reported. (I haven’t looked at Anthropic’s RLAIF setup to see if they’re using a similar critic portion of the model- I’d guess they are, following OpenAIs success with it).
I’d expect they’re experimenting with using small sets of human feedback to leverage self-critique as in RLAIF, making a better critic that makes a better overall model.
Decomposing video into text and then predicting how people behave both physically and emotionally offer two new windows onto the rules of the world. I guess those aren’t quite in the self-play domain on their own, but having good predictions of outcomes might allow autonomous learning of agentic actions by taking feedback not from a real or simulated world, but from that trained predictor of physical and social outcomes.
Deriving a feedback signal directly from the world can be done in many ways. I expect there are more clever ideas out there.
So in sum, I don’t think this is guaranteed, but it’s quite possible.
Glancing back at this, I noted I missed the most obvious form of self-play: putting an agent in an interaction with another copy of itself. You could do any sort of “scoring” by having an automated of the outcome vs. the current goal.
This has some obvious downsides, in that the agents aren’t the same as people. But it might get you a good bit of extra training that predicting static datasets doesn’t give. A little interaction with real humans might be the cherry on top of the self-play whipped cream on the predictive learning sundae.
I think this type of autonomous learning is fairly likely to be achieved soon (1-2 years), and it doesn’t need to follow exactly AlphaZero’s self-play model.
The world has rules. Those rules are much more complex and stochastic than games or protein folding. But note that the feedback in Go comes only after something like 200 moves, yet the powerful critic head is able to use that to derive a good local estimate of what’s likely a good or bad move.
Humans use a similar powerful critic in the dopamine system working in concert with the cortex’s rich world model to decide what’s rewarding long before there’s a physical reward or punishment signal. This is one route to autonomous learning for LLM agents. I don’t know if Amodei is focused on base models or hybrid learning systems, and that matters.
Or maybe it doesn’t. I can think of more human-like ways of autonomous learning in a hybrid system, but a powerful critic may be adequate for self-play even in a base model. Existing RLHF techniques do use a critic—I think it’s proximal policy optimization (or DPO?) in the last OpenAI setup they publicly reported. (I haven’t looked at Anthropic’s RLAIF setup to see if they’re using a similar critic portion of the model- I’d guess they are, following OpenAIs success with it).
I’d expect they’re experimenting with using small sets of human feedback to leverage self-critique as in RLAIF, making a better critic that makes a better overall model.
Decomposing video into text and then predicting how people behave both physically and emotionally offer two new windows onto the rules of the world. I guess those aren’t quite in the self-play domain on their own, but having good predictions of outcomes might allow autonomous learning of agentic actions by taking feedback not from a real or simulated world, but from that trained predictor of physical and social outcomes.
Deriving a feedback signal directly from the world can be done in many ways. I expect there are more clever ideas out there.
So in sum, I don’t think this is guaranteed, but it’s quite possible.
Glancing back at this, I noted I missed the most obvious form of self-play: putting an agent in an interaction with another copy of itself. You could do any sort of “scoring” by having an automated of the outcome vs. the current goal.
This has some obvious downsides, in that the agents aren’t the same as people. But it might get you a good bit of extra training that predicting static datasets doesn’t give. A little interaction with real humans might be the cherry on top of the self-play whipped cream on the predictive learning sundae.