Why do many people think RL will produce “agents”, but maybe (self-)supervised learning ((S)SL) won’t? Historically, the field of RL says that RL trains agents. That, of course, is no argument at all. Let’s consider the technical differences between the training regimes.
[Exact gradients] RL’s credit assignment problem is harder than (self-)supervised learning’s. In RL, if an agent solves a maze in 10 steps, it gets (discounted) reward; this trajectory then provides a set of reward-modulated gradients to the agent. But if the agent could have solved the maze in 5 steps, the agent isn’t directly updated to be more likely to do that in the future; RL’s gradients are generally inexact, not pointing directly at intended behavior.
On the other hand, if a supervised-learning classifier outputs dog when it should have output cat, then e.g. cross-entropy loss + correct label yields a gradient update which tweaks the network to output cat next time for that image. The gradient is exact.
I don’t think this is really where the “agentic propensity” of RL comes from, conditional on such a propensity existing (I think it probably does).
[Independence of data points] In RL, the agent’s policy determines its actions, which determines its future experiences (a.k.a. state-action-state’ transitions), which determines its future rewards (R(s,a,s′)), which determines its future cognitive updates.
In (S)SL, there isn’t such an entanglement (assuming teacher forcing in the SSL regime). Whether or not the network outputs cat or dog now, doesn’t really affect the future data distribution shown to the agent.
After a few minutes of thinking, I think that the relevant criterion is:
P(d′ at time t′|output a on datum d at time t)=P(d′ at time t′|d at time t)
where d,d′ are data points ((s,a,s′) tuples in RL, (x,y) labelled datapoints in supervised learning, (x1:t−1,xt) context-completion pairs in self-supervised predictive text learning, etc).
Most RL regimes break this assumption pretty hard.
Corollaries:
Dependence allows message-passing and chaining of computation across time, beyond whatever recurrent capacities the network has.
This probably is “what agency is built from”; the updates chaining cognition together into weak coherence-over-time. I currently don’t see an easy way to be less handwavy or more concrete.
Dependence should strictly increase path-dependence of training.
Amplifying a network using its own past outputs always breaks independence.
I think that independence is the important part of (S)SL, not identical distribution; so I say “independence” and not “IID.”
EG Pre-trained initializations generally break the “ID” part.
Thanks to Quintin Pope and Nora Belrose for conversations which produced these thoughts.
I’m not inclined to think that “exact gradients” is important; in fact, I’m not even sure if it’s (universally) true. In particular, PPO / TRPO / etc. are approximating a policy gradient, right? I feel like, if some future magical technique was a much better approximation to the true policy gradient, such that it was for all intents and purposes a perfect approximation, it wouldn’t really change how I think about RL in general. Conversely, on the SSL side, you get gradient noise from things like dropout and the random selection of data in each batch, so you could say the gradient “isn’t exact”, but I don’t think that makes any important conceptual difference either.
(A central difference in practice is that SSL gives you a gradient “for free” each query, whereas RL policy gradients require many runs in an identical (episodic) environment before you get a gradient.)
In terms of “why RL” in general, among other things, I might emphasize the idea that if we want an AI that can (for example) invent new technology, it needs to find creative out-of-the-box solutions to problems (IMO), which requires being able to explore / learn / build knowledge in parts of concept-space where there is no human data. SSL can’t do that (at least, “vanilla SSL” can’t do that; maybe there are “SSL-plus” systems that can), whereas RL algorithms can. I guess this is somewhat related to your “independence”, but with a different emphasis.
I don’t have too strong an opinion about whether vanilla SSL can yield an “agent” or not. It would seem to be a pointless and meaningless terminological question. Hmm, I guess when I think of “agent” it has a bunch of connotations, e.g. an ability to do trial-and-error exploration, and I think that RL systems tend to match all those connotations more than SSL systems—at least, more than “vanilla” SSL systems. But again, if someone wants to disagree, I’m not interested in arguing about it.
Why do many people think RL will produce “agents”, but maybe (self-)supervised learning ((S)SL) won’t? Historically, the field of RL says that RL trains agents. That, of course, is no argument at all. Let’s consider the technical differences between the training regimes.
In the modern era, both RL and (S)SL involve initializing one or more neural networks, and using the reward/loss function to provide cognitive updates to the network(s). Now we arrive at some differences.
Some of this isn’t new (see Hidden Incentives for Auto-Induced Distributional Shift), but I think it’s important and felt like writing up my own take on it. Maybe this becomes a post later.
[Exact gradients] RL’s credit assignment problem is harder than (self-)supervised learning’s. In RL, if an agent solves a maze in 10 steps, it gets (discounted) reward; this trajectory then provides a set of reward-modulated gradients to the agent. But if the agent could have solved the maze in 5 steps, the agent isn’t directly updated to be more likely to do that in the future; RL’s gradients are generally inexact, not pointing directly at intended behavior.
On the other hand, if a supervised-learning classifier outputs
dog
when it should have outputcat
, then e.g. cross-entropy loss + correct label yields a gradient update which tweaks the network to outputcat
next time for that image. The gradient is exact.I don’t think this is really where the “agentic propensity” of RL comes from, conditional on such a propensity existing (I think it probably does).
[Independence of data points] In RL, the agent’s policy determines its actions, which determines its future experiences (a.k.a. state-action-state’ transitions), which determines its future rewards (R(s,a,s′)), which determines its future cognitive updates.
In (S)SL, there isn’t such an entanglement (assuming teacher forcing in the SSL regime). Whether or not the network outputs
cat
ordog
now, doesn’t really affect the future data distribution shown to the agent.After a few minutes of thinking, I think that the relevant criterion is:
P(d′ at time t′|output a on datum d at time t)=P(d′ at time t′|d at time t)where d,d′ are data points ((s,a,s′) tuples in RL, (x,y) labelled datapoints in supervised learning, (x1:t−1,xt) context-completion pairs in self-supervised predictive text learning, etc).
Most RL regimes break this assumption pretty hard.
Corollaries:
Dependence allows message-passing and chaining of computation across time, beyond whatever recurrent capacities the network has.
This probably is “what agency is built from”; the updates chaining cognition together into weak coherence-over-time. I currently don’t see an easy way to be less handwavy or more concrete.
Dependence should strictly increase path-dependence of training.
Amplifying a network using its own past outputs always breaks independence.
I think that independence is the important part of (S)SL, not identical distribution; so I say “independence” and not “IID.”
EG Pre-trained initializations generally break the “ID” part.
Thanks to Quintin Pope and Nora Belrose for conversations which produced these thoughts.
I’m not inclined to think that “exact gradients” is important; in fact, I’m not even sure if it’s (universally) true. In particular, PPO / TRPO / etc. are approximating a policy gradient, right? I feel like, if some future magical technique was a much better approximation to the true policy gradient, such that it was for all intents and purposes a perfect approximation, it wouldn’t really change how I think about RL in general. Conversely, on the SSL side, you get gradient noise from things like dropout and the random selection of data in each batch, so you could say the gradient “isn’t exact”, but I don’t think that makes any important conceptual difference either.
(A central difference in practice is that SSL gives you a gradient “for free” each query, whereas RL policy gradients require many runs in an identical (episodic) environment before you get a gradient.)
In terms of “why RL” in general, among other things, I might emphasize the idea that if we want an AI that can (for example) invent new technology, it needs to find creative out-of-the-box solutions to problems (IMO), which requires being able to explore / learn / build knowledge in parts of concept-space where there is no human data. SSL can’t do that (at least, “vanilla SSL” can’t do that; maybe there are “SSL-plus” systems that can), whereas RL algorithms can. I guess this is somewhat related to your “independence”, but with a different emphasis.
I don’t have too strong an opinion about whether vanilla SSL can yield an “agent” or not. It would seem to be a pointless and meaningless terminological question. Hmm, I guess when I think of “agent” it has a bunch of connotations, e.g. an ability to do trial-and-error exploration, and I think that RL systems tend to match all those connotations more than SSL systems—at least, more than “vanilla” SSL systems. But again, if someone wants to disagree, I’m not interested in arguing about it.