My overall contention is that viewing deep RL as an optimization problem where the RL alg tries to find a policy which tries to maximize reward over time, is fatally flawed and misunderstands the point and practicalities of real RL algorithms.[1]
Is it misleading to say that the agent is maximising reward?
I think it can be misleading, but it depends. “The network implements a policy which reliably reaches A”—that is what we observe. We can also state “this achieves a high numerical score on the policy-gradient-intensity (aka ‘reward’) statistic.” These statements are true and not very misleading, IMO. It doesn’t push me to consider unwarranted hypotheses about its generalization behavior, like that it cares about reward or will try to make reward come out high in general. These statements instead draw my attention towards our observations—“I wonder how the policy is internally structured so as to reliably reach A?” is a very fruitful question IMO.
One related problem is that RL papers often repeat “the point of RL is to train agents to maximize reward”, which leads to really bad implicit models of how RL works in my experience. I think it leads people to privilege the hypothesis that RL agents will optimize their own reward signal in some way, shape, or form.
If you could unpack your belief “There aren’t interesting examples like this which are alignment-relevant”, I might be able to give a more precise/appropriate reply.
Let’s consider one obstacle.
Convergence theorems require certain learning rate and state visitation schedules (agents don’t visit every state of the world infinitely many times, in real life) which can themselves be interrupted if an AI eg sets its own learning rate to zero (thus violating the theorem’s preconditions). As best I can fathom, the convergence theorems do not apply to situations we care about (e.g. an embodied agent which we want to manufacture diamonds for us), for generalizable reasons which won’t be fixed through more clever proof techniques: EG an agent which tried to visit every state infinitely many times would quickly die and stop exploring.
Or in say the LLM-finetuning case, where you’re doing RLHF to get the model to (hopefully!) help you brainstorm research topics, the agent simply won’t try out every token-sequence in its context window. That won’t happen a single time, let alone infinitely many times. Even finite-time guarantees won’t kick in in time to apply to reality.
Or even if exploration weren’t an issue, the agent could—as i mentioned—simply set its learning rate to zero. How can the theorems help us there?
Again, as best I can fathom—there’s no clever argument or proof strategy that gets you around that obstacle, if you want to just apply the standard results to agents which can die or which operate on reasonable timescales or which can modify the learning rate schedule we nominally set.
(And then we can also talk about expressivity issues, learning dynamics being nonstationary over time, etc. The learning rate/state visitation obstacle is a sufficient blocker for convergence theorems IRL, but not itself a crux for me, in that I still wouldn’t expect you can apply the theorems even if the LR “issue” vanished.)
My overall contention is that viewing deep RL as an optimization problem where the RL alg tries to find a policy which tries to maximize reward over time, is fatally flawed and misunderstands the point and practicalities of real RL algorithms.[1]
I think it can be misleading, but it depends. “The network implements a policy which reliably reaches A”—that is what we observe. We can also state “this achieves a high numerical score on the policy-gradient-intensity (aka ‘reward’) statistic.” These statements are true and not very misleading, IMO. It doesn’t push me to consider unwarranted hypotheses about its generalization behavior, like that it cares about reward or will try to make reward come out high in general. These statements instead draw my attention towards our observations—“I wonder how the policy is internally structured so as to reliably reach A?” is a very fruitful question IMO.
One related problem is that RL papers often repeat “the point of RL is to train agents to maximize reward”, which leads to really bad implicit models of how RL works in my experience. I think it leads people to privilege the hypothesis that RL agents will optimize their own reward signal in some way, shape, or form.
Let’s consider one obstacle.
Convergence theorems require certain learning rate and state visitation schedules (agents don’t visit every state of the world infinitely many times, in real life) which can themselves be interrupted if an AI eg sets its own learning rate to zero (thus violating the theorem’s preconditions). As best I can fathom, the convergence theorems do not apply to situations we care about (e.g. an embodied agent which we want to manufacture diamonds for us), for generalizable reasons which won’t be fixed through more clever proof techniques: EG an agent which tried to visit every state infinitely many times would quickly die and stop exploring.
Or in say the LLM-finetuning case, where you’re doing RLHF to get the model to (hopefully!) help you brainstorm research topics, the agent simply won’t try out every token-sequence in its context window. That won’t happen a single time, let alone infinitely many times. Even finite-time guarantees won’t kick in in time to apply to reality.
Or even if exploration weren’t an issue, the agent could—as i mentioned—simply set its learning rate to zero. How can the theorems help us there?
Again, as best I can fathom—there’s no clever argument or proof strategy that gets you around that obstacle, if you want to just apply the standard results to agents which can die or which operate on reasonable timescales or which can modify the learning rate schedule we nominally set.
(And then we can also talk about expressivity issues, learning dynamics being nonstationary over time, etc. The learning rate/state visitation obstacle is a sufficient blocker for convergence theorems IRL, but not itself a crux for me, in that I still wouldn’t expect you can apply the theorems even if the LR “issue” vanished.)
See more writing on this: Reward is not the optimization target and Models Don’t “Get Reward”.