At the same time, I don’t think that people who write papers about RL are off-track: I consider AIXI to be a good mathematical abstraction of many different RL algorithms, convergence theorems are valid for these algorithms, and thinking of RL in terms of reward maximisation doesn’t seem particularly misleading to me.
Do you have concrete examples of where convergence theorems apply to an interesting task with e.g. PPO? “There aren’t interesting examples like this which are alignment-relevant” seems like an important belief of mine, so if you know a counterexample, I’d be very grateful to learn about it and change my mind!
I might be misunderstanding you: take this with a grain of salt.
From my perspective: if convergence theorems did not work to a reasonable degree in practice, nobody would use RL-related algorithms. If I set reward in place A, but by default agents end up going somewhere far away from A, my approach is not doing what it is supposed to do; I put reward in place A because I wanted an agent that would go towards A to a certain extent.
I am not familiar with PPO. From this short article, in the section about TRPO:
Recall that due to approximations, theoretical guarantees no longer hold.
Is this what you are referring to? But is it important for alignment? Let’s say the conditions for convergence are not met anymore, the theorem can’t be applied in theory, but in practice I do get an agent that goes towards A, where I’ve put reward. Is it misleading to say that the agent is maximising reward?
(However, keep in mind that
I agree with Turner that modelling humans as simple reward maximisers is inappropriate
)
If you could unpack your belief “There aren’t interesting examples like this which are alignment-relevant”, I might be able to give a more precise/appropriate reply.
My overall contention is that viewing deep RL as an optimization problem where the RL alg tries to find a policy which tries to maximize reward over time, is fatally flawed and misunderstands the point and practicalities of real RL algorithms.[1]
Is it misleading to say that the agent is maximising reward?
I think it can be misleading, but it depends. “The network implements a policy which reliably reaches A”—that is what we observe. We can also state “this achieves a high numerical score on the policy-gradient-intensity (aka ‘reward’) statistic.” These statements are true and not very misleading, IMO. It doesn’t push me to consider unwarranted hypotheses about its generalization behavior, like that it cares about reward or will try to make reward come out high in general. These statements instead draw my attention towards our observations—“I wonder how the policy is internally structured so as to reliably reach A?” is a very fruitful question IMO.
One related problem is that RL papers often repeat “the point of RL is to train agents to maximize reward”, which leads to really bad implicit models of how RL works in my experience. I think it leads people to privilege the hypothesis that RL agents will optimize their own reward signal in some way, shape, or form.
If you could unpack your belief “There aren’t interesting examples like this which are alignment-relevant”, I might be able to give a more precise/appropriate reply.
Let’s consider one obstacle.
Convergence theorems require certain learning rate and state visitation schedules (agents don’t visit every state of the world infinitely many times, in real life) which can themselves be interrupted if an AI eg sets its own learning rate to zero (thus violating the theorem’s preconditions). As best I can fathom, the convergence theorems do not apply to situations we care about (e.g. an embodied agent which we want to manufacture diamonds for us), for generalizable reasons which won’t be fixed through more clever proof techniques: EG an agent which tried to visit every state infinitely many times would quickly die and stop exploring.
Or in say the LLM-finetuning case, where you’re doing RLHF to get the model to (hopefully!) help you brainstorm research topics, the agent simply won’t try out every token-sequence in its context window. That won’t happen a single time, let alone infinitely many times. Even finite-time guarantees won’t kick in in time to apply to reality.
Or even if exploration weren’t an issue, the agent could—as i mentioned—simply set its learning rate to zero. How can the theorems help us there?
Again, as best I can fathom—there’s no clever argument or proof strategy that gets you around that obstacle, if you want to just apply the standard results to agents which can die or which operate on reasonable timescales or which can modify the learning rate schedule we nominally set.
(And then we can also talk about expressivity issues, learning dynamics being nonstationary over time, etc. The learning rate/state visitation obstacle is a sufficient blocker for convergence theorems IRL, but not itself a crux for me, in that I still wouldn’t expect you can apply the theorems even if the LR “issue” vanished.)
My understanding is that after a lot of simplifications, policy gradients just takes a noisy gradient step in the direction of minimising Bellman error, and so in the limit of infinite data/computation/visiting all states in the world, it is ‘guaranteed’ to converge to an optimal policy for the MDP. Q learning and other model-free algorithms have similar guarantees. In practice, with function approximation, and PPOs regularisation bits, these guarantees do not hold anymore, but the fundamental RL they are built off of does have them. The place to go deeper into this is Sutton and Bart’s textbook and also Bertsekas’ dynamic programming textbook
Do you have concrete examples of where convergence theorems apply to an interesting task with e.g. PPO? “There aren’t interesting examples like this which are alignment-relevant” seems like an important belief of mine, so if you know a counterexample, I’d be very grateful to learn about it and change my mind!
I might be misunderstanding you: take this with a grain of salt.
From my perspective: if convergence theorems did not work to a reasonable degree in practice, nobody would use RL-related algorithms. If I set reward in place A, but by default agents end up going somewhere far away from A, my approach is not doing what it is supposed to do; I put reward in place A because I wanted an agent that would go towards A to a certain extent.
I am not familiar with PPO. From this short article, in the section about TRPO:
Is this what you are referring to? But is it important for alignment? Let’s say the conditions for convergence are not met anymore, the theorem can’t be applied in theory, but in practice I do get an agent that goes towards A, where I’ve put reward. Is it misleading to say that the agent is maximising reward?
(However, keep in mind that
)
If you could unpack your belief “There aren’t interesting examples like this which are alignment-relevant”, I might be able to give a more precise/appropriate reply.
My overall contention is that viewing deep RL as an optimization problem where the RL alg tries to find a policy which tries to maximize reward over time, is fatally flawed and misunderstands the point and practicalities of real RL algorithms.[1]
I think it can be misleading, but it depends. “The network implements a policy which reliably reaches A”—that is what we observe. We can also state “this achieves a high numerical score on the policy-gradient-intensity (aka ‘reward’) statistic.” These statements are true and not very misleading, IMO. It doesn’t push me to consider unwarranted hypotheses about its generalization behavior, like that it cares about reward or will try to make reward come out high in general. These statements instead draw my attention towards our observations—“I wonder how the policy is internally structured so as to reliably reach A?” is a very fruitful question IMO.
One related problem is that RL papers often repeat “the point of RL is to train agents to maximize reward”, which leads to really bad implicit models of how RL works in my experience. I think it leads people to privilege the hypothesis that RL agents will optimize their own reward signal in some way, shape, or form.
Let’s consider one obstacle.
Convergence theorems require certain learning rate and state visitation schedules (agents don’t visit every state of the world infinitely many times, in real life) which can themselves be interrupted if an AI eg sets its own learning rate to zero (thus violating the theorem’s preconditions). As best I can fathom, the convergence theorems do not apply to situations we care about (e.g. an embodied agent which we want to manufacture diamonds for us), for generalizable reasons which won’t be fixed through more clever proof techniques: EG an agent which tried to visit every state infinitely many times would quickly die and stop exploring.
Or in say the LLM-finetuning case, where you’re doing RLHF to get the model to (hopefully!) help you brainstorm research topics, the agent simply won’t try out every token-sequence in its context window. That won’t happen a single time, let alone infinitely many times. Even finite-time guarantees won’t kick in in time to apply to reality.
Or even if exploration weren’t an issue, the agent could—as i mentioned—simply set its learning rate to zero. How can the theorems help us there?
Again, as best I can fathom—there’s no clever argument or proof strategy that gets you around that obstacle, if you want to just apply the standard results to agents which can die or which operate on reasonable timescales or which can modify the learning rate schedule we nominally set.
(And then we can also talk about expressivity issues, learning dynamics being nonstationary over time, etc. The learning rate/state visitation obstacle is a sufficient blocker for convergence theorems IRL, but not itself a crux for me, in that I still wouldn’t expect you can apply the theorems even if the LR “issue” vanished.)
See more writing on this: Reward is not the optimization target and Models Don’t “Get Reward”.
My understanding is that after a lot of simplifications, policy gradients just takes a noisy gradient step in the direction of minimising Bellman error, and so in the limit of infinite data/computation/visiting all states in the world, it is ‘guaranteed’ to converge to an optimal policy for the MDP. Q learning and other model-free algorithms have similar guarantees. In practice, with function approximation, and PPOs regularisation bits, these guarantees do not hold anymore, but the fundamental RL they are built off of does have them. The place to go deeper into this is Sutton and Bart’s textbook and also Bertsekas’ dynamic programming textbook
Yeah, I’ve read those books, although I admit to heavily skimming Bertsekas.