If the specific claim was closer to “yes, RL algorithms if ran until convergence have a good chance of ‘caring’ about reward, but in practice we’ll never run RL algorithms that way”, then I think this would be a much stronger objection.
This is part of the reasoning (and I endorse Steve’s sibling comment, while disagreeing with his original one). I guess I was baking “convergence doesn’t happen in practice” into my reasoning, that there is no force which compels agents to keep accepting policy gradients from the same policy-gradient-intensity-producing function (aka the “reward” function).
But obviously [the coherence theorems’] conditions aren’t true in the real world. Your learning algorithm doesn’t force you to try drugs. Any AI which e.g. tried every action at least once would quickly kill itself, and so real-world general RL agents won’t explore like that because that would be stupid. So the RL agent’s algorithm won’t make it e.g. explore wireheading either, and so the convergence theorems don’t apply even a little—even in spirit.
Perhaps my claim was even stronger, that “we can’t really run real-world AGI-training RL algorithms ‘until convergence’, in the sense of ‘surmounting exploration issues’.” Not just that we won’t, but that it often doesn’t make sense to consider that “limit.”
Also, to draw conclusions about e.g. running RL for infinite data and infinite time so as to achieve “convergence”… Limits have to be well-defined, with arguments for why the limits are reasonable abstraction, with care taken with the order in which we take limits.
What about the part where the sun dies in finite time? To avoid that, are we assuming a fixed data distribution (e.g. over observation-action-observation-reward tuples in a maze-solving environment, with more tuples drawn from embodied navigation)
What is our sampling distribution of data, since “infinite data” admits many kinds of relative proportions?
Is the agent allowed to stop or modify the learning process?
(Even finite-time learning theory results don’t apply if the optimized network can reach into the learning process and set its learning rate to zero, thereby breaking the assumptions of the theorems.)
Limit to infinite data and then limit to infinite time, or vice versa, or both at once?
For example, say they are doing early stopping to prevent RL from Goodharting on the given proxy reward: when are they deciding to stop? My guess is they will stop when a second proxy reward is high. So we still end up selecting in favor of a sum of first proxy reward + second proxy reward. Indeed, you can generally think of RL + explicit regularization itself as RL on a different cost function.
I disagree. Early stopping on a separate stopping criterion which we don’t run gradients through, is not at all similar [EDIT: seems in many ways extremely dissimilar] to reinforcement learningon a joint cost function additively incorporating the stopping criterion with the nominal reward. Where is the reinforcement, where are the gradients, in the first case? They don’t flow back through performance on the stopping criterion. These are just mechanistically different processes.
Separately consider how many bits of selection this is for the stopping criterion being high. Suppose you run 12 random seeds on a make-people-smile task. (Er, I’m stuck, can you propose the stopping criterion reward you had in mind?) Anyways, my intuition here was just that it seems like you aren’t getting more than log212<4 bits of selection on any given quantity from taking statistics of the seeds. Seems like way too small to conclude “this is meaningfully selecting for sum of first+second reward.”
“Early stopping on a separate stopping criterion which we don’t run gradients through, is not at all similar to reinforcement learning on a joint cost function additively incorporating the stopping criterion with the nominal reward.”
Sounds like this could be an interesting empirical experiment. Similar to Scaling Laws for Reward Model Overoptimization, you could start with a gold reward model that represents human preference. Then you could try to figure out the best way to train an agent to maximize gold reward using only a limited number of sampled data points from that gold distribution. For example, you could train two reward models using different samples from the same distribution and use one for training, the other for early stopping. (This is essentially the train / val split used in typical ML settings with data constraints.) You could measure the best ways to maximize gold reward on a limited data budget. Alternatively your early stopping RM could be trained on a gold RM containing distribution shift, data augmentations, adversarial perturbations, or a challenge set of particularly challenging cases.
Would you be interested to see an experiment like this? Ideally it could make progress towards an empirical study of how reward hacking happens and how to prevent it. Do you see design flaws that would prevent us from learning much? What changes would you make to the setup?
This is part of the reasoning (and I endorse Steve’s sibling comment, while disagreeing with his original one). I guess I was baking “convergence doesn’t happen in practice” into my reasoning, that there is no force which compels agents to keep accepting policy gradients from the same policy-gradient-intensity-producing function (aka the “reward” function).
From Reward is not the optimization target:
Perhaps my claim was even stronger, that “we can’t really run real-world AGI-training RL algorithms ‘until convergence’, in the sense of ‘surmounting exploration issues’.” Not just that we won’t, but that it often doesn’t make sense to consider that “limit.”
Also, to draw conclusions about e.g. running RL for infinite data and infinite time so as to achieve “convergence”… Limits have to be well-defined, with arguments for why the limits are reasonable abstraction, with care taken with the order in which we take limits.
What about the part where the sun dies in finite time? To avoid that, are we assuming a fixed data distribution (e.g. over observation-action-observation-reward tuples in a maze-solving environment, with more tuples drawn from embodied navigation)
What is our sampling distribution of data, since “infinite data” admits many kinds of relative proportions?
Is the agent allowed to stop or modify the learning process?
(Even finite-time learning theory results don’t apply if the optimized network can reach into the learning process and set its learning rate to zero, thereby breaking the assumptions of the theorems.)
Limit to infinite data and then limit to infinite time, or vice versa, or both at once?
I disagree. Early stopping on a separate stopping criterion which we don’t run gradients through, is
not at all similar[EDIT: seems in many ways extremely dissimilar] to reinforcement learning on a joint cost function additively incorporating the stopping criterion with the nominal reward. Where is the reinforcement, where are the gradients, in the first case? They don’t flow back through performance on the stopping criterion. These are just mechanistically different processes.Separately consider how many bits of selection this is for the stopping criterion being high. Suppose you run 12 random seeds on a make-people-smile task. (Er, I’m stuck, can you propose the stopping criterion reward you had in mind?) Anyways, my intuition here was just that it seems like you aren’t getting more than log212<4 bits of selection on any given quantity from taking statistics of the seeds. Seems like way too small to conclude “this is meaningfully selecting for sum of first+second reward.”
“Early stopping on a separate stopping criterion which we don’t run gradients through, is not at all similar to reinforcement learning on a joint cost function additively incorporating the stopping criterion with the nominal reward.”
Sounds like this could be an interesting empirical experiment. Similar to Scaling Laws for Reward Model Overoptimization, you could start with a gold reward model that represents human preference. Then you could try to figure out the best way to train an agent to maximize gold reward using only a limited number of sampled data points from that gold distribution. For example, you could train two reward models using different samples from the same distribution and use one for training, the other for early stopping. (This is essentially the train / val split used in typical ML settings with data constraints.) You could measure the best ways to maximize gold reward on a limited data budget. Alternatively your early stopping RM could be trained on a gold RM containing distribution shift, data augmentations, adversarial perturbations, or a challenge set of particularly challenging cases.
Would you be interested to see an experiment like this? Ideally it could make progress towards an empirical study of how reward hacking happens and how to prevent it. Do you see design flaws that would prevent us from learning much? What changes would you make to the setup?