It is therefore very plausible that RL systems would in fact continue to maximize the reward after training, even if what they’re ultimately maximizing is just something highly correlated with it.
What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?
Can you give an example of such a motivational structure, so I know we’re considering the same thing?
ML systems in general seem to be able to generalize to human-labeled categories in situations that aren’t in the training data (e.g. image classifiers working, LMs able to do poetry).
Agreed. I also think this is different from a very specific kind of generalization towards reward maximization.
I again think it is plausible (2-5%-ish) that agents end up primarily making decisions on the basis of a tight reward-correlate (e.g. the register value, or some abstract representation of their historical reward function), and about 60% that agents end up at least somewhat making decisions on the basis of reward in a terminal sense (e.g. all else equal, the agent makes decisions which lead to high reward values; I think people are reward-oriented in this sense). Overall I feel pretty confused about what’s going on with people, and I can imagine changing my mind here relatively easily.
What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?
No, I mean that they’ll maximize a reward function that is ≈equal to the reward function on the training data (thus, highly correlated), and a plausible extrapolation of it outside of the training data. Take the coinrun example, the actual reward is “go to the coin”, and in the training data this coincides with “go to the right”. In test data from a similar distribution this coincides too.
Of course, this correlation breaks when the agent optimizes hard enough. But the point is that the agents you get are only those that optimize a plausible extrapolation of the reward signal in training, which will include agents that maximize the reward in most situations way more often than if you select a random agent.
Is your point in:
I also think this is different from a very specific kind of generalization towards reward maximization
That you think agents won’t be maximizing reward at all?
I would think that even if they don’t ultimately maximize reward in all situations, the situations encountered in test will be similar enough to training that agents will still kind of maximize reward there. (And agents definitely behave as reward maximizers in the specific seen training points, because that’s what SGD is selecting)
I’m not sure I understand what we disagree on at the moment.
I’m going to just reply with my gut responses here, hoping this clarifies how I’m considering the issues. Not meaning to imply we agree or disagree.
which will include agents that maximize the reward in most situations way more often than if you select a random agent.
Probably, yeah. Consider a network which received lots of policy gradients from the cognitive-update-intensity-signals (“rewards”[1]) generated by the “go to coin?” subroutine. I agree that this network will tend to, in the deployment distribution, tend to take actions which average higher sum-cognitive-update-intensity-signal (“reward over time”), than networks which are randomly initialized, or even which have randomly sampled shard compositions/values (in some reasonable sense).
But this doesn’t seem like it constrains my predictions too strongly. It seems like a relatively weak, correlational statement, where I’d be better off reasoning mechanistically about the likely “proxy-for-reward” values which get learned.
And agents definitely behave as reward maximizers in the specific seen training points, because that’s what SGD is selecting
I understand you to argue: “SGD will select policy networks for maximizing reward during training. Therefore, we should expect policy networks to behaviorally maximize reward on the training distribution over episodes.” On this understanding of what you’re arguing:
No, agents often do not behave as reward maximizers in the specific seen training points. RL trains agents which don’t maximize training reward… all the time!
fail to perform the most expert tricks and shortcuts (is AlphaZero playing perfect chess?),
(presumably) fail to exploit reward hacking opportunities which are hard to explore into.
For the last point, imagine that AlphaStar could perform a sequence of 300 precise actions, and then get +1 million policy-gradient-intensity (“reward”) due to a glitch. On the reasoning I understand you to advance, SGD is “selecting” for networks which receive high policy-gradient-intensity, but… it’s never going to happen in realistic amounts of time. Even in training.
This is because SGD is updating the agent on the observed empirical data distribution, as collected by the policy at previous timesteps. SGD isn’t updating the agent on things which didn’t happen. And so SGD itself isn’t selecting for reward maximizers. Maybe if you run the outer training loop long enough, such that the agent probabilistically explores into this glitch (a long time), maybe then this reward-maximizing policy gets “selected for.”[3]
So there’s this broader question I have of “what, exactly, is being predicted by the ‘agents are selected to maximize reward during training’[4] hypothesis?”. It seems to me like we need to modify this hypothesis in various ways in order to handle the objections I’ve raised. And the ways we’re modifying the hypothesis (e.g. “well, it depends on the empirical data distribution, and expressivity constraints implicit in the inductive biases, and the details of exploration strategies, and the skill ceiling of the task”) seem to lead us to us no longer predicting that the policy networks will actually maximize reward in training episodes.
(Also, I note that the context of this thread is that I generally don’t buy “SGD selects for X” arguments without mechanistic reasoning to back them up.)
I’m substituting mechanistic descriptions of “reward” because that helps me think more clearly about what’s happening during training, without the suggestive-to-me connotations of “reward.”
You can argue “DQN sucked”, but also DQN was a substantial advance at the time. Why should I expect that AGI will be trained on an architecture which actually gets maximal training reward, as opposed to squeaking by with a decent amount and still ending up very smart?
But maybe not, because the network can mode collapse onto existing reinforced Starcraft strategies fast enough that P(explore into glitch) decreases exponentially with time, such that the final probability of exploring into the glitch is not in fact 1. (Haven’t checked the math on this, but feels plausible.)
What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?
Can you give an example of such a motivational structure, so I know we’re considering the same thing?
Agreed. I also think this is different from a very specific kind of generalization towards reward maximization.
I again think it is plausible (2-5%-ish) that agents end up primarily making decisions on the basis of a tight reward-correlate (e.g. the register value, or some abstract representation of their historical reward function), and about 60% that agents end up at least somewhat making decisions on the basis of reward in a terminal sense (e.g. all else equal, the agent makes decisions which lead to high reward values; I think people are reward-oriented in this sense). Overall I feel pretty confused about what’s going on with people, and I can imagine changing my mind here relatively easily.
No, I mean that they’ll maximize a reward function that is ≈equal to the reward function on the training data (thus, highly correlated), and a plausible extrapolation of it outside of the training data. Take the coinrun example, the actual reward is “go to the coin”, and in the training data this coincides with “go to the right”. In test data from a similar distribution this coincides too.
Of course, this correlation breaks when the agent optimizes hard enough. But the point is that the agents you get are only those that optimize a plausible extrapolation of the reward signal in training, which will include agents that maximize the reward in most situations way more often than if you select a random agent.
Is your point in:
That you think agents won’t be maximizing reward at all?
I would think that even if they don’t ultimately maximize reward in all situations, the situations encountered in test will be similar enough to training that agents will still kind of maximize reward there. (And agents definitely behave as reward maximizers in the specific seen training points, because that’s what SGD is selecting)
I’m not sure I understand what we disagree on at the moment.
I’m going to just reply with my gut responses here, hoping this clarifies how I’m considering the issues. Not meaning to imply we agree or disagree.
Probably, yeah. Consider a network which received lots of policy gradients from the cognitive-update-intensity-signals (“rewards”[1]) generated by the “go to coin?” subroutine. I agree that this network will tend to, in the deployment distribution, tend to take actions which average higher sum-cognitive-update-intensity-signal (“reward over time”), than networks which are randomly initialized, or even which have randomly sampled shard compositions/values (in some reasonable sense).
But this doesn’t seem like it constrains my predictions too strongly. It seems like a relatively weak, correlational statement, where I’d be better off reasoning mechanistically about the likely “proxy-for-reward” values which get learned.
I understand you to argue: “SGD will select policy networks for maximizing reward during training. Therefore, we should expect policy networks to behaviorally maximize reward on the training distribution over episodes.” On this understanding of what you’re arguing:
No, agents often do not behave as reward maximizers in the specific seen training points. RL trains agents which don’t maximize training reward… all the time!
Agents:
die in video games (see DQN),[2]
fail to perform the most expert tricks and shortcuts (is AlphaZero playing perfect chess?),
(presumably) fail to exploit reward hacking opportunities which are hard to explore into.
For the last point, imagine that AlphaStar could perform a sequence of 300 precise actions, and then get +1 million policy-gradient-intensity (“reward”) due to a glitch. On the reasoning I understand you to advance, SGD is “selecting” for networks which receive high policy-gradient-intensity, but… it’s never going to happen in realistic amounts of time. Even in training.
This is because SGD is updating the agent on the observed empirical data distribution, as collected by the policy at previous timesteps. SGD isn’t updating the agent on things which didn’t happen. And so SGD itself isn’t selecting for reward maximizers. Maybe if you run the outer training loop long enough, such that the agent probabilistically explores into this glitch (a long time), maybe then this reward-maximizing policy gets “selected for.”[3]
So there’s this broader question I have of “what, exactly, is being predicted by the ‘agents are selected to maximize reward during training’[4] hypothesis?”. It seems to me like we need to modify this hypothesis in various ways in order to handle the objections I’ve raised. And the ways we’re modifying the hypothesis (e.g. “well, it depends on the empirical data distribution, and expressivity constraints implicit in the inductive biases, and the details of exploration strategies, and the skill ceiling of the task”) seem to lead us to us no longer predicting that the policy networks will actually maximize reward in training episodes.
(Also, I note that the context of this thread is that I generally don’t buy “SGD selects for X” arguments without mechanistic reasoning to back them up.)
I’m substituting mechanistic descriptions of “reward” because that helps me think more clearly about what’s happening during training, without the suggestive-to-me connotations of “reward.”
You can argue “DQN sucked”, but also DQN was a substantial advance at the time. Why should I expect that AGI will be trained on an architecture which actually gets maximal training reward, as opposed to squeaking by with a decent amount and still ending up very smart?
But maybe not, because the network can mode collapse onto existing reinforced Starcraft strategies fast enough that P(explore into glitch) decreases exponentially with time, such that the final probability of exploring into the glitch is not in fact 1. (Haven’t checked the math on this, but feels plausible.)
Paul Christiano recently made a similar claim: