wirehead-proof crib, and eventually it will be sufficiently self-aware and foresighted that when we let it out of the crib, it can deliberately avoid situations that would get it addicted to wireheading.
I feel like I’m saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where human approval diverges from what the humans “would want if fully informed”.
This doesn’t mean that they end up intrinsically caring about exactly X, but it does promote the “they might reward hack X” hypothesis into consideration.
I feel like you’re arguing that what I’m saying could potentially fail, and I’m arguing that what I’m saying could potentially succeed. In which case, maybe we can both agree that it’s a potential but not inevitable failure mode that we should absolutely keep thinking about.
Yep, my current stance is that it’s quite likely we’ll see power grab into reward-hacking behavior by default, but it’s nowhere near inevitable (but we should probably think about this scenario anyways).
I feel like I’m saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where human approval diverges from what the humans “would want if fully informed”.
I actually want to controversy that. I’m now going to write quickly about selection arguments in alignment more generally (this is not all addressed directly to you).
I don’t know what it means to “select” “sufficiently hard enough” here; I think when we actually ground out what that means, we have actual mechanistic arguments to consider.
We aren’t just “selecting for” human approval. We’re selecting for conformity-under-updating with all functions which give the same historical feedback (i.e. when the reward button was pressed when the AI was telling jokes, and when it wasn’t).
i.e. identifiability “issues” undermine selection reasoning for why agents do one particular thing.
I think that “select for human approval” is a very mentally available selection criterion to consider, such that we promote the hypothesis to attention without considering a range of alternative selection pressures.
What about selecting for conformance to “human approval until next year, but extremely high approval on OOD cheese-making situations”?
You could say “that’s more complex”, which is reasonable, but now we’re reasoning about mechanisms, and should just keep doing so IMO (see [point 3] below).
Insofar as selection is worth mentioning, there are a huge range of selection pressures of exactly the same nominal strength (e.g. “human approval until next year, but extremely high approval on OOD cheese-making situations” vs “human approval”).
Some people want to apply selection arguments because they believe that selection arguments bypass the need to understand mechanistic details to draw strong conclusions. I think this is mistaken, and that selection arguments often prove too much, and to understand why, you have to know something about the mechanisms.
An inexact but relevant analogy.Consider a gunsmith who doesn’t know biology and hasn’t seen many animals, but does know the basics of natural selection. He reasons “Wolves with sniper rifles would have incredible fitness advantages relative to wolves without sniper rifles. Therefore, wolves develop sniper rifles given enough evolutionary time.”
It is simply true that there is a hypothetical fitness gradient in this direction (it would in fact improve fitness), but false that there is a path in mutation-space which implements this phenotypic change.
To predict, in advance and without knowing about wolves already, you have to know things about biology. You may have to understand the available mutations at any given genotype, and perhaps some details about chromosomal crossover, and how guns work, and the wolf’s evolutionary environment, and physics, and perhaps many other things.
From my perspective, a bunch of selection-based reasoning draws enormous conclusions (e.g. high chance the policy cares about reward OOD) given vague / weak technical preconditions (e.g. policies are selected for reward) without attending to strength or mechanism or size of the trained network or training timescales or net direction of selection. All the while being perceived (by some, not necessarily you) to avoid/reduce the need to understand SGD dynamics to draw good conclusions about alignment properties.
The point isn’t that these selection arguments can’t possibly be shored up. It’s that I don’t think people do shore them up, and I don’t see how to shore them up, and in order to shore them up, we have to start talking about details and understanding the mechanisms anyways.
The point isn’t that selection reasoning can never be useful, but I think that we have to be damn careful with it. I’m not yet settled on how I want to positively use it, going forwards.
(I wrote a lot of this and have to go now, and this isn’t necessarily a complete list of my gripes, but here’s some of my present thinking.)
This doesn’t mean that they end up intrinsically caring about exactly X, but it does promote the “they might reward hack X” hypothesis into consideration.
I agree this should at least be considered. But not to the degree it’s been historically considered.
You can also get around the “IDK the mechanism” issue if you observe variance over the relevant trait in the population you’re selecting over. Like, if we/SGD could veridically select over “propensity to optimize approval OOD”, then you don’t have to know the mechanism. The variance shows you there are some mechanisms, such that variation in their settings leads to variation in the trait (e.g. approval OOD).
But the designers can’t tell that. Can SGD tell that? (This question feels a bit confused, so please extra scan it before answering it)
From their perspective, they cannot select on approval OOD, except insofar as selecting for approval on-training rules out some settings which don’t pursue approval OOD. (EG If I want someone to watch my dog, i can’t scan the dogsitter and divine what they will do alone in my house. But if the dogsitter steals from me in front of my face during the interview, I can select against that. Combined with “people who steal from you while you watch, will also steal when you don’t watch”, I can get a tiny bit of selection against thieving dogsitters, even if I can’t observe them once I’ve left.)
But the designers can’t tell that. Can SGD tell that?
No, SGD can’t tell the degree to which some agent generalizes a trait outside the training distribution.
But empirically, it seems that RL agents reinforced to maximize some reward function (e.g. the Atari game score) on data points; do fairly well at maximizing that reward function OOD (such as when playing the game again from a different starting state). ML systems in general seem to be able to generalize to human-labeled categories in situations that aren’t in the training data (e.g. image classifiers working, LMs able to do poetry).
It is therefore very plausible that RL systems would in fact continue to maximize the reward after training, even if what they’re ultimately maximizing is just something highly correlated with it.
It is therefore very plausible that RL systems would in fact continue to maximize the reward after training, even if what they’re ultimately maximizing is just something highly correlated with it.
What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?
Can you give an example of such a motivational structure, so I know we’re considering the same thing?
ML systems in general seem to be able to generalize to human-labeled categories in situations that aren’t in the training data (e.g. image classifiers working, LMs able to do poetry).
Agreed. I also think this is different from a very specific kind of generalization towards reward maximization.
I again think it is plausible (2-5%-ish) that agents end up primarily making decisions on the basis of a tight reward-correlate (e.g. the register value, or some abstract representation of their historical reward function), and about 60% that agents end up at least somewhat making decisions on the basis of reward in a terminal sense (e.g. all else equal, the agent makes decisions which lead to high reward values; I think people are reward-oriented in this sense). Overall I feel pretty confused about what’s going on with people, and I can imagine changing my mind here relatively easily.
What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?
No, I mean that they’ll maximize a reward function that is ≈equal to the reward function on the training data (thus, highly correlated), and a plausible extrapolation of it outside of the training data. Take the coinrun example, the actual reward is “go to the coin”, and in the training data this coincides with “go to the right”. In test data from a similar distribution this coincides too.
Of course, this correlation breaks when the agent optimizes hard enough. But the point is that the agents you get are only those that optimize a plausible extrapolation of the reward signal in training, which will include agents that maximize the reward in most situations way more often than if you select a random agent.
Is your point in:
I also think this is different from a very specific kind of generalization towards reward maximization
That you think agents won’t be maximizing reward at all?
I would think that even if they don’t ultimately maximize reward in all situations, the situations encountered in test will be similar enough to training that agents will still kind of maximize reward there. (And agents definitely behave as reward maximizers in the specific seen training points, because that’s what SGD is selecting)
I’m not sure I understand what we disagree on at the moment.
I’m going to just reply with my gut responses here, hoping this clarifies how I’m considering the issues. Not meaning to imply we agree or disagree.
which will include agents that maximize the reward in most situations way more often than if you select a random agent.
Probably, yeah. Consider a network which received lots of policy gradients from the cognitive-update-intensity-signals (“rewards”[1]) generated by the “go to coin?” subroutine. I agree that this network will tend to, in the deployment distribution, tend to take actions which average higher sum-cognitive-update-intensity-signal (“reward over time”), than networks which are randomly initialized, or even which have randomly sampled shard compositions/values (in some reasonable sense).
But this doesn’t seem like it constrains my predictions too strongly. It seems like a relatively weak, correlational statement, where I’d be better off reasoning mechanistically about the likely “proxy-for-reward” values which get learned.
And agents definitely behave as reward maximizers in the specific seen training points, because that’s what SGD is selecting
I understand you to argue: “SGD will select policy networks for maximizing reward during training. Therefore, we should expect policy networks to behaviorally maximize reward on the training distribution over episodes.” On this understanding of what you’re arguing:
No, agents often do not behave as reward maximizers in the specific seen training points. RL trains agents which don’t maximize training reward… all the time!
fail to perform the most expert tricks and shortcuts (is AlphaZero playing perfect chess?),
(presumably) fail to exploit reward hacking opportunities which are hard to explore into.
For the last point, imagine that AlphaStar could perform a sequence of 300 precise actions, and then get +1 million policy-gradient-intensity (“reward”) due to a glitch. On the reasoning I understand you to advance, SGD is “selecting” for networks which receive high policy-gradient-intensity, but… it’s never going to happen in realistic amounts of time. Even in training.
This is because SGD is updating the agent on the observed empirical data distribution, as collected by the policy at previous timesteps. SGD isn’t updating the agent on things which didn’t happen. And so SGD itself isn’t selecting for reward maximizers. Maybe if you run the outer training loop long enough, such that the agent probabilistically explores into this glitch (a long time), maybe then this reward-maximizing policy gets “selected for.”[3]
So there’s this broader question I have of “what, exactly, is being predicted by the ‘agents are selected to maximize reward during training’[4] hypothesis?”. It seems to me like we need to modify this hypothesis in various ways in order to handle the objections I’ve raised. And the ways we’re modifying the hypothesis (e.g. “well, it depends on the empirical data distribution, and expressivity constraints implicit in the inductive biases, and the details of exploration strategies, and the skill ceiling of the task”) seem to lead us to us no longer predicting that the policy networks will actually maximize reward in training episodes.
(Also, I note that the context of this thread is that I generally don’t buy “SGD selects for X” arguments without mechanistic reasoning to back them up.)
I’m substituting mechanistic descriptions of “reward” because that helps me think more clearly about what’s happening during training, without the suggestive-to-me connotations of “reward.”
You can argue “DQN sucked”, but also DQN was a substantial advance at the time. Why should I expect that AGI will be trained on an architecture which actually gets maximal training reward, as opposed to squeaking by with a decent amount and still ending up very smart?
But maybe not, because the network can mode collapse onto existing reinforced Starcraft strategies fast enough that P(explore into glitch) decreases exponentially with time, such that the final probability of exploring into the glitch is not in fact 1. (Haven’t checked the math on this, but feels plausible.)
Some people want to apply selection arguments because they believe that selection arguments bypass the need to understand mechanistic details to draw strong conclusions. I think this is mistaken, and that selection arguments often prove too much, and to understand why, you have to know something about the mechanisms.
It is clear that selecting for X selects for agents which historically did X in the course of the selection. But how this generalizes outside of the selecting strongly depends on the selection process and architecture. It could be a capabilities generalization, reward generalization for the written-down reward, generalization for some other reward function, or something else entirely.
We cannot predict how the agent will generalize without considering the details of its construction.
I feel like I’m saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where human approval diverges from what the humans “would want if fully informed”.
This doesn’t mean that they end up intrinsically caring about exactly X, but it does promote the “they might reward hack X” hypothesis into consideration.
Yep, my current stance is that it’s quite likely we’ll see power grab into reward-hacking behavior by default, but it’s nowhere near inevitable (but we should probably think about this scenario anyways).
I actually want to controversy that. I’m now going to write quickly about selection arguments in alignment more generally (this is not all addressed directly to you).
I don’t know what it means to “select” “sufficiently hard enough” here; I think when we actually ground out what that means, we have actual mechanistic arguments to consider.
We aren’t just “selecting for” human approval. We’re selecting for conformity-under-updating with all functions which give the same historical feedback (i.e. when the reward button was pressed when the AI was telling jokes, and when it wasn’t).
i.e. identifiability “issues” undermine selection reasoning for why agents do one particular thing.
I think that “select for human approval” is a very mentally available selection criterion to consider, such that we promote the hypothesis to attention without considering a range of alternative selection pressures.
What about selecting for conformance to “human approval until next year, but extremely high approval on OOD cheese-making situations”?
You could say “that’s more complex”, which is reasonable, but now we’re reasoning about mechanisms, and should just keep doing so IMO (see [point 3] below).
Insofar as selection is worth mentioning, there are a huge range of selection pressures of exactly the same nominal strength (e.g. “human approval until next year, but extremely high approval on OOD cheese-making situations” vs “human approval”).
Some people want to apply selection arguments because they believe that selection arguments bypass the need to understand mechanistic details to draw strong conclusions. I think this is mistaken, and that selection arguments often prove too much, and to understand why, you have to know something about the mechanisms.
An inexact but relevant analogy. Consider a gunsmith who doesn’t know biology and hasn’t seen many animals, but does know the basics of natural selection. He reasons “Wolves with sniper rifles would have incredible fitness advantages relative to wolves without sniper rifles. Therefore, wolves develop sniper rifles given enough evolutionary time.”
It is simply true that there is a hypothetical fitness gradient in this direction (it would in fact improve fitness), but false that there is a path in mutation-space which implements this phenotypic change.
To predict, in advance and without knowing about wolves already, you have to know things about biology. You may have to understand the available mutations at any given genotype, and perhaps some details about chromosomal crossover, and how guns work, and the wolf’s evolutionary environment, and physics, and perhaps many other things.
From my perspective, a bunch of selection-based reasoning draws enormous conclusions (e.g. high chance the policy cares about reward OOD) given vague / weak technical preconditions (e.g. policies are selected for reward) without attending to strength or mechanism or size of the trained network or training timescales or net direction of selection. All the while being perceived (by some, not necessarily you) to avoid/reduce the need to understand SGD dynamics to draw good conclusions about alignment properties.
The point isn’t that these selection arguments can’t possibly be shored up. It’s that I don’t think people do shore them up, and I don’t see how to shore them up, and in order to shore them up, we have to start talking about details and understanding the mechanisms anyways.
The point isn’t that selection reasoning can never be useful, but I think that we have to be damn careful with it. I’m not yet settled on how I want to positively use it, going forwards.
(I wrote a lot of this and have to go now, and this isn’t necessarily a complete list of my gripes, but here’s some of my present thinking.)
I agree this should at least be considered. But not to the degree it’s been historically considered.
You can also get around the “IDK the mechanism” issue if you observe variance over the relevant trait in the population you’re selecting over. Like, if we/SGD could veridically select over “propensity to optimize approval OOD”, then you don’t have to know the mechanism. The variance shows you there are some mechanisms, such that variation in their settings leads to variation in the trait (e.g. approval OOD).
But the designers can’t tell that. Can SGD tell that? (This question feels a bit confused, so please extra scan it before answering it)
From their perspective, they cannot select on approval OOD, except insofar as selecting for approval on-training rules out some settings which don’t pursue approval OOD. (EG If I want someone to watch my dog, i can’t scan the dogsitter and divine what they will do alone in my house. But if the dogsitter steals from me in front of my face during the interview, I can select against that. Combined with “people who steal from you while you watch, will also steal when you don’t watch”, I can get a tiny bit of selection against thieving dogsitters, even if I can’t observe them once I’ve left.)
No, SGD can’t tell the degree to which some agent generalizes a trait outside the training distribution.
But empirically, it seems that RL agents reinforced to maximize some reward function (e.g. the Atari game score) on data points; do fairly well at maximizing that reward function OOD (such as when playing the game again from a different starting state). ML systems in general seem to be able to generalize to human-labeled categories in situations that aren’t in the training data (e.g. image classifiers working, LMs able to do poetry).
It is therefore very plausible that RL systems would in fact continue to maximize the reward after training, even if what they’re ultimately maximizing is just something highly correlated with it.
What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?
Can you give an example of such a motivational structure, so I know we’re considering the same thing?
Agreed. I also think this is different from a very specific kind of generalization towards reward maximization.
I again think it is plausible (2-5%-ish) that agents end up primarily making decisions on the basis of a tight reward-correlate (e.g. the register value, or some abstract representation of their historical reward function), and about 60% that agents end up at least somewhat making decisions on the basis of reward in a terminal sense (e.g. all else equal, the agent makes decisions which lead to high reward values; I think people are reward-oriented in this sense). Overall I feel pretty confused about what’s going on with people, and I can imagine changing my mind here relatively easily.
No, I mean that they’ll maximize a reward function that is ≈equal to the reward function on the training data (thus, highly correlated), and a plausible extrapolation of it outside of the training data. Take the coinrun example, the actual reward is “go to the coin”, and in the training data this coincides with “go to the right”. In test data from a similar distribution this coincides too.
Of course, this correlation breaks when the agent optimizes hard enough. But the point is that the agents you get are only those that optimize a plausible extrapolation of the reward signal in training, which will include agents that maximize the reward in most situations way more often than if you select a random agent.
Is your point in:
That you think agents won’t be maximizing reward at all?
I would think that even if they don’t ultimately maximize reward in all situations, the situations encountered in test will be similar enough to training that agents will still kind of maximize reward there. (And agents definitely behave as reward maximizers in the specific seen training points, because that’s what SGD is selecting)
I’m not sure I understand what we disagree on at the moment.
I’m going to just reply with my gut responses here, hoping this clarifies how I’m considering the issues. Not meaning to imply we agree or disagree.
Probably, yeah. Consider a network which received lots of policy gradients from the cognitive-update-intensity-signals (“rewards”[1]) generated by the “go to coin?” subroutine. I agree that this network will tend to, in the deployment distribution, tend to take actions which average higher sum-cognitive-update-intensity-signal (“reward over time”), than networks which are randomly initialized, or even which have randomly sampled shard compositions/values (in some reasonable sense).
But this doesn’t seem like it constrains my predictions too strongly. It seems like a relatively weak, correlational statement, where I’d be better off reasoning mechanistically about the likely “proxy-for-reward” values which get learned.
I understand you to argue: “SGD will select policy networks for maximizing reward during training. Therefore, we should expect policy networks to behaviorally maximize reward on the training distribution over episodes.” On this understanding of what you’re arguing:
No, agents often do not behave as reward maximizers in the specific seen training points. RL trains agents which don’t maximize training reward… all the time!
Agents:
die in video games (see DQN),[2]
fail to perform the most expert tricks and shortcuts (is AlphaZero playing perfect chess?),
(presumably) fail to exploit reward hacking opportunities which are hard to explore into.
For the last point, imagine that AlphaStar could perform a sequence of 300 precise actions, and then get +1 million policy-gradient-intensity (“reward”) due to a glitch. On the reasoning I understand you to advance, SGD is “selecting” for networks which receive high policy-gradient-intensity, but… it’s never going to happen in realistic amounts of time. Even in training.
This is because SGD is updating the agent on the observed empirical data distribution, as collected by the policy at previous timesteps. SGD isn’t updating the agent on things which didn’t happen. And so SGD itself isn’t selecting for reward maximizers. Maybe if you run the outer training loop long enough, such that the agent probabilistically explores into this glitch (a long time), maybe then this reward-maximizing policy gets “selected for.”[3]
So there’s this broader question I have of “what, exactly, is being predicted by the ‘agents are selected to maximize reward during training’[4] hypothesis?”. It seems to me like we need to modify this hypothesis in various ways in order to handle the objections I’ve raised. And the ways we’re modifying the hypothesis (e.g. “well, it depends on the empirical data distribution, and expressivity constraints implicit in the inductive biases, and the details of exploration strategies, and the skill ceiling of the task”) seem to lead us to us no longer predicting that the policy networks will actually maximize reward in training episodes.
(Also, I note that the context of this thread is that I generally don’t buy “SGD selects for X” arguments without mechanistic reasoning to back them up.)
I’m substituting mechanistic descriptions of “reward” because that helps me think more clearly about what’s happening during training, without the suggestive-to-me connotations of “reward.”
You can argue “DQN sucked”, but also DQN was a substantial advance at the time. Why should I expect that AGI will be trained on an architecture which actually gets maximal training reward, as opposed to squeaking by with a decent amount and still ending up very smart?
But maybe not, because the network can mode collapse onto existing reinforced Starcraft strategies fast enough that P(explore into glitch) decreases exponentially with time, such that the final probability of exploring into the glitch is not in fact 1. (Haven’t checked the math on this, but feels plausible.)
Paul Christiano recently made a similar claim:
Strongly agree with this in particular:
(emphasis mine). I think it’s an application of the no free lunch razor
It is clear that selecting for X selects for agents which historically did X in the course of the selection. But how this generalizes outside of the selecting strongly depends on the selection process and architecture. It could be a capabilities generalization, reward generalization for the written-down reward, generalization for some other reward function, or something else entirely.
We cannot predict how the agent will generalize without considering the details of its construction.