I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward.
I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers.
I’m not exactly sure how to cash out your objection as a response to this, but I suspect it’s probably a bit too galaxy-brained for my taste.
Train an agent A in an environment that contains the source B of A’s reward.
Alice observes that A learns to hack B. Then she solves this as follows:
Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B’s internals.
Alice observes that A doesn’t hack B. The Bob looks at Alice’s results and says,
“Cool. But this won’t generalize to future lethal systems because it doesn’t account for how A can combine innocuous understanding that it gains. Future systems, to be very competent, will probably do something functionally equivalent to exploring their environment to understand parts of the environment without necessarily trying to achieve some big goal (such as hacking B) along the way. This creates a ‘capabilities overhang’ relative to the overseer: there’s no behavior that’s clearly aimed at something B considers dangerous, but A accumulates ability to put together plans that do more and more effective stuff, compared to what A has actually previously acted out and gotten direct reinforcement on. This is an important part of how future systems might be lethal.”
So then Alice and Bob collaborate and come up with this variation:
Same setup, but now A also has an exploration objective (e.g. it’s rewarded for reaching states for the first time, or for gaining information, or for a randomly sampled goal from some rich distribution on goals).
Alice and Bob observe that A avoids approaching B for a long time while steadily improving both its B-score and also its exploration score. Then at some point, all in one episode, A hacks B and achieves very high reward.
Now, this might be interesting from an alignment perspective, or not. But my point is that Alice and Bob have perhaps, in some version of the hypothetical, also made a capabilities advance: they’ve demonstrated non-trivial gains from an exploration objective. I assume that in our world this is not much of an insight, as exploration objectives have already been discussed and tried. But this is the sort of pattern that’s concerning to me.
I’m not saying one can’t do this sort of thing in a way such that the alignment value exceeds the capabilities advancement in the relevant way. I’m saying, these things seem to push pretty directly against each other, so I’d want careful thinking about how to pull them apart. Even instances that don’t come up with new ideas, but just demonstrate “hey actually this method is powerful”, would seem to advance capabilities non-trivially.
I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward.
I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers.
I’m not exactly sure how to cash out your objection as a response to this, but I suspect it’s probably a bit too galaxy-brained for my taste.
So for example, say Alice runs this experiment:
Alice observes that A learns to hack B. Then she solves this as follows:
Alice observes that A doesn’t hack B. The Bob looks at Alice’s results and says,
“Cool. But this won’t generalize to future lethal systems because it doesn’t account for how A can combine innocuous understanding that it gains. Future systems, to be very competent, will probably do something functionally equivalent to exploring their environment to understand parts of the environment without necessarily trying to achieve some big goal (such as hacking B) along the way. This creates a ‘capabilities overhang’ relative to the overseer: there’s no behavior that’s clearly aimed at something B considers dangerous, but A accumulates ability to put together plans that do more and more effective stuff, compared to what A has actually previously acted out and gotten direct reinforcement on. This is an important part of how future systems might be lethal.”
So then Alice and Bob collaborate and come up with this variation:
Alice and Bob observe that A avoids approaching B for a long time while steadily improving both its B-score and also its exploration score. Then at some point, all in one episode, A hacks B and achieves very high reward.
Now, this might be interesting from an alignment perspective, or not. But my point is that Alice and Bob have perhaps, in some version of the hypothetical, also made a capabilities advance: they’ve demonstrated non-trivial gains from an exploration objective. I assume that in our world this is not much of an insight, as exploration objectives have already been discussed and tried. But this is the sort of pattern that’s concerning to me.
I’m not saying one can’t do this sort of thing in a way such that the alignment value exceeds the capabilities advancement in the relevant way. I’m saying, these things seem to push pretty directly against each other, so I’d want careful thinking about how to pull them apart. Even instances that don’t come up with new ideas, but just demonstrate “hey actually this method is powerful”, would seem to advance capabilities non-trivially.