Train an agent A in an environment that contains the source B of A’s reward.
Alice observes that A learns to hack B. Then she solves this as follows:
Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B’s internals.
Alice observes that A doesn’t hack B. The Bob looks at Alice’s results and says,
“Cool. But this won’t generalize to future lethal systems because it doesn’t account for how A can combine innocuous understanding that it gains. Future systems, to be very competent, will probably do something functionally equivalent to exploring their environment to understand parts of the environment without necessarily trying to achieve some big goal (such as hacking B) along the way. This creates a ‘capabilities overhang’ relative to the overseer: there’s no behavior that’s clearly aimed at something B considers dangerous, but A accumulates ability to put together plans that do more and more effective stuff, compared to what A has actually previously acted out and gotten direct reinforcement on. This is an important part of how future systems might be lethal.”
So then Alice and Bob collaborate and come up with this variation:
Same setup, but now A also has an exploration objective (e.g. it’s rewarded for reaching states for the first time, or for gaining information, or for a randomly sampled goal from some rich distribution on goals).
Alice and Bob observe that A avoids approaching B for a long time while steadily improving both its B-score and also its exploration score. Then at some point, all in one episode, A hacks B and achieves very high reward.
Now, this might be interesting from an alignment perspective, or not. But my point is that Alice and Bob have perhaps, in some version of the hypothetical, also made a capabilities advance: they’ve demonstrated non-trivial gains from an exploration objective. I assume that in our world this is not much of an insight, as exploration objectives have already been discussed and tried. But this is the sort of pattern that’s concerning to me.
I’m not saying one can’t do this sort of thing in a way such that the alignment value exceeds the capabilities advancement in the relevant way. I’m saying, these things seem to push pretty directly against each other, so I’d want careful thinking about how to pull them apart. Even instances that don’t come up with new ideas, but just demonstrate “hey actually this method is powerful”, would seem to advance capabilities non-trivially.
So for example, say Alice runs this experiment:
Alice observes that A learns to hack B. Then she solves this as follows:
Alice observes that A doesn’t hack B. The Bob looks at Alice’s results and says,
“Cool. But this won’t generalize to future lethal systems because it doesn’t account for how A can combine innocuous understanding that it gains. Future systems, to be very competent, will probably do something functionally equivalent to exploring their environment to understand parts of the environment without necessarily trying to achieve some big goal (such as hacking B) along the way. This creates a ‘capabilities overhang’ relative to the overseer: there’s no behavior that’s clearly aimed at something B considers dangerous, but A accumulates ability to put together plans that do more and more effective stuff, compared to what A has actually previously acted out and gotten direct reinforcement on. This is an important part of how future systems might be lethal.”
So then Alice and Bob collaborate and come up with this variation:
Alice and Bob observe that A avoids approaching B for a long time while steadily improving both its B-score and also its exploration score. Then at some point, all in one episode, A hacks B and achieves very high reward.
Now, this might be interesting from an alignment perspective, or not. But my point is that Alice and Bob have perhaps, in some version of the hypothetical, also made a capabilities advance: they’ve demonstrated non-trivial gains from an exploration objective. I assume that in our world this is not much of an insight, as exploration objectives have already been discussed and tried. But this is the sort of pattern that’s concerning to me.
I’m not saying one can’t do this sort of thing in a way such that the alignment value exceeds the capabilities advancement in the relevant way. I’m saying, these things seem to push pretty directly against each other, so I’d want careful thinking about how to pull them apart. Even instances that don’t come up with new ideas, but just demonstrate “hey actually this method is powerful”, would seem to advance capabilities non-trivially.