Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.
A central version of this seems to straightforwardly advance capabilities. The strongest (ISTM) sort of analogy between a current system and a future lethal system would be that they use an overlapping set of generators of capabilities. Trying to find an agent that does a treacherous turn, for the same reasons as a future lethal agent, seems to be in particular a search for an agent that has the same generators of capabilities as future lethal agents. On the other hand, trying to prevent treacherous turns in a system that has different generators seems like it doesn’t have much chance of generalizing.
It seems clear that one could do useful “advertising” (better term?) research of this form, where one makes e.g. treacherous turns intuitively salient to others by showing something with some features in common with future lethal ones. E.g. one could train an agent A in an environment that contains the source B of A’s reward, where B does some limited search to punish actions by A that seem, to the limited search, to be building up towards A hacking B. One might find that A does well according to B for a while, until it’s understood the environment well enough (via exploration that didn’t look to B like hacking) to plan, recognize as high reward, and follow a pathway to hack B. Or something. This could be helpful for “advertising” reasons, but I think my sense of how much this actually helps with the actual alignment problem correlates pretty strongly with how much A is shaped—in terms of how it got its capabilities—alike to future lethal systems. What are ways that the helpfulness for alignment of an observational study like this can be pulled apart from similarity of capability generators?
The main way you produce a treacherous turn is not by “finding the treacherous turn capabilities,” it’s by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.
This could be helpful for “advertising” reasons, but I think my sense of how much this actually helps with the actual alignment problem correlates pretty strongly with how much A is shaped—in terms of how it got its capabilities—alike to future lethal systems. What are ways that the helpfulness for alignment of an observational study like this can be pulled apart from similarity of capability generators?
There are some differences and lots of similarities between what is going on in a weaker AI doing a treacherous turn and a stronger AI doing a treacherous turn. So you expect to learn some things and not others. After studying several such cases it seems quite likely you understand enough to generalize to new cases.
It’s possible MIRI folks expect a bigger difference in how future AI is produced. I mostly expect just using gradient descent, resulting in minds that are in some ways different and in many ways different. My sense is that MIRI folks have a more mystical view about the difference between subhuman AI systems and “AGI.”
(The view “stack more layers won’t ever give you true intelligence, there is a qualitative difference here” seems like it’s taking a beating every year, whether it’s Eliezer or Gary Marcus saying it.)
The main way you produce a treacherous turn is not by “finding the treacherous turn capabilities,” it’s by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.
When you say “motive” here, is it fair to reexpress that as: “that which determines by what method and in which directions capabilities are deployed to push the world”? If you mean something like that, then my worry here is that motives are a kind of relation involving capabilities, not something that just depends on, say, the reward structure of the local environment. Different sorts of capabilities or generators of capabilities will relate in different ways to ultimate effects on the world. So the task of interfacing with capabilities to understand how they’re being deployed (with what motive), and to actually specify motives, is a task that seems like it would depend a lot on the sort of capability in question.
I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward.
I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers.
I’m not exactly sure how to cash out your objection as a response to this, but I suspect it’s probably a bit too galaxy-brained for my taste.
Train an agent A in an environment that contains the source B of A’s reward.
Alice observes that A learns to hack B. Then she solves this as follows:
Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B’s internals.
Alice observes that A doesn’t hack B. The Bob looks at Alice’s results and says,
“Cool. But this won’t generalize to future lethal systems because it doesn’t account for how A can combine innocuous understanding that it gains. Future systems, to be very competent, will probably do something functionally equivalent to exploring their environment to understand parts of the environment without necessarily trying to achieve some big goal (such as hacking B) along the way. This creates a ‘capabilities overhang’ relative to the overseer: there’s no behavior that’s clearly aimed at something B considers dangerous, but A accumulates ability to put together plans that do more and more effective stuff, compared to what A has actually previously acted out and gotten direct reinforcement on. This is an important part of how future systems might be lethal.”
So then Alice and Bob collaborate and come up with this variation:
Same setup, but now A also has an exploration objective (e.g. it’s rewarded for reaching states for the first time, or for gaining information, or for a randomly sampled goal from some rich distribution on goals).
Alice and Bob observe that A avoids approaching B for a long time while steadily improving both its B-score and also its exploration score. Then at some point, all in one episode, A hacks B and achieves very high reward.
Now, this might be interesting from an alignment perspective, or not. But my point is that Alice and Bob have perhaps, in some version of the hypothetical, also made a capabilities advance: they’ve demonstrated non-trivial gains from an exploration objective. I assume that in our world this is not much of an insight, as exploration objectives have already been discussed and tried. But this is the sort of pattern that’s concerning to me.
I’m not saying one can’t do this sort of thing in a way such that the alignment value exceeds the capabilities advancement in the relevant way. I’m saying, these things seem to push pretty directly against each other, so I’d want careful thinking about how to pull them apart. Even instances that don’t come up with new ideas, but just demonstrate “hey actually this method is powerful”, would seem to advance capabilities non-trivially.
A central version of this seems to straightforwardly advance capabilities. The strongest (ISTM) sort of analogy between a current system and a future lethal system would be that they use an overlapping set of generators of capabilities. Trying to find an agent that does a treacherous turn, for the same reasons as a future lethal agent, seems to be in particular a search for an agent that has the same generators of capabilities as future lethal agents. On the other hand, trying to prevent treacherous turns in a system that has different generators seems like it doesn’t have much chance of generalizing.
It seems clear that one could do useful “advertising” (better term?) research of this form, where one makes e.g. treacherous turns intuitively salient to others by showing something with some features in common with future lethal ones. E.g. one could train an agent A in an environment that contains the source B of A’s reward, where B does some limited search to punish actions by A that seem, to the limited search, to be building up towards A hacking B. One might find that A does well according to B for a while, until it’s understood the environment well enough (via exploration that didn’t look to B like hacking) to plan, recognize as high reward, and follow a pathway to hack B. Or something. This could be helpful for “advertising” reasons, but I think my sense of how much this actually helps with the actual alignment problem correlates pretty strongly with how much A is shaped—in terms of how it got its capabilities—alike to future lethal systems. What are ways that the helpfulness for alignment of an observational study like this can be pulled apart from similarity of capability generators?
The main way you produce a treacherous turn is not by “finding the treacherous turn capabilities,” it’s by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.
There are some differences and lots of similarities between what is going on in a weaker AI doing a treacherous turn and a stronger AI doing a treacherous turn. So you expect to learn some things and not others. After studying several such cases it seems quite likely you understand enough to generalize to new cases.
It’s possible MIRI folks expect a bigger difference in how future AI is produced. I mostly expect just using gradient descent, resulting in minds that are in some ways different and in many ways different. My sense is that MIRI folks have a more mystical view about the difference between subhuman AI systems and “AGI.”
(The view “stack more layers won’t ever give you true intelligence, there is a qualitative difference here” seems like it’s taking a beating every year, whether it’s Eliezer or Gary Marcus saying it.)
When you say “motive” here, is it fair to reexpress that as: “that which determines by what method and in which directions capabilities are deployed to push the world”? If you mean something like that, then my worry here is that motives are a kind of relation involving capabilities, not something that just depends on, say, the reward structure of the local environment. Different sorts of capabilities or generators of capabilities will relate in different ways to ultimate effects on the world. So the task of interfacing with capabilities to understand how they’re being deployed (with what motive), and to actually specify motives, is a task that seems like it would depend a lot on the sort of capability in question.
I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward.
I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers.
I’m not exactly sure how to cash out your objection as a response to this, but I suspect it’s probably a bit too galaxy-brained for my taste.
So for example, say Alice runs this experiment:
Alice observes that A learns to hack B. Then she solves this as follows:
Alice observes that A doesn’t hack B. The Bob looks at Alice’s results and says,
“Cool. But this won’t generalize to future lethal systems because it doesn’t account for how A can combine innocuous understanding that it gains. Future systems, to be very competent, will probably do something functionally equivalent to exploring their environment to understand parts of the environment without necessarily trying to achieve some big goal (such as hacking B) along the way. This creates a ‘capabilities overhang’ relative to the overseer: there’s no behavior that’s clearly aimed at something B considers dangerous, but A accumulates ability to put together plans that do more and more effective stuff, compared to what A has actually previously acted out and gotten direct reinforcement on. This is an important part of how future systems might be lethal.”
So then Alice and Bob collaborate and come up with this variation:
Alice and Bob observe that A avoids approaching B for a long time while steadily improving both its B-score and also its exploration score. Then at some point, all in one episode, A hacks B and achieves very high reward.
Now, this might be interesting from an alignment perspective, or not. But my point is that Alice and Bob have perhaps, in some version of the hypothetical, also made a capabilities advance: they’ve demonstrated non-trivial gains from an exploration objective. I assume that in our world this is not much of an insight, as exploration objectives have already been discussed and tried. But this is the sort of pattern that’s concerning to me.
I’m not saying one can’t do this sort of thing in a way such that the alignment value exceeds the capabilities advancement in the relevant way. I’m saying, these things seem to push pretty directly against each other, so I’d want careful thinking about how to pull them apart. Even instances that don’t come up with new ideas, but just demonstrate “hey actually this method is powerful”, would seem to advance capabilities non-trivially.