A key idea of deceptive alignment is that the model starts by learning a proxy that becomes its internal goal.
One of my pictures of deceptive alignment goes like this:
The learner fairly quickly develops a decent representation of the actual goal, world model etc. and pursues this goal
The world model gets better and better in order to improve the learner’s loss
When the learner has a really excellent world model that can make long range predictions and so forth—good enough that it can reason itself into playing the training game for a wide class of long-term goals—then we get a large class of objectives that achieve losses just as good as if not better than the initial representation of the goal
When this happens, gradients derived from regularisation and/or loss may push the learner’s objective towards one of these problematic alternatives. Also, because the initial “pretty good” goal is not a long-range one (because it developed when the world model was not so good), it doesn’t necessarily steer the learner away from possibilities like this
I don’t think the argument presented here really addresses this picture.
If we’re training a learner with some variant of human feedback on diverse tasks (HFDT), with a training goal of shaping the model into one honestly trying to complete the tasks as the human overseers intend, then the actual goal is a long-range goal. So if as you say, the learner quickly develops “a decent representation of the actual goal”, then that long-term goal representation drives the learner to use its growing capabilities to make decisions downstream of that goal, which steers the model away from alternative long-term goals. It doesn’t particularly matter that there are other possible long-term goals that would have equal or greater training performance, because they aren’t locally accessible via paths compatible with the model’s existing goal, since it provides no reason to steer differentially towards those goal modifications.
I don’t think it’s obvious that a learner which initially learns to pursue the right thing within each episode is also likely to learn to steer itself towards doing the right thing across episodes.
I wasn’t assuming we were working in an episodic context. But if we are, then the agent isn’t getting differential feedback on its across-episode generalization behavior, so there is no reason for the model to develop dominant policy circuits that respond to episode boundaries (any more than ones that respond to some random perceptual bitvector).
Let me try again. It initially develops a reasonable goal because it doesn’t yet know a) playing the training game is instrumentally correct for a bunch of other goals and/or b) how to play the training game in a way that is instrumentally correct for these otehr goals.
By the same assumptions, it seems reasonable to speculate that it doesn’t know a) allowing itself to change in certain ways from gradient updates would be un-instrumental and/or b) how to prevent itself from changing in these ways. Furthermore it’s not obvious that the “reasonable goal” it learns will incentivise figuring this out later.
In my model, having a goal of X means the agent’s policy circuits are differentially sensitive to (a.k.a. care about) the features in the world that it recognizes as relevant to X (as represented in the agent’s own ontology), because that recognition was selected for in the past. If it has a “reasonable goal” of X, then it doesn’t necessarily want to “play the training game” instrumentally for an alternative goal Y, at least not along dimensions where Y diverges from X, even if it knows how to do so and Y is another (more) highly-rewarded goal. If it does “play the training game”, it wants to play the training game in the service of X.
If the agent never gets updated again, its policy circuits stay as they are, so its goal remains fixed at X. Otherwise, since we know X-seeking is a rewarded strategy (otherwise, how did we entrain X in the first place?), and since its existing policy is X-seeking, new updates will continue to flow from rewards that it gets from trying to seek X (modulo random drift from accidental rewards). By default[1] those updates move it around within the basin of X-seeking policies, rather than somehow teleporting it into another basin that has high reward for X-unrelated reasons. So I think that if we got a reasonable X goal early on, the goal will tend to stay X or something encompassing X. Late in training, once the agent is situationally-aware, it can start explicitly reasoning about what actions will keep its current “reasonable” goal intact.
To give an example, if my thoughts have historically been primarily shaped by rewards flowing from having “eating apples” as my goal, then by default I’m going to persist in apple-seeking and apple-eating behavior/cognition (unless some event strongly and differentially downweights that cognition, like getting horribly sick from a bad apple), which will tend to steer me towards more apples and apple-based reinforcement events, which will further solidify this goal in me.
The learner fairly quickly develops a decent representation of the actual goal, world model etc. and pursues this goal
Wouldn’t you expect decision making, and therefore goals, to be in the final layers of the model? If so, they will calculate the goal based on high-level world model neurons. If the world model improves, those high-level abstractions will also improve. The goal doesn’t have to make a model of the goal from scratch, because it is connected to the world model.
When the learner has a really excellent world model that can make long range predictions and so forth—good enough that it can reason itself into playing the training game for a wide class of long-term goals—then we get a large class of objectives that achieve losses just as good as if not better than the initial representation of the goal
Even if the model is sophisticated enough to make long-range predictions, it still has to care about the long-run for it to have an incentive to play the training game. Long-term goals are addressed extensively in this post and the next.
When this happens, gradients derived from regularisation and/or loss may push the learner’s objective towards one of these problematic alternatives.
Suppose we have a model with a sufficiently aligned goal A. I will also denote an unaligned goal as U, and instrumental training reward optimization S. It sounds like your idea is that S gets better training performance than directly pursuing A, so the model should switch its goal to U so it can play the training game and get better performance. But if S gets better training performance than A, then the model doesn’t need to switch its goal to play the training game. It’s already instrumentally valuable. Why would it switch?
Also, because the initial “pretty good” goal is not a long-range one (because it developed when the world model was not so good), it doesn’t necessarily steer the learner away from possibilities like this
Wouldn’t the initial goal continue to update over time? Why would it build a second goal instead of making improvements to the original?
One of my pictures of deceptive alignment goes like this:
The learner fairly quickly develops a decent representation of the actual goal, world model etc. and pursues this goal
The world model gets better and better in order to improve the learner’s loss
When the learner has a really excellent world model that can make long range predictions and so forth—good enough that it can reason itself into playing the training game for a wide class of long-term goals—then we get a large class of objectives that achieve losses just as good as if not better than the initial representation of the goal
When this happens, gradients derived from regularisation and/or loss may push the learner’s objective towards one of these problematic alternatives. Also, because the initial “pretty good” goal is not a long-range one (because it developed when the world model was not so good), it doesn’t necessarily steer the learner away from possibilities like this
I don’t think the argument presented here really addresses this picture.
If we’re training a learner with some variant of human feedback on diverse tasks (HFDT), with a training goal of shaping the model into one honestly trying to complete the tasks as the human overseers intend, then the actual goal is a long-range goal. So if as you say, the learner quickly develops “a decent representation of the actual goal”, then that long-term goal representation drives the learner to use its growing capabilities to make decisions downstream of that goal, which steers the model away from alternative long-term goals. It doesn’t particularly matter that there are other possible long-term goals that would have equal or greater training performance, because they aren’t locally accessible via paths compatible with the model’s existing goal, since it provides no reason to steer differentially towards those goal modifications.
I don’t think it’s obvious that a learner which initially learns to pursue the right thing within each episode is also likely to learn to steer itself towards doing the right thing across episodes.
I wasn’t assuming we were working in an episodic context. But if we are, then the agent isn’t getting differential feedback on its across-episode generalization behavior, so there is no reason for the model to develop dominant policy circuits that respond to episode boundaries (any more than ones that respond to some random perceptual bitvector).
Let me try again. It initially develops a reasonable goal because it doesn’t yet know a) playing the training game is instrumentally correct for a bunch of other goals and/or b) how to play the training game in a way that is instrumentally correct for these otehr goals.
By the same assumptions, it seems reasonable to speculate that it doesn’t know a) allowing itself to change in certain ways from gradient updates would be un-instrumental and/or b) how to prevent itself from changing in these ways. Furthermore it’s not obvious that the “reasonable goal” it learns will incentivise figuring this out later.
In my model, having a goal of X means the agent’s policy circuits are differentially sensitive to (a.k.a. care about) the features in the world that it recognizes as relevant to X (as represented in the agent’s own ontology), because that recognition was selected for in the past. If it has a “reasonable goal” of X, then it doesn’t necessarily want to “play the training game” instrumentally for an alternative goal Y, at least not along dimensions where Y diverges from X, even if it knows how to do so and Y is another (more) highly-rewarded goal. If it does “play the training game”, it wants to play the training game in the service of X.
If the agent never gets updated again, its policy circuits stay as they are, so its goal remains fixed at X. Otherwise, since we know X-seeking is a rewarded strategy (otherwise, how did we entrain X in the first place?), and since its existing policy is X-seeking, new updates will continue to flow from rewards that it gets from trying to seek X (modulo random drift from accidental rewards). By default[1] those updates move it around within the basin of X-seeking policies, rather than somehow teleporting it into another basin that has high reward for X-unrelated reasons. So I think that if we got a reasonable X goal early on, the goal will tend to stay X or something encompassing X. Late in training, once the agent is situationally-aware, it can start explicitly reasoning about what actions will keep its current “reasonable” goal intact.
To give an example, if my thoughts have historically been primarily shaped by rewards flowing from having “eating apples” as my goal, then by default I’m going to persist in apple-seeking and apple-eating behavior/cognition (unless some event strongly and differentially downweights that cognition, like getting horribly sick from a bad apple), which will tend to steer me towards more apples and apple-based reinforcement events, which will further solidify this goal in me.
Wouldn’t you expect decision making, and therefore goals, to be in the final layers of the model? If so, they will calculate the goal based on high-level world model neurons. If the world model improves, those high-level abstractions will also improve. The goal doesn’t have to make a model of the goal from scratch, because it is connected to the world model.
Even if the model is sophisticated enough to make long-range predictions, it still has to care about the long-run for it to have an incentive to play the training game. Long-term goals are addressed extensively in this post and the next.
Suppose we have a model with a sufficiently aligned goal A. I will also denote an unaligned goal as U, and instrumental training reward optimization S. It sounds like your idea is that S gets better training performance than directly pursuing A, so the model should switch its goal to U so it can play the training game and get better performance. But if S gets better training performance than A, then the model doesn’t need to switch its goal to play the training game. It’s already instrumentally valuable. Why would it switch?
Wouldn’t the initial goal continue to update over time? Why would it build a second goal instead of making improvements to the original?