I think that if you do assume a fixed goal slot and outline an overall architecture, then there are pretty good arguments for a serious probabilty of scheming.
(Though there are also plenty of bad arguments, including some that people have made in the past : ).)
That said, I’m sympathetic to some version of the “Against goal realism” argument applying to models which are sufficiently useful. As in, the first transformatively useful models won’t in practice contain have internal (opaque-to-human-overseers) goals such that the traditional story for scheming doesn’t apply.
(However, it’s worth noting that at least some humans do seem to have internal goals and reductionism doesn’t defeat this intuition. It’s not super clear that the situation with humans is well described as a “goal slot”, though there is pretty clearly some stuff that could be changed in a human brain that would cause them to be well described as coherantly pursue different goals. So arguing that AIs won’t have internal goals in a way that could result in scheming does require noting some ways in which you’re argument doesn’t apply to humans. More strongly, humans can and do scheme even in cases where some overseer selected the humans for good behavior.)
Regardless, I’ll assume the existance of something well described as a goal slot in this comment.
In the remainder of this comment:
I make a simple (but vague and hand wavy) counting/measure/simplicity style argument for scheming independently.
I’ll try to quickly sketch a simple and relatively concrete architecture with a goal slot that I think might lead to scheming. I make somewhat artificial choices for the sake of simplicity.
Simple (counting-style) argument for scheming
(I make a less fleshed out version of this argument, in this comment below.)
Let’s suppose that we do have a neural network which implements goal-directed reasoning and has something well described as a goal slot.
The model is trained for a long time with RL.
Let’s imagine that this model is very smart and it understands reasonably well the training the process it’s in and how its actions in training effect the final model produced by training. (This second property is sometimes called situational awareness.)
Speaking roughly, and under a few more assumptions I’ll discuss later, there are two types of goals which perform well enough in RL to be selected:
Goals which directly pursue anything that is nearly perfectly correlated with the outcome that it reinforced (aka reward).
Goals which care most about literally any long run outcome.
So, we care about the measure on these two types of goals.
If we imagine that our RL’d neural network computes a bunch of predictions, then it does seem plausible that it will have a bunch of long horizon predictions in its activations with higher aggregate measure than predicting things that perfectly correlate with the outcome that was reinforced (aka reward)! As in, if we imagine randomly sampling a linear probe, it will be far more likely to sample a probe where most of the variance is driven by long run outcomes than to sample a linear probe which is almost perfectly correlated with reward (e.g. a near perfect predictor of reward up to monotone regression). Neural networks are likely to compute a bunch of long range predictions at least as intermediates, but they only need to compute things that nearly perfectly correlate with reward once! (With some important caveats about transfer from other distributions.)
It’s not clear that variance in the activations determines the relative measure of different goals that perform equally well in training, but it seems like a vaguely sane rough ballpark guess. (Other proposals don’t seem to particular effect the intuitions because the process of goals selecting actions will absolutely require thinking about predicted consequences.)
Overall, I don’t think this measure argument is clear, but in the case with explicit goal slots there are at least some concerning signs.
For the above argument to go through, there are a few assumptions I’ve made:
In order for the AI to perform sufficiently well as to be a plausible result of training it has to explicitly “play the training game” (e.g. explicitly reason about and try to get high performance).
The AI’s predictions are aware of what its goal slot contains.
It also seems reasonable to contest these assumption, but this is a different thing than the counting argument. I think these two assumptions are very likely to hold in the limit of arbitrary intelligence, but it’s unclear if they hold for AIs which are created prior to the point where human AI safety efforts are totally obsolete.
Simple architecture where scheming seems plausible (though not certain)
This architecture will involve internally doing literal best of n search against an internally represented goal.
Let’s suppose that our AI has an internal world model P. P maps from proposed actions to a bunch of predictions about the world conditional on that action being taken. Let’s say these predictions are represented as a vector in RM (e.g. a neural network computes predictions about the future as an embedding). For now, let’s assume that this world model is fixed, but does end up predicting all of the important outcomes needed to perform well in training.
It also has some sort of action generator to get proposed actions A. This allows it to sample proposed actions to search over. For now, let’s suppose that the AI’s distribution over actions is fixed. (But the search over these actions won’t be fixed.)
It also has an explicit goal represented as a linear probe which is applied to the predictions from the world model. We’ll suppose this goal is the only thing which is train for simplicity (though I expect this rough argument to work if we train the world model in parallel with the goals).
Now, our overall architecture will do search using these components. Here’s some python pseudo code:
def P(input, action, goal):
# returns a vector in R^M
...
def A(input):
# samples an action
...
def model(input, goal_linear_probe, N):
highest_goodness = -float('inf')
best_action = None
for i in range(N):
action = A(input)
preds = P(input, action, goal_linear_probe)
interally_assessed_goodness = (goal_linear_probe * preds).sum()
if interally_assessed_goodness > highest_goodness:
highest_goodness = interally_assessed_goodness
best_action = action
return best_action
Note that this is pseudo code for the overall model which we’re training and which is implementing consequentialism interally in an opaque way (opaque in the corresponding real model, this is obviously not opaque because it’s written as python code). So there would an outer training loop that trains this model. (In this case for simplicity, we assume that only the goal_linear_probe is trained.)
Now, in the case where:
N is very large (analogy: a massive amount of selection pressure is applied by internal consequentialist reasoning, e.g. 64 bits or N=2^64).
P has all of the important predictions for performing well in training including intermediate predictions.
We train the AI (really just the goal probe) for a huge number of episodes (analogy: a massive amount of selection pressure is applied by training). (For simplicity, we can imagine this training just doing best-of-n sampling over some massive number of random linear probes (e.g. 2^1024) which is ~equivalent to Mingard et al.)
This assumption and the above assumption combined with this architecture basically implies that playing the training game is required to be selected by training. (See above for a rough definition of playing the training game.)
P “understands” that it is part of an AI in training and that the submitted action will effect selection/training pressure on the goal probe. And in turn it understands what actions various goal probes will take at different points. Note that this requires that the predictor takes the goal as input which corresponds to the AI being aware of its own goals.
(We also need some consistency properties on the representations that P uses for long run predictions. E.g., specific consistent long run predictions need to be linearly represented and use consistent representations over time. This is even more in the weeds, so most people can ignore this even if this is a bit confusing.)
Then there are basically two types of goal linear probes you can get:
Probes which mostly “care” about long run outcomes in ~every training episode and thus assign high internally_assessed_goodness to highly rewarded actions because these actions will result in these goals getting more influence over the long run (because the goal probe isn’t selected against).
Probes which are nearly perfectly correlated with the outcome that is reinforced (aka reward).
Probably literal reward is somewhat saliant to P in this setup (particular if we imagine this model being trained to predict reward as occurs in many RL setups), but there are also going to be a lot of long range predictions that the model will need to compute to do well in training (both in RL and in pretraining). And you only need to compute near-perfect-on-distribution correlates of reward once (at least for the RL part of training).
I might try to touch up this argument at some point, but this is the core sketch.
I think that if you do assume a fixed goal slot and outline an overall architecture, then there are pretty good arguments for a serious probabilty of scheming.
(Though there are also plenty of bad arguments, including some that people have made in the past : ).)
That said, I’m sympathetic to some version of the “Against goal realism” argument applying to models which are sufficiently useful. As in, the first transformatively useful models won’t in practice contain have internal (opaque-to-human-overseers) goals such that the traditional story for scheming doesn’t apply.
(However, it’s worth noting that at least some humans do seem to have internal goals and reductionism doesn’t defeat this intuition. It’s not super clear that the situation with humans is well described as a “goal slot”, though there is pretty clearly some stuff that could be changed in a human brain that would cause them to be well described as coherantly pursue different goals. So arguing that AIs won’t have internal goals in a way that could result in scheming does require noting some ways in which you’re argument doesn’t apply to humans. More strongly, humans can and do scheme even in cases where some overseer selected the humans for good behavior.)
Regardless, I’ll assume the existance of something well described as a goal slot in this comment.
In the remainder of this comment:
I make a simple (but vague and hand wavy) counting/measure/simplicity style argument for scheming independently.
I’ll try to quickly sketch a simple and relatively concrete architecture with a goal slot that I think might lead to scheming. I make somewhat artificial choices for the sake of simplicity.
Simple (counting-style) argument for scheming
(I make a less fleshed out version of this argument, in this comment below.)
Let’s suppose that we do have a neural network which implements goal-directed reasoning and has something well described as a goal slot.
The model is trained for a long time with RL.
Let’s imagine that this model is very smart and it understands reasonably well the training the process it’s in and how its actions in training effect the final model produced by training. (This second property is sometimes called situational awareness.)
Speaking roughly, and under a few more assumptions I’ll discuss later, there are two types of goals which perform well enough in RL to be selected:
Goals which directly pursue anything that is nearly perfectly correlated with the outcome that it reinforced (aka reward).
Goals which care most about literally any long run outcome.
So, we care about the measure on these two types of goals.
If we imagine that our RL’d neural network computes a bunch of predictions, then it does seem plausible that it will have a bunch of long horizon predictions in its activations with higher aggregate measure than predicting things that perfectly correlate with the outcome that was reinforced (aka reward)! As in, if we imagine randomly sampling a linear probe, it will be far more likely to sample a probe where most of the variance is driven by long run outcomes than to sample a linear probe which is almost perfectly correlated with reward (e.g. a near perfect predictor of reward up to monotone regression). Neural networks are likely to compute a bunch of long range predictions at least as intermediates, but they only need to compute things that nearly perfectly correlate with reward once! (With some important caveats about transfer from other distributions.)
It’s not clear that variance in the activations determines the relative measure of different goals that perform equally well in training, but it seems like a vaguely sane rough ballpark guess. (Other proposals don’t seem to particular effect the intuitions because the process of goals selecting actions will absolutely require thinking about predicted consequences.)
Overall, I don’t think this measure argument is clear, but in the case with explicit goal slots there are at least some concerning signs.
For the above argument to go through, there are a few assumptions I’ve made:
In order for the AI to perform sufficiently well as to be a plausible result of training it has to explicitly “play the training game” (e.g. explicitly reason about and try to get high performance).
The AI’s predictions are aware of what its goal slot contains.
It also seems reasonable to contest these assumption, but this is a different thing than the counting argument. I think these two assumptions are very likely to hold in the limit of arbitrary intelligence, but it’s unclear if they hold for AIs which are created prior to the point where human AI safety efforts are totally obsolete.
Simple architecture where scheming seems plausible (though not certain)
This architecture will involve internally doing literal best of n search against an internally represented goal.
Let’s suppose that our AI has an internal world model
P
.P
maps from proposed actions to a bunch of predictions about the world conditional on that action being taken. Let’s say these predictions are represented as a vector in RM (e.g. a neural network computes predictions about the future as an embedding). For now, let’s assume that this world model is fixed, but does end up predicting all of the important outcomes needed to perform well in training.It also has some sort of action generator to get proposed actions
A
. This allows it to sample proposed actions to search over. For now, let’s suppose that the AI’s distribution over actions is fixed. (But the search over these actions won’t be fixed.)It also has an explicit goal represented as a linear probe which is applied to the predictions from the world model. We’ll suppose this goal is the only thing which is train for simplicity (though I expect this rough argument to work if we train the world model in parallel with the goals).
Now, our overall architecture will do search using these components. Here’s some python pseudo code:
Note that this is pseudo code for the overall model which we’re training and which is implementing consequentialism interally in an opaque way (opaque in the corresponding real model, this is obviously not opaque because it’s written as python code). So there would an outer training loop that trains this model. (In this case for simplicity, we assume that only the
goal_linear_probe
is trained.)Now, in the case where:
N is very large (analogy: a massive amount of selection pressure is applied by internal consequentialist reasoning, e.g. 64 bits or N=2^64).
P
has all of the important predictions for performing well in training including intermediate predictions.We train the AI (really just the goal probe) for a huge number of episodes (analogy: a massive amount of selection pressure is applied by training). (For simplicity, we can imagine this training just doing best-of-n sampling over some massive number of random linear probes (e.g. 2^1024) which is ~equivalent to Mingard et al.)
This assumption and the above assumption combined with this architecture basically implies that playing the training game is required to be selected by training. (See above for a rough definition of playing the training game.)
P
“understands” that it is part of an AI in training and that the submitted action will effect selection/training pressure on the goal probe. And in turn it understands what actions various goal probes will take at different points. Note that this requires that the predictor takes the goal as input which corresponds to the AI being aware of its own goals.(We also need some consistency properties on the representations that
P
uses for long run predictions. E.g., specific consistent long run predictions need to be linearly represented and use consistent representations over time. This is even more in the weeds, so most people can ignore this even if this is a bit confusing.)Then there are basically two types of goal linear probes you can get:
Probes which mostly “care” about long run outcomes in ~every training episode and thus assign high
internally_assessed_goodness
to highly rewarded actions because these actions will result in these goals getting more influence over the long run (because the goal probe isn’t selected against).Probes which are nearly perfectly correlated with the outcome that is reinforced (aka reward).
Probably literal reward is somewhat saliant to P in this setup (particular if we imagine this model being trained to predict reward as occurs in many RL setups), but there are also going to be a lot of long range predictions that the model will need to compute to do well in training (both in RL and in pretraining). And you only need to compute near-perfect-on-distribution correlates of reward once (at least for the RL part of training).
I might try to touch up this argument at some point, but this is the core sketch.