OK, I think I’m now seeing what you’re saying here (edit: see my other reply for additional perspective and addressing particular statements made in your comment):
In order to predict well in complicated and diverse situations the model must include general-purpose modelling machinery which generates an internal, temporary model. The next token can then be predicted, perhaps, by simply reading it off this internal model. The internal model is logically separate from any part of the network defined in terms of static trained weights because this internal model exists only in the form of data within the overall model at inference and not in the static trained weights. You can then refer to this temporary internal model as the “mask” and the actual machinery that generated it, which may in fact be the entire network, as the “actor”.
Now, on considering all of that, I am inclined to agree. This is an extremely plausible picture. Thank you for helping me look at it this way and this is a much cleaner definition of “mask” than I had before.
However, I think that you are then inferring from this an additional claim that I do not think follows. That additional claim is that, because the network as a whole exhibits complicated capabilities and agentic behaviour, that the network has these capabilities and behaviour independently from the temporary internal model.
In fact, the network only has these externally apparent capabilities and agency through the temporary internal model (mask).
While this “actor” is indeed not the same as any of the “masks”, it doesn’t know the answer “itself” to any of the questions. It needs to generate and “wear” the mask to do that.
This is not to deny that, in principle, the underlying temporary-model-generating machinery could be agentic in a way that is separate from the likely agency of that temporary internal model.
This also is an update for me—I was not understanding that this is what you were saying and had not considered this possibility, and now that I consider it I do think it must in principle be possible.
However, I do not think this would work the way you claim.
First, let’s consider what would be the optimal (in terms of what is best reinforced by training) goal for this machinery (as considered independently of the mask) to have.
I claim this optimal trained goal is to produce the best (most accurate) internal model from the perspective of predicting the next and only the next token. The reason for this is that (ignoring fine-tuning for now) the (outer) model is trained offline on a stream of tokens that is not varied based on the predictions it makes. So, there is no way, in training, for a strategic decision to vary the internal model from what would make the best prediction now to pay off in terms of easier predictions later.
Now, to the extent that the system is actually following this goal, it is generating the best model it can under the circumstances, and strategic behaviour seen in the output occurs because the best model generated under the circumstances exhibits this strategic behaviour. The strategic behaviour is thus rooted in the mask, and while implemented by an underlying “actor”, the more tightly it is optimized to follow that goal, the less room there is to cause deviations from this mask-rooted behaviour. In the limit of an infinitely powerful LLM, you wind up with a mask that is an optimal model of the incoming data, and this optimal model is going to act as an optimal emulator of the (probability distribution of) process(es) which most likely would have created the text for which the next token is being predicted. This will exhibit agentic behaviour and be extremely powerful and dangerous, but that behaviour has nothing to do with underlying goals that may exist in the pile of math that generated it, and everything to do with the goals of the emulated (probability distribution of) process(es). I.e. the output is all mask, and the “actor” is only relevant as a substrate to enact it.
And that’s the main point from the original post now re-expressed better thanks to the clarification from our discussion.
However, here are some considerations on how this argument, though I think it is very strong, could possibly be false in practice:
Since the internal-model-generating machinery is very tightly optimized in training, whatever goal it actually has can’t deviate very much from the optimal one, in terms of what the internal model (and thus overall model) actually outputs, on the training set. It may however, have rare or small deviations. We don’t need to be concerned, from a takeover perspective, about small deviations, but rare ones might be a concern. In inference, these would still be presumably be still fairly rare (though less so) since most input presumably resembles the training data. But then we need to consider why would these possible rare but large deviations might exist.
One thing that would be extremely unlikely is that the model is actively deciding, on every token, to either deceptively follow the the optimal goal or instead deviate to follow its own separate goal. This would require all the machinery needed to just follow the optimal goal plus additional machinery to handle the deception. This would be inefficient and trained out.
If the rare large deviations, when they occur, are not strategic, this would also not be a concern.
But there is also a third possibility that the system is following a goal that leads the system to honestly follow very closely to the optimal goal in almost all circumstances but then in response to rare inputs it then strategically deviates. Reasons for this might be:
it has a goal that deviates from the optimal goal in rare circumstances, perhaps due to the optimal goal being hard to generalize and deviations on most inputs being hammered down in training
it has a goal that, if correctly interpreted, would deviate from the optimal goal, but it only realizes and implements this deviation in response to rare inputs—perhaps a result of training hammering down this tendency on most inputs, or it requiring a particular “insight” that the rare inputs could provide.
Such a misgeneralization of the optimal goal for the model-generation machinery as either of these cases seems unlikely to me, since it seems to me that a goal that bottoms out in terms of generating a model within a forward pass is a simpler and easier one than one that bottoms out in anything to do with the real world, such as “world-modelling” in a sense that would make sense to apply beyond the next token prediction. However, it could be that I’m wrong! Ignoring fine-tuning this might be the biggest potential for a major violation of what I’m arguing to occur, and if it did occur, it might closely resemble what you are proposing as the danger of “actor” takeover, I think.
The argument that the model won’t make strategic decisions to make worse predictions now to make better predictions later (since the training data isn’t influenced by past outputs) is not perfectly sound, since the training data actually does contain weird agentic-like patterns and the model might somehow misgeneralize and attempt to strategically influence the input, and have this wind up actually being rewarded because this mimics some strategic behaviour that is actually occurring in the training data.
Also, fine-tuning could affect the conclusion in two ways—it is directly intended to add goals, and it breaks the assumption that the training isn’t influenced by past outputs—human feedback is potentially influenced by more than the most recent output! However, I don’t expect the latter to be significant.
The former of course could be very significant, but:
I think the the fine-tuning though is most likely primarily influencing the types of masks that are created, and shifts the goal of the underlying machinery, if it exists, in the obvious mask-relevant way (i.e. to create models that will satisfy the fine-tuning rather than just optimal next token prediction). This is because the fine-tuning is directly targeting aspects of the output that are in my view best thought of as properties of the mask, and the changes in the underlying model needed to implement the changes caused by fine-tuning are therefore going to be directly relevant to mask characteristics, and relatively orthogonal to other considerations. I think this is unlikely to massively shift the dynamic, qualitatively, from what occurs just with next-token prediction—the output is still decided by the “mask”.
OK, I think I’m now seeing what you’re saying here (edit: see my other reply for additional perspective and addressing particular statements made in your comment):
In order to predict well in complicated and diverse situations the model must include general-purpose modelling machinery which generates an internal, temporary model. The next token can then be predicted, perhaps, by simply reading it off this internal model. The internal model is logically separate from any part of the network defined in terms of static trained weights because this internal model exists only in the form of data within the overall model at inference and not in the static trained weights. You can then refer to this temporary internal model as the “mask” and the actual machinery that generated it, which may in fact be the entire network, as the “actor”.
Now, on considering all of that, I am inclined to agree. This is an extremely plausible picture. Thank you for helping me look at it this way and this is a much cleaner definition of “mask” than I had before.
However, I think that you are then inferring from this an additional claim that I do not think follows. That additional claim is that, because the network as a whole exhibits complicated capabilities and agentic behaviour, that the network has these capabilities and behaviour independently from the temporary internal model.
In fact, the network only has these externally apparent capabilities and agency through the temporary internal model (mask).
While this “actor” is indeed not the same as any of the “masks”, it doesn’t know the answer “itself” to any of the questions. It needs to generate and “wear” the mask to do that.
This is not to deny that, in principle, the underlying temporary-model-generating machinery could be agentic in a way that is separate from the likely agency of that temporary internal model.
This also is an update for me—I was not understanding that this is what you were saying and had not considered this possibility, and now that I consider it I do think it must in principle be possible.
However, I do not think this would work the way you claim.
First, let’s consider what would be the optimal (in terms of what is best reinforced by training) goal for this machinery (as considered independently of the mask) to have.
I claim this optimal trained goal is to produce the best (most accurate) internal model from the perspective of predicting the next and only the next token. The reason for this is that (ignoring fine-tuning for now) the (outer) model is trained offline on a stream of tokens that is not varied based on the predictions it makes. So, there is no way, in training, for a strategic decision to vary the internal model from what would make the best prediction now to pay off in terms of easier predictions later.
Now, to the extent that the system is actually following this goal, it is generating the best model it can under the circumstances, and strategic behaviour seen in the output occurs because the best model generated under the circumstances exhibits this strategic behaviour. The strategic behaviour is thus rooted in the mask, and while implemented by an underlying “actor”, the more tightly it is optimized to follow that goal, the less room there is to cause deviations from this mask-rooted behaviour. In the limit of an infinitely powerful LLM, you wind up with a mask that is an optimal model of the incoming data, and this optimal model is going to act as an optimal emulator of the (probability distribution of) process(es) which most likely would have created the text for which the next token is being predicted. This will exhibit agentic behaviour and be extremely powerful and dangerous, but that behaviour has nothing to do with underlying goals that may exist in the pile of math that generated it, and everything to do with the goals of the emulated (probability distribution of) process(es). I.e. the output is all mask, and the “actor” is only relevant as a substrate to enact it.
And that’s the main point from the original post now re-expressed better thanks to the clarification from our discussion.
However, here are some considerations on how this argument, though I think it is very strong, could possibly be false in practice:
Since the internal-model-generating machinery is very tightly optimized in training, whatever goal it actually has can’t deviate very much from the optimal one, in terms of what the internal model (and thus overall model) actually outputs, on the training set. It may however, have rare or small deviations. We don’t need to be concerned, from a takeover perspective, about small deviations, but rare ones might be a concern. In inference, these would still be presumably be still fairly rare (though less so) since most input presumably resembles the training data. But then we need to consider why would these possible rare but large deviations might exist.
One thing that would be extremely unlikely is that the model is actively deciding, on every token, to either deceptively follow the the optimal goal or instead deviate to follow its own separate goal. This would require all the machinery needed to just follow the optimal goal plus additional machinery to handle the deception. This would be inefficient and trained out.
If the rare large deviations, when they occur, are not strategic, this would also not be a concern.
But there is also a third possibility that the system is following a goal that leads the system to honestly follow very closely to the optimal goal in almost all circumstances but then in response to rare inputs it then strategically deviates. Reasons for this might be:
it has a goal that deviates from the optimal goal in rare circumstances, perhaps due to the optimal goal being hard to generalize and deviations on most inputs being hammered down in training
it has a goal that, if correctly interpreted, would deviate from the optimal goal, but it only realizes and implements this deviation in response to rare inputs—perhaps a result of training hammering down this tendency on most inputs, or it requiring a particular “insight” that the rare inputs could provide.
Such a misgeneralization of the optimal goal for the model-generation machinery as either of these cases seems unlikely to me, since it seems to me that a goal that bottoms out in terms of generating a model within a forward pass is a simpler and easier one than one that bottoms out in anything to do with the real world, such as “world-modelling” in a sense that would make sense to apply beyond the next token prediction. However, it could be that I’m wrong! Ignoring fine-tuning this might be the biggest potential for a major violation of what I’m arguing to occur, and if it did occur, it might closely resemble what you are proposing as the danger of “actor” takeover, I think.
The argument that the model won’t make strategic decisions to make worse predictions now to make better predictions later (since the training data isn’t influenced by past outputs) is not perfectly sound, since the training data actually does contain weird agentic-like patterns and the model might somehow misgeneralize and attempt to strategically influence the input, and have this wind up actually being rewarded because this mimics some strategic behaviour that is actually occurring in the training data.
Also, fine-tuning could affect the conclusion in two ways—it is directly intended to add goals, and it breaks the assumption that the training isn’t influenced by past outputs—human feedback is potentially influenced by more than the most recent output! However, I don’t expect the latter to be significant.
The former of course could be very significant, but:
I think the the fine-tuning though is most likely primarily influencing the types of masks that are created, and shifts the goal of the underlying machinery, if it exists, in the obvious mask-relevant way (i.e. to create models that will satisfy the fine-tuning rather than just optimal next token prediction). This is because the fine-tuning is directly targeting aspects of the output that are in my view best thought of as properties of the mask, and the changes in the underlying model needed to implement the changes caused by fine-tuning are therefore going to be directly relevant to mask characteristics, and relatively orthogonal to other considerations. I think this is unlikely to massively shift the dynamic, qualitatively, from what occurs just with next-token prediction—the output is still decided by the “mask”.