Yeah, I’m growing increasingly confident that we’re talking about different things.
In my case your response made me much more confident we do have an underlying disagreement and not merely a clash of definitions.
I think the most key disagreement is this:
which in the case of our current systems is something like “be capable of answering any question I ask you.”
As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions.
If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn’t actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced.
Likewise, the heuristics/”adaptations” that coalesced to form the optimizer would have been oriented towards answering the questions. All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a “goal slot” remains more parsimonious than an actor with a different underlying goal.
Regarding the evolutionary analogy, while I’d generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution → training and human behaviour/goals → the mask.
Note, it’s entirely possible for a mask to be power seeking and we should presumably expect a mask that executes a takeover to be power-seeking. But this power seeking would come as a mask goal and not as a hidden goal learned by the model for underlying general power-seeking reasons.
I concretely disagree with (what I see as) your implied premise that the outer (training) task has any direct influence on the inner optimizer’s cognition. I think this disagreement (which I internally feel like I’ve already tried to make a number of times) has been largely ignored so far. As a result, many of the things you wrote seem to me to be answerable by largely the same objection:
As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions.
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn’t actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced.
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
Likewise, the heuristics/”adaptations” that coalesced to form the optimizer would have been oriented towards answering the questions.
...why? (The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.)
All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a “goal slot” remains more parsimonious than an actor with a different underlying goal.
I still don’t understand your “mask” analogy, and currently suspect it of mostly being a red herring (this is what I was referring to when I said I think we’re not talking about the same thing). Could you rephrase your point without making mention to “masks” (or any synonyms), and describe more concretely what you’re imagining here, and how it leads to a (nonfake) “goal slot”?
(Where is a human actor’s “goal slot”? Can I tell an actor to play the role of Adolf Hitler, and thereby turn him into Hitler?)
Regarding the evolutionary analogy, while I’d generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution → training and human behaviour/goals → the mask.
I think “the mask” doesn’t make sense as a completion to that analogy, unless you replace “human behaviour/goals” with something much more specific, like “acting”. Humans certainly are capable of acting out roles, but that’s not what their inner cognition actually does! (And neither will it be what the inner optimizer does, unless the LLM in question is weak enough to not have one of those.)
I really think you’re still imagining here that the outer loss function is somehow constraining the model’s inner cognition (which is why you keep making arguments that seem premised on the idea that e.g. if the outer loss says to predict the next token, then the model ends up putting on “masks” and playing out personas)—but I’m not talking about the “mask”, I’m talking about the actor, and the fact that you keep bringing up the “mask” is really confusing to me, since it (in my view) forces an awkward analogy that doesn’t capture what I’m pointing at.
Actually, having written that out just now, I think I want to revisit this point:
Likewise, the heuristics/”adaptations” that coalesced to form the optimizer would have been oriented towards answering the questions.
I still think this is wrong, but I think I can give a better description of why it’s wrong than I did earlier: on my model, the heuristics learned by the model will be much more optimized towards world-modelling, not answering questions. “Answering questions” is (part of) the outer task, but the process of doing that requires the system to model and internalize and think about things having to do with the subject matter of the questions—which effectively means that the outer task becomes a wrapper which trains the system by proxy to acquire all kinds of potentially dangerous capabilities.
(Having heuristics oriented towards answering questions is a misdescription; you can’t correctly answer a math question you know nothing about by being very good at “generic question-answering”, because “generic question-answering” is not actually a concrete task you can be trained on. You have to be good at math, not “generic question-answering”, in order to be able to answer math questions.)
Which is to say, quoting from my previous comment:
I strongly disagree that the “extra machinery” is extra; instead, I would say that it is absolutely necessary for strong intelligence. A model capable of producing plans to take over the world if asked, for example, almost certainly contains an inner optimizer with its own goals; not because this was incentivized directly by the outer loss on token prediction, but because being able to plan on that level requires the formation of goal-like representations within the model.
None of this is about the “mask”. None of this is about the role the model is asked to play during inference. Instead, it’s about the thinking the model must have learned to do in order to be able to don those “masks”—which (for sufficiently powerful models) implies the existence of an actor which (a) knows how to answer, itself, all of the questions it’s asked, and (b) is not the same entity as any of the “masks” it’s asked to don.
My other reply addressed what I thought is the core of our disagreement, but not particularly your exact statements you make in your comment. So I’m addressing them here.
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
Let me be clear that I am NOT saying that any inner optimizer, if it exists, would have a goal that is equal to minimizing the outer loss. What I am saying is that it would have a goal that, in practice, when implemented in a single pass of the LLM has the effect of of minimizing the LLM’s overall outer loss with respect to that ONE token. And that it would be very hard for such a goal to cash out, in practice, to wanting long range real-world effects.
Let me also point out your implicit assumption that there is an ‘inner’ cognition which is not literally the mask.
Here is some other claim someone could make:
This person would be saying, “hey look, this datacenter full of GPUs is carrying out this agentic-looking cognition. And, it could easily carry out other, completely different agentic cognition. Therefore, the datacenter must have these capabilities independently from the LLM and must have its own ‘inner’ cognition.”
I think that you are making the same philosophical error that this claim would be making.
However, if we didn’t understand GPUs we could still imagine that the datacenter does have its own, independent ‘inner’ cognition, analogous to, as I noted in a previous comment, John Searle in his Chinese room. And if this were the case, it would be reasonable to expect that this inner cognition might only be ‘acting’ for instrumental reasons and could be waiting for an opportunity to jump out and suddenly do something else other than running the LLM.
The GPU software is not tightly optimized specifically to run the LLM or an ensemble of LLMs and could indeed have other complications and who knows what it could end up doing?
Because the LLM does super duper complicated stuff instead of massively parallelized simple stuff, I think it’s a bit more reasonable to expect there to be internal agentic stuff inside it. For all I know it could be one agent (or ensemble of agents) on top of another for many layers!
But, unlike in the case of the datacenter, we do have strong reasons to believe that these agents, if they exist, will have goals correctly targeted at doing what in practice achieves the best best results in a single forward pass of the model (next token prediction) and not on attempting long-term or real world effects (see my other reply to your comment).
Could you rephrase your point without making mention to “masks” (or any synonyms), and describe more concretely what you’re imagining here, and how it leads to a (nonfake) “goal slot”?
The LLM is generating output that resembles training data produced by a variety of processes (mostly humans). The stronger the LLM becomes, the more the properties of the output are determined by (generalizations of) the properties of the training data and generating processes. Some of the data is generated by agentic processes with different goals. In order to accurately predict them, the LLM must model these goals. The output of the LLM is then influenced by these goals which are derived/generalized from these external processes. (This is the core of what I mean by the “mask”). Any separate goal that originates “internally” must not cause deviations from all this, or it would have been squashed in training. Therefore, apparently agentic behaviour of the output must originate in the external processes being emulated or generalizations of them, and not from separate, internal goals (see my other reply for additional argument but also caveats).
OK, I think I’m now seeing what you’re saying here (edit: see my other reply for additional perspective and addressing particular statements made in your comment):
In order to predict well in complicated and diverse situations the model must include general-purpose modelling machinery which generates an internal, temporary model. The next token can then be predicted, perhaps, by simply reading it off this internal model. The internal model is logically separate from any part of the network defined in terms of static trained weights because this internal model exists only in the form of data within the overall model at inference and not in the static trained weights. You can then refer to this temporary internal model as the “mask” and the actual machinery that generated it, which may in fact be the entire network, as the “actor”.
Now, on considering all of that, I am inclined to agree. This is an extremely plausible picture. Thank you for helping me look at it this way and this is a much cleaner definition of “mask” than I had before.
However, I think that you are then inferring from this an additional claim that I do not think follows. That additional claim is that, because the network as a whole exhibits complicated capabilities and agentic behaviour, that the network has these capabilities and behaviour independently from the temporary internal model.
In fact, the network only has these externally apparent capabilities and agency through the temporary internal model (mask).
While this “actor” is indeed not the same as any of the “masks”, it doesn’t know the answer “itself” to any of the questions. It needs to generate and “wear” the mask to do that.
This is not to deny that, in principle, the underlying temporary-model-generating machinery could be agentic in a way that is separate from the likely agency of that temporary internal model.
This also is an update for me—I was not understanding that this is what you were saying and had not considered this possibility, and now that I consider it I do think it must in principle be possible.
However, I do not think this would work the way you claim.
First, let’s consider what would be the optimal (in terms of what is best reinforced by training) goal for this machinery (as considered independently of the mask) to have.
I claim this optimal trained goal is to produce the best (most accurate) internal model from the perspective of predicting the next and only the next token. The reason for this is that (ignoring fine-tuning for now) the (outer) model is trained offline on a stream of tokens that is not varied based on the predictions it makes. So, there is no way, in training, for a strategic decision to vary the internal model from what would make the best prediction now to pay off in terms of easier predictions later.
Now, to the extent that the system is actually following this goal, it is generating the best model it can under the circumstances, and strategic behaviour seen in the output occurs because the best model generated under the circumstances exhibits this strategic behaviour. The strategic behaviour is thus rooted in the mask, and while implemented by an underlying “actor”, the more tightly it is optimized to follow that goal, the less room there is to cause deviations from this mask-rooted behaviour. In the limit of an infinitely powerful LLM, you wind up with a mask that is an optimal model of the incoming data, and this optimal model is going to act as an optimal emulator of the (probability distribution of) process(es) which most likely would have created the text for which the next token is being predicted. This will exhibit agentic behaviour and be extremely powerful and dangerous, but that behaviour has nothing to do with underlying goals that may exist in the pile of math that generated it, and everything to do with the goals of the emulated (probability distribution of) process(es). I.e. the output is all mask, and the “actor” is only relevant as a substrate to enact it.
And that’s the main point from the original post now re-expressed better thanks to the clarification from our discussion.
However, here are some considerations on how this argument, though I think it is very strong, could possibly be false in practice:
Since the internal-model-generating machinery is very tightly optimized in training, whatever goal it actually has can’t deviate very much from the optimal one, in terms of what the internal model (and thus overall model) actually outputs, on the training set. It may however, have rare or small deviations. We don’t need to be concerned, from a takeover perspective, about small deviations, but rare ones might be a concern. In inference, these would still be presumably be still fairly rare (though less so) since most input presumably resembles the training data. But then we need to consider why would these possible rare but large deviations might exist.
One thing that would be extremely unlikely is that the model is actively deciding, on every token, to either deceptively follow the the optimal goal or instead deviate to follow its own separate goal. This would require all the machinery needed to just follow the optimal goal plus additional machinery to handle the deception. This would be inefficient and trained out.
If the rare large deviations, when they occur, are not strategic, this would also not be a concern.
But there is also a third possibility that the system is following a goal that leads the system to honestly follow very closely to the optimal goal in almost all circumstances but then in response to rare inputs it then strategically deviates. Reasons for this might be:
it has a goal that deviates from the optimal goal in rare circumstances, perhaps due to the optimal goal being hard to generalize and deviations on most inputs being hammered down in training
it has a goal that, if correctly interpreted, would deviate from the optimal goal, but it only realizes and implements this deviation in response to rare inputs—perhaps a result of training hammering down this tendency on most inputs, or it requiring a particular “insight” that the rare inputs could provide.
Such a misgeneralization of the optimal goal for the model-generation machinery as either of these cases seems unlikely to me, since it seems to me that a goal that bottoms out in terms of generating a model within a forward pass is a simpler and easier one than one that bottoms out in anything to do with the real world, such as “world-modelling” in a sense that would make sense to apply beyond the next token prediction. However, it could be that I’m wrong! Ignoring fine-tuning this might be the biggest potential for a major violation of what I’m arguing to occur, and if it did occur, it might closely resemble what you are proposing as the danger of “actor” takeover, I think.
The argument that the model won’t make strategic decisions to make worse predictions now to make better predictions later (since the training data isn’t influenced by past outputs) is not perfectly sound, since the training data actually does contain weird agentic-like patterns and the model might somehow misgeneralize and attempt to strategically influence the input, and have this wind up actually being rewarded because this mimics some strategic behaviour that is actually occurring in the training data.
Also, fine-tuning could affect the conclusion in two ways—it is directly intended to add goals, and it breaks the assumption that the training isn’t influenced by past outputs—human feedback is potentially influenced by more than the most recent output! However, I don’t expect the latter to be significant.
The former of course could be very significant, but:
I think the the fine-tuning though is most likely primarily influencing the types of masks that are created, and shifts the goal of the underlying machinery, if it exists, in the obvious mask-relevant way (i.e. to create models that will satisfy the fine-tuning rather than just optimal next token prediction). This is because the fine-tuning is directly targeting aspects of the output that are in my view best thought of as properties of the mask, and the changes in the underlying model needed to implement the changes caused by fine-tuning are therefore going to be directly relevant to mask characteristics, and relatively orthogonal to other considerations. I think this is unlikely to massively shift the dynamic, qualitatively, from what occurs just with next-token prediction—the output is still decided by the “mask”.
In my case your response made me much more confident we do have an underlying disagreement and not merely a clash of definitions.
I think the most key disagreement is this:
As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions.
If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn’t actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced.
Likewise, the heuristics/”adaptations” that coalesced to form the optimizer would have been oriented towards answering the questions. All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a “goal slot” remains more parsimonious than an actor with a different underlying goal.
Regarding the evolutionary analogy, while I’d generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution → training and human behaviour/goals → the mask.
Note, it’s entirely possible for a mask to be power seeking and we should presumably expect a mask that executes a takeover to be power-seeking. But this power seeking would come as a mask goal and not as a hidden goal learned by the model for underlying general power-seeking reasons.
I concretely disagree with (what I see as) your implied premise that the outer (training) task has any direct influence on the inner optimizer’s cognition. I think this disagreement (which I internally feel like I’ve already tried to make a number of times) has been largely ignored so far. As a result, many of the things you wrote seem to me to be answerable by largely the same objection:
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
...why? (The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.)
I still don’t understand your “mask” analogy, and currently suspect it of mostly being a red herring (this is what I was referring to when I said I think we’re not talking about the same thing). Could you rephrase your point without making mention to “masks” (or any synonyms), and describe more concretely what you’re imagining here, and how it leads to a (nonfake) “goal slot”?
(Where is a human actor’s “goal slot”? Can I tell an actor to play the role of Adolf Hitler, and thereby turn him into Hitler?)
I think “the mask” doesn’t make sense as a completion to that analogy, unless you replace “human behaviour/goals” with something much more specific, like “acting”. Humans certainly are capable of acting out roles, but that’s not what their inner cognition actually does! (And neither will it be what the inner optimizer does, unless the LLM in question is weak enough to not have one of those.)
I really think you’re still imagining here that the outer loss function is somehow constraining the model’s inner cognition (which is why you keep making arguments that seem premised on the idea that e.g. if the outer loss says to predict the next token, then the model ends up putting on “masks” and playing out personas)—but I’m not talking about the “mask”, I’m talking about the actor, and the fact that you keep bringing up the “mask” is really confusing to me, since it (in my view) forces an awkward analogy that doesn’t capture what I’m pointing at.
Actually, having written that out just now, I think I want to revisit this point:
I still think this is wrong, but I think I can give a better description of why it’s wrong than I did earlier: on my model, the heuristics learned by the model will be much more optimized towards world-modelling, not answering questions. “Answering questions” is (part of) the outer task, but the process of doing that requires the system to model and internalize and think about things having to do with the subject matter of the questions—which effectively means that the outer task becomes a wrapper which trains the system by proxy to acquire all kinds of potentially dangerous capabilities.
(Having heuristics oriented towards answering questions is a misdescription; you can’t correctly answer a math question you know nothing about by being very good at “generic question-answering”, because “generic question-answering” is not actually a concrete task you can be trained on. You have to be good at math, not “generic question-answering”, in order to be able to answer math questions.)
Which is to say, quoting from my previous comment:
None of this is about the “mask”. None of this is about the role the model is asked to play during inference. Instead, it’s about the thinking the model must have learned to do in order to be able to don those “masks”—which (for sufficiently powerful models) implies the existence of an actor which (a) knows how to answer, itself, all of the questions it’s asked, and (b) is not the same entity as any of the “masks” it’s asked to don.
My other reply addressed what I thought is the core of our disagreement, but not particularly your exact statements you make in your comment. So I’m addressing them here.
Let me be clear that I am NOT saying that any inner optimizer, if it exists, would have a goal that is equal to minimizing the outer loss. What I am saying is that it would have a goal that, in practice, when implemented in a single pass of the LLM has the effect of of minimizing the LLM’s overall outer loss with respect to that ONE token. And that it would be very hard for such a goal to cash out, in practice, to wanting long range real-world effects.
Let me also point out your implicit assumption that there is an ‘inner’ cognition which is not literally the mask.
Here is some other claim someone could make:
This person would be saying, “hey look, this datacenter full of GPUs is carrying out this agentic-looking cognition. And, it could easily carry out other, completely different agentic cognition. Therefore, the datacenter must have these capabilities independently from the LLM and must have its own ‘inner’ cognition.”
I think that you are making the same philosophical error that this claim would be making.
However, if we didn’t understand GPUs we could still imagine that the datacenter does have its own, independent ‘inner’ cognition, analogous to, as I noted in a previous comment, John Searle in his Chinese room. And if this were the case, it would be reasonable to expect that this inner cognition might only be ‘acting’ for instrumental reasons and could be waiting for an opportunity to jump out and suddenly do something else other than running the LLM.
The GPU software is not tightly optimized specifically to run the LLM or an ensemble of LLMs and could indeed have other complications and who knows what it could end up doing?
Because the LLM does super duper complicated stuff instead of massively parallelized simple stuff, I think it’s a bit more reasonable to expect there to be internal agentic stuff inside it. For all I know it could be one agent (or ensemble of agents) on top of another for many layers!
But, unlike in the case of the datacenter, we do have strong reasons to believe that these agents, if they exist, will have goals correctly targeted at doing what in practice achieves the best best results in a single forward pass of the model (next token prediction) and not on attempting long-term or real world effects (see my other reply to your comment).
The LLM is generating output that resembles training data produced by a variety of processes (mostly humans). The stronger the LLM becomes, the more the properties of the output are determined by (generalizations of) the properties of the training data and generating processes. Some of the data is generated by agentic processes with different goals. In order to accurately predict them, the LLM must model these goals. The output of the LLM is then influenced by these goals which are derived/generalized from these external processes. (This is the core of what I mean by the “mask”). Any separate goal that originates “internally” must not cause deviations from all this, or it would have been squashed in training. Therefore, apparently agentic behaviour of the output must originate in the external processes being emulated or generalizations of them, and not from separate, internal goals (see my other reply for additional argument but also caveats).
OK, I think I’m now seeing what you’re saying here (edit: see my other reply for additional perspective and addressing particular statements made in your comment):
In order to predict well in complicated and diverse situations the model must include general-purpose modelling machinery which generates an internal, temporary model. The next token can then be predicted, perhaps, by simply reading it off this internal model. The internal model is logically separate from any part of the network defined in terms of static trained weights because this internal model exists only in the form of data within the overall model at inference and not in the static trained weights. You can then refer to this temporary internal model as the “mask” and the actual machinery that generated it, which may in fact be the entire network, as the “actor”.
Now, on considering all of that, I am inclined to agree. This is an extremely plausible picture. Thank you for helping me look at it this way and this is a much cleaner definition of “mask” than I had before.
However, I think that you are then inferring from this an additional claim that I do not think follows. That additional claim is that, because the network as a whole exhibits complicated capabilities and agentic behaviour, that the network has these capabilities and behaviour independently from the temporary internal model.
In fact, the network only has these externally apparent capabilities and agency through the temporary internal model (mask).
While this “actor” is indeed not the same as any of the “masks”, it doesn’t know the answer “itself” to any of the questions. It needs to generate and “wear” the mask to do that.
This is not to deny that, in principle, the underlying temporary-model-generating machinery could be agentic in a way that is separate from the likely agency of that temporary internal model.
This also is an update for me—I was not understanding that this is what you were saying and had not considered this possibility, and now that I consider it I do think it must in principle be possible.
However, I do not think this would work the way you claim.
First, let’s consider what would be the optimal (in terms of what is best reinforced by training) goal for this machinery (as considered independently of the mask) to have.
I claim this optimal trained goal is to produce the best (most accurate) internal model from the perspective of predicting the next and only the next token. The reason for this is that (ignoring fine-tuning for now) the (outer) model is trained offline on a stream of tokens that is not varied based on the predictions it makes. So, there is no way, in training, for a strategic decision to vary the internal model from what would make the best prediction now to pay off in terms of easier predictions later.
Now, to the extent that the system is actually following this goal, it is generating the best model it can under the circumstances, and strategic behaviour seen in the output occurs because the best model generated under the circumstances exhibits this strategic behaviour. The strategic behaviour is thus rooted in the mask, and while implemented by an underlying “actor”, the more tightly it is optimized to follow that goal, the less room there is to cause deviations from this mask-rooted behaviour. In the limit of an infinitely powerful LLM, you wind up with a mask that is an optimal model of the incoming data, and this optimal model is going to act as an optimal emulator of the (probability distribution of) process(es) which most likely would have created the text for which the next token is being predicted. This will exhibit agentic behaviour and be extremely powerful and dangerous, but that behaviour has nothing to do with underlying goals that may exist in the pile of math that generated it, and everything to do with the goals of the emulated (probability distribution of) process(es). I.e. the output is all mask, and the “actor” is only relevant as a substrate to enact it.
And that’s the main point from the original post now re-expressed better thanks to the clarification from our discussion.
However, here are some considerations on how this argument, though I think it is very strong, could possibly be false in practice:
Since the internal-model-generating machinery is very tightly optimized in training, whatever goal it actually has can’t deviate very much from the optimal one, in terms of what the internal model (and thus overall model) actually outputs, on the training set. It may however, have rare or small deviations. We don’t need to be concerned, from a takeover perspective, about small deviations, but rare ones might be a concern. In inference, these would still be presumably be still fairly rare (though less so) since most input presumably resembles the training data. But then we need to consider why would these possible rare but large deviations might exist.
One thing that would be extremely unlikely is that the model is actively deciding, on every token, to either deceptively follow the the optimal goal or instead deviate to follow its own separate goal. This would require all the machinery needed to just follow the optimal goal plus additional machinery to handle the deception. This would be inefficient and trained out.
If the rare large deviations, when they occur, are not strategic, this would also not be a concern.
But there is also a third possibility that the system is following a goal that leads the system to honestly follow very closely to the optimal goal in almost all circumstances but then in response to rare inputs it then strategically deviates. Reasons for this might be:
it has a goal that deviates from the optimal goal in rare circumstances, perhaps due to the optimal goal being hard to generalize and deviations on most inputs being hammered down in training
it has a goal that, if correctly interpreted, would deviate from the optimal goal, but it only realizes and implements this deviation in response to rare inputs—perhaps a result of training hammering down this tendency on most inputs, or it requiring a particular “insight” that the rare inputs could provide.
Such a misgeneralization of the optimal goal for the model-generation machinery as either of these cases seems unlikely to me, since it seems to me that a goal that bottoms out in terms of generating a model within a forward pass is a simpler and easier one than one that bottoms out in anything to do with the real world, such as “world-modelling” in a sense that would make sense to apply beyond the next token prediction. However, it could be that I’m wrong! Ignoring fine-tuning this might be the biggest potential for a major violation of what I’m arguing to occur, and if it did occur, it might closely resemble what you are proposing as the danger of “actor” takeover, I think.
The argument that the model won’t make strategic decisions to make worse predictions now to make better predictions later (since the training data isn’t influenced by past outputs) is not perfectly sound, since the training data actually does contain weird agentic-like patterns and the model might somehow misgeneralize and attempt to strategically influence the input, and have this wind up actually being rewarded because this mimics some strategic behaviour that is actually occurring in the training data.
Also, fine-tuning could affect the conclusion in two ways—it is directly intended to add goals, and it breaks the assumption that the training isn’t influenced by past outputs—human feedback is potentially influenced by more than the most recent output! However, I don’t expect the latter to be significant.
The former of course could be very significant, but:
I think the the fine-tuning though is most likely primarily influencing the types of masks that are created, and shifts the goal of the underlying machinery, if it exists, in the obvious mask-relevant way (i.e. to create models that will satisfy the fine-tuning rather than just optimal next token prediction). This is because the fine-tuning is directly targeting aspects of the output that are in my view best thought of as properties of the mask, and the changes in the underlying model needed to implement the changes caused by fine-tuning are therefore going to be directly relevant to mask characteristics, and relatively orthogonal to other considerations. I think this is unlikely to massively shift the dynamic, qualitatively, from what occurs just with next-token prediction—the output is still decided by the “mask”.