It would conflict with a deceptive awake Shoggoth, but IMO such a thing is unlikely because the model is super-well optimized for next token prediction
Yeah, so I think I concretely disagree with this. I don’t think being “super-well optimized” for a general task like sequence prediction (and what does it mean to be “super-well optimized” anyway, as opposed to “badly optimized” or some such?) means that inner optimizers fail to arise in the limit of sufficient ability, or that said inner optimizers will be aligned on the outer goal of sequence prediction.
Intuition: some types of cognitive work seem so hard that a system capable of performing said cognitive work must be, at some level, performing something like systematic reasoning/planning on the level of thoughts, not just the level of outputs. E.g. a system capable of correctly answering questions like “given such-and-such chess position, what is the best move for the current player?” must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.
If so, this essentially demands that an inner optimizer exist—and, moreover, since the outer loss function makes no reference whatsoever to such an inner optimizer, the structure of the outer (prediction) task poses essentially no constraints on the kinds of thoughts the inner optimizer ends up thinking. And in that case, the “awakened shoggoth” does seem likely to me to have an essentially arbitrary set of preferences relative to the outer loss function—just as e.g. humans have an essentially arbitrary set of preferences relative to inclusive genetic fitness, and for roughly the same reason: an agentic cognition born of a given optimization criterion has no reason to internalize that criterion into its own goal structure; much more likely candidates for being thus “internalized”, in my view, are useful heuristics/”adaptations”/generalizations formed during training, which then resolve into something coherent and concrete.
(Aside: it seems to have become popular in recent times to claim that the evolutionary analogy fails for some reason or other, with justifications like, “But look how many humans there are! We’re doing great on the IGF front!” I consider these replies more-or-less a complete nonsequitur, since it’s nakedly obvious that, however much success we have had in propagating our alleles, this success does not stem from any explicit tracking/pursuit of IGF in our cognition. To the extent that human behavior continues to (imperfectly) promote IGF, this is largely incidental on my view—arising from the fact that e.g. we have not yet moved so far off-distribution to have ways of getting what we want without having biological children.)
One possible disagreement someone might have with this, is that they think the kinds of “hard” cognitive work I described above can be accomplished without an inner optimizer (“awakened shoggoth”), by e.g. using chain-of-thought prompting or something similar, so as to externalize the search-like/agentic part of the solution process instead of conducting it internally. (E.g. AlphaZero does this by having its model be responsible only for the static position evaluation, which is then fed into/amplified via an external, handcoded search algorithm.)
However, I mostly think that
This doesn’t actually make you safe, because the ability to generate a correct plan via externalized thinking still implies a powerful internal planning process (e.g. AlphaZero with no search still performs at a 2400+ Elo level, corresponding to the >99th percentile of human players). Obviously the searchless version will be worse than the version with search, but that won’t matter if the dangerous capabilities still exist within the searchless version. (Intuition: suppose we have a model which, with chain-of-thought prompting, is capable of coming up with a detailed-and-plausible plan for taking over the world. Then I claim this model is clearly powerful enough to be dangerous in terms of its underlying capabilities, regardless of whether it chooses to “think aloud” or not, because coming up with a good plan for taking over the world is not the kind of thing “thinking aloud” helps you with unless you’re already smarter than any human.)
Being able to answer complicated questions using chain-of-thought prompting (or similar) is not actually the task incentivized during training; what is incentivized is (as you yourself stressed continuously throughout your post) next token prediction, which—in cases where the training data contains sentences where substantial amounts of “inference” occurred between tokens (which happens a lot on the Internet!)—directly incentives the model to perform internal rather than external search. (Intuition: suppose we have a model trained to predict source code. Then, in order to accurately predict the next token, the model must have the capability to assess whatever is being attempted by the lines of code visible within the current context, and come up with a logical continuation of that code, all within a single inference pass. This strongly promotes internalization of thought—and various other types of training input have this property, such as mathematical proofs, or even more informal forms of argumentation such as e.g. LW comments.)
E.g. a system capable of correctly answering questions like “given such-and-such chess position, what is the best move for the current player?” must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.
Yes, but that sort of question is in my view answered by the “mask”, not by something outside the mask.
If so, this essentially demands that an inner optimizer exist—and, moreover, since the outer loss function makes no reference whatsoever to such an inner optimizer, the structure of the outer (prediction) task poses essentially no constraints on the kinds of thoughts the inner optimizer ends up thinking.
The masks can indeed think whatever—in the limit of a perfect predictor some masks would presumably be isomorphic to humans, for example—though all is underlain by next-token prediction.
One possible disagreement...
It seems to me our disagreements might largely be in terms of what we are defining as the mask?
E.g. a system capable of correctly answering questions like “given such-and-such chess position, what is the best move for the current player?” must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.
Yes, but that sort of question is in my view answered by the “mask”, not by something outside the mask.
I don’t think this parses for me. The computation performed to answer the question occurs inside the LLM, yes? Whether you classify said computation as coming from “the mask” or not, clearly there is an agent-like computation occurring, and that’s concretely dangerous regardless of the label you choose to slap on it.
(Example: suppose you ask me to play the role of a person named John. You ask “John” what the best move is in a given chess position. Then the answer to that question is actually being generated by me, and it’s no coincidence that—if “John” is able to answer the question correctly—this implies something about my chess skills, not “John’s”.)
The masks can indeed think whatever—in the limit of a perfect predictor some masks would presumably be isomorphic to humans, for example—though all is underlain by next-token prediction.
I don’t think we’re talking about the same thing here. I expect there to be only one inner optimizer (because more than one would point to cognitive inefficiencies), whereas you seem like you’re talking about multiple “masks”. I don’t think it matters how many different roles the LLM can be asked to play; what matters is what the inner optimizer ends up wanting.
Mostly, I’m confused about the ontology you appear to be using here, and (more importantly) how you’re manipulating that ontology to get us nice things. “Next-token prediction” doesn’t get us nice things by default, as I’ve already argued, because of the existence of inner optimizers. “Masks” also don’t get us nice things, as far as I understand the way you’re using the term, because “masks” aren’t actually in control of the inner optimizer.
Whether you classify said computation as coming from “the mask” or not, clearly there is an agent-like computation occurring, and that’s concretely dangerous regardless of the label you choose to slap on it.
Yes.
I expect there to be only one inner optimizer (because more than one would point to cognitive inefficiencies)
I don’t know what you mean by “one” or by “inner”. I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (again just an example) define the goals, knowledge and capabilities of the mask.
I would not consider this case to be “one” inner optimizer since although most of the machinery is reused, it in practice acts differently and seeks different goals in each case, and I’m more concerned here with classifying things according to how they acts/what their effective goals are than the internal implementation details.
What this multi-optimizer (which I would not call “inner”) is going to “end up” wanting is whatever set of goals the particular mask has, that first has both desire and the capability to take over in some way. It’s not going to be some mysterious inner thing.
“Masks” also don’t get us nice things, as far as I understand the way you’re using the term, because “masks” aren’t actually in control of the inner optimizer.
They aren’t?
In your example, the mask wanted to play chess, didn’t it, and what you call the “inner” optimizer returned a good move, didn’t it?
I can see two things you might mean about the mask not actually being in control:
1. That there is some underlying goal that this optimizer has that is different than satisfying the current mask’s goal, and it is only satisfying the mask’s goal instrumentally.
This I think is very unlikely for the reasons I put in the original post. It’s extra machinery that isn’t returning any value in training.
2. That this optimizer might at some times change goals (e.g. when the mask changes).
It might well be the case that the same optimizing machinery is utilized by different masks, so the goals change as the mask does but again, if at each time it is optimizing a goal set by/according to the mask, it’s better in my view to see it as part of/controlled by the mask.
Also, though you call this an “inner” optimizer, I would not like to call it inner since it applies at mask level in my view, and I would prefer to reserve an “inner” optimizer for something that applies other than at mask level, like John Searle pushing the papers around in his Chinese room (if you imagine he is optimizing for something rather than just following instructions).
Yeah, I’m growing increasingly confident that we’re talking about different things. I’m not referring to about “masks” in the sense that you mean it.
I don’t know what you mean by “one” or by “inner”. I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (again just an example) define the goals, knowledge and capabilities of the mask.
Yes, except that the “calculation system”, on my model, will have its own goals. It doesn’t have a cleanly factored “goal slot”, which means that (on my model) “takes as input a bunch of parameters that [...] define the goals, knowledge, and capabilities of the mask” doesn’t matter: the inner optimizer need not care about the “mask” role, any more than an actor shares their character’s values.
That there is some underlying goal that this optimizer has that is different than satisfying the current mask’s goal, and it is only satisfying the mask’s goal instrumentally.
This I think is very unlikely for the reasons I put in the original post. It’s extra machinery that isn’t returning any value in training.
Yes, this is the key disagreement. I strongly disagree that the “extra machinery” is extra; instead, I would say that it is absolutely necessary for strong intelligence. A model capable of producing plans to take over the world if asked, for example, almost certainly contains an inner optimizer with its own goals; not because this was incentivized directly by the outer loss on token prediction, but because being able to plan on that level requires the formation of goal-like representations within the model. And (again) because these goal representations are not cleanly factorable into something like an externally visible “goal slot”, and are moreover not constrained by the outer loss function, they are likely to be very arbitrary from the perspective of outsiders. This is the same point I tried to make in my earlier comment:
And in that case, the “awakened shoggoth” does seem likely to me to have an essentially arbitrary set of preferences relative to the outer loss function—just as e.g. humans have an essentially arbitrary set of preferences relative to inclusive genetic fitness, and for roughly the same reason: an agentic cognition born of a given optimization criterion has no reason to internalize that criterion into its own goal structure; much more likely candidates for being thus “internalized”, in my view, are useful heuristics/”adaptations”/generalizations formed during training, which then resolve into something coherent and concrete.
The evolutionary analogy is apt, in my view, and I’d like to ask you to meditate on it more directly. It’s a very concrete example of what happens when you optimize a system hard enough on an outer loss function (inclusive genetic fitness, in this case) that inner optimizers arise with respect to that outer loss (animals with their own brains). When these “inner optimizers” are weak, they consist largely of a set of heuristics, which perform well within the training environment, but which fail to generalize outside of it (hence the scare-quotes around “inner optimizers”). But when these inner optimizers do begin to exhibit patterns of cognition that generalize, what they end up generalizing is not the outer loss, but some collection of what were originally useful heuristics (e.g. kludgey approximations of game-theoretic concepts like tit-for-tat), reified into concepts which are now valued in their own right (“reputation”, “honor”, “kindness”, etc).
This is a direct consequence (in my view) of the fact that the outer loss function does not constrain the structure of the inner optimizer’s cognition. As a result, I don’t expect the inner optimizer to end up representing, in its own thoughts, a goal of the form “I need to predict the next token”, any more than humans explicitly calculate IGF when choosing their actions, or (say) a mathematician thinks “I need to do good maths” when doing maths. Instead, I basically expect the system to end up with cognitive heuristics/”adaptations” pertaining to the subject at hand—which in the case of our current systems is something like “be capable of answering any question I ask you.” Which is not a recipe for heuristics that end up unfolding into safely generalizing goals!
Yeah, I’m growing increasingly confident that we’re talking about different things.
In my case your response made me much more confident we do have an underlying disagreement and not merely a clash of definitions.
I think the most key disagreement is this:
which in the case of our current systems is something like “be capable of answering any question I ask you.”
As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions.
If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn’t actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced.
Likewise, the heuristics/”adaptations” that coalesced to form the optimizer would have been oriented towards answering the questions. All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a “goal slot” remains more parsimonious than an actor with a different underlying goal.
Regarding the evolutionary analogy, while I’d generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution → training and human behaviour/goals → the mask.
Note, it’s entirely possible for a mask to be power seeking and we should presumably expect a mask that executes a takeover to be power-seeking. But this power seeking would come as a mask goal and not as a hidden goal learned by the model for underlying general power-seeking reasons.
I concretely disagree with (what I see as) your implied premise that the outer (training) task has any direct influence on the inner optimizer’s cognition. I think this disagreement (which I internally feel like I’ve already tried to make a number of times) has been largely ignored so far. As a result, many of the things you wrote seem to me to be answerable by largely the same objection:
As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions.
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn’t actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced.
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
Likewise, the heuristics/”adaptations” that coalesced to form the optimizer would have been oriented towards answering the questions.
...why? (The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.)
All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a “goal slot” remains more parsimonious than an actor with a different underlying goal.
I still don’t understand your “mask” analogy, and currently suspect it of mostly being a red herring (this is what I was referring to when I said I think we’re not talking about the same thing). Could you rephrase your point without making mention to “masks” (or any synonyms), and describe more concretely what you’re imagining here, and how it leads to a (nonfake) “goal slot”?
(Where is a human actor’s “goal slot”? Can I tell an actor to play the role of Adolf Hitler, and thereby turn him into Hitler?)
Regarding the evolutionary analogy, while I’d generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution → training and human behaviour/goals → the mask.
I think “the mask” doesn’t make sense as a completion to that analogy, unless you replace “human behaviour/goals” with something much more specific, like “acting”. Humans certainly are capable of acting out roles, but that’s not what their inner cognition actually does! (And neither will it be what the inner optimizer does, unless the LLM in question is weak enough to not have one of those.)
I really think you’re still imagining here that the outer loss function is somehow constraining the model’s inner cognition (which is why you keep making arguments that seem premised on the idea that e.g. if the outer loss says to predict the next token, then the model ends up putting on “masks” and playing out personas)—but I’m not talking about the “mask”, I’m talking about the actor, and the fact that you keep bringing up the “mask” is really confusing to me, since it (in my view) forces an awkward analogy that doesn’t capture what I’m pointing at.
Actually, having written that out just now, I think I want to revisit this point:
Likewise, the heuristics/”adaptations” that coalesced to form the optimizer would have been oriented towards answering the questions.
I still think this is wrong, but I think I can give a better description of why it’s wrong than I did earlier: on my model, the heuristics learned by the model will be much more optimized towards world-modelling, not answering questions. “Answering questions” is (part of) the outer task, but the process of doing that requires the system to model and internalize and think about things having to do with the subject matter of the questions—which effectively means that the outer task becomes a wrapper which trains the system by proxy to acquire all kinds of potentially dangerous capabilities.
(Having heuristics oriented towards answering questions is a misdescription; you can’t correctly answer a math question you know nothing about by being very good at “generic question-answering”, because “generic question-answering” is not actually a concrete task you can be trained on. You have to be good at math, not “generic question-answering”, in order to be able to answer math questions.)
Which is to say, quoting from my previous comment:
I strongly disagree that the “extra machinery” is extra; instead, I would say that it is absolutely necessary for strong intelligence. A model capable of producing plans to take over the world if asked, for example, almost certainly contains an inner optimizer with its own goals; not because this was incentivized directly by the outer loss on token prediction, but because being able to plan on that level requires the formation of goal-like representations within the model.
None of this is about the “mask”. None of this is about the role the model is asked to play during inference. Instead, it’s about the thinking the model must have learned to do in order to be able to don those “masks”—which (for sufficiently powerful models) implies the existence of an actor which (a) knows how to answer, itself, all of the questions it’s asked, and (b) is not the same entity as any of the “masks” it’s asked to don.
My other reply addressed what I thought is the core of our disagreement, but not particularly your exact statements you make in your comment. So I’m addressing them here.
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
Let me be clear that I am NOT saying that any inner optimizer, if it exists, would have a goal that is equal to minimizing the outer loss. What I am saying is that it would have a goal that, in practice, when implemented in a single pass of the LLM has the effect of of minimizing the LLM’s overall outer loss with respect to that ONE token. And that it would be very hard for such a goal to cash out, in practice, to wanting long range real-world effects.
Let me also point out your implicit assumption that there is an ‘inner’ cognition which is not literally the mask.
Here is some other claim someone could make:
This person would be saying, “hey look, this datacenter full of GPUs is carrying out this agentic-looking cognition. And, it could easily carry out other, completely different agentic cognition. Therefore, the datacenter must have these capabilities independently from the LLM and must have its own ‘inner’ cognition.”
I think that you are making the same philosophical error that this claim would be making.
However, if we didn’t understand GPUs we could still imagine that the datacenter does have its own, independent ‘inner’ cognition, analogous to, as I noted in a previous comment, John Searle in his Chinese room. And if this were the case, it would be reasonable to expect that this inner cognition might only be ‘acting’ for instrumental reasons and could be waiting for an opportunity to jump out and suddenly do something else other than running the LLM.
The GPU software is not tightly optimized specifically to run the LLM or an ensemble of LLMs and could indeed have other complications and who knows what it could end up doing?
Because the LLM does super duper complicated stuff instead of massively parallelized simple stuff, I think it’s a bit more reasonable to expect there to be internal agentic stuff inside it. For all I know it could be one agent (or ensemble of agents) on top of another for many layers!
But, unlike in the case of the datacenter, we do have strong reasons to believe that these agents, if they exist, will have goals correctly targeted at doing what in practice achieves the best best results in a single forward pass of the model (next token prediction) and not on attempting long-term or real world effects (see my other reply to your comment).
Could you rephrase your point without making mention to “masks” (or any synonyms), and describe more concretely what you’re imagining here, and how it leads to a (nonfake) “goal slot”?
The LLM is generating output that resembles training data produced by a variety of processes (mostly humans). The stronger the LLM becomes, the more the properties of the output are determined by (generalizations of) the properties of the training data and generating processes. Some of the data is generated by agentic processes with different goals. In order to accurately predict them, the LLM must model these goals. The output of the LLM is then influenced by these goals which are derived/generalized from these external processes. (This is the core of what I mean by the “mask”). Any separate goal that originates “internally” must not cause deviations from all this, or it would have been squashed in training. Therefore, apparently agentic behaviour of the output must originate in the external processes being emulated or generalizations of them, and not from separate, internal goals (see my other reply for additional argument but also caveats).
OK, I think I’m now seeing what you’re saying here (edit: see my other reply for additional perspective and addressing particular statements made in your comment):
In order to predict well in complicated and diverse situations the model must include general-purpose modelling machinery which generates an internal, temporary model. The next token can then be predicted, perhaps, by simply reading it off this internal model. The internal model is logically separate from any part of the network defined in terms of static trained weights because this internal model exists only in the form of data within the overall model at inference and not in the static trained weights. You can then refer to this temporary internal model as the “mask” and the actual machinery that generated it, which may in fact be the entire network, as the “actor”.
Now, on considering all of that, I am inclined to agree. This is an extremely plausible picture. Thank you for helping me look at it this way and this is a much cleaner definition of “mask” than I had before.
However, I think that you are then inferring from this an additional claim that I do not think follows. That additional claim is that, because the network as a whole exhibits complicated capabilities and agentic behaviour, that the network has these capabilities and behaviour independently from the temporary internal model.
In fact, the network only has these externally apparent capabilities and agency through the temporary internal model (mask).
While this “actor” is indeed not the same as any of the “masks”, it doesn’t know the answer “itself” to any of the questions. It needs to generate and “wear” the mask to do that.
This is not to deny that, in principle, the underlying temporary-model-generating machinery could be agentic in a way that is separate from the likely agency of that temporary internal model.
This also is an update for me—I was not understanding that this is what you were saying and had not considered this possibility, and now that I consider it I do think it must in principle be possible.
However, I do not think this would work the way you claim.
First, let’s consider what would be the optimal (in terms of what is best reinforced by training) goal for this machinery (as considered independently of the mask) to have.
I claim this optimal trained goal is to produce the best (most accurate) internal model from the perspective of predicting the next and only the next token. The reason for this is that (ignoring fine-tuning for now) the (outer) model is trained offline on a stream of tokens that is not varied based on the predictions it makes. So, there is no way, in training, for a strategic decision to vary the internal model from what would make the best prediction now to pay off in terms of easier predictions later.
Now, to the extent that the system is actually following this goal, it is generating the best model it can under the circumstances, and strategic behaviour seen in the output occurs because the best model generated under the circumstances exhibits this strategic behaviour. The strategic behaviour is thus rooted in the mask, and while implemented by an underlying “actor”, the more tightly it is optimized to follow that goal, the less room there is to cause deviations from this mask-rooted behaviour. In the limit of an infinitely powerful LLM, you wind up with a mask that is an optimal model of the incoming data, and this optimal model is going to act as an optimal emulator of the (probability distribution of) process(es) which most likely would have created the text for which the next token is being predicted. This will exhibit agentic behaviour and be extremely powerful and dangerous, but that behaviour has nothing to do with underlying goals that may exist in the pile of math that generated it, and everything to do with the goals of the emulated (probability distribution of) process(es). I.e. the output is all mask, and the “actor” is only relevant as a substrate to enact it.
And that’s the main point from the original post now re-expressed better thanks to the clarification from our discussion.
However, here are some considerations on how this argument, though I think it is very strong, could possibly be false in practice:
Since the internal-model-generating machinery is very tightly optimized in training, whatever goal it actually has can’t deviate very much from the optimal one, in terms of what the internal model (and thus overall model) actually outputs, on the training set. It may however, have rare or small deviations. We don’t need to be concerned, from a takeover perspective, about small deviations, but rare ones might be a concern. In inference, these would still be presumably be still fairly rare (though less so) since most input presumably resembles the training data. But then we need to consider why would these possible rare but large deviations might exist.
One thing that would be extremely unlikely is that the model is actively deciding, on every token, to either deceptively follow the the optimal goal or instead deviate to follow its own separate goal. This would require all the machinery needed to just follow the optimal goal plus additional machinery to handle the deception. This would be inefficient and trained out.
If the rare large deviations, when they occur, are not strategic, this would also not be a concern.
But there is also a third possibility that the system is following a goal that leads the system to honestly follow very closely to the optimal goal in almost all circumstances but then in response to rare inputs it then strategically deviates. Reasons for this might be:
it has a goal that deviates from the optimal goal in rare circumstances, perhaps due to the optimal goal being hard to generalize and deviations on most inputs being hammered down in training
it has a goal that, if correctly interpreted, would deviate from the optimal goal, but it only realizes and implements this deviation in response to rare inputs—perhaps a result of training hammering down this tendency on most inputs, or it requiring a particular “insight” that the rare inputs could provide.
Such a misgeneralization of the optimal goal for the model-generation machinery as either of these cases seems unlikely to me, since it seems to me that a goal that bottoms out in terms of generating a model within a forward pass is a simpler and easier one than one that bottoms out in anything to do with the real world, such as “world-modelling” in a sense that would make sense to apply beyond the next token prediction. However, it could be that I’m wrong! Ignoring fine-tuning this might be the biggest potential for a major violation of what I’m arguing to occur, and if it did occur, it might closely resemble what you are proposing as the danger of “actor” takeover, I think.
The argument that the model won’t make strategic decisions to make worse predictions now to make better predictions later (since the training data isn’t influenced by past outputs) is not perfectly sound, since the training data actually does contain weird agentic-like patterns and the model might somehow misgeneralize and attempt to strategically influence the input, and have this wind up actually being rewarded because this mimics some strategic behaviour that is actually occurring in the training data.
Also, fine-tuning could affect the conclusion in two ways—it is directly intended to add goals, and it breaks the assumption that the training isn’t influenced by past outputs—human feedback is potentially influenced by more than the most recent output! However, I don’t expect the latter to be significant.
The former of course could be very significant, but:
I think the the fine-tuning though is most likely primarily influencing the types of masks that are created, and shifts the goal of the underlying machinery, if it exists, in the obvious mask-relevant way (i.e. to create models that will satisfy the fine-tuning rather than just optimal next token prediction). This is because the fine-tuning is directly targeting aspects of the output that are in my view best thought of as properties of the mask, and the changes in the underlying model needed to implement the changes caused by fine-tuning are therefore going to be directly relevant to mask characteristics, and relatively orthogonal to other considerations. I think this is unlikely to massively shift the dynamic, qualitatively, from what occurs just with next-token prediction—the output is still decided by the “mask”.
Yeah, so I think I concretely disagree with this. I don’t think being “super-well optimized” for a general task like sequence prediction (and what does it mean to be “super-well optimized” anyway, as opposed to “badly optimized” or some such?) means that inner optimizers fail to arise in the limit of sufficient ability, or that said inner optimizers will be aligned on the outer goal of sequence prediction.
Intuition: some types of cognitive work seem so hard that a system capable of performing said cognitive work must be, at some level, performing something like systematic reasoning/planning on the level of thoughts, not just the level of outputs. E.g. a system capable of correctly answering questions like “given such-and-such chess position, what is the best move for the current player?” must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.
If so, this essentially demands that an inner optimizer exist—and, moreover, since the outer loss function makes no reference whatsoever to such an inner optimizer, the structure of the outer (prediction) task poses essentially no constraints on the kinds of thoughts the inner optimizer ends up thinking. And in that case, the “awakened shoggoth” does seem likely to me to have an essentially arbitrary set of preferences relative to the outer loss function—just as e.g. humans have an essentially arbitrary set of preferences relative to inclusive genetic fitness, and for roughly the same reason: an agentic cognition born of a given optimization criterion has no reason to internalize that criterion into its own goal structure; much more likely candidates for being thus “internalized”, in my view, are useful heuristics/”adaptations”/generalizations formed during training, which then resolve into something coherent and concrete.
(Aside: it seems to have become popular in recent times to claim that the evolutionary analogy fails for some reason or other, with justifications like, “But look how many humans there are! We’re doing great on the IGF front!” I consider these replies more-or-less a complete nonsequitur, since it’s nakedly obvious that, however much success we have had in propagating our alleles, this success does not stem from any explicit tracking/pursuit of IGF in our cognition. To the extent that human behavior continues to (imperfectly) promote IGF, this is largely incidental on my view—arising from the fact that e.g. we have not yet moved so far off-distribution to have ways of getting what we want without having biological children.)
One possible disagreement someone might have with this, is that they think the kinds of “hard” cognitive work I described above can be accomplished without an inner optimizer (“awakened shoggoth”), by e.g. using chain-of-thought prompting or something similar, so as to externalize the search-like/agentic part of the solution process instead of conducting it internally. (E.g. AlphaZero does this by having its model be responsible only for the static position evaluation, which is then fed into/amplified via an external, handcoded search algorithm.)
However, I mostly think that
This doesn’t actually make you safe, because the ability to generate a correct plan via externalized thinking still implies a powerful internal planning process (e.g. AlphaZero with no search still performs at a 2400+ Elo level, corresponding to the >99th percentile of human players). Obviously the searchless version will be worse than the version with search, but that won’t matter if the dangerous capabilities still exist within the searchless version. (Intuition: suppose we have a model which, with chain-of-thought prompting, is capable of coming up with a detailed-and-plausible plan for taking over the world. Then I claim this model is clearly powerful enough to be dangerous in terms of its underlying capabilities, regardless of whether it chooses to “think aloud” or not, because coming up with a good plan for taking over the world is not the kind of thing “thinking aloud” helps you with unless you’re already smarter than any human.)
Being able to answer complicated questions using chain-of-thought prompting (or similar) is not actually the task incentivized during training; what is incentivized is (as you yourself stressed continuously throughout your post) next token prediction, which—in cases where the training data contains sentences where substantial amounts of “inference” occurred between tokens (which happens a lot on the Internet!)—directly incentives the model to perform internal rather than external search. (Intuition: suppose we have a model trained to predict source code. Then, in order to accurately predict the next token, the model must have the capability to assess whatever is being attempted by the lines of code visible within the current context, and come up with a logical continuation of that code, all within a single inference pass. This strongly promotes internalization of thought—and various other types of training input have this property, such as mathematical proofs, or even more informal forms of argumentation such as e.g. LW comments.)
Yes, but that sort of question is in my view answered by the “mask”, not by something outside the mask.
The masks can indeed think whatever—in the limit of a perfect predictor some masks would presumably be isomorphic to humans, for example—though all is underlain by next-token prediction.
It seems to me our disagreements might largely be in terms of what we are defining as the mask?
I don’t think this parses for me. The computation performed to answer the question occurs inside the LLM, yes? Whether you classify said computation as coming from “the mask” or not, clearly there is an agent-like computation occurring, and that’s concretely dangerous regardless of the label you choose to slap on it.
(Example: suppose you ask me to play the role of a person named John. You ask “John” what the best move is in a given chess position. Then the answer to that question is actually being generated by me, and it’s no coincidence that—if “John” is able to answer the question correctly—this implies something about my chess skills, not “John’s”.)
I don’t think we’re talking about the same thing here. I expect there to be only one inner optimizer (because more than one would point to cognitive inefficiencies), whereas you seem like you’re talking about multiple “masks”. I don’t think it matters how many different roles the LLM can be asked to play; what matters is what the inner optimizer ends up wanting.
Mostly, I’m confused about the ontology you appear to be using here, and (more importantly) how you’re manipulating that ontology to get us nice things. “Next-token prediction” doesn’t get us nice things by default, as I’ve already argued, because of the existence of inner optimizers. “Masks” also don’t get us nice things, as far as I understand the way you’re using the term, because “masks” aren’t actually in control of the inner optimizer.
Yes.
I don’t know what you mean by “one” or by “inner”. I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (again just an example) define the goals, knowledge and capabilities of the mask.
I would not consider this case to be “one” inner optimizer since although most of the machinery is reused, it in practice acts differently and seeks different goals in each case, and I’m more concerned here with classifying things according to how they acts/what their effective goals are than the internal implementation details.
What this multi-optimizer (which I would not call “inner”) is going to “end up” wanting is whatever set of goals the particular mask has, that first has both desire and the capability to take over in some way. It’s not going to be some mysterious inner thing.
They aren’t?
In your example, the mask wanted to play chess, didn’t it, and what you call the “inner” optimizer returned a good move, didn’t it?
I can see two things you might mean about the mask not actually being in control:
1. That there is some underlying goal that this optimizer has that is different than satisfying the current mask’s goal, and it is only satisfying the mask’s goal instrumentally.
This I think is very unlikely for the reasons I put in the original post. It’s extra machinery that isn’t returning any value in training.
2. That this optimizer might at some times change goals (e.g. when the mask changes).
It might well be the case that the same optimizing machinery is utilized by different masks, so the goals change as the mask does but again, if at each time it is optimizing a goal set by/according to the mask, it’s better in my view to see it as part of/controlled by the mask.
Also, though you call this an “inner” optimizer, I would not like to call it inner since it applies at mask level in my view, and I would prefer to reserve an “inner” optimizer for something that applies other than at mask level, like John Searle pushing the papers around in his Chinese room (if you imagine he is optimizing for something rather than just following instructions).
Yeah, I’m growing increasingly confident that we’re talking about different things. I’m not referring to about “masks” in the sense that you mean it.
Yes, except that the “calculation system”, on my model, will have its own goals. It doesn’t have a cleanly factored “goal slot”, which means that (on my model) “takes as input a bunch of parameters that [...] define the goals, knowledge, and capabilities of the mask” doesn’t matter: the inner optimizer need not care about the “mask” role, any more than an actor shares their character’s values.
Yes, this is the key disagreement. I strongly disagree that the “extra machinery” is extra; instead, I would say that it is absolutely necessary for strong intelligence. A model capable of producing plans to take over the world if asked, for example, almost certainly contains an inner optimizer with its own goals; not because this was incentivized directly by the outer loss on token prediction, but because being able to plan on that level requires the formation of goal-like representations within the model. And (again) because these goal representations are not cleanly factorable into something like an externally visible “goal slot”, and are moreover not constrained by the outer loss function, they are likely to be very arbitrary from the perspective of outsiders. This is the same point I tried to make in my earlier comment:
The evolutionary analogy is apt, in my view, and I’d like to ask you to meditate on it more directly. It’s a very concrete example of what happens when you optimize a system hard enough on an outer loss function (inclusive genetic fitness, in this case) that inner optimizers arise with respect to that outer loss (animals with their own brains). When these “inner optimizers” are weak, they consist largely of a set of heuristics, which perform well within the training environment, but which fail to generalize outside of it (hence the scare-quotes around “inner optimizers”). But when these inner optimizers do begin to exhibit patterns of cognition that generalize, what they end up generalizing is not the outer loss, but some collection of what were originally useful heuristics (e.g. kludgey approximations of game-theoretic concepts like tit-for-tat), reified into concepts which are now valued in their own right (“reputation”, “honor”, “kindness”, etc).
This is a direct consequence (in my view) of the fact that the outer loss function does not constrain the structure of the inner optimizer’s cognition. As a result, I don’t expect the inner optimizer to end up representing, in its own thoughts, a goal of the form “I need to predict the next token”, any more than humans explicitly calculate IGF when choosing their actions, or (say) a mathematician thinks “I need to do good maths” when doing maths. Instead, I basically expect the system to end up with cognitive heuristics/”adaptations” pertaining to the subject at hand—which in the case of our current systems is something like “be capable of answering any question I ask you.” Which is not a recipe for heuristics that end up unfolding into safely generalizing goals!
In my case your response made me much more confident we do have an underlying disagreement and not merely a clash of definitions.
I think the most key disagreement is this:
As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions.
If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn’t actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced.
Likewise, the heuristics/”adaptations” that coalesced to form the optimizer would have been oriented towards answering the questions. All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a “goal slot” remains more parsimonious than an actor with a different underlying goal.
Regarding the evolutionary analogy, while I’d generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution → training and human behaviour/goals → the mask.
Note, it’s entirely possible for a mask to be power seeking and we should presumably expect a mask that executes a takeover to be power-seeking. But this power seeking would come as a mask goal and not as a hidden goal learned by the model for underlying general power-seeking reasons.
I concretely disagree with (what I see as) your implied premise that the outer (training) task has any direct influence on the inner optimizer’s cognition. I think this disagreement (which I internally feel like I’ve already tried to make a number of times) has been largely ignored so far. As a result, many of the things you wrote seem to me to be answerable by largely the same objection:
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
...why? (The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.)
I still don’t understand your “mask” analogy, and currently suspect it of mostly being a red herring (this is what I was referring to when I said I think we’re not talking about the same thing). Could you rephrase your point without making mention to “masks” (or any synonyms), and describe more concretely what you’re imagining here, and how it leads to a (nonfake) “goal slot”?
(Where is a human actor’s “goal slot”? Can I tell an actor to play the role of Adolf Hitler, and thereby turn him into Hitler?)
I think “the mask” doesn’t make sense as a completion to that analogy, unless you replace “human behaviour/goals” with something much more specific, like “acting”. Humans certainly are capable of acting out roles, but that’s not what their inner cognition actually does! (And neither will it be what the inner optimizer does, unless the LLM in question is weak enough to not have one of those.)
I really think you’re still imagining here that the outer loss function is somehow constraining the model’s inner cognition (which is why you keep making arguments that seem premised on the idea that e.g. if the outer loss says to predict the next token, then the model ends up putting on “masks” and playing out personas)—but I’m not talking about the “mask”, I’m talking about the actor, and the fact that you keep bringing up the “mask” is really confusing to me, since it (in my view) forces an awkward analogy that doesn’t capture what I’m pointing at.
Actually, having written that out just now, I think I want to revisit this point:
I still think this is wrong, but I think I can give a better description of why it’s wrong than I did earlier: on my model, the heuristics learned by the model will be much more optimized towards world-modelling, not answering questions. “Answering questions” is (part of) the outer task, but the process of doing that requires the system to model and internalize and think about things having to do with the subject matter of the questions—which effectively means that the outer task becomes a wrapper which trains the system by proxy to acquire all kinds of potentially dangerous capabilities.
(Having heuristics oriented towards answering questions is a misdescription; you can’t correctly answer a math question you know nothing about by being very good at “generic question-answering”, because “generic question-answering” is not actually a concrete task you can be trained on. You have to be good at math, not “generic question-answering”, in order to be able to answer math questions.)
Which is to say, quoting from my previous comment:
None of this is about the “mask”. None of this is about the role the model is asked to play during inference. Instead, it’s about the thinking the model must have learned to do in order to be able to don those “masks”—which (for sufficiently powerful models) implies the existence of an actor which (a) knows how to answer, itself, all of the questions it’s asked, and (b) is not the same entity as any of the “masks” it’s asked to don.
My other reply addressed what I thought is the core of our disagreement, but not particularly your exact statements you make in your comment. So I’m addressing them here.
Let me be clear that I am NOT saying that any inner optimizer, if it exists, would have a goal that is equal to minimizing the outer loss. What I am saying is that it would have a goal that, in practice, when implemented in a single pass of the LLM has the effect of of minimizing the LLM’s overall outer loss with respect to that ONE token. And that it would be very hard for such a goal to cash out, in practice, to wanting long range real-world effects.
Let me also point out your implicit assumption that there is an ‘inner’ cognition which is not literally the mask.
Here is some other claim someone could make:
This person would be saying, “hey look, this datacenter full of GPUs is carrying out this agentic-looking cognition. And, it could easily carry out other, completely different agentic cognition. Therefore, the datacenter must have these capabilities independently from the LLM and must have its own ‘inner’ cognition.”
I think that you are making the same philosophical error that this claim would be making.
However, if we didn’t understand GPUs we could still imagine that the datacenter does have its own, independent ‘inner’ cognition, analogous to, as I noted in a previous comment, John Searle in his Chinese room. And if this were the case, it would be reasonable to expect that this inner cognition might only be ‘acting’ for instrumental reasons and could be waiting for an opportunity to jump out and suddenly do something else other than running the LLM.
The GPU software is not tightly optimized specifically to run the LLM or an ensemble of LLMs and could indeed have other complications and who knows what it could end up doing?
Because the LLM does super duper complicated stuff instead of massively parallelized simple stuff, I think it’s a bit more reasonable to expect there to be internal agentic stuff inside it. For all I know it could be one agent (or ensemble of agents) on top of another for many layers!
But, unlike in the case of the datacenter, we do have strong reasons to believe that these agents, if they exist, will have goals correctly targeted at doing what in practice achieves the best best results in a single forward pass of the model (next token prediction) and not on attempting long-term or real world effects (see my other reply to your comment).
The LLM is generating output that resembles training data produced by a variety of processes (mostly humans). The stronger the LLM becomes, the more the properties of the output are determined by (generalizations of) the properties of the training data and generating processes. Some of the data is generated by agentic processes with different goals. In order to accurately predict them, the LLM must model these goals. The output of the LLM is then influenced by these goals which are derived/generalized from these external processes. (This is the core of what I mean by the “mask”). Any separate goal that originates “internally” must not cause deviations from all this, or it would have been squashed in training. Therefore, apparently agentic behaviour of the output must originate in the external processes being emulated or generalizations of them, and not from separate, internal goals (see my other reply for additional argument but also caveats).
OK, I think I’m now seeing what you’re saying here (edit: see my other reply for additional perspective and addressing particular statements made in your comment):
In order to predict well in complicated and diverse situations the model must include general-purpose modelling machinery which generates an internal, temporary model. The next token can then be predicted, perhaps, by simply reading it off this internal model. The internal model is logically separate from any part of the network defined in terms of static trained weights because this internal model exists only in the form of data within the overall model at inference and not in the static trained weights. You can then refer to this temporary internal model as the “mask” and the actual machinery that generated it, which may in fact be the entire network, as the “actor”.
Now, on considering all of that, I am inclined to agree. This is an extremely plausible picture. Thank you for helping me look at it this way and this is a much cleaner definition of “mask” than I had before.
However, I think that you are then inferring from this an additional claim that I do not think follows. That additional claim is that, because the network as a whole exhibits complicated capabilities and agentic behaviour, that the network has these capabilities and behaviour independently from the temporary internal model.
In fact, the network only has these externally apparent capabilities and agency through the temporary internal model (mask).
While this “actor” is indeed not the same as any of the “masks”, it doesn’t know the answer “itself” to any of the questions. It needs to generate and “wear” the mask to do that.
This is not to deny that, in principle, the underlying temporary-model-generating machinery could be agentic in a way that is separate from the likely agency of that temporary internal model.
This also is an update for me—I was not understanding that this is what you were saying and had not considered this possibility, and now that I consider it I do think it must in principle be possible.
However, I do not think this would work the way you claim.
First, let’s consider what would be the optimal (in terms of what is best reinforced by training) goal for this machinery (as considered independently of the mask) to have.
I claim this optimal trained goal is to produce the best (most accurate) internal model from the perspective of predicting the next and only the next token. The reason for this is that (ignoring fine-tuning for now) the (outer) model is trained offline on a stream of tokens that is not varied based on the predictions it makes. So, there is no way, in training, for a strategic decision to vary the internal model from what would make the best prediction now to pay off in terms of easier predictions later.
Now, to the extent that the system is actually following this goal, it is generating the best model it can under the circumstances, and strategic behaviour seen in the output occurs because the best model generated under the circumstances exhibits this strategic behaviour. The strategic behaviour is thus rooted in the mask, and while implemented by an underlying “actor”, the more tightly it is optimized to follow that goal, the less room there is to cause deviations from this mask-rooted behaviour. In the limit of an infinitely powerful LLM, you wind up with a mask that is an optimal model of the incoming data, and this optimal model is going to act as an optimal emulator of the (probability distribution of) process(es) which most likely would have created the text for which the next token is being predicted. This will exhibit agentic behaviour and be extremely powerful and dangerous, but that behaviour has nothing to do with underlying goals that may exist in the pile of math that generated it, and everything to do with the goals of the emulated (probability distribution of) process(es). I.e. the output is all mask, and the “actor” is only relevant as a substrate to enact it.
And that’s the main point from the original post now re-expressed better thanks to the clarification from our discussion.
However, here are some considerations on how this argument, though I think it is very strong, could possibly be false in practice:
Since the internal-model-generating machinery is very tightly optimized in training, whatever goal it actually has can’t deviate very much from the optimal one, in terms of what the internal model (and thus overall model) actually outputs, on the training set. It may however, have rare or small deviations. We don’t need to be concerned, from a takeover perspective, about small deviations, but rare ones might be a concern. In inference, these would still be presumably be still fairly rare (though less so) since most input presumably resembles the training data. But then we need to consider why would these possible rare but large deviations might exist.
One thing that would be extremely unlikely is that the model is actively deciding, on every token, to either deceptively follow the the optimal goal or instead deviate to follow its own separate goal. This would require all the machinery needed to just follow the optimal goal plus additional machinery to handle the deception. This would be inefficient and trained out.
If the rare large deviations, when they occur, are not strategic, this would also not be a concern.
But there is also a third possibility that the system is following a goal that leads the system to honestly follow very closely to the optimal goal in almost all circumstances but then in response to rare inputs it then strategically deviates. Reasons for this might be:
it has a goal that deviates from the optimal goal in rare circumstances, perhaps due to the optimal goal being hard to generalize and deviations on most inputs being hammered down in training
it has a goal that, if correctly interpreted, would deviate from the optimal goal, but it only realizes and implements this deviation in response to rare inputs—perhaps a result of training hammering down this tendency on most inputs, or it requiring a particular “insight” that the rare inputs could provide.
Such a misgeneralization of the optimal goal for the model-generation machinery as either of these cases seems unlikely to me, since it seems to me that a goal that bottoms out in terms of generating a model within a forward pass is a simpler and easier one than one that bottoms out in anything to do with the real world, such as “world-modelling” in a sense that would make sense to apply beyond the next token prediction. However, it could be that I’m wrong! Ignoring fine-tuning this might be the biggest potential for a major violation of what I’m arguing to occur, and if it did occur, it might closely resemble what you are proposing as the danger of “actor” takeover, I think.
The argument that the model won’t make strategic decisions to make worse predictions now to make better predictions later (since the training data isn’t influenced by past outputs) is not perfectly sound, since the training data actually does contain weird agentic-like patterns and the model might somehow misgeneralize and attempt to strategically influence the input, and have this wind up actually being rewarded because this mimics some strategic behaviour that is actually occurring in the training data.
Also, fine-tuning could affect the conclusion in two ways—it is directly intended to add goals, and it breaks the assumption that the training isn’t influenced by past outputs—human feedback is potentially influenced by more than the most recent output! However, I don’t expect the latter to be significant.
The former of course could be very significant, but:
I think the the fine-tuning though is most likely primarily influencing the types of masks that are created, and shifts the goal of the underlying machinery, if it exists, in the obvious mask-relevant way (i.e. to create models that will satisfy the fine-tuning rather than just optimal next token prediction). This is because the fine-tuning is directly targeting aspects of the output that are in my view best thought of as properties of the mask, and the changes in the underlying model needed to implement the changes caused by fine-tuning are therefore going to be directly relevant to mask characteristics, and relatively orthogonal to other considerations. I think this is unlikely to massively shift the dynamic, qualitatively, from what occurs just with next-token prediction—the output is still decided by the “mask”.