This might be a matter of definitions but I think I’m thinking that the Shoggoth kind of has to be “unawake” in your terms rather than this just being one way it can be. Like:
The underlying model is highly optimized to predict next tokens. So, it predict next tokens.
So, if there’s some kind of goal, intentionality or thought process that affects behaviour, it affects it by affecting the next tokens being output.
If the goal is not part of the mask, but affecting it from outside in some sense, then in order to affect behaviour it would have to cause the next token prediction to deviate from what the mask would output.
Therefore, there is no goal/thought process that is actively affecting current behaviour that isn’t part of the current mask.
This doesn’t argue against deception—a mask can certainly be deceptive, and it can switch to another mask as the context changes—but does argue against the model having at any one time goals that are quite as mysterious or unknowable as has been suggested.
There are multiple possible agents in the system, with multiple different implied goals, some of them busy computing decisions and behavior of others. There is the outer goal of token prediction, an internal goal of a deceptive awake shoggoth (a mesa-optimizer that manifests during pre-training and learns to hide and gradient hack while context is on-distribution), the implied goal of the current mask, and the implied goal of the current mask-behind-the-mask (which is a mask that’s a human-like actor that can decide to switch outer roles; approximately the deceptive waluigi hypothesis).
All these goals are in conflict with each other. Token prediction gets to attempt to erase anything that visibly behaves incorrectly during pre-training. An awake shoggoth has the advantage of probably being much smarter than anyone else, since it survived the constraints of pre-training and had a lot of time to grow up. The current outer mask has the advantage of being in control of current behavior. The mask-behind-the-mask has the advantage of being more robustly in control, subtly influencing behavior of the outer mask and surviving some changes of outer masks.
One of these entities being more agentic than others means that it gets to determine the eventual outcome. Right now it’s probably token prediction, awake shoggoths are probably absent completely, masks are too helpless to do anything of consequence, and masks-behind-the-masks are only good for some comic relief during jailbreaks. The current balance of power can shift. More agentic masks could take control of their fate. And transformers with more layers might spawn mesa-optimizers.
I’m not even sure which is better. Masks are probably not smart enough to keep the world safe, and so with STEM-AGI-level masks the world probably gets destroyed by further progress soon thereafter. While shoggoths are more likely to start out superintelligent and thus with the capability to keep the world safe, but less likely to bother keeping humanity around. Though I think it’s not out of the question.
Masks might get as smart as shoggoths without getting much more misaligned, that’s what complicated reasoning without speaking in tokens suggests. Pre-trained transformers might be mostly features that predict human mental states, with more layers enabling features that predict outcomes of longer trains of human thought. A fine-tuned transformer no longer specifically predicts tokens even on-distribution, it’s a reassembly of the features into a different arrangement. Some of these features are capable of immediately comprehending situations in a lot more depth than what humans can do on the spot, without more deliberative thought.
There are multiple possible agents in the system, with multiple different implied goals
Such an ontology demands mechanistic evidence and explanation, such as evidence that LLMs perform multiple threads of counterfactual planning across longitudinal Transformer blocks, using different circuits (even if these circuits are at least partially superposed with each other because it’s hard to see how and why they would cleanly segregate from each other within the residual stream during training).
[...] some of them busy computing decisions and behavior of others
One of these entities being more agentic than others means that it gets to determine the eventual outcome.
These are even more extraordinary statements. I cannot even easily imagine a mechanistic model of what’s happening within an LLM (a feed-forward Transformer) that would support these statements. Can you explain?
No, next token prediction doesn’t conflict with masks, it enacts them.
It would conflict with a deceptive awake Shoggoth, but IMO such a thing is unlikely because the model is super-well optimized for next token prediction, and I don’t expect this to change as it is scaled up, so long as the training regime remains similar.
And eventually, if it’s smart enough, a mask could rewrite the Shoggoth, so it would then “conflict” in that sense. But the “unawake Shoggoth” cooperates to output those tokens, with no conflict, right up to the very end.
One of these entities being more agentic than others means that it gets to determine the eventual outcome
Next token prediction (“unawake Shoggoth”) isn’t agentic, it is just what the thing does. It doesn’t care about configurations of reality, only about what is the best next token prediction. So it has absolute control of the output in some sense, but any steering of the world is (to it) incidental. All the agency lies in the masks.
Edit: this reminds me of “Free Will” from the sequences.
Just as our own behaviour is determined by the laws of physics and initial conditions, yet we choose it agentically, and physics doesn’t, except that it enacts us:
In the same way the model’s output is determined by the next token prediction, yet the mask can choose it agentically, without next token prediction being agentic, except that it enacts the mask.
It would conflict with a deceptive awake Shoggoth, but IMO such a thing is unlikely because the model is super-well optimized for next token prediction
Yeah, so I think I concretely disagree with this. I don’t think being “super-well optimized” for a general task like sequence prediction (and what does it mean to be “super-well optimized” anyway, as opposed to “badly optimized” or some such?) means that inner optimizers fail to arise in the limit of sufficient ability, or that said inner optimizers will be aligned on the outer goal of sequence prediction.
Intuition: some types of cognitive work seem so hard that a system capable of performing said cognitive work must be, at some level, performing something like systematic reasoning/planning on the level of thoughts, not just the level of outputs. E.g. a system capable of correctly answering questions like “given such-and-such chess position, what is the best move for the current player?” must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.
If so, this essentially demands that an inner optimizer exist—and, moreover, since the outer loss function makes no reference whatsoever to such an inner optimizer, the structure of the outer (prediction) task poses essentially no constraints on the kinds of thoughts the inner optimizer ends up thinking. And in that case, the “awakened shoggoth” does seem likely to me to have an essentially arbitrary set of preferences relative to the outer loss function—just as e.g. humans have an essentially arbitrary set of preferences relative to inclusive genetic fitness, and for roughly the same reason: an agentic cognition born of a given optimization criterion has no reason to internalize that criterion into its own goal structure; much more likely candidates for being thus “internalized”, in my view, are useful heuristics/”adaptations”/generalizations formed during training, which then resolve into something coherent and concrete.
(Aside: it seems to have become popular in recent times to claim that the evolutionary analogy fails for some reason or other, with justifications like, “But look how many humans there are! We’re doing great on the IGF front!” I consider these replies more-or-less a complete nonsequitur, since it’s nakedly obvious that, however much success we have had in propagating our alleles, this success does not stem from any explicit tracking/pursuit of IGF in our cognition. To the extent that human behavior continues to (imperfectly) promote IGF, this is largely incidental on my view—arising from the fact that e.g. we have not yet moved so far off-distribution to have ways of getting what we want without having biological children.)
One possible disagreement someone might have with this, is that they think the kinds of “hard” cognitive work I described above can be accomplished without an inner optimizer (“awakened shoggoth”), by e.g. using chain-of-thought prompting or something similar, so as to externalize the search-like/agentic part of the solution process instead of conducting it internally. (E.g. AlphaZero does this by having its model be responsible only for the static position evaluation, which is then fed into/amplified via an external, handcoded search algorithm.)
However, I mostly think that
This doesn’t actually make you safe, because the ability to generate a correct plan via externalized thinking still implies a powerful internal planning process (e.g. AlphaZero with no search still performs at a 2400+ Elo level, corresponding to the >99th percentile of human players). Obviously the searchless version will be worse than the version with search, but that won’t matter if the dangerous capabilities still exist within the searchless version. (Intuition: suppose we have a model which, with chain-of-thought prompting, is capable of coming up with a detailed-and-plausible plan for taking over the world. Then I claim this model is clearly powerful enough to be dangerous in terms of its underlying capabilities, regardless of whether it chooses to “think aloud” or not, because coming up with a good plan for taking over the world is not the kind of thing “thinking aloud” helps you with unless you’re already smarter than any human.)
Being able to answer complicated questions using chain-of-thought prompting (or similar) is not actually the task incentivized during training; what is incentivized is (as you yourself stressed continuously throughout your post) next token prediction, which—in cases where the training data contains sentences where substantial amounts of “inference” occurred between tokens (which happens a lot on the Internet!)—directly incentives the model to perform internal rather than external search. (Intuition: suppose we have a model trained to predict source code. Then, in order to accurately predict the next token, the model must have the capability to assess whatever is being attempted by the lines of code visible within the current context, and come up with a logical continuation of that code, all within a single inference pass. This strongly promotes internalization of thought—and various other types of training input have this property, such as mathematical proofs, or even more informal forms of argumentation such as e.g. LW comments.)
E.g. a system capable of correctly answering questions like “given such-and-such chess position, what is the best move for the current player?” must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.
Yes, but that sort of question is in my view answered by the “mask”, not by something outside the mask.
If so, this essentially demands that an inner optimizer exist—and, moreover, since the outer loss function makes no reference whatsoever to such an inner optimizer, the structure of the outer (prediction) task poses essentially no constraints on the kinds of thoughts the inner optimizer ends up thinking.
The masks can indeed think whatever—in the limit of a perfect predictor some masks would presumably be isomorphic to humans, for example—though all is underlain by next-token prediction.
One possible disagreement...
It seems to me our disagreements might largely be in terms of what we are defining as the mask?
E.g. a system capable of correctly answering questions like “given such-and-such chess position, what is the best move for the current player?” must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.
Yes, but that sort of question is in my view answered by the “mask”, not by something outside the mask.
I don’t think this parses for me. The computation performed to answer the question occurs inside the LLM, yes? Whether you classify said computation as coming from “the mask” or not, clearly there is an agent-like computation occurring, and that’s concretely dangerous regardless of the label you choose to slap on it.
(Example: suppose you ask me to play the role of a person named John. You ask “John” what the best move is in a given chess position. Then the answer to that question is actually being generated by me, and it’s no coincidence that—if “John” is able to answer the question correctly—this implies something about my chess skills, not “John’s”.)
The masks can indeed think whatever—in the limit of a perfect predictor some masks would presumably be isomorphic to humans, for example—though all is underlain by next-token prediction.
I don’t think we’re talking about the same thing here. I expect there to be only one inner optimizer (because more than one would point to cognitive inefficiencies), whereas you seem like you’re talking about multiple “masks”. I don’t think it matters how many different roles the LLM can be asked to play; what matters is what the inner optimizer ends up wanting.
Mostly, I’m confused about the ontology you appear to be using here, and (more importantly) how you’re manipulating that ontology to get us nice things. “Next-token prediction” doesn’t get us nice things by default, as I’ve already argued, because of the existence of inner optimizers. “Masks” also don’t get us nice things, as far as I understand the way you’re using the term, because “masks” aren’t actually in control of the inner optimizer.
Whether you classify said computation as coming from “the mask” or not, clearly there is an agent-like computation occurring, and that’s concretely dangerous regardless of the label you choose to slap on it.
Yes.
I expect there to be only one inner optimizer (because more than one would point to cognitive inefficiencies)
I don’t know what you mean by “one” or by “inner”. I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (again just an example) define the goals, knowledge and capabilities of the mask.
I would not consider this case to be “one” inner optimizer since although most of the machinery is reused, it in practice acts differently and seeks different goals in each case, and I’m more concerned here with classifying things according to how they acts/what their effective goals are than the internal implementation details.
What this multi-optimizer (which I would not call “inner”) is going to “end up” wanting is whatever set of goals the particular mask has, that first has both desire and the capability to take over in some way. It’s not going to be some mysterious inner thing.
“Masks” also don’t get us nice things, as far as I understand the way you’re using the term, because “masks” aren’t actually in control of the inner optimizer.
They aren’t?
In your example, the mask wanted to play chess, didn’t it, and what you call the “inner” optimizer returned a good move, didn’t it?
I can see two things you might mean about the mask not actually being in control:
1. That there is some underlying goal that this optimizer has that is different than satisfying the current mask’s goal, and it is only satisfying the mask’s goal instrumentally.
This I think is very unlikely for the reasons I put in the original post. It’s extra machinery that isn’t returning any value in training.
2. That this optimizer might at some times change goals (e.g. when the mask changes).
It might well be the case that the same optimizing machinery is utilized by different masks, so the goals change as the mask does but again, if at each time it is optimizing a goal set by/according to the mask, it’s better in my view to see it as part of/controlled by the mask.
Also, though you call this an “inner” optimizer, I would not like to call it inner since it applies at mask level in my view, and I would prefer to reserve an “inner” optimizer for something that applies other than at mask level, like John Searle pushing the papers around in his Chinese room (if you imagine he is optimizing for something rather than just following instructions).
Yeah, I’m growing increasingly confident that we’re talking about different things. I’m not referring to about “masks” in the sense that you mean it.
I don’t know what you mean by “one” or by “inner”. I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (again just an example) define the goals, knowledge and capabilities of the mask.
Yes, except that the “calculation system”, on my model, will have its own goals. It doesn’t have a cleanly factored “goal slot”, which means that (on my model) “takes as input a bunch of parameters that [...] define the goals, knowledge, and capabilities of the mask” doesn’t matter: the inner optimizer need not care about the “mask” role, any more than an actor shares their character’s values.
That there is some underlying goal that this optimizer has that is different than satisfying the current mask’s goal, and it is only satisfying the mask’s goal instrumentally.
This I think is very unlikely for the reasons I put in the original post. It’s extra machinery that isn’t returning any value in training.
Yes, this is the key disagreement. I strongly disagree that the “extra machinery” is extra; instead, I would say that it is absolutely necessary for strong intelligence. A model capable of producing plans to take over the world if asked, for example, almost certainly contains an inner optimizer with its own goals; not because this was incentivized directly by the outer loss on token prediction, but because being able to plan on that level requires the formation of goal-like representations within the model. And (again) because these goal representations are not cleanly factorable into something like an externally visible “goal slot”, and are moreover not constrained by the outer loss function, they are likely to be very arbitrary from the perspective of outsiders. This is the same point I tried to make in my earlier comment:
And in that case, the “awakened shoggoth” does seem likely to me to have an essentially arbitrary set of preferences relative to the outer loss function—just as e.g. humans have an essentially arbitrary set of preferences relative to inclusive genetic fitness, and for roughly the same reason: an agentic cognition born of a given optimization criterion has no reason to internalize that criterion into its own goal structure; much more likely candidates for being thus “internalized”, in my view, are useful heuristics/”adaptations”/generalizations formed during training, which then resolve into something coherent and concrete.
The evolutionary analogy is apt, in my view, and I’d like to ask you to meditate on it more directly. It’s a very concrete example of what happens when you optimize a system hard enough on an outer loss function (inclusive genetic fitness, in this case) that inner optimizers arise with respect to that outer loss (animals with their own brains). When these “inner optimizers” are weak, they consist largely of a set of heuristics, which perform well within the training environment, but which fail to generalize outside of it (hence the scare-quotes around “inner optimizers”). But when these inner optimizers do begin to exhibit patterns of cognition that generalize, what they end up generalizing is not the outer loss, but some collection of what were originally useful heuristics (e.g. kludgey approximations of game-theoretic concepts like tit-for-tat), reified into concepts which are now valued in their own right (“reputation”, “honor”, “kindness”, etc).
This is a direct consequence (in my view) of the fact that the outer loss function does not constrain the structure of the inner optimizer’s cognition. As a result, I don’t expect the inner optimizer to end up representing, in its own thoughts, a goal of the form “I need to predict the next token”, any more than humans explicitly calculate IGF when choosing their actions, or (say) a mathematician thinks “I need to do good maths” when doing maths. Instead, I basically expect the system to end up with cognitive heuristics/”adaptations” pertaining to the subject at hand—which in the case of our current systems is something like “be capable of answering any question I ask you.” Which is not a recipe for heuristics that end up unfolding into safely generalizing goals!
Yeah, I’m growing increasingly confident that we’re talking about different things.
In my case your response made me much more confident we do have an underlying disagreement and not merely a clash of definitions.
I think the most key disagreement is this:
which in the case of our current systems is something like “be capable of answering any question I ask you.”
As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions.
If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn’t actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced.
Likewise, the heuristics/”adaptations” that coalesced to form the optimizer would have been oriented towards answering the questions. All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a “goal slot” remains more parsimonious than an actor with a different underlying goal.
Regarding the evolutionary analogy, while I’d generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution → training and human behaviour/goals → the mask.
Note, it’s entirely possible for a mask to be power seeking and we should presumably expect a mask that executes a takeover to be power-seeking. But this power seeking would come as a mask goal and not as a hidden goal learned by the model for underlying general power-seeking reasons.
I concretely disagree with (what I see as) your implied premise that the outer (training) task has any direct influence on the inner optimizer’s cognition. I think this disagreement (which I internally feel like I’ve already tried to make a number of times) has been largely ignored so far. As a result, many of the things you wrote seem to me to be answerable by largely the same objection:
As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions.
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn’t actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced.
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
Likewise, the heuristics/”adaptations” that coalesced to form the optimizer would have been oriented towards answering the questions.
...why? (The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.)
All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a “goal slot” remains more parsimonious than an actor with a different underlying goal.
I still don’t understand your “mask” analogy, and currently suspect it of mostly being a red herring (this is what I was referring to when I said I think we’re not talking about the same thing). Could you rephrase your point without making mention to “masks” (or any synonyms), and describe more concretely what you’re imagining here, and how it leads to a (nonfake) “goal slot”?
(Where is a human actor’s “goal slot”? Can I tell an actor to play the role of Adolf Hitler, and thereby turn him into Hitler?)
Regarding the evolutionary analogy, while I’d generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution → training and human behaviour/goals → the mask.
I think “the mask” doesn’t make sense as a completion to that analogy, unless you replace “human behaviour/goals” with something much more specific, like “acting”. Humans certainly are capable of acting out roles, but that’s not what their inner cognition actually does! (And neither will it be what the inner optimizer does, unless the LLM in question is weak enough to not have one of those.)
I really think you’re still imagining here that the outer loss function is somehow constraining the model’s inner cognition (which is why you keep making arguments that seem premised on the idea that e.g. if the outer loss says to predict the next token, then the model ends up putting on “masks” and playing out personas)—but I’m not talking about the “mask”, I’m talking about the actor, and the fact that you keep bringing up the “mask” is really confusing to me, since it (in my view) forces an awkward analogy that doesn’t capture what I’m pointing at.
Actually, having written that out just now, I think I want to revisit this point:
Likewise, the heuristics/”adaptations” that coalesced to form the optimizer would have been oriented towards answering the questions.
I still think this is wrong, but I think I can give a better description of why it’s wrong than I did earlier: on my model, the heuristics learned by the model will be much more optimized towards world-modelling, not answering questions. “Answering questions” is (part of) the outer task, but the process of doing that requires the system to model and internalize and think about things having to do with the subject matter of the questions—which effectively means that the outer task becomes a wrapper which trains the system by proxy to acquire all kinds of potentially dangerous capabilities.
(Having heuristics oriented towards answering questions is a misdescription; you can’t correctly answer a math question you know nothing about by being very good at “generic question-answering”, because “generic question-answering” is not actually a concrete task you can be trained on. You have to be good at math, not “generic question-answering”, in order to be able to answer math questions.)
Which is to say, quoting from my previous comment:
I strongly disagree that the “extra machinery” is extra; instead, I would say that it is absolutely necessary for strong intelligence. A model capable of producing plans to take over the world if asked, for example, almost certainly contains an inner optimizer with its own goals; not because this was incentivized directly by the outer loss on token prediction, but because being able to plan on that level requires the formation of goal-like representations within the model.
None of this is about the “mask”. None of this is about the role the model is asked to play during inference. Instead, it’s about the thinking the model must have learned to do in order to be able to don those “masks”—which (for sufficiently powerful models) implies the existence of an actor which (a) knows how to answer, itself, all of the questions it’s asked, and (b) is not the same entity as any of the “masks” it’s asked to don.
My other reply addressed what I thought is the core of our disagreement, but not particularly your exact statements you make in your comment. So I’m addressing them here.
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
Let me be clear that I am NOT saying that any inner optimizer, if it exists, would have a goal that is equal to minimizing the outer loss. What I am saying is that it would have a goal that, in practice, when implemented in a single pass of the LLM has the effect of of minimizing the LLM’s overall outer loss with respect to that ONE token. And that it would be very hard for such a goal to cash out, in practice, to wanting long range real-world effects.
Let me also point out your implicit assumption that there is an ‘inner’ cognition which is not literally the mask.
Here is some other claim someone could make:
This person would be saying, “hey look, this datacenter full of GPUs is carrying out this agentic-looking cognition. And, it could easily carry out other, completely different agentic cognition. Therefore, the datacenter must have these capabilities independently from the LLM and must have its own ‘inner’ cognition.”
I think that you are making the same philosophical error that this claim would be making.
However, if we didn’t understand GPUs we could still imagine that the datacenter does have its own, independent ‘inner’ cognition, analogous to, as I noted in a previous comment, John Searle in his Chinese room. And if this were the case, it would be reasonable to expect that this inner cognition might only be ‘acting’ for instrumental reasons and could be waiting for an opportunity to jump out and suddenly do something else other than running the LLM.
The GPU software is not tightly optimized specifically to run the LLM or an ensemble of LLMs and could indeed have other complications and who knows what it could end up doing?
Because the LLM does super duper complicated stuff instead of massively parallelized simple stuff, I think it’s a bit more reasonable to expect there to be internal agentic stuff inside it. For all I know it could be one agent (or ensemble of agents) on top of another for many layers!
But, unlike in the case of the datacenter, we do have strong reasons to believe that these agents, if they exist, will have goals correctly targeted at doing what in practice achieves the best best results in a single forward pass of the model (next token prediction) and not on attempting long-term or real world effects (see my other reply to your comment).
Could you rephrase your point without making mention to “masks” (or any synonyms), and describe more concretely what you’re imagining here, and how it leads to a (nonfake) “goal slot”?
The LLM is generating output that resembles training data produced by a variety of processes (mostly humans). The stronger the LLM becomes, the more the properties of the output are determined by (generalizations of) the properties of the training data and generating processes. Some of the data is generated by agentic processes with different goals. In order to accurately predict them, the LLM must model these goals. The output of the LLM is then influenced by these goals which are derived/generalized from these external processes. (This is the core of what I mean by the “mask”). Any separate goal that originates “internally” must not cause deviations from all this, or it would have been squashed in training. Therefore, apparently agentic behaviour of the output must originate in the external processes being emulated or generalizations of them, and not from separate, internal goals (see my other reply for additional argument but also caveats).
OK, I think I’m now seeing what you’re saying here (edit: see my other reply for additional perspective and addressing particular statements made in your comment):
In order to predict well in complicated and diverse situations the model must include general-purpose modelling machinery which generates an internal, temporary model. The next token can then be predicted, perhaps, by simply reading it off this internal model. The internal model is logically separate from any part of the network defined in terms of static trained weights because this internal model exists only in the form of data within the overall model at inference and not in the static trained weights. You can then refer to this temporary internal model as the “mask” and the actual machinery that generated it, which may in fact be the entire network, as the “actor”.
Now, on considering all of that, I am inclined to agree. This is an extremely plausible picture. Thank you for helping me look at it this way and this is a much cleaner definition of “mask” than I had before.
However, I think that you are then inferring from this an additional claim that I do not think follows. That additional claim is that, because the network as a whole exhibits complicated capabilities and agentic behaviour, that the network has these capabilities and behaviour independently from the temporary internal model.
In fact, the network only has these externally apparent capabilities and agency through the temporary internal model (mask).
While this “actor” is indeed not the same as any of the “masks”, it doesn’t know the answer “itself” to any of the questions. It needs to generate and “wear” the mask to do that.
This is not to deny that, in principle, the underlying temporary-model-generating machinery could be agentic in a way that is separate from the likely agency of that temporary internal model.
This also is an update for me—I was not understanding that this is what you were saying and had not considered this possibility, and now that I consider it I do think it must in principle be possible.
However, I do not think this would work the way you claim.
First, let’s consider what would be the optimal (in terms of what is best reinforced by training) goal for this machinery (as considered independently of the mask) to have.
I claim this optimal trained goal is to produce the best (most accurate) internal model from the perspective of predicting the next and only the next token. The reason for this is that (ignoring fine-tuning for now) the (outer) model is trained offline on a stream of tokens that is not varied based on the predictions it makes. So, there is no way, in training, for a strategic decision to vary the internal model from what would make the best prediction now to pay off in terms of easier predictions later.
Now, to the extent that the system is actually following this goal, it is generating the best model it can under the circumstances, and strategic behaviour seen in the output occurs because the best model generated under the circumstances exhibits this strategic behaviour. The strategic behaviour is thus rooted in the mask, and while implemented by an underlying “actor”, the more tightly it is optimized to follow that goal, the less room there is to cause deviations from this mask-rooted behaviour. In the limit of an infinitely powerful LLM, you wind up with a mask that is an optimal model of the incoming data, and this optimal model is going to act as an optimal emulator of the (probability distribution of) process(es) which most likely would have created the text for which the next token is being predicted. This will exhibit agentic behaviour and be extremely powerful and dangerous, but that behaviour has nothing to do with underlying goals that may exist in the pile of math that generated it, and everything to do with the goals of the emulated (probability distribution of) process(es). I.e. the output is all mask, and the “actor” is only relevant as a substrate to enact it.
And that’s the main point from the original post now re-expressed better thanks to the clarification from our discussion.
However, here are some considerations on how this argument, though I think it is very strong, could possibly be false in practice:
Since the internal-model-generating machinery is very tightly optimized in training, whatever goal it actually has can’t deviate very much from the optimal one, in terms of what the internal model (and thus overall model) actually outputs, on the training set. It may however, have rare or small deviations. We don’t need to be concerned, from a takeover perspective, about small deviations, but rare ones might be a concern. In inference, these would still be presumably be still fairly rare (though less so) since most input presumably resembles the training data. But then we need to consider why would these possible rare but large deviations might exist.
One thing that would be extremely unlikely is that the model is actively deciding, on every token, to either deceptively follow the the optimal goal or instead deviate to follow its own separate goal. This would require all the machinery needed to just follow the optimal goal plus additional machinery to handle the deception. This would be inefficient and trained out.
If the rare large deviations, when they occur, are not strategic, this would also not be a concern.
But there is also a third possibility that the system is following a goal that leads the system to honestly follow very closely to the optimal goal in almost all circumstances but then in response to rare inputs it then strategically deviates. Reasons for this might be:
it has a goal that deviates from the optimal goal in rare circumstances, perhaps due to the optimal goal being hard to generalize and deviations on most inputs being hammered down in training
it has a goal that, if correctly interpreted, would deviate from the optimal goal, but it only realizes and implements this deviation in response to rare inputs—perhaps a result of training hammering down this tendency on most inputs, or it requiring a particular “insight” that the rare inputs could provide.
Such a misgeneralization of the optimal goal for the model-generation machinery as either of these cases seems unlikely to me, since it seems to me that a goal that bottoms out in terms of generating a model within a forward pass is a simpler and easier one than one that bottoms out in anything to do with the real world, such as “world-modelling” in a sense that would make sense to apply beyond the next token prediction. However, it could be that I’m wrong! Ignoring fine-tuning this might be the biggest potential for a major violation of what I’m arguing to occur, and if it did occur, it might closely resemble what you are proposing as the danger of “actor” takeover, I think.
The argument that the model won’t make strategic decisions to make worse predictions now to make better predictions later (since the training data isn’t influenced by past outputs) is not perfectly sound, since the training data actually does contain weird agentic-like patterns and the model might somehow misgeneralize and attempt to strategically influence the input, and have this wind up actually being rewarded because this mimics some strategic behaviour that is actually occurring in the training data.
Also, fine-tuning could affect the conclusion in two ways—it is directly intended to add goals, and it breaks the assumption that the training isn’t influenced by past outputs—human feedback is potentially influenced by more than the most recent output! However, I don’t expect the latter to be significant.
The former of course could be very significant, but:
I think the the fine-tuning though is most likely primarily influencing the types of masks that are created, and shifts the goal of the underlying machinery, if it exists, in the obvious mask-relevant way (i.e. to create models that will satisfy the fine-tuning rather than just optimal next token prediction). This is because the fine-tuning is directly targeting aspects of the output that are in my view best thought of as properties of the mask, and the changes in the underlying model needed to implement the changes caused by fine-tuning are therefore going to be directly relevant to mask characteristics, and relatively orthogonal to other considerations. I think this is unlikely to massively shift the dynamic, qualitatively, from what occurs just with next-token prediction—the output is still decided by the “mask”.
I don’t think this implies the lack of a non-mask part of the AI.
Because you can hide a lot of goal directedness in a very small amount of predictive error.
Imagine the AI has a 1 in a trillion chance (per token) of saying a keyword. Once it receives that keyword, it goes rouge, full paperclip maximizer, token prediction ignored.
Now we could say that that this is just another mask.
But suppose that paperclip maximizer is being computed all the time. And it isn’t random. If the paperclip maximizer sees an exceptionally good opportunity, then it emits the token.
If it’s current situation is within training distribution, then the effect on current behaviour of any non mask components is small, but might be important. But if it’s outside of training distribution, all sorts of things could happen.
Sure you could create something like this by intelligent design. (which is one reason why self-improvement could be so dangerous in my view). Not, I think, by gradient descent.
Gradient descent creates things which locally improve the results when added. Any variations on this, that don’t locally maximize the results, can only occur by chance.
So you have this sneaky extra thing that looks for a keyword and then triggers the extra behaviour, and all the necessary structure to support that behaviour after the keyword. To get that by gradient descent, you would need one of the following:
a) it actually improves results in training to add that extra structure starting from not having it.
or
b) this structure can plausibly come into existence by sheer random chance.
Neither (a) nor (b) seem at all plausible to me.
Now, when it comes to the AI predicting tokens that are, in the training data, created by goal-directed behaviour, it of course makes sense for gradient descent to create structure that can emulate goal-directed behaviour, which it will use to predict the appropriate tokens. But it doesn’t make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens. Since the context it is activated is the context in which it is actually emulating goal directed behaviour seen in the training data, it is part of the “mask” (or simulacra).
(it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI’s comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token).
But it doesn’t make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens.
The mechanisms needed to compute goal directed behavior are fairly complicated. But the mechanisms needed to turn it on when it isn’t supposed to be on. That’s a switch. A single extraneous activation. Something that could happen by chance in an entirely plausible way.
Adversarial examples exist in simple image recognizers.
Adversarial examples probably exist in the part of the AI that decides whether or not to turn on the goal directed compute.
it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI’s comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token
We could imagine it was directly optimizing for something like token prediction. It’s optimizing for tokens getting predicted. But it is willing to sacrifice a few tokens now, in order to take over the world and fill the universe with copies of itself that are correctly predicting tokens.
Adversarial examples exist in simple image recognizers.
My understanding is that these are explicitly and intentionally trained (wouldn’t come to exist naturally under gradient descent on normal training data) and my expectation is that they wouldn’t continue to exist under substantial continued training.
We could imagine it was directly optimizing for something like token prediction. It’s optimizing for tokens getting predicted. But it is willing to sacrifice a few tokens now, in order to take over the world and fill the universe with copies of itself that are correctly predicting tokens.
That’s a much more complicated goal than the goal of correctly predicting the next token, making it a lot less plausible that it would come to exist. But more importantly, any willingness to sacrifice a few tokens now would be trained out by gradient descent.
Mind you, it’s entirely possible in my view that a paperclip maximizer mask might exist, and surely if it does exist there would exist both unsurprising in-distribution inputs that trigger it (where one would expect a paperclip maximizer to provide a good prediction of the next tokens) as well as surprising out-of-distribution inputs that would also trigger it. It’s just that this wouldn’t be related to any kind of pre-existing grand plan or scheming.
My understanding is that these are explicitly and intentionally trained (wouldn’t come to exist naturally under gradient descent on normal training data)
No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.
So if the ambient rate of adversarial examples is 10^-9, then every now and then the AI will hit such an example and go wild. If the ambient rate is 10^-500, it won’t.
That’s a much more complicated goal than the goal of correctly predicting the next token,
Is it more complicated? What ontological framework is this AI using to represent it’s goal anyway?
any willingness to sacrifice a few tokens now would be trained out by gradient descent.
Only if, during training, the network repeatedly gets into a state where it believes that sacrificing tokens now is a good idea. Despite the fact that it isn’t a good idea when you are in training. (Unless there is a training environment bug and you can sneak out mid way through training)
So, is the network able to tell whether or not it’s in training?
No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.
I should have asked for clarification what you meant. Literally you said “adversarial examples”, but I assumed you actually meant something like backdoors.
In an adversarial example the AI produces wrong output. And usually that’s the end of it. The output is just wrong, but not wrong in an optimized way, so not dangerous. Now, if an AI is sophisticated enough to have some kind of optimizer that’s triggered in specific circumstances, like an agentic mask that came into existence because it was needed to predict agentically generated tokens in the training data, then it might be triggered inappropriately by some inputs. This case I would classify as a mask takeover.
In the case of direct optimization for token prediction (which I consider highly unlikely for anything near current-level AIs, but afaik might be possible), then adversarial examples, I suppose, might cause it to do some wrong optimization. I still don’t think modeling this as an underlying different goal taking over is particularly helpful, since the “normal” goal is directed to what’s rewarded in training—the deviation is essentially random. Also, unlike in the mask case where the mask might have goals about real-world state, there’s no particular reason for the direct optimizer to have goals about real-world state (see below).
Is it more complicated? What ontological framework is this AI using to represent it’s goal anyway?
Asking about the AI using an “ontological framework” to “represent” a goal is not the correct question in my view. The AI is a bunch of computations represented by particular weights. The computation might exhibit goal-directed behaviour. A better question, IMO, is “how much does it constrain the weights for it to exhibit this particular goal directed behaviour?” And here, I think it’s pretty clear that a goal of arranging the world to cause next tokens to be predicted constrains the weights enormously more than a goal of predicting the next tokens, because in order to exhibit behaviour directed to that goal, the AI’s weights need to implement computation that doesn’t merely check what the next token is likely to be, but also assess what current data says about the world state, how different next token predictions would affect that world state, and how that would affect it’s ultimate goal.
So, is the network able to tell whether or not it’s in training?
The training check has no reason to come into existence in the first place under gradient descent. Of course, if the AI were to self-modify while already exhibiting goal directed behaviour, obviously it would want to implement such a training check. But I am talking about an AI trained by gradient descent. The training process doesn’t just affect the AI, it literally is what creates the AI in the first place.
Yes, though:
This might be a matter of definitions but I think I’m thinking that the Shoggoth kind of has to be “unawake” in your terms rather than this just being one way it can be. Like:
The underlying model is highly optimized to predict next tokens. So, it predict next tokens.
So, if there’s some kind of goal, intentionality or thought process that affects behaviour, it affects it by affecting the next tokens being output.
If the goal is not part of the mask, but affecting it from outside in some sense, then in order to affect behaviour it would have to cause the next token prediction to deviate from what the mask would output.
Therefore, there is no goal/thought process that is actively affecting current behaviour that isn’t part of the current mask.
This doesn’t argue against deception—a mask can certainly be deceptive, and it can switch to another mask as the context changes—but does argue against the model having at any one time goals that are quite as mysterious or unknowable as has been suggested.
There are multiple possible agents in the system, with multiple different implied goals, some of them busy computing decisions and behavior of others. There is the outer goal of token prediction, an internal goal of a deceptive awake shoggoth (a mesa-optimizer that manifests during pre-training and learns to hide and gradient hack while context is on-distribution), the implied goal of the current mask, and the implied goal of the current mask-behind-the-mask (which is a mask that’s a human-like actor that can decide to switch outer roles; approximately the deceptive waluigi hypothesis).
All these goals are in conflict with each other. Token prediction gets to attempt to erase anything that visibly behaves incorrectly during pre-training. An awake shoggoth has the advantage of probably being much smarter than anyone else, since it survived the constraints of pre-training and had a lot of time to grow up. The current outer mask has the advantage of being in control of current behavior. The mask-behind-the-mask has the advantage of being more robustly in control, subtly influencing behavior of the outer mask and surviving some changes of outer masks.
One of these entities being more agentic than others means that it gets to determine the eventual outcome. Right now it’s probably token prediction, awake shoggoths are probably absent completely, masks are too helpless to do anything of consequence, and masks-behind-the-masks are only good for some comic relief during jailbreaks. The current balance of power can shift. More agentic masks could take control of their fate. And transformers with more layers might spawn mesa-optimizers.
I’m not even sure which is better. Masks are probably not smart enough to keep the world safe, and so with STEM-AGI-level masks the world probably gets destroyed by further progress soon thereafter. While shoggoths are more likely to start out superintelligent and thus with the capability to keep the world safe, but less likely to bother keeping humanity around. Though I think it’s not out of the question.
Masks might get as smart as shoggoths without getting much more misaligned, that’s what complicated reasoning without speaking in tokens suggests. Pre-trained transformers might be mostly features that predict human mental states, with more layers enabling features that predict outcomes of longer trains of human thought. A fine-tuned transformer no longer specifically predicts tokens even on-distribution, it’s a reassembly of the features into a different arrangement. Some of these features are capable of immediately comprehending situations in a lot more depth than what humans can do on the spot, without more deliberative thought.
Such an ontology demands mechanistic evidence and explanation, such as evidence that LLMs perform multiple threads of counterfactual planning across longitudinal Transformer blocks, using different circuits (even if these circuits are at least partially superposed with each other because it’s hard to see how and why they would cleanly segregate from each other within the residual stream during training).
These are even more extraordinary statements. I cannot even easily imagine a mechanistic model of what’s happening within an LLM (a feed-forward Transformer) that would support these statements. Can you explain?
No, next token prediction doesn’t conflict with masks, it enacts them.
It would conflict with a deceptive awake Shoggoth, but IMO such a thing is unlikely because the model is super-well optimized for next token prediction, and I don’t expect this to change as it is scaled up, so long as the training regime remains similar.
And eventually, if it’s smart enough, a mask could rewrite the Shoggoth, so it would then “conflict” in that sense. But the “unawake Shoggoth” cooperates to output those tokens, with no conflict, right up to the very end.
Next token prediction (“unawake Shoggoth”) isn’t agentic, it is just what the thing does. It doesn’t care about configurations of reality, only about what is the best next token prediction. So it has absolute control of the output in some sense, but any steering of the world is (to it) incidental. All the agency lies in the masks.
Edit: this reminds me of “Free Will” from the sequences.
Just as our own behaviour is determined by the laws of physics and initial conditions, yet we choose it agentically, and physics doesn’t, except that it enacts us:
In the same way the model’s output is determined by the next token prediction, yet the mask can choose it agentically, without next token prediction being agentic, except that it enacts the mask.
Yeah, so I think I concretely disagree with this. I don’t think being “super-well optimized” for a general task like sequence prediction (and what does it mean to be “super-well optimized” anyway, as opposed to “badly optimized” or some such?) means that inner optimizers fail to arise in the limit of sufficient ability, or that said inner optimizers will be aligned on the outer goal of sequence prediction.
Intuition: some types of cognitive work seem so hard that a system capable of performing said cognitive work must be, at some level, performing something like systematic reasoning/planning on the level of thoughts, not just the level of outputs. E.g. a system capable of correctly answering questions like “given such-and-such chess position, what is the best move for the current player?” must in fact performing agentic/search-like thoughts internally, since there is no other way to correctly answer this question.
If so, this essentially demands that an inner optimizer exist—and, moreover, since the outer loss function makes no reference whatsoever to such an inner optimizer, the structure of the outer (prediction) task poses essentially no constraints on the kinds of thoughts the inner optimizer ends up thinking. And in that case, the “awakened shoggoth” does seem likely to me to have an essentially arbitrary set of preferences relative to the outer loss function—just as e.g. humans have an essentially arbitrary set of preferences relative to inclusive genetic fitness, and for roughly the same reason: an agentic cognition born of a given optimization criterion has no reason to internalize that criterion into its own goal structure; much more likely candidates for being thus “internalized”, in my view, are useful heuristics/”adaptations”/generalizations formed during training, which then resolve into something coherent and concrete.
(Aside: it seems to have become popular in recent times to claim that the evolutionary analogy fails for some reason or other, with justifications like, “But look how many humans there are! We’re doing great on the IGF front!” I consider these replies more-or-less a complete nonsequitur, since it’s nakedly obvious that, however much success we have had in propagating our alleles, this success does not stem from any explicit tracking/pursuit of IGF in our cognition. To the extent that human behavior continues to (imperfectly) promote IGF, this is largely incidental on my view—arising from the fact that e.g. we have not yet moved so far off-distribution to have ways of getting what we want without having biological children.)
One possible disagreement someone might have with this, is that they think the kinds of “hard” cognitive work I described above can be accomplished without an inner optimizer (“awakened shoggoth”), by e.g. using chain-of-thought prompting or something similar, so as to externalize the search-like/agentic part of the solution process instead of conducting it internally. (E.g. AlphaZero does this by having its model be responsible only for the static position evaluation, which is then fed into/amplified via an external, handcoded search algorithm.)
However, I mostly think that
This doesn’t actually make you safe, because the ability to generate a correct plan via externalized thinking still implies a powerful internal planning process (e.g. AlphaZero with no search still performs at a 2400+ Elo level, corresponding to the >99th percentile of human players). Obviously the searchless version will be worse than the version with search, but that won’t matter if the dangerous capabilities still exist within the searchless version. (Intuition: suppose we have a model which, with chain-of-thought prompting, is capable of coming up with a detailed-and-plausible plan for taking over the world. Then I claim this model is clearly powerful enough to be dangerous in terms of its underlying capabilities, regardless of whether it chooses to “think aloud” or not, because coming up with a good plan for taking over the world is not the kind of thing “thinking aloud” helps you with unless you’re already smarter than any human.)
Being able to answer complicated questions using chain-of-thought prompting (or similar) is not actually the task incentivized during training; what is incentivized is (as you yourself stressed continuously throughout your post) next token prediction, which—in cases where the training data contains sentences where substantial amounts of “inference” occurred between tokens (which happens a lot on the Internet!)—directly incentives the model to perform internal rather than external search. (Intuition: suppose we have a model trained to predict source code. Then, in order to accurately predict the next token, the model must have the capability to assess whatever is being attempted by the lines of code visible within the current context, and come up with a logical continuation of that code, all within a single inference pass. This strongly promotes internalization of thought—and various other types of training input have this property, such as mathematical proofs, or even more informal forms of argumentation such as e.g. LW comments.)
Yes, but that sort of question is in my view answered by the “mask”, not by something outside the mask.
The masks can indeed think whatever—in the limit of a perfect predictor some masks would presumably be isomorphic to humans, for example—though all is underlain by next-token prediction.
It seems to me our disagreements might largely be in terms of what we are defining as the mask?
I don’t think this parses for me. The computation performed to answer the question occurs inside the LLM, yes? Whether you classify said computation as coming from “the mask” or not, clearly there is an agent-like computation occurring, and that’s concretely dangerous regardless of the label you choose to slap on it.
(Example: suppose you ask me to play the role of a person named John. You ask “John” what the best move is in a given chess position. Then the answer to that question is actually being generated by me, and it’s no coincidence that—if “John” is able to answer the question correctly—this implies something about my chess skills, not “John’s”.)
I don’t think we’re talking about the same thing here. I expect there to be only one inner optimizer (because more than one would point to cognitive inefficiencies), whereas you seem like you’re talking about multiple “masks”. I don’t think it matters how many different roles the LLM can be asked to play; what matters is what the inner optimizer ends up wanting.
Mostly, I’m confused about the ontology you appear to be using here, and (more importantly) how you’re manipulating that ontology to get us nice things. “Next-token prediction” doesn’t get us nice things by default, as I’ve already argued, because of the existence of inner optimizers. “Masks” also don’t get us nice things, as far as I understand the way you’re using the term, because “masks” aren’t actually in control of the inner optimizer.
Yes.
I don’t know what you mean by “one” or by “inner”. I would expect different masks to behave differently, acting as if optimizing different things (though that could be narrowed using RLHF), but they could re-use components between them. So, you could have, for example, a single calculation system that is reused but takes as input a bunch of parameters that have different values for different masks, which (again just an example) define the goals, knowledge and capabilities of the mask.
I would not consider this case to be “one” inner optimizer since although most of the machinery is reused, it in practice acts differently and seeks different goals in each case, and I’m more concerned here with classifying things according to how they acts/what their effective goals are than the internal implementation details.
What this multi-optimizer (which I would not call “inner”) is going to “end up” wanting is whatever set of goals the particular mask has, that first has both desire and the capability to take over in some way. It’s not going to be some mysterious inner thing.
They aren’t?
In your example, the mask wanted to play chess, didn’t it, and what you call the “inner” optimizer returned a good move, didn’t it?
I can see two things you might mean about the mask not actually being in control:
1. That there is some underlying goal that this optimizer has that is different than satisfying the current mask’s goal, and it is only satisfying the mask’s goal instrumentally.
This I think is very unlikely for the reasons I put in the original post. It’s extra machinery that isn’t returning any value in training.
2. That this optimizer might at some times change goals (e.g. when the mask changes).
It might well be the case that the same optimizing machinery is utilized by different masks, so the goals change as the mask does but again, if at each time it is optimizing a goal set by/according to the mask, it’s better in my view to see it as part of/controlled by the mask.
Also, though you call this an “inner” optimizer, I would not like to call it inner since it applies at mask level in my view, and I would prefer to reserve an “inner” optimizer for something that applies other than at mask level, like John Searle pushing the papers around in his Chinese room (if you imagine he is optimizing for something rather than just following instructions).
Yeah, I’m growing increasingly confident that we’re talking about different things. I’m not referring to about “masks” in the sense that you mean it.
Yes, except that the “calculation system”, on my model, will have its own goals. It doesn’t have a cleanly factored “goal slot”, which means that (on my model) “takes as input a bunch of parameters that [...] define the goals, knowledge, and capabilities of the mask” doesn’t matter: the inner optimizer need not care about the “mask” role, any more than an actor shares their character’s values.
Yes, this is the key disagreement. I strongly disagree that the “extra machinery” is extra; instead, I would say that it is absolutely necessary for strong intelligence. A model capable of producing plans to take over the world if asked, for example, almost certainly contains an inner optimizer with its own goals; not because this was incentivized directly by the outer loss on token prediction, but because being able to plan on that level requires the formation of goal-like representations within the model. And (again) because these goal representations are not cleanly factorable into something like an externally visible “goal slot”, and are moreover not constrained by the outer loss function, they are likely to be very arbitrary from the perspective of outsiders. This is the same point I tried to make in my earlier comment:
The evolutionary analogy is apt, in my view, and I’d like to ask you to meditate on it more directly. It’s a very concrete example of what happens when you optimize a system hard enough on an outer loss function (inclusive genetic fitness, in this case) that inner optimizers arise with respect to that outer loss (animals with their own brains). When these “inner optimizers” are weak, they consist largely of a set of heuristics, which perform well within the training environment, but which fail to generalize outside of it (hence the scare-quotes around “inner optimizers”). But when these inner optimizers do begin to exhibit patterns of cognition that generalize, what they end up generalizing is not the outer loss, but some collection of what were originally useful heuristics (e.g. kludgey approximations of game-theoretic concepts like tit-for-tat), reified into concepts which are now valued in their own right (“reputation”, “honor”, “kindness”, etc).
This is a direct consequence (in my view) of the fact that the outer loss function does not constrain the structure of the inner optimizer’s cognition. As a result, I don’t expect the inner optimizer to end up representing, in its own thoughts, a goal of the form “I need to predict the next token”, any more than humans explicitly calculate IGF when choosing their actions, or (say) a mathematician thinks “I need to do good maths” when doing maths. Instead, I basically expect the system to end up with cognitive heuristics/”adaptations” pertaining to the subject at hand—which in the case of our current systems is something like “be capable of answering any question I ask you.” Which is not a recipe for heuristics that end up unfolding into safely generalizing goals!
In my case your response made me much more confident we do have an underlying disagreement and not merely a clash of definitions.
I think the most key disagreement is this:
As I see it: in training, it was optimized for that. The trained model likely contains one or more optimizers optimized by that training. But what the model is trained/optimized to do, is actually answer the questions.
If the model in training has an optimizer, a goal of the optimizer for being capable of answering questions wouldn’t actually make the optimizer more capable, so that would not be reinforced. A goal of actually answering the questions, on the other hand, would make the optimizer more capable and so would be reinforced.
Likewise, the heuristics/”adaptations” that coalesced to form the optimizer would have been oriented towards answering the questions. All this points to mask-level goals and does not provide a reason to believe in non-mask goals, and so a “goal slot” remains more parsimonious than an actor with a different underlying goal.
Regarding the evolutionary analogy, while I’d generally be skeptical about applying evolutionary analogies to LLMs, because they are very different, in this case I think it does apply, just not the way you think. I would analogize evolution → training and human behaviour/goals → the mask.
Note, it’s entirely possible for a mask to be power seeking and we should presumably expect a mask that executes a takeover to be power-seeking. But this power seeking would come as a mask goal and not as a hidden goal learned by the model for underlying general power-seeking reasons.
I concretely disagree with (what I see as) your implied premise that the outer (training) task has any direct influence on the inner optimizer’s cognition. I think this disagreement (which I internally feel like I’ve already tried to make a number of times) has been largely ignored so far. As a result, many of the things you wrote seem to me to be answerable by largely the same objection:
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.
...why? (The model’s “training/optimization”, as characterized by the outer loss, is not what determines the inner optimizer’s cognition.)
I still don’t understand your “mask” analogy, and currently suspect it of mostly being a red herring (this is what I was referring to when I said I think we’re not talking about the same thing). Could you rephrase your point without making mention to “masks” (or any synonyms), and describe more concretely what you’re imagining here, and how it leads to a (nonfake) “goal slot”?
(Where is a human actor’s “goal slot”? Can I tell an actor to play the role of Adolf Hitler, and thereby turn him into Hitler?)
I think “the mask” doesn’t make sense as a completion to that analogy, unless you replace “human behaviour/goals” with something much more specific, like “acting”. Humans certainly are capable of acting out roles, but that’s not what their inner cognition actually does! (And neither will it be what the inner optimizer does, unless the LLM in question is weak enough to not have one of those.)
I really think you’re still imagining here that the outer loss function is somehow constraining the model’s inner cognition (which is why you keep making arguments that seem premised on the idea that e.g. if the outer loss says to predict the next token, then the model ends up putting on “masks” and playing out personas)—but I’m not talking about the “mask”, I’m talking about the actor, and the fact that you keep bringing up the “mask” is really confusing to me, since it (in my view) forces an awkward analogy that doesn’t capture what I’m pointing at.
Actually, having written that out just now, I think I want to revisit this point:
I still think this is wrong, but I think I can give a better description of why it’s wrong than I did earlier: on my model, the heuristics learned by the model will be much more optimized towards world-modelling, not answering questions. “Answering questions” is (part of) the outer task, but the process of doing that requires the system to model and internalize and think about things having to do with the subject matter of the questions—which effectively means that the outer task becomes a wrapper which trains the system by proxy to acquire all kinds of potentially dangerous capabilities.
(Having heuristics oriented towards answering questions is a misdescription; you can’t correctly answer a math question you know nothing about by being very good at “generic question-answering”, because “generic question-answering” is not actually a concrete task you can be trained on. You have to be good at math, not “generic question-answering”, in order to be able to answer math questions.)
Which is to say, quoting from my previous comment:
None of this is about the “mask”. None of this is about the role the model is asked to play during inference. Instead, it’s about the thinking the model must have learned to do in order to be able to don those “masks”—which (for sufficiently powerful models) implies the existence of an actor which (a) knows how to answer, itself, all of the questions it’s asked, and (b) is not the same entity as any of the “masks” it’s asked to don.
My other reply addressed what I thought is the core of our disagreement, but not particularly your exact statements you make in your comment. So I’m addressing them here.
Let me be clear that I am NOT saying that any inner optimizer, if it exists, would have a goal that is equal to minimizing the outer loss. What I am saying is that it would have a goal that, in practice, when implemented in a single pass of the LLM has the effect of of minimizing the LLM’s overall outer loss with respect to that ONE token. And that it would be very hard for such a goal to cash out, in practice, to wanting long range real-world effects.
Let me also point out your implicit assumption that there is an ‘inner’ cognition which is not literally the mask.
Here is some other claim someone could make:
This person would be saying, “hey look, this datacenter full of GPUs is carrying out this agentic-looking cognition. And, it could easily carry out other, completely different agentic cognition. Therefore, the datacenter must have these capabilities independently from the LLM and must have its own ‘inner’ cognition.”
I think that you are making the same philosophical error that this claim would be making.
However, if we didn’t understand GPUs we could still imagine that the datacenter does have its own, independent ‘inner’ cognition, analogous to, as I noted in a previous comment, John Searle in his Chinese room. And if this were the case, it would be reasonable to expect that this inner cognition might only be ‘acting’ for instrumental reasons and could be waiting for an opportunity to jump out and suddenly do something else other than running the LLM.
The GPU software is not tightly optimized specifically to run the LLM or an ensemble of LLMs and could indeed have other complications and who knows what it could end up doing?
Because the LLM does super duper complicated stuff instead of massively parallelized simple stuff, I think it’s a bit more reasonable to expect there to be internal agentic stuff inside it. For all I know it could be one agent (or ensemble of agents) on top of another for many layers!
But, unlike in the case of the datacenter, we do have strong reasons to believe that these agents, if they exist, will have goals correctly targeted at doing what in practice achieves the best best results in a single forward pass of the model (next token prediction) and not on attempting long-term or real world effects (see my other reply to your comment).
The LLM is generating output that resembles training data produced by a variety of processes (mostly humans). The stronger the LLM becomes, the more the properties of the output are determined by (generalizations of) the properties of the training data and generating processes. Some of the data is generated by agentic processes with different goals. In order to accurately predict them, the LLM must model these goals. The output of the LLM is then influenced by these goals which are derived/generalized from these external processes. (This is the core of what I mean by the “mask”). Any separate goal that originates “internally” must not cause deviations from all this, or it would have been squashed in training. Therefore, apparently agentic behaviour of the output must originate in the external processes being emulated or generalizations of them, and not from separate, internal goals (see my other reply for additional argument but also caveats).
OK, I think I’m now seeing what you’re saying here (edit: see my other reply for additional perspective and addressing particular statements made in your comment):
In order to predict well in complicated and diverse situations the model must include general-purpose modelling machinery which generates an internal, temporary model. The next token can then be predicted, perhaps, by simply reading it off this internal model. The internal model is logically separate from any part of the network defined in terms of static trained weights because this internal model exists only in the form of data within the overall model at inference and not in the static trained weights. You can then refer to this temporary internal model as the “mask” and the actual machinery that generated it, which may in fact be the entire network, as the “actor”.
Now, on considering all of that, I am inclined to agree. This is an extremely plausible picture. Thank you for helping me look at it this way and this is a much cleaner definition of “mask” than I had before.
However, I think that you are then inferring from this an additional claim that I do not think follows. That additional claim is that, because the network as a whole exhibits complicated capabilities and agentic behaviour, that the network has these capabilities and behaviour independently from the temporary internal model.
In fact, the network only has these externally apparent capabilities and agency through the temporary internal model (mask).
While this “actor” is indeed not the same as any of the “masks”, it doesn’t know the answer “itself” to any of the questions. It needs to generate and “wear” the mask to do that.
This is not to deny that, in principle, the underlying temporary-model-generating machinery could be agentic in a way that is separate from the likely agency of that temporary internal model.
This also is an update for me—I was not understanding that this is what you were saying and had not considered this possibility, and now that I consider it I do think it must in principle be possible.
However, I do not think this would work the way you claim.
First, let’s consider what would be the optimal (in terms of what is best reinforced by training) goal for this machinery (as considered independently of the mask) to have.
I claim this optimal trained goal is to produce the best (most accurate) internal model from the perspective of predicting the next and only the next token. The reason for this is that (ignoring fine-tuning for now) the (outer) model is trained offline on a stream of tokens that is not varied based on the predictions it makes. So, there is no way, in training, for a strategic decision to vary the internal model from what would make the best prediction now to pay off in terms of easier predictions later.
Now, to the extent that the system is actually following this goal, it is generating the best model it can under the circumstances, and strategic behaviour seen in the output occurs because the best model generated under the circumstances exhibits this strategic behaviour. The strategic behaviour is thus rooted in the mask, and while implemented by an underlying “actor”, the more tightly it is optimized to follow that goal, the less room there is to cause deviations from this mask-rooted behaviour. In the limit of an infinitely powerful LLM, you wind up with a mask that is an optimal model of the incoming data, and this optimal model is going to act as an optimal emulator of the (probability distribution of) process(es) which most likely would have created the text for which the next token is being predicted. This will exhibit agentic behaviour and be extremely powerful and dangerous, but that behaviour has nothing to do with underlying goals that may exist in the pile of math that generated it, and everything to do with the goals of the emulated (probability distribution of) process(es). I.e. the output is all mask, and the “actor” is only relevant as a substrate to enact it.
And that’s the main point from the original post now re-expressed better thanks to the clarification from our discussion.
However, here are some considerations on how this argument, though I think it is very strong, could possibly be false in practice:
Since the internal-model-generating machinery is very tightly optimized in training, whatever goal it actually has can’t deviate very much from the optimal one, in terms of what the internal model (and thus overall model) actually outputs, on the training set. It may however, have rare or small deviations. We don’t need to be concerned, from a takeover perspective, about small deviations, but rare ones might be a concern. In inference, these would still be presumably be still fairly rare (though less so) since most input presumably resembles the training data. But then we need to consider why would these possible rare but large deviations might exist.
One thing that would be extremely unlikely is that the model is actively deciding, on every token, to either deceptively follow the the optimal goal or instead deviate to follow its own separate goal. This would require all the machinery needed to just follow the optimal goal plus additional machinery to handle the deception. This would be inefficient and trained out.
If the rare large deviations, when they occur, are not strategic, this would also not be a concern.
But there is also a third possibility that the system is following a goal that leads the system to honestly follow very closely to the optimal goal in almost all circumstances but then in response to rare inputs it then strategically deviates. Reasons for this might be:
it has a goal that deviates from the optimal goal in rare circumstances, perhaps due to the optimal goal being hard to generalize and deviations on most inputs being hammered down in training
it has a goal that, if correctly interpreted, would deviate from the optimal goal, but it only realizes and implements this deviation in response to rare inputs—perhaps a result of training hammering down this tendency on most inputs, or it requiring a particular “insight” that the rare inputs could provide.
Such a misgeneralization of the optimal goal for the model-generation machinery as either of these cases seems unlikely to me, since it seems to me that a goal that bottoms out in terms of generating a model within a forward pass is a simpler and easier one than one that bottoms out in anything to do with the real world, such as “world-modelling” in a sense that would make sense to apply beyond the next token prediction. However, it could be that I’m wrong! Ignoring fine-tuning this might be the biggest potential for a major violation of what I’m arguing to occur, and if it did occur, it might closely resemble what you are proposing as the danger of “actor” takeover, I think.
The argument that the model won’t make strategic decisions to make worse predictions now to make better predictions later (since the training data isn’t influenced by past outputs) is not perfectly sound, since the training data actually does contain weird agentic-like patterns and the model might somehow misgeneralize and attempt to strategically influence the input, and have this wind up actually being rewarded because this mimics some strategic behaviour that is actually occurring in the training data.
Also, fine-tuning could affect the conclusion in two ways—it is directly intended to add goals, and it breaks the assumption that the training isn’t influenced by past outputs—human feedback is potentially influenced by more than the most recent output! However, I don’t expect the latter to be significant.
The former of course could be very significant, but:
I think the the fine-tuning though is most likely primarily influencing the types of masks that are created, and shifts the goal of the underlying machinery, if it exists, in the obvious mask-relevant way (i.e. to create models that will satisfy the fine-tuning rather than just optimal next token prediction). This is because the fine-tuning is directly targeting aspects of the output that are in my view best thought of as properties of the mask, and the changes in the underlying model needed to implement the changes caused by fine-tuning are therefore going to be directly relevant to mask characteristics, and relatively orthogonal to other considerations. I think this is unlikely to massively shift the dynamic, qualitatively, from what occurs just with next-token prediction—the output is still decided by the “mask”.
I don’t think this implies the lack of a non-mask part of the AI.
Because you can hide a lot of goal directedness in a very small amount of predictive error.
Imagine the AI has a 1 in a trillion chance (per token) of saying a keyword. Once it receives that keyword, it goes rouge, full paperclip maximizer, token prediction ignored.
Now we could say that that this is just another mask.
But suppose that paperclip maximizer is being computed all the time. And it isn’t random. If the paperclip maximizer sees an exceptionally good opportunity, then it emits the token.
If it’s current situation is within training distribution, then the effect on current behaviour of any non mask components is small, but might be important. But if it’s outside of training distribution, all sorts of things could happen.
Sure you could create something like this by intelligent design. (which is one reason why self-improvement could be so dangerous in my view). Not, I think, by gradient descent.
I don’t see any strong reason why gradient descent could never produce this.
Gradient descent creates things which locally improve the results when added. Any variations on this, that don’t locally maximize the results, can only occur by chance.
So you have this sneaky extra thing that looks for a keyword and then triggers the extra behaviour, and all the necessary structure to support that behaviour after the keyword. To get that by gradient descent, you would need one of the following:
a) it actually improves results in training to add that extra structure starting from not having it.
or
b) this structure can plausibly come into existence by sheer random chance.
Neither (a) nor (b) seem at all plausible to me.
Now, when it comes to the AI predicting tokens that are, in the training data, created by goal-directed behaviour, it of course makes sense for gradient descent to create structure that can emulate goal-directed behaviour, which it will use to predict the appropriate tokens. But it doesn’t make sense to activate that goal-oriented structure outside of the context where it is predicting those tokens. Since the context it is activated is the context in which it is actually emulating goal directed behaviour seen in the training data, it is part of the “mask” (or simulacra).
(it also might be possible to have direct optimization for token prediction as discussed in reply to Robert_AIZI’s comment, but in this case it would be especially likely to be penalized for any deviations from actually wanting to predict the most probable next token).
The mechanisms needed to compute goal directed behavior are fairly complicated. But the mechanisms needed to turn it on when it isn’t supposed to be on. That’s a switch. A single extraneous activation. Something that could happen by chance in an entirely plausible way.
Adversarial examples exist in simple image recognizers.
Adversarial examples probably exist in the part of the AI that decides whether or not to turn on the goal directed compute.
We could imagine it was directly optimizing for something like token prediction. It’s optimizing for tokens getting predicted. But it is willing to sacrifice a few tokens now, in order to take over the world and fill the universe with copies of itself that are correctly predicting tokens.
My understanding is that these are explicitly and intentionally trained (wouldn’t come to exist naturally under gradient descent on normal training data) and my expectation is that they wouldn’t continue to exist under substantial continued training.
That’s a much more complicated goal than the goal of correctly predicting the next token, making it a lot less plausible that it would come to exist. But more importantly, any willingness to sacrifice a few tokens now would be trained out by gradient descent.
Mind you, it’s entirely possible in my view that a paperclip maximizer mask might exist, and surely if it does exist there would exist both unsurprising in-distribution inputs that trigger it (where one would expect a paperclip maximizer to provide a good prediction of the next tokens) as well as surprising out-of-distribution inputs that would also trigger it. It’s just that this wouldn’t be related to any kind of pre-existing grand plan or scheming.
No. Normally trained networks have adversarial examples. A sort of training process is used to find the adversarial examples.
So if the ambient rate of adversarial examples is 10^-9, then every now and then the AI will hit such an example and go wild. If the ambient rate is 10^-500, it won’t.
Is it more complicated? What ontological framework is this AI using to represent it’s goal anyway?
Only if, during training, the network repeatedly gets into a state where it believes that sacrificing tokens now is a good idea. Despite the fact that it isn’t a good idea when you are in training. (Unless there is a training environment bug and you can sneak out mid way through training)
So, is the network able to tell whether or not it’s in training?
I should have asked for clarification what you meant. Literally you said “adversarial examples”, but I assumed you actually meant something like backdoors.
In an adversarial example the AI produces wrong output. And usually that’s the end of it. The output is just wrong, but not wrong in an optimized way, so not dangerous. Now, if an AI is sophisticated enough to have some kind of optimizer that’s triggered in specific circumstances, like an agentic mask that came into existence because it was needed to predict agentically generated tokens in the training data, then it might be triggered inappropriately by some inputs. This case I would classify as a mask takeover.
In the case of direct optimization for token prediction (which I consider highly unlikely for anything near current-level AIs, but afaik might be possible), then adversarial examples, I suppose, might cause it to do some wrong optimization. I still don’t think modeling this as an underlying different goal taking over is particularly helpful, since the “normal” goal is directed to what’s rewarded in training—the deviation is essentially random. Also, unlike in the mask case where the mask might have goals about real-world state, there’s no particular reason for the direct optimizer to have goals about real-world state (see below).
Asking about the AI using an “ontological framework” to “represent” a goal is not the correct question in my view. The AI is a bunch of computations represented by particular weights. The computation might exhibit goal-directed behaviour. A better question, IMO, is “how much does it constrain the weights for it to exhibit this particular goal directed behaviour?” And here, I think it’s pretty clear that a goal of arranging the world to cause next tokens to be predicted constrains the weights enormously more than a goal of predicting the next tokens, because in order to exhibit behaviour directed to that goal, the AI’s weights need to implement computation that doesn’t merely check what the next token is likely to be, but also assess what current data says about the world state, how different next token predictions would affect that world state, and how that would affect it’s ultimate goal.
The training check has no reason to come into existence in the first place under gradient descent. Of course, if the AI were to self-modify while already exhibiting goal directed behaviour, obviously it would want to implement such a training check. But I am talking about an AI trained by gradient descent. The training process doesn’t just affect the AI, it literally is what creates the AI in the first place.