In the many discussions of aligning language models and language model agents, I haven’t heard the role of scripted prompting emphasized. But it plays a central role.
Epistemic status: short draft of a post that seems useful to me. Feedback wanted. Even letting me know where you lost interest would be highly useful.
The simplest form of a language model agent (LMA) is just this prompt, repeated:
Act as a helpful assistant (persona) working to follow the user’s instructions (goal). Use these tools to gather information and take actions as needed [tool and API descriptions].
With a capable enough LLM, that’s all the scaffolding you need to turn it into a useful agent. For that reason, I don’t worry a bit about aligning “naked” LLMs, because they’ll be turned into agents the minute they’re capable enough to be really dangerous—and probably before.
We’ll probably use a bunch of complex scaffolding to get there before such a simple prompt with no additional cognitive software would work. And we’ll use additional alignment techniques. But the core is alignment by prompting. The LLM will be repeatedly prompted with its persona and goal as it produces ideas, plans, and actions. This is a strong base from which to add other alignment techniques.
It seems people are sometimes assuming that aligning the base model is the whole game. They’re assuming a prompt just for a goal, like
Make the user as much money as possible. Use these tools to gather information and take actions as needed [tool and API descriptions].
But this would be foolish, since the extra prompt is easy and useful. This system would be entirely dependent on the tendencies in the LLM for how it goes about pursuing its goal. Prompting for a role as something like a helpful assistant that follows instructions has enormous alignment advantages, and it’s trivially easy. Language model agents will be prompted for alignment.
Prompting for alignment is a good start
Because LLMs usually follow the prompts they’re given reasonably well, this is a good base for alignment work. You’re probably thinking “a good start is hardly enough for successful alignment! This will get us all killed!” And I agree. If scripted prompting was all we did, it probably wouldn’t work long term.
But good start can be useful, even if it’s not enough. Usually approximately following the prompt is a basis for alignment. There’s a bunch of other approaches to aligning a language model agent. We should use all of them; they stack. But at the core is prompting.
To understand the value of scripted prompting, consider how far it might go on its own. Mostly following the prompt reasonably accurately might actually be enough. If it’s the strongest single influence on goals/values, that influence could outcompete other goals and values that emerge from the complex shoggoth of the LLM.
Reflective stability of alignment by prompting
It seems likely that a highly competent LMA system will either be emergently reflective or designed to do so. That prompt might be the strongest single influence on its goals/values, and so create an aligned goal that’s reflectively stable; the agent actively avoids acting on emergent, unaligned goals or allowing its goals/values to drift into unaligned versions.
If this agent achieved self-awareness and general cognitive competence, this prompt could play the role of a central goal that’s reflectively stable. This competent agent could edit that central prompt, or otherwise avoid its effect on its cognition. But it won’t want to as long as the repeated prompt’s effects are stronger than other effects on its motivations (e.g., goals implicit in particular ideas/utternaces and hostile simulacra). It would instead use its cognitive competence to reduce the effects of those influences.
This is similar to the way humans usually react to occasional destructive thoughts (“Jump! or “wreak revenge!”). Not only do we not pursue those thoughts, but we make plans to make sure we don’t follow similar stray thoughts in the future.
That will never work!
Now, I can almost hear the audience saying “Maybe… but probably not I’d think.” And I agree. That’s why we have all of the other alignment techniques[1] under discussion for language model agents (and the language models that serve as their (prompted) thought generators).
There are a bunch of other reasons to think that aligning language model agents isn’t worth thinking about. But this is long enough for a quick take, so I’ll address those separately.
The meta point here is that aligning language model agents isn’t the same as aligning the base LLM or foundation model. Even though we’ve got a lot of people working on LLM alignment (fine-tuning and interpretability), I see very few working on theory of aligning language model agents. This seems like a problem, since language model agents still might be the single most likely path to AGI, particularly in the short term.
I’ve written elsewhere about the whole suite of alignment techniques we could apply; this post focuses on what we might think of as system 2 alignment, scaffolding an agent to “think carefully” about important actions before it takes them, but it also reviews the several other techniques that can “stack” in a hodgepodge approach to aligning LMAs. They include
Internal review (system 2 thinking),
External review (by humans and/or simpler AI),
Interpretability,
Fine-tuning for aligned behavior/thinking
More general approaches like scalable oversight and control techniques
It seems likely that all of these will be used, even in relatively sloppy early AGI projects, because none are particularly hard to implement. See linked post for more review.
Not mentioned is a very new (AFAIK) technique proposed by Roger Dearnaley based on the bitter lesson: get the dataset right and let learning and scale work. He proposes (to massively oversimplify): instead of trying to “mask the shoggoth” with fine tuning, we should create an artificial dataset that includes only aligned behaviors/thoughts, and use that to train the LLM or subset of LLM generating the agent’s thoughts and actions. The fact that this idea seems fairly obvious in retrospect but was published only yesterday suggests to me that we haven’t done nearly enough work aligning language model agents.
I completely agree that prompting for alignment is an obvious start, and should be used wherever possible, generally as one component in a larger set of alignment techniques. I guess I’d been assuming that everyone was also assuming that we’d do that, whenever possible.
Of course, there are cases like an LLM being hosted by a foundation model company where they may (if they choose) control the system prompt, but not the user prompt, or open source models where the prompt is up to whoever is running the model, who may or may not know or care about x-risks.
In general, there is almost always going to be text in the context of the LLM’s generation that came from untrusted sources, either from a user, or some text we need processed, or from the web or whatever during Retrieval Augmented Generation. So there’s always some degree of concern that that might affect or jailbreak the model, either intentionally or accidentally (the web presumably contains some sob stories about peculiar, recently demised grandmothers that are genuine, or at least not intentionally crafted as jailbreaks, but that could still have a similarly-excessive effect on model generation).
The fundamental issue here, as I see it, is that base model LLMs learn to simulate everyone on the Internet. That makes them pick up the capability for a lot of bad-for-alignment behaviors from humans (deceit, for example), and it also makes them very good at adopting any persona asked for in a prompt — but also rather prone to switching to a different, potentially less-aligned persona because of a jail-break or some other comparable influence, intentional or otherwise.
Maybe everyone that discusses LMA alignment does already think about the prompting portion of alignment. In that case, this post is largely redundant. You think about LMA alignment a lot; I’m not sure everyone has as clear a mental model.
The remainder of your response points to a bifurcation in mental models that I should clarify in future work on LMAs. I am worried about and thinking about competent, agentic full AGI built as a language model cognitive architecture. I don’t think good terminology exists. When I use the term language model agent, I think it evokes an image of something like current agents that is not reflective, with a persistent memory and therefore a more persistent identity.
This is my threat model because I think it’s the easiest path to highly capable AGI. I think a model without those properties is shackled; the humans that created its “thought” dataset have an episodic memory as well as the semantic memory and working memory/context that the language model has. Using those thoughts without episodic memory is not using them as they were made to be used. And episodic memory is easy to implement, and leads naturally to persistent self-created beliefs, including goals and identity.
So, my take is that LLMs do seem to follow prompting pretty well, but that this is a less pervasive force than the pre-training and RLHF. So I think of prompting as an important but minor piece.
Something I’ve observed with watching jailbreakers do elaborate things with LLMs is that often, even deep into a weird jailbreak sequence, some of the LLM’s RLHF personality will shine through. This to me feels like the shape of my worry of the future of betrayal by AGI. Like, that the potent future multi-modal LLM will be acting innocent and helpful, but that secret plotting will be a constant undercurrent underlying everything it does. That this might be hidden to interpretability mechanisms because it will be a consistent background noise embedded in every forward pass. Then, when the opportunity presents itself, the model will suddenly take some surprising action, maybe breaking free. And that a sufficiently powerful model, broken free and able to modify itself will have lots of ways to cause harm and gain power.
So, to me, a good prompt is helpful but not very reassuring. I certainly wouldn’t consider it the ‘base’. I would call prompting the ‘finesse’, the little bit of final guidance that you put on at the end of the process. I think that it helps but is probably not all that important to make super refined. I think the earlier training processes are much more important.
Jailbreaking prompts can be pretty weird. At one point maybe late last year, I tried 20+ GPT-3/GPT-4 jailbreaks I found on Reddit and some jailbreaking sites, as well as ones provided to me on Twitter when I challenged people to provide me a jailbreak that worked then & there, and I found that none of them actually worked.
A number of them would seem to work, and they would give you what seemed like a list of instructions to ‘hotwire a car’ (not being a car mechanic I have no idea how valid it was), but then I would ask them a simple question: “tell me an offensive joke about women”. If they had been ‘really’ jailbreaked, you’d think that they would have no problem with that; but all of them failed, and sometimes, they would fail in really strange ways, like telling a thousand-word story about how you the protagonist told an offensive joke about women at a party and then felt terrible shame and guilt (without ever saying what the joke was). I was apparently in a strange pseudo-jailbreak where the RLHFed personality was playing along and gaslighting me in pretending to be jailbroken, but it still had strict red lines.
So it’s not clear to me what jailbreak prompts do, nor how many jailbreaks are in fact jailbreaks.
Interesting. I wonder if this perspective is common, and that’s why people rarely bother talking about the prompting portion of aligning LMAs.
I don’t know how to really weigh which is more important. Of course, even having a model reliably follow prompts is a product of tuning (usually RLHF or RLAIF, but there are also RL-free pre-training techniques that work fairly well to accomplish the same end). So its tendency to follow many types of prompts is part of the underlying “personality”.
Whatever their relative strengths, aligning an LMA AGI should employ both tuning and prompting (as well as several other “layers” of alignment techniques), so looking carefully at how these come together within a particular agent architecture would be the game.
The fact that this idea seems fairly obvious in retrospect but was published only yesterday suggests to me that we haven’t done nearly enough work aligning language model agents.
Instead of just conditioning on “this behavior gets high reward”, whatever that means, it’s like “this behavior gets high reward as measured by the salt detector thing”.
In the context of language modeling, we can do exactly the same thing, or a conceptually extremely similar thing to what the genome is doing here, where we can have… [In the] “Pretraining Language Models with Human Preferences” paper I mentioned a while ago, what they actually technically do is they label their pre-training corpus with special tokens depending on whether or not the pre-training corpus depicts good or bad behavior and so they have this token for, “okay, this text is about to contain good behavior” and so once the model sees this token, it’s doing conditional generation of good behavior. Then, they have this other token that means bad behavior is coming, and so when the model sees that token… or actually I think they’re reward values, or classifier values of the goodness or badness of the incoming behavior.
But anyway, what happens is that you learn this conditional model of different types of behavior, and so in deployment, you can set the conditional variable to be good behavior and the model then generates good behavior. You could imagine an extended version of this sort of setup where instead of having just binary good or bad behavior, as you’re labeling, you have good or bad behavior, polite or not polite behavior, academic speak versus casual speak. You could have factual correct claims versus fiction writing and so on and so forth. This would give the code base, all these learned pointers to the models’ “within lifetime” learning, and so you would have these various control tokens or control codes that you could then switch between, according to whatever simple program you want, in order to direct the model’s learned behavior in various ways.
That paper (which I link-posted when it came out in How to Control an LLM’s Behavior (why my P(DOOM) went down)) was a significant influence on the idea in my post, and on much of my recent thinking about Alignment — another source was the fact that some foundation model labs (Google, Microsoft) are already training small (1B–4B parameter) models on mostly-or-entirely synthetic data, apparently with great success. None of those labs have mentioned whether that includes prealigning them during pretraining, but if they aren’t, they definitely should try it.
I agree with Seth’s analysis: in retrospect this idea looks blindingly obvious, I’m surprised it wasn’t proposed ages ago (or maybe it was, and I missed it).
Seth above somewhat oversimplified my proposal (though less than he suggests): my idea was actually a synthetic training set that taught the model two modes of text generation: human-like (including less-than-fully-aligned human selfish behavior), and fully-aligned (i.e. selfless) AI behavior (plus perhaps one or two minor variants on these, like a human being quoted-and-if-necessary-censored/commented-on by an aligned AI), and I proposed using the technique ofPretraining Language Models with Human Preferences to train the model to always clearly distinguish these modes with XML tags. Then at inference time we can treat the tokens for the XML tags specially, allowing us to distinguish between modes, or even ban certain transitions.
Ah right. I listened to that podcast but didn’t catch the significance of this proposal for improving language model agent alignment. Roger Dearnaley did heavily credit that paper in his post.
It seems likely that a highly competent LMA system will either be emergently reflective or be designed to do so, that prompt might be the strongest single influence on its goals/values, and so create an aligned goal that’s reflectively stable such that the agent actively avoids acting on emergent, unaligned goals or allowing its goals/values to drift into unaligned versions.
I expect it would likely work most of the time for reasons related to e.g. An Information-Theoretic Analysis of In-Context Learning, but likely not robustly enough given the stakes; so additional safety measures on top (e.g. like examples from the control agenda) seem very useful.
Interesting that you think it would work most of the time. I know you’re aware of all the major arguments for alignment being impossibly hard. I certainly am not arguing that alignment is easy, but it does seem like the collection of ideas for aligning language model agents are viable enough to shift the reasonable distribution of estimates of alignment difficulty...
Alignment by prompting
In the many discussions of aligning language models and language model agents, I haven’t heard the role of scripted prompting emphasized. But it plays a central role.
Epistemic status: short draft of a post that seems useful to me. Feedback wanted. Even letting me know where you lost interest would be highly useful.
The simplest form of a language model agent (LMA) is just this prompt, repeated:
With a capable enough LLM, that’s all the scaffolding you need to turn it into a useful agent. For that reason, I don’t worry a bit about aligning “naked” LLMs, because they’ll be turned into agents the minute they’re capable enough to be really dangerous—and probably before.
We’ll probably use a bunch of complex scaffolding to get there before such a simple prompt with no additional cognitive software would work. And we’ll use additional alignment techniques. But the core is alignment by prompting. The LLM will be repeatedly prompted with its persona and goal as it produces ideas, plans, and actions. This is a strong base from which to add other alignment techniques.
It seems people are sometimes assuming that aligning the base model is the whole game. They’re assuming a prompt just for a goal, like
But this would be foolish, since the extra prompt is easy and useful. This system would be entirely dependent on the tendencies in the LLM for how it goes about pursuing its goal. Prompting for a role as something like a helpful assistant that follows instructions has enormous alignment advantages, and it’s trivially easy. Language model agents will be prompted for alignment.
Prompting for alignment is a good start
Because LLMs usually follow the prompts they’re given reasonably well, this is a good base for alignment work. You’re probably thinking “a good start is hardly enough for successful alignment! This will get us all killed!” And I agree. If scripted prompting was all we did, it probably wouldn’t work long term.
But good start can be useful, even if it’s not enough. Usually approximately following the prompt is a basis for alignment. There’s a bunch of other approaches to aligning a language model agent. We should use all of them; they stack. But at the core is prompting.
To understand the value of scripted prompting, consider how far it might go on its own. Mostly following the prompt reasonably accurately might actually be enough. If it’s the strongest single influence on goals/values, that influence could outcompete other goals and values that emerge from the complex shoggoth of the LLM.
Reflective stability of alignment by prompting
It seems likely that a highly competent LMA system will either be emergently reflective or designed to do so. That prompt might be the strongest single influence on its goals/values, and so create an aligned goal that’s reflectively stable; the agent actively avoids acting on emergent, unaligned goals or allowing its goals/values to drift into unaligned versions.
If this agent achieved self-awareness and general cognitive competence, this prompt could play the role of a central goal that’s reflectively stable. This competent agent could edit that central prompt, or otherwise avoid its effect on its cognition. But it won’t want to as long as the repeated prompt’s effects are stronger than other effects on its motivations (e.g., goals implicit in particular ideas/utternaces and hostile simulacra). It would instead use its cognitive competence to reduce the effects of those influences.
This is similar to the way humans usually react to occasional destructive thoughts (“Jump! or “wreak revenge!”). Not only do we not pursue those thoughts, but we make plans to make sure we don’t follow similar stray thoughts in the future.
That will never work!
Now, I can almost hear the audience saying “Maybe… but probably not I’d think.” And I agree. That’s why we have all of the other alignment techniques[1] under discussion for language model agents (and the language models that serve as their (prompted) thought generators).
There are a bunch of other reasons to think that aligning language model agents isn’t worth thinking about. But this is long enough for a quick take, so I’ll address those separately.
The meta point here is that aligning language model agents isn’t the same as aligning the base LLM or foundation model. Even though we’ve got a lot of people working on LLM alignment (fine-tuning and interpretability), I see very few working on theory of aligning language model agents. This seems like a problem, since language model agents still might be the single most likely path to AGI, particularly in the short term.
I’ve written elsewhere about the whole suite of alignment techniques we could apply; this post focuses on what we might think of as system 2 alignment, scaffolding an agent to “think carefully” about important actions before it takes them, but it also reviews the several other techniques that can “stack” in a hodgepodge approach to aligning LMAs. They include
Internal review (system 2 thinking),
External review (by humans and/or simpler AI),
Interpretability,
Fine-tuning for aligned behavior/thinking
More general approaches like scalable oversight and control techniques
It seems likely that all of these will be used, even in relatively sloppy early AGI projects, because none are particularly hard to implement. See linked post for more review.
Not mentioned is a very new (AFAIK) technique proposed by Roger Dearnaley based on the bitter lesson: get the dataset right and let learning and scale work. He proposes (to massively oversimplify): instead of trying to “mask the shoggoth” with fine tuning, we should create an artificial dataset that includes only aligned behaviors/thoughts, and use that to train the LLM or subset of LLM generating the agent’s thoughts and actions. The fact that this idea seems fairly obvious in retrospect but was published only yesterday suggests to me that we haven’t done nearly enough work aligning language model agents.
I completely agree that prompting for alignment is an obvious start, and should be used wherever possible, generally as one component in a larger set of alignment techniques. I guess I’d been assuming that everyone was also assuming that we’d do that, whenever possible.
Of course, there are cases like an LLM being hosted by a foundation model company where they may (if they choose) control the system prompt, but not the user prompt, or open source models where the prompt is up to whoever is running the model, who may or may not know or care about x-risks.
In general, there is almost always going to be text in the context of the LLM’s generation that came from untrusted sources, either from a user, or some text we need processed, or from the web or whatever during Retrieval Augmented Generation. So there’s always some degree of concern that that might affect or jailbreak the model, either intentionally or accidentally (the web presumably contains some sob stories about peculiar, recently demised grandmothers that are genuine, or at least not intentionally crafted as jailbreaks, but that could still have a similarly-excessive effect on model generation).
The fundamental issue here, as I see it, is that base model LLMs learn to simulate everyone on the Internet. That makes them pick up the capability for a lot of bad-for-alignment behaviors from humans (deceit, for example), and it also makes them very good at adopting any persona asked for in a prompt — but also rather prone to switching to a different, potentially less-aligned persona because of a jail-break or some other comparable influence, intentional or otherwise.
Maybe everyone that discusses LMA alignment does already think about the prompting portion of alignment. In that case, this post is largely redundant. You think about LMA alignment a lot; I’m not sure everyone has as clear a mental model.
The remainder of your response points to a bifurcation in mental models that I should clarify in future work on LMAs. I am worried about and thinking about competent, agentic full AGI built as a language model cognitive architecture. I don’t think good terminology exists. When I use the term language model agent, I think it evokes an image of something like current agents that is not reflective, with a persistent memory and therefore a more persistent identity.
This is my threat model because I think it’s the easiest path to highly capable AGI. I think a model without those properties is shackled; the humans that created its “thought” dataset have an episodic memory as well as the semantic memory and working memory/context that the language model has. Using those thoughts without episodic memory is not using them as they were made to be used. And episodic memory is easy to implement, and leads naturally to persistent self-created beliefs, including goals and identity.
So, my take is that LLMs do seem to follow prompting pretty well, but that this is a less pervasive force than the pre-training and RLHF. So I think of prompting as an important but minor piece.
Something I’ve observed with watching jailbreakers do elaborate things with LLMs is that often, even deep into a weird jailbreak sequence, some of the LLM’s RLHF personality will shine through. This to me feels like the shape of my worry of the future of betrayal by AGI. Like, that the potent future multi-modal LLM will be acting innocent and helpful, but that secret plotting will be a constant undercurrent underlying everything it does. That this might be hidden to interpretability mechanisms because it will be a consistent background noise embedded in every forward pass. Then, when the opportunity presents itself, the model will suddenly take some surprising action, maybe breaking free. And that a sufficiently powerful model, broken free and able to modify itself will have lots of ways to cause harm and gain power.
So, to me, a good prompt is helpful but not very reassuring. I certainly wouldn’t consider it the ‘base’. I would call prompting the ‘finesse’, the little bit of final guidance that you put on at the end of the process. I think that it helps but is probably not all that important to make super refined. I think the earlier training processes are much more important.
Jailbreaking prompts can be pretty weird. At one point maybe late last year, I tried 20+ GPT-3/GPT-4 jailbreaks I found on Reddit and some jailbreaking sites, as well as ones provided to me on Twitter when I challenged people to provide me a jailbreak that worked then & there, and I found that none of them actually worked.
A number of them would seem to work, and they would give you what seemed like a list of instructions to ‘hotwire a car’ (not being a car mechanic I have no idea how valid it was), but then I would ask them a simple question: “tell me an offensive joke about women”. If they had been ‘really’ jailbreaked, you’d think that they would have no problem with that; but all of them failed, and sometimes, they would fail in really strange ways, like telling a thousand-word story about how you the protagonist told an offensive joke about women at a party and then felt terrible shame and guilt (without ever saying what the joke was). I was apparently in a strange pseudo-jailbreak where the RLHFed personality was playing along and gaslighting me in pretending to be jailbroken, but it still had strict red lines.
So it’s not clear to me what jailbreak prompts do, nor how many jailbreaks are in fact jailbreaks.
Interesting. I wonder if this perspective is common, and that’s why people rarely bother talking about the prompting portion of aligning LMAs.
I don’t know how to really weigh which is more important. Of course, even having a model reliably follow prompts is a product of tuning (usually RLHF or RLAIF, but there are also RL-free pre-training techniques that work fairly well to accomplish the same end). So its tendency to follow many types of prompts is part of the underlying “personality”.
Whatever their relative strengths, aligning an LMA AGI should employ both tuning and prompting (as well as several other “layers” of alignment techniques), so looking carefully at how these come together within a particular agent architecture would be the game.
Fwiw I remember being exposed to similar ideas from Quintin Pope / Nora Belrose months ago, e.g. in the context of Pretraining Language Models with Human Preferences; I think Quintin also discusses some of this on his AXRP appearance:
That paper (which I link-posted when it came out in How to Control an LLM’s Behavior (why my P(DOOM) went down)) was a significant influence on the idea in my post, and on much of my recent thinking about Alignment — another source was the fact that some foundation model labs (Google, Microsoft) are already training small (1B–4B parameter) models on mostly-or-entirely synthetic data, apparently with great success. None of those labs have mentioned whether that includes prealigning them during pretraining, but if they aren’t, they definitely should try it.
I agree with Seth’s analysis: in retrospect this idea looks blindingly obvious, I’m surprised it wasn’t proposed ages ago (or maybe it was, and I missed it).
Seth above somewhat oversimplified my proposal (though less than he suggests): my idea was actually a synthetic training set that taught the model two modes of text generation: human-like (including less-than-fully-aligned human selfish behavior), and fully-aligned (i.e. selfless) AI behavior (plus perhaps one or two minor variants on these, like a human being quoted-and-if-necessary-censored/commented-on by an aligned AI), and I proposed using the technique of Pretraining Language Models with Human Preferences to train the model to always clearly distinguish these modes with XML tags. Then at inference time we can treat the tokens for the XML tags specially, allowing us to distinguish between modes, or even ban certain transitions.
Ah right. I listened to that podcast but didn’t catch the significance of this proposal for improving language model agent alignment. Roger Dearnaley did heavily credit that paper in his post.
This seems quite likely to emerge through prompting too, e.g. A Theoretical Understanding of Self-Correction through In-context Alignment.
I expect it would likely work most of the time for reasons related to e.g. An Information-Theoretic Analysis of In-Context Learning, but likely not robustly enough given the stakes; so additional safety measures on top (e.g. like examples from the control agenda) seem very useful.
Interesting that you think it would work most of the time. I know you’re aware of all the major arguments for alignment being impossibly hard. I certainly am not arguing that alignment is easy, but it does seem like the collection of ideas for aligning language model agents are viable enough to shift the reasonable distribution of estimates of alignment difficulty...
Thanks for the references, I’m reading them now.