I completely agree that prompting for alignment is an obvious start, and should be used wherever possible, generally as one component in a larger set of alignment techniques. I guess I’d been assuming that everyone was also assuming that we’d do that, whenever possible.
Of course, there are cases like an LLM being hosted by a foundation model company where they may (if they choose) control the system prompt, but not the user prompt, or open source models where the prompt is up to whoever is running the model, who may or may not know or care about x-risks.
In general, there is almost always going to be text in the context of the LLM’s generation that came from untrusted sources, either from a user, or some text we need processed, or from the web or whatever during Retrieval Augmented Generation. So there’s always some degree of concern that that might affect or jailbreak the model, either intentionally or accidentally (the web presumably contains some sob stories about peculiar, recently demised grandmothers that are genuine, or at least not intentionally crafted as jailbreaks, but that could still have a similarly-excessive effect on model generation).
The fundamental issue here, as I see it, is that base model LLMs learn to simulate everyone on the Internet. That makes them pick up the capability for a lot of bad-for-alignment behaviors from humans (deceit, for example), and it also makes them very good at adopting any persona asked for in a prompt — but also rather prone to switching to a different, potentially less-aligned persona because of a jail-break or some other comparable influence, intentional or otherwise.
Maybe everyone that discusses LMA alignment does already think about the prompting portion of alignment. In that case, this post is largely redundant. You think about LMA alignment a lot; I’m not sure everyone has as clear a mental model.
The remainder of your response points to a bifurcation in mental models that I should clarify in future work on LMAs. I am worried about and thinking about competent, agentic full AGI built as a language model cognitive architecture. I don’t think good terminology exists. When I use the term language model agent, I think it evokes an image of something like current agents that is not reflective, with a persistent memory and therefore a more persistent identity.
This is my threat model because I think it’s the easiest path to highly capable AGI. I think a model without those properties is shackled; the humans that created its “thought” dataset have an episodic memory as well as the semantic memory and working memory/context that the language model has. Using those thoughts without episodic memory is not using them as they were made to be used. And episodic memory is easy to implement, and leads naturally to persistent self-created beliefs, including goals and identity.
I completely agree that prompting for alignment is an obvious start, and should be used wherever possible, generally as one component in a larger set of alignment techniques. I guess I’d been assuming that everyone was also assuming that we’d do that, whenever possible.
Of course, there are cases like an LLM being hosted by a foundation model company where they may (if they choose) control the system prompt, but not the user prompt, or open source models where the prompt is up to whoever is running the model, who may or may not know or care about x-risks.
In general, there is almost always going to be text in the context of the LLM’s generation that came from untrusted sources, either from a user, or some text we need processed, or from the web or whatever during Retrieval Augmented Generation. So there’s always some degree of concern that that might affect or jailbreak the model, either intentionally or accidentally (the web presumably contains some sob stories about peculiar, recently demised grandmothers that are genuine, or at least not intentionally crafted as jailbreaks, but that could still have a similarly-excessive effect on model generation).
The fundamental issue here, as I see it, is that base model LLMs learn to simulate everyone on the Internet. That makes them pick up the capability for a lot of bad-for-alignment behaviors from humans (deceit, for example), and it also makes them very good at adopting any persona asked for in a prompt — but also rather prone to switching to a different, potentially less-aligned persona because of a jail-break or some other comparable influence, intentional or otherwise.
Maybe everyone that discusses LMA alignment does already think about the prompting portion of alignment. In that case, this post is largely redundant. You think about LMA alignment a lot; I’m not sure everyone has as clear a mental model.
The remainder of your response points to a bifurcation in mental models that I should clarify in future work on LMAs. I am worried about and thinking about competent, agentic full AGI built as a language model cognitive architecture. I don’t think good terminology exists. When I use the term language model agent, I think it evokes an image of something like current agents that is not reflective, with a persistent memory and therefore a more persistent identity.
This is my threat model because I think it’s the easiest path to highly capable AGI. I think a model without those properties is shackled; the humans that created its “thought” dataset have an episodic memory as well as the semantic memory and working memory/context that the language model has. Using those thoughts without episodic memory is not using them as they were made to be used. And episodic memory is easy to implement, and leads naturally to persistent self-created beliefs, including goals and identity.