I agree with pretty much everything you’ve said here. And yet, my overall take on this direction for AGI is relative optimism!
The potential for language model agents to be developed open source is indeed frightening. So is the potential for self-modification and rapid develpment through modularity. This will speed progress, and lead to a more massively multipolar scenario, which I find hugely dangerous (I do suspect that a superhuman LMA and/or its developers would pretty quickly choose to perform some sort of pivotal act to limit the danger of that multipolar scenario).
However, the alignment advantages over other approaches seem so large as to be worth the development speedup and multipolar scenario. LMAs seem to have many alignment advantages, and no disadvantages I can identify relative to other practically feasible AGI approaches.
I agree that having an internal train of thought in natural language isn’t a guarantee of safety. But it does seem like a huge advantage. Particularly, when calls are made to a new instance of an LLM (not including previous calls in the context window), it’s going to be hard for a model to have persistent, agent-like, Waluigi simulations. Natural language is what they’re trained to produce and use as input, so the bulk of their thinking should be summarized somewhat accurately by that natural language.
The second huge advantage is being able to add and specify safety goals in natural language. The proposal is not to rely on RLHF alignment of the base LLM, but to have alignment goals alongside performance goals. The idea is to have separate cognitive routines for checking alignment/safety of plans before executing them.
I think my next post should be on alignment advantages of language model agents. Neither my nor the other recent posts on advantages have directly addressed that type of safety cross-checking. That sets the stage for thinking about whether language model agents are the safest practical route to aligned AGI.
It also doesn’t matter a whole lot whether these are the safest route. They either will or won’t be the route the world takes. The alignment community still has very little sway in convincing people to develop safer sorts of AGI. If we developed a consensus about one sort being much safer, we could try to shift public and professional opinions to apply pressure. We could also actively contribute to capabilities work in one direction if we really thought it was safer. On LMAs versus other practical approaches, the jury is still out.
This tradeoff and the relative advantages is something I’d very much like to discuss, on lesswrong, and in conversation with anyone who’s interested.
Edit: I’m disappointed that this post didn’t get more attention. I suggest you request to have it cross-posted to the Alignment Forum; it’s clearly appropriate given the level of scholarship. I intend to cite this post in all of my future discussions to draw attention to it.
I agree with pretty much everything you’ve said here. And yet, my overall take on this direction for AGI is relative optimism!
The potential for language model agents to be developed open source is indeed frightening. So is the potential for self-modification and rapid develpment through modularity. This will speed progress, and lead to a more massively multipolar scenario, which I find hugely dangerous (I do suspect that a superhuman LMA and/or its developers would pretty quickly choose to perform some sort of pivotal act to limit the danger of that multipolar scenario).
However, the alignment advantages over other approaches seem so large as to be worth the development speedup and multipolar scenario. LMAs seem to have many alignment advantages, and no disadvantages I can identify relative to other practically feasible AGI approaches.
I agree that having an internal train of thought in natural language isn’t a guarantee of safety. But it does seem like a huge advantage. Particularly, when calls are made to a new instance of an LLM (not including previous calls in the context window), it’s going to be hard for a model to have persistent, agent-like, Waluigi simulations. Natural language is what they’re trained to produce and use as input, so the bulk of their thinking should be summarized somewhat accurately by that natural language.
The second huge advantage is being able to add and specify safety goals in natural language. The proposal is not to rely on RLHF alignment of the base LLM, but to have alignment goals alongside performance goals. The idea is to have separate cognitive routines for checking alignment/safety of plans before executing them.
I think my next post should be on alignment advantages of language model agents. Neither my nor the other recent posts on advantages have directly addressed that type of safety cross-checking. That sets the stage for thinking about whether language model agents are the safest practical route to aligned AGI.
It also doesn’t matter a whole lot whether these are the safest route. They either will or won’t be the route the world takes. The alignment community still has very little sway in convincing people to develop safer sorts of AGI. If we developed a consensus about one sort being much safer, we could try to shift public and professional opinions to apply pressure. We could also actively contribute to capabilities work in one direction if we really thought it was safer. On LMAs versus other practical approaches, the jury is still out.
This tradeoff and the relative advantages is something I’d very much like to discuss, on lesswrong, and in conversation with anyone who’s interested.
Edit: I’m disappointed that this post didn’t get more attention. I suggest you request to have it cross-posted to the Alignment Forum; it’s clearly appropriate given the level of scholarship. I intend to cite this post in all of my future discussions to draw attention to it.