The fact that this idea seems fairly obvious in retrospect but was published only yesterday suggests to me that we haven’t done nearly enough work aligning language model agents.
Instead of just conditioning on “this behavior gets high reward”, whatever that means, it’s like “this behavior gets high reward as measured by the salt detector thing”.
In the context of language modeling, we can do exactly the same thing, or a conceptually extremely similar thing to what the genome is doing here, where we can have… [In the] “Pretraining Language Models with Human Preferences” paper I mentioned a while ago, what they actually technically do is they label their pre-training corpus with special tokens depending on whether or not the pre-training corpus depicts good or bad behavior and so they have this token for, “okay, this text is about to contain good behavior” and so once the model sees this token, it’s doing conditional generation of good behavior. Then, they have this other token that means bad behavior is coming, and so when the model sees that token… or actually I think they’re reward values, or classifier values of the goodness or badness of the incoming behavior.
But anyway, what happens is that you learn this conditional model of different types of behavior, and so in deployment, you can set the conditional variable to be good behavior and the model then generates good behavior. You could imagine an extended version of this sort of setup where instead of having just binary good or bad behavior, as you’re labeling, you have good or bad behavior, polite or not polite behavior, academic speak versus casual speak. You could have factual correct claims versus fiction writing and so on and so forth. This would give the code base, all these learned pointers to the models’ “within lifetime” learning, and so you would have these various control tokens or control codes that you could then switch between, according to whatever simple program you want, in order to direct the model’s learned behavior in various ways.
That paper (which I link-posted when it came out in How to Control an LLM’s Behavior (why my P(DOOM) went down)) was a significant influence on the idea in my post, and on much of my recent thinking about Alignment — another source was the fact that some foundation model labs (Google, Microsoft) are already training small (1B–4B parameter) models on mostly-or-entirely synthetic data, apparently with great success. None of those labs have mentioned whether that includes prealigning them during pretraining, but if they aren’t, they definitely should try it.
I agree with Seth’s analysis: in retrospect this idea looks blindingly obvious, I’m surprised it wasn’t proposed ages ago (or maybe it was, and I missed it).
Seth above somewhat oversimplified my proposal (though less than he suggests): my idea was actually a synthetic training set that taught the model two modes of text generation: human-like (including less-than-fully-aligned human selfish behavior), and fully-aligned (i.e. selfless) AI behavior (plus perhaps one or two minor variants on these, like a human being quoted-and-if-necessary-censored/commented-on by an aligned AI), and I proposed using the technique ofPretraining Language Models with Human Preferences to train the model to always clearly distinguish these modes with XML tags. Then at inference time we can treat the tokens for the XML tags specially, allowing us to distinguish between modes, or even ban certain transitions.
Ah right. I listened to that podcast but didn’t catch the significance of this proposal for improving language model agent alignment. Roger Dearnaley did heavily credit that paper in his post.
Fwiw I remember being exposed to similar ideas from Quintin Pope / Nora Belrose months ago, e.g. in the context of Pretraining Language Models with Human Preferences; I think Quintin also discusses some of this on his AXRP appearance:
That paper (which I link-posted when it came out in How to Control an LLM’s Behavior (why my P(DOOM) went down)) was a significant influence on the idea in my post, and on much of my recent thinking about Alignment — another source was the fact that some foundation model labs (Google, Microsoft) are already training small (1B–4B parameter) models on mostly-or-entirely synthetic data, apparently with great success. None of those labs have mentioned whether that includes prealigning them during pretraining, but if they aren’t, they definitely should try it.
I agree with Seth’s analysis: in retrospect this idea looks blindingly obvious, I’m surprised it wasn’t proposed ages ago (or maybe it was, and I missed it).
Seth above somewhat oversimplified my proposal (though less than he suggests): my idea was actually a synthetic training set that taught the model two modes of text generation: human-like (including less-than-fully-aligned human selfish behavior), and fully-aligned (i.e. selfless) AI behavior (plus perhaps one or two minor variants on these, like a human being quoted-and-if-necessary-censored/commented-on by an aligned AI), and I proposed using the technique of Pretraining Language Models with Human Preferences to train the model to always clearly distinguish these modes with XML tags. Then at inference time we can treat the tokens for the XML tags specially, allowing us to distinguish between modes, or even ban certain transitions.
Ah right. I listened to that podcast but didn’t catch the significance of this proposal for improving language model agent alignment. Roger Dearnaley did heavily credit that paper in his post.