That paper (which I link-posted when it came out in How to Control an LLM’s Behavior (why my P(DOOM) went down)) was a significant influence on the idea in my post, and on much of my recent thinking about Alignment — another source was the fact that some foundation model labs (Google, Microsoft) are already training small (1B–4B parameter) models on mostly-or-entirely synthetic data, apparently with great success. None of those labs have mentioned whether that includes prealigning them during pretraining, but if they aren’t, they definitely should try it.
I agree with Seth’s analysis: in retrospect this idea looks blindingly obvious, I’m surprised it wasn’t proposed ages ago (or maybe it was, and I missed it).
Seth above somewhat oversimplified my proposal (though less than he suggests): my idea was actually a synthetic training set that taught the model two modes of text generation: human-like (including less-than-fully-aligned human selfish behavior), and fully-aligned (i.e. selfless) AI behavior (plus perhaps one or two minor variants on these, like a human being quoted-and-if-necessary-censored/commented-on by an aligned AI), and I proposed using the technique ofPretraining Language Models with Human Preferences to train the model to always clearly distinguish these modes with XML tags. Then at inference time we can treat the tokens for the XML tags specially, allowing us to distinguish between modes, or even ban certain transitions.
That paper (which I link-posted when it came out in How to Control an LLM’s Behavior (why my P(DOOM) went down)) was a significant influence on the idea in my post, and on much of my recent thinking about Alignment — another source was the fact that some foundation model labs (Google, Microsoft) are already training small (1B–4B parameter) models on mostly-or-entirely synthetic data, apparently with great success. None of those labs have mentioned whether that includes prealigning them during pretraining, but if they aren’t, they definitely should try it.
I agree with Seth’s analysis: in retrospect this idea looks blindingly obvious, I’m surprised it wasn’t proposed ages ago (or maybe it was, and I missed it).
Seth above somewhat oversimplified my proposal (though less than he suggests): my idea was actually a synthetic training set that taught the model two modes of text generation: human-like (including less-than-fully-aligned human selfish behavior), and fully-aligned (i.e. selfless) AI behavior (plus perhaps one or two minor variants on these, like a human being quoted-and-if-necessary-censored/commented-on by an aligned AI), and I proposed using the technique of Pretraining Language Models with Human Preferences to train the model to always clearly distinguish these modes with XML tags. Then at inference time we can treat the tokens for the XML tags specially, allowing us to distinguish between modes, or even ban certain transitions.