Fucking o3. This pace of improvement looks absolutely alarming. I would really hate to have my fast timelines turn out to be right.
The “alignment” technique, “deliberative alignment”, is much better than constitutional AI. It’s the same during training, but it also teaches the model the safety criteria, and teaches the model to reason about them at runtime, using a reward model that compares the output to their specific safety criteria. (This also suggests something else I’ve been expecting—the CoT training technique behind o1 doesn’t need perfectly verifiable answers in coding and math, it can use a decent guess as to the better answer in what’s probably the same procedure).
While safety is not alignment (SINA?), this technique has a lot of promise for actual alignment. By chance, I’ve been working on an update to my Internal independent review for language model agent alignment, and have been thinking about how this type of review could be trained instead of scripted into an agent as I’d originally predicted.
This is that technique. It does have some promise.
But I don’t think OpenAI has really thought through the consequences of using their smarter-than-human models with scaffolding that makes them fully agentic and soon enough reflective and continuously learning.
The race for AGI speeds up, and so does the race to align it adequately by the time it arrives in a takeover-capable form.
I’ll write a little more on their new alignment approach soon.
Fucking o3. This pace of improvement looks absolutely alarming. I would really hate to have my fast timelines turn out to be right.
The “alignment” technique, “deliberative alignment”, is much better than constitutional AI. It’s the same during training, but it also teaches the model the safety criteria, and teaches the model to reason about them at runtime, using a reward model that compares the output to their specific safety criteria. (This also suggests something else I’ve been expecting—the CoT training technique behind o1 doesn’t need perfectly verifiable answers in coding and math, it can use a decent guess as to the better answer in what’s probably the same procedure).
While safety is not alignment (SINA?), this technique has a lot of promise for actual alignment. By chance, I’ve been working on an update to my Internal independent review for language model agent alignment, and have been thinking about how this type of review could be trained instead of scripted into an agent as I’d originally predicted.
This is that technique. It does have some promise.
But I don’t think OpenAI has really thought through the consequences of using their smarter-than-human models with scaffolding that makes them fully agentic and soon enough reflective and continuously learning.
The race for AGI speeds up, and so does the race to align it adequately by the time it arrives in a takeover-capable form.
I’ll write a little more on their new alignment approach soon.