Deceptive alignment feels like a failure of reflection, a model not being in equilibrium with its episodes. If a behavior is not expressed in episodes sampled in the model’s scope, it’s not there at all. If it is expressed and is contrary to the alignment target, then either the alignment target is not looking at the episodes correctly to adjust them (when they are within the scope of the target), or the episodes were constructed incorrectly to extend the scope of the model, beyond where it’s aligned.
I think a model must never be knowingly exercised off-distribution (risking robustness), that should be a core alignment principle of a system that uses models. By itself, learning only interpolates, doesn’t extrapolate. Language models are failing at this, missing this critical safety feature. They don’t know what they don’t know, and happily generate nonsense in response to off-distribution inputs (though empirically it seems easy to fix in some form, that’s not the level of care this issue deserves). Extending the scope (beyond what’s expected by the model of the scope) should be deliberate, with specifically crafted training data, whose alignment is conferred by alignment of the systems that generate it.
Deceptive alignment feels like a failure of reflection, a model not being in equilibrium with its episodes. If a behavior is not expressed in episodes sampled in the model’s scope, it’s not there at all. If it is expressed and is contrary to the alignment target, then either the alignment target is not looking at the episodes correctly to adjust them (when they are within the scope of the target), or the episodes were constructed incorrectly to extend the scope of the model, beyond where it’s aligned.
I think a model must never be knowingly exercised off-distribution (risking robustness), that should be a core alignment principle of a system that uses models. By itself, learning only interpolates, doesn’t extrapolate. Language models are failing at this, missing this critical safety feature. They don’t know what they don’t know, and happily generate nonsense in response to off-distribution inputs (though empirically it seems easy to fix in some form, that’s not the level of care this issue deserves). Extending the scope (beyond what’s expected by the model of the scope) should be deliberate, with specifically crafted training data, whose alignment is conferred by alignment of the systems that generate it.