There might be a natural concept for this that reframes deceptive alignment in the direction of reflection/extrapolation. Looking at deceptive alignment as a change of behavior not in response to capability gain, but instead as a change in response to stepping into a new situation, it’s then like a phase change in the (unchanging) mapping from situations to behaviors (local policies). The behaviors of a model suddenly change as it moves to similar situations, in a way that’s not “correctly prompted” by behaviors in original situations.
It’s like a robustness failure, but with respect to actual behavior in related situations, rather than with respect to some outer objective or training/testing distribution. So it seems more like a failure of reflection/extrapolation, where behavior in new situations should be determined by coarse-grained descriptions of behavior in old situations (maybe “behavioral invariants” are something like that; or just algorithms) rather than by any other details of the model. Aligned properties of behavior in well-tested situations normatively-should screen off details of the model, in determining behavior in new situations (for a different extrapolated/”robustness”-hardened model prepared for use in the new situations).
There might be a natural concept for this that reframes deceptive alignment in the direction of reflection/extrapolation. Looking at deceptive alignment as a change of behavior not in response to capability gain, but instead as a change in response to stepping into a new situation, it’s then like a phase change in the (unchanging) mapping from situations to behaviors (local policies). The behaviors of a model suddenly change as it moves to similar situations, in a way that’s not “correctly prompted” by behaviors in original situations.
It’s like a robustness failure, but with respect to actual behavior in related situations, rather than with respect to some outer objective or training/testing distribution. So it seems more like a failure of reflection/extrapolation, where behavior in new situations should be determined by coarse-grained descriptions of behavior in old situations (maybe “behavioral invariants” are something like that; or just algorithms) rather than by any other details of the model. Aligned properties of behavior in well-tested situations normatively-should screen off details of the model, in determining behavior in new situations (for a different extrapolated/”robustness”-hardened model prepared for use in the new situations).