A deceptive model doesn’t have to have some sort of very explicit check for whether it’s in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it’s in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn’t really think about it very often because during training it just looks too unlikely.
A deceptive model doesn’t have to have some sort of very explicit check for whether it’s in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it’s in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn’t really think about it very often because during training it just looks too unlikely.