Models might scheme, but still retain some desirable preferences, heuristics, deontological preferences, or flinches such as a reluctance to very directly lie.
Scheming might only be sampled some subset of the time in an effectively random way (in particular, not in a way that strongly corresponds to scheming related properties of the input such that if you take a bunch of samples on some input, you’ll likely get some weight on a non-scheming policy).
There might be multiple different scheming AIs which have different preferences and which might cooperate with humans over other AIs if a reasonable trade offer was available.
Some other important categories:
Models might scheme, but still retain some desirable preferences, heuristics, deontological preferences, or flinches such as a reluctance to very directly lie.
Scheming might only be sampled some subset of the time in an effectively random way (in particular, not in a way that strongly corresponds to scheming related properties of the input such that if you take a bunch of samples on some input, you’ll likely get some weight on a non-scheming policy).
There might be multiple different scheming AIs which have different preferences and which might cooperate with humans over other AIs if a reasonable trade offer was available.