Before an AI can even decide even to try to become a harder target for interpretability, it first needs to become deceitful, since this is deceitful behavior. So if we do interpretability during training, then observing and interpreting the initial development of deceit provides a window of opportunity before the model can possibly take any of the disguise measures you classify. See my post Interpreting the Learning of Deceit for a proposal on how we might do that, given only rather basic Interpretability capabilities, not far beyond what we already have.
Before an AI can even decide even to try to become a harder target for interpretability, it first needs to become deceitful, since this is deceitful behavior. So if we do interpretability during training, then observing and interpreting the initial development of deceit provides a window of opportunity before the model can possibly take any of the disguise measures you classify. See my post Interpreting the Learning of Deceit for a proposal on how we might do that, given only rather basic Interpretability capabilities, not far beyond what we already have.