If you want to produce warning shots for deceptive alignment, you’re faced with a basic sequencing question. If the model is capable of reasoning about its training process before it’s capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won’t be detectable.
If you want to produce warning shots for deceptive alignment, you’re faced with a basic sequencing question. If the model is capable of reasoning about its training process before it’s capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won’t be detectable.