Detecting the bad circuit might be possible in practice, but there does exist a case where this doesn’t work. Your model could implement “If RSA-2048 is factored, do bad” in an arbitrarily obfuscated way, and it’s in general computationally expensive to tell if two circuits implement the same function. It could require arbitrarily large changes to the model to make it think that it saw RSA2048.
Detecting the bad circuit might be possible in practice, but there does exist a case where this doesn’t work. Your model could implement “If RSA-2048 is factored, do bad” in an arbitrarily obfuscated way, and it’s in general computationally expensive to tell if two circuits implement the same function. It could require arbitrarily large changes to the model to make it think that it saw RSA2048.