leogao comments on [ASoT] Some thoughts about deceptive mesaoptimization

leogao 29 Mar 2022 14:54 UTC
2 points
Detecting the bad circuit might be possible in practice, but there does exist a case where this doesn’t work. Your model could implement “If RSA-2048 is factored, do bad” in an arbitrarily obfuscated way, and it’s in general computationally expensive to tell if two circuits implement the same function. It could require arbitrarily large changes to the model to make it think that it saw RSA2048.