I think for the remaining 5% to be hiding really big important stuff like the presence of optimization (which is to say, mesa-optimization) or deceptive cognition, it has to be the case that there was adversarial obfuscation (e.g. gradient hacking). Of course, I’m only hypothesizing here, but it seems quite unlikely for that sort of stuff to just be randomly obfuscated.
I read Adversarial Examples are Features Not Bugs as suggesting that this sort of thing happens by default, and the main question is “sure, some of it happens by default, but can really big stuff happen by default?”. But if you imagine a LSTM implementing a finite state machine, or something, it seems quite possible to me that it will mostly be hard to unravel instead of easy to unravel, while still being a relevant part of the computation.
I read Adversarial Examples are Features Not Bugs as suggesting that this sort of thing happens by default, and the main question is “sure, some of it happens by default, but can really big stuff happen by default?”. But if you imagine a LSTM implementing a finite state machine, or something, it seems quite possible to me that it will mostly be hard to unravel instead of easy to unravel, while still being a relevant part of the computation.