the issue I still see is—how do you recognize an ai executive that is trying to disguise itself?
It can’t disguise itself without researching disguising methods first. The question is will interpretability tools be up to the task of catching it.
It will not work for catching AI executive originating outside of controlled environment (unless it queries AI scientist). But given that such attempts will originate from uncoordinated relatively computationally underpowered sources, it may be possible to preemptively enumerate disguising techniques that such AI executive could come up with. If there are undetectable varieties..., well, it’s mostly game over.
It can’t disguise itself without researching disguising methods first. The question is will interpretability tools be up to the task of catching it.
It will not work for catching AI executive originating outside of controlled environment (unless it queries AI scientist). But given that such attempts will originate from uncoordinated relatively computationally underpowered sources, it may be possible to preemptively enumerate disguising techniques that such AI executive could come up with. If there are undetectable varieties..., well, it’s mostly game over.