I think that acausal attacks is kinda galaxy-brained example, I have better one.
Imagine that you are training superintelligent programmer. It writes code, you evaluate it and analyse vulnerabilities in code. Reward is calculated based on quality metrics, including number of vulnerabilities. In some moment your model becomes sufficiently smart to notice that you don’t see all vulnerabilities, because you are not superintelligence. I.e., in some moment ground-truth objective of training process becomes “produce code with vulnerabilities that only superintelligence can notice” instead of “produce code with no vulnerabilities”, because you see code, think “wow, so good code with no vulnerabilies” and assign maximum reward, while actually code is filled with them.
I think that acausal attacks is kinda galaxy-brained example, I have better one. Imagine that you are training superintelligent programmer. It writes code, you evaluate it and analyse vulnerabilities in code. Reward is calculated based on quality metrics, including number of vulnerabilities. In some moment your model becomes sufficiently smart to notice that you don’t see all vulnerabilities, because you are not superintelligence. I.e., in some moment ground-truth objective of training process becomes “produce code with vulnerabilities that only superintelligence can notice” instead of “produce code with no vulnerabilities”, because you see code, think “wow, so good code with no vulnerabilies” and assign maximum reward, while actually code is filled with them.