quetzal_rainbow comments on A Shutdown Problem Proposal

quetzal_rainbow 25 Jan 2024 16:02 UTC
5 points
0
I think that acausal attacks is kinda galaxy-brained example, I have better one. Imagine that you are training superintelligent programmer. It writes code, you evaluate it and analyse vulnerabilities in code. Reward is calculated based on quality metrics, including number of vulnerabilities. In some moment your model becomes sufficiently smart to notice that you don’t see all vulnerabilities, because you are not superintelligence. I.e., in some moment ground-truth objective of training process becomes “produce code with vulnerabilities that only superintelligence can notice” instead of “produce code with no vulnerabilities”, because you see code, think “wow, so good code with no vulnerabilies” and assign maximum reward, while actually code is filled with them.
What links here?
- johnswentworth's comment on A Shutdown Problem Proposal by johnswentworth (25 Jan 2024 16:18 UTC; 4 points)
- quetzal_rainbow's comment on And All the Shoggoths Merely Players by Zack_M_Davis (17 Feb 2024 11:21 UTC; 1 point)