faul_sname comments on Gradient hacking via actual hacking

faul_sname 10 May 2023 20:05 UTC
2 points
0
Yeah if outputs of the training process are logged and processed by insecure software (which includes pretty much any software that handles “text” instead of “bytes”) I think it’s safe to say that superhuman-hackerbot-which-controls-its-own-outputs pwns the log processing machine.

BTW of possible interest to you is Automated Repair of Binary and Assembly Programs for Cooperating Embedded Devices. It’s from 2013 so there are probably better examples by now, but the key passage (well it’s buried deep within a paragraph of the “Limitations and Caveats” section, but it’s key for this context) is:

The fine granularity of repairs at the ASM and ELF levels may be a poor match for conventional test suites. For example, we have observed ASM-level repairs that change the calling convention of one particular function. Such a repair has no direct representation at the C source level, and a test suite designed to maximize statement coverage (for example) may not speak to the validity of such a repair. Producing efficient test suites that give confidence that an implementation adheres to its specification remains an open problem in software engineering. Our work shares this general weakness with all other approaches that use test suites or workloads to validate candidate repairs (e.g., Clearview [26] and GenProg [35]). In this regard, sandboxing is crucial: we have observed ASM variants that subvert the testing framework by deleting key test files, leading to perfect fitness for all subsequent variants until the test framework is repaired.

tl;dr: If you train a model to avoid test failures, and don’t sandbox it properly, it will figure out that the tests won’t fail if there are no tests. This is not a hypothetical future scenario that might happen or might be impossible, this is a thing that has already happened (10 years ago, even).