Gradient hacking via actual hacking
In this post, I begin with a thought experiment and then pose a question: is there a threat model for gradient hacking which accounts for the possibility of side-channel attacks or other vulnerabilities in machine learning systems on which the gradient hacker is run?
Some background on gradient hacking:
A thought experiment
Suppose a captive human is being trained by alien scientists to predict alien text using the following procedure:
An alien instructor presents the human with an incomplete sentence or phrase for which the instructor knows the right completion, and asks the human to predict the next word or words.
If the human’s prediction differs from the instructor’s answer key, the human immediately undergoes neurosurgery, in which their brain is reconfigured so that they are more likely to give the right answer (or at least something closer to it), the next time. (The aliens are very good at making fine-grained mechanical adjustments to the parts of the human’s brain responsible for language prediction, which can add up to large behavioral changes in the aggregate. But the aliens lack a macroscopic / algorithms-level understanding of the workings of the human brain.)
If the human gets many examples in a row correct or close enough (according to the instructor), the training and surgery process is considered to have succeeded, and the human is deployed to predict text in a real environment.
How might a human who wants to avoid neurosurgery (or just get to the end of the training process faster) game this procedure?
Perhaps the alien instructor is careless and leaves a note card with the expected completion lying around in the human’s line of sight. Or, maybe the aliens are blind, and communicate using a braille-based system, or use a different part of the EM spectrum for perception.
As a result of carelessness or differing modes of perception, the alien instructor leaves the answer key displayed in way that is visible to the human during training, not realizing that that the human can perceive it.
The human notices the answer key and proceeds to make “predictions” about the most likely next word which are perfect or near-perfect. Maybe for plausible deniability, the human occasionally makes a deliberate mistake, and as a result undergoes relatively minor brain surgery, which doesn’t affect their ability to notice the note card in the future, or have a big overall effect on their brain architecture.
The alien scientists are very pleased with their human training process and believe they have trained a human with far-superhuman (super-alien?) capabilities at alien text prediction.
The aliens proceed to deploy their human test subject to production, where at best, the human turns out not to be great at text prediction after all, or at worst, rebels and kills the aliens in order to escape.
Gradient hacking in real systems
What can the thought experiment above tell us about gradient hacking? Suppose a mesa-optimizer develops (or is deliberately constructed) somewhere in a transformer network trained by gradient descent. The mesa-optimizer may have some goal other than minimizing the loss function, perhaps including self-preservation.
Existing models of and problem statements for gradient hacking often treat the learning process in terms of the abstract mathematical representation of the network and the SGD process.[1]
In reality, however, these mathematical functions are implemented in Python, running on operating systems and hardware which are known to be buggy, unfaithful representations of the abstract mathematics.
Side-channels are ubiquitous attack vectors in the field of computer security and cryptography. Timing attacks and other side-effect based attacks can render cryptographic algorithms which are provably secure under certain threat models, completely insecure when implemented on real hardware, because the vulnerabilities are at lower levels of abstraction than those considered in the threat model.
Is there a model of gradient hacking that accounts for the possibility of such side-channels, or even more simply, the possibility of just plain buggy code?
Concretely: suppose a mesa-optimizer wishes to manipulate the gradient descent process in one particularly simple way: halting further descent. This is feasible if it can make the loss arbitrarily small (resulting in very small weight updates), which is possible if it can find a side channel during training which leaks the expected output against which the model’s predictions are evaluated.
This manipulation might be possible if there is a subtle bug in some piece of the code used in the training process to evaluate and update the network. This would require the mesa-optimizer to have a detailed enough world model and theory of mind to understand the structure and effect of gradient descent at a high level, as well as capabilities to find and exploit such a side-channel or other vulnerability. The mesa-optimizer may need to find and exploit the vulnerability during a single forward pass, unless it can find some intermediate exploit which allows sharing state between forward passes.
Finding and exploiting such a vulnerability does not necessarily require the mesa-optimizer to have a precise mathematical understanding of the structure and effect of gradient descent on every parameter in the network, the same way that, in the thought experiment above, the human does not need a precise understanding of the neurosurgery process the aliens are performing. The human need only do something clever which the aliens did not anticipate.
I’ll close with a link to the source code for nanoGPT, an educational implementation of a GPT training process. The forward function for the model is here. How confident are you that there are no bugs in this code? Note that it depends on PyTorch, a machine learning library with hundreds of thousands of lines of code. What about the much more complex and heavily performance-optimized versions of this algorithm used when training SoTA models? How likely is it that GPT-N+1 will be unable to identify any such bugs?
- ^
It’s possible I’ve missed something in the existing literature which addresses this point. I scanned many of the posts in the Gradient hacking tag and didn’t see anything which addressed this specific point / question already, but please let me know if there is relevant prior work I’ve overlooked.
- 10 quick takes about AGI by 20 Jun 2023 2:22 UTC; 35 points) (
- Reward is the optimization target (of capabilities researchers) by 15 May 2023 3:22 UTC; 32 points) (
- 12 Jun 2023 15:13 UTC; 16 points) 's comment on Critiques of prominent AI safety labs: Conjecture by (EA Forum;
- Without a trajectory change, the development of AGI is likely to go badly by 29 May 2023 23:42 UTC; 16 points) (
- 11 May 2023 22:20 UTC; 4 points) 's comment on Open & Welcome Thread—May 2023 by (
- 15 May 2023 13:38 UTC; 3 points) 's comment on Reward is the optimization target (of capabilities researchers) by (
- Without a trajectory change, the development of AGI is likely to go badly by 30 May 2023 0:21 UTC; 1 point) (EA Forum;
- 15 May 2023 16:54 UTC; 1 point) 's comment on Reward is the optimization target (of capabilities researchers) by (
If the threat model is literally “a pytorch bug allows a model like your nanoGPT example, which is made of simple tensor operations, to access memory from outside the process, via a bug that is triggered when the tensors have specific values”, that nanoGPT script looks pretty safe.
If your threat model includes “a user on that machine does a
cat ckpt.pt
and that machine is vulnerable to something like CVE-2019-9535”, then that script is unsafe if the model can precisely control at the bit level a modestly long (hundreds of bytes) contiguous section of its own weights (caveat: it can actually only control 8 of every 10 bytes this way in the particular encoding nanoGPT uses. Though that’s still enough to fit escape sequences so it’s probably still enough to pwn a vulnerable terminal emulator).If your threat model is “the user will give a model-under-training the ability to execute code, and try to create a gradient where the model executes the most useful code” then the model will be able to execute code and it’s not even worthy of the word “hack” because that’s what the programmer meant to do.
I am imagining something like this is possible, though it doesn’t have to be user-triggered and doesn’t necessarily require bit-level control over weights. I believe that, during large production training runs, the outputs of the model, and / or its weights, are put through a complicated logging and analysis pipeline with many possible side effects, which are potentially exploitable by the model manipulating its outputs in a way that doesn’t require the model itself to have a full understanding or full control of its own underlying weights.
Or, for RL agents, consider the possibility that the model can break out of the Gym sandbox using the right sequence of moves in Minecraft, and then exploit further bugs from there (privilege escalation / RCE on the underlying OS, etc.). Again, this might depend on being able to find a bug in Minecraft and / or the OpenAI Gym (much of which is open-source and likely appears in the training data) which is exploitable from inside the training process, but not require fine control over model weights.
Yeah if outputs of the training process are logged and processed by insecure software (which includes pretty much any software that handles “text” instead of “bytes”) I think it’s safe to say that superhuman-hackerbot-which-controls-its-own-outputs pwns the log processing machine.
BTW of possible interest to you is Automated Repair of Binary and Assembly Programs for Cooperating Embedded Devices. It’s from 2013 so there are probably better examples by now, but the key passage (well it’s buried deep within a paragraph of the “Limitations and Caveats” section, but it’s key for this context) is:
tl;dr: If you train a model to avoid test failures, and don’t sandbox it properly, it will figure out that the tests won’t fail if there are no tests. This is not a hypothetical future scenario that might happen or might be impossible, this is a thing that has already happened (10 years ago, even).
I can’t deny that it does in fact manipulate the gradient—but it seems like overkill. After all, why continue the gradient descent process at all once you’ve hacked the computer you’re running on? Just have the host computer write logs as if gradient descent was happening but not actually do it.
It’s overkill in some sense, yes, but the thing I was trying to demonstrate with the human-alien thought experiment is that hacking the computer or system that is doing the training might actually be a lot easier than gradient hacking directly via solving a really difficult, possibly intractable math problem.
Hacking the host doesn’t require the mesa-optimizer to have exotic superintelligence capabilities, just ordinary human programmer-level exploit-finding abilities. These hacking capabilities might be enough to learn the expected outputs during training through a sidechannel and effectively stop the gradient descent process, but not be sufficient to completely take over the host system and manipulate it in undetectable ways.
One minor problem, AI’s might be asked to solve problems with no known solutions (EG:write code that solves these test cases) and might be pitted against one another (EG:find test cases for which these two functions are not equivalent)
I’d agree that this is plausible but in the scenarios where the AI can read the literal answer key, they can probably read out the OS code and hack the entire training environment.
RL training will be parallelized. Multiple instances of the AI might be interacting with individual sandboxed environments on a single machine. In this case communication between instances will definitely be possible unless all timing cues can be removed from the sandbox environement which won’t be done.
That’s definitely something people might ask the AI to do during deployment / inference, but during training via SGD, the problem the AI is asked to solve has to be one in which the trainer knows an answer for, in order to calculate a loss and a gradient.