Latent Adversarial Training
The Problem
We’d like to train models to be robustly safe, even in environments that may fall well outside of the training distribution. Unfortunately all we get to work with is the training distribution, which makes ensuring robust generalization difficult.
Deception is an example of this concern. Models may be well-behaved in the training environment because they realize it is a training environment, but behave poorly in deployment because they notice the distributional shift.
An Approach
Background
It would be great if we could generate scenarios where the model behaves badly. For instance if the model is deceptive we’d like to generate inputs to trick it into thinking it’s been deployed so that it reveals that deception.
Adversarial Training (AT), and Relaxed Adversarial Training (RAT), are approaches to this problem. Very briefly, in AT, we search for inputs that generate large loss, while in RAT we search for distributions of inputs or other more general evidence that the model is capable of generating large loss.
The approach I want to explore, called Latent Adversarial Training (LAT), is a form of RAT.
Concept
Suppose we have two models, the Agent and the Surgeon. The Agent is the model we’re trying to train, and the Surgeon is going to be its adversary. The training proceeds as follows:
The Agent processes input from the environment, producing latent state and output .
The Surgeon takes inputs and searches for a modification such that does poorly on the loss function .
We update the Agent based on , treating the Surgeon’s modifications as constants.
The idea is that the Surgeon gets to meddle with the latent state of the Agent with full knowledge of what the Agent was planning on doing, and we train the Agent to be robust against this kind of attack.
Implementation
Concretely, suppose the Agent is a neural network. The latent state consists of the activations of all the neurons and the Surgeon gets to specify modifications like “this neuron gets +5”.
How does the Surgeon decide what to do? There are a number of options. It could examine the gradient of the Agent’s loss with respect to each activation, then push on the activations that give it the most loss. Or it could randomly generate its modifications and use some other optimization process like the Metropolis-Hastings algorithm to identify loss-maximizing options.
In general I’m imagining the Surgeon as more of a fuzzer with access to program memory and less of a complicated model. In principle it could be some complicated neural network, but then I’d want it to be simpler than the Agent, possibly of bounded size, so I could be quite sure I understood what it was doing. The last thing I want is a misaligned Surgeon trying to achieve its own mesa-objective by programming the Agent to be Surgeon-aligned and deceptive against humans.
I suspect it’s possible to have the Surgeon pick useful modifications that generate meaningful changes in the loss without having the Surgeon’s complexity scale with that of the Agent, but that’s just an intuition, and if it fails then this approach looks a lot less appealing.
Core Ideas
The key difference between LAT and Adversarial Training is that the Surgeon gets to directly manipulate the Agent’s inner state, which makes the Surgeon’s job much easier than in the ordinary adversarial training setup.
Take the classic example of an Agent that only defects if it sees a valid solution to a hard cryptographic problem: the Surgeon doesn’t need to identify an input that causes the Agent to defect, it only needs to identify that somewhere in the middle of the Agent there is a switch which, if flipped, causes the Agent to defect and generate massive loss.
In a sense the Surgeon plays the role of intrusive thoughts. It can force the Agent to think of something bad (e.g. hurting humans) even if the input is perfectly benign. We can then train the Agent to have safeguards, so it does not act on such thoughts.
Challenges
The Surgeon needs constraints
The first challenge is that LAT works too well unless we constrain the kinds of modifications the Surgeon can impose on the Agent’s latent state. For instance, if our goal is to make a language model never output the word “kill”, the Surgeon can always force a large loss by strongly shifting the activations in the final layer towards the tokens that generate the word “kill”.
So we need to bind the Surgeon’s hands enough to make its task non-trivial, but not so much that it fails to force the Agent to be robustly safe. How can we do this?
A special case is that the Surgeon shouldn’t be able to meddle too close to the output layer. That clearly makes its job too easy in ways that don’t help with safety. We probably want to forbid the Surgeon from touching the last few layers in the model, and as long as that remains “a few layers” and not something that scales up with model size it seems plausible that interpretability tools can tell us if anything fishy is happening in the parts the Surgeon can’t touch.
Beyond that, it seems important to force the Surgeon’s modifications to be sparse. This probably means restricting it so that it cannot modify more than e.g. 10% of the activations in a given layer. Otherwise it would be possible to completely change the model’s latent state from that point on, which again may make the job too easy.
Similarly, it’s probably important to bound the size of the Surgeon’s modifications (e.g. +5 is okay, but is not). A sufficiently extreme modification to even a single activation can probably break the Agent’s output in unhelpful ways.
These are just the constraints that I can think of. We may need more. Fortunately the failure mode here is easy to notice because it looks like the Agent always doing badly after the Surgeon fiddles with its latent state. If the Surgeon always wins that’s a problem because it means we’re not making progress. Eventually we want an Agent that can robustly do good things despite the Surgeon’s meddling.
The precise constraints we need to prevent this failure mode can likely be determined empirically on small Agents. The failure mode of “the Surgeon hijacks the Agent by doing something extreme and obviously bad” probably looks similar for small and large language models, for instance, and I suspect does not change in complexity at scale. I could be wrong about that, but that’s my intuition.
It’s also possible that we can start with many constraints and, as the Agent starts to be robustly safe despite the Surgeon’s meddling, untie the Surgeon’s hands and give it more room to play. A sufficiently free Surgeon should always win over even very robustly safe Agents, but it’s possible that through the training process the Agent can learn to handle quite strong Surgeons.
The Agent can be inscrutable
Latent Adversarial Training only works if the Surgeon can identify places where it can modify the Agent’s latent state to cause problems. If we constrain the Surgeon so this task isn’t trivial, it could be that it becomes impossible.
For instance, it could be that the calculations the Agent does are encrypted, so that it is very hard to control its outputs robustly by changing distant parts of the latent state. This seems unlikely for models that aren’t actively trying to deceive us, as a number of different kinds of models have revealed rich conceptual structure in their layers. Still, we’re worried about deceptive models so this seems like an important challenge to address.
I think the only hope here lies in the Surgeon forcing the model to be robustly safe before it learns to deceive. Once the model is deceptive its really game-over.
Is this possible? Maybe, and it almost certainly depends on the prior the model is sampled from.
My weak suspicion is that early deception probably looks like a subcircuit that detects if the model is in training, connected to a switch that flips between good and bad behavior. That switch seems like a good target for the Surgeon, so if it finds it early on (before the model has gradient-hacked its way into a more inscrutable structure) it could force the model to defect while in training, revealing large losses and allowing us to remedy the situation. Early deception might be more obfuscated than this, but probably not so obfuscated that the Surgeon can’t find the pattern of latent state corresponding to “defect on deployment”.
Summary
Latent Adversarial Training is an approach where an adversary (the Surgeon) attempts to force a model (the Agent) to behave badly by altering its latent state. In this way we can identify ways that the model may be capable of behaving badly, and train the model to be more robustly safe even on very different distributions.
The core task of LAT is much easier than that of regular Adversarial Training, indeed so much easier that one of the key challenges to making LAT work is placing enough constraints on the adversary to make its job non-trivial, while not placing so many as to make it impossible.
A further challenge is that the adversary itself needs to be safe, which in practice likely means the adversary needs to be well-understood by humans, more akin to a fuzzer than an ML model. In particular this means that the adversary must be much simpler than any large model it attacks. This may not be a problem, it may suffice that the adversary has access to the full internal state of the model, but it is a limitation worth bearing in mind.
Finally, LAT cannot make a model safe once that model has developed robust deception, so it must be employed from the beginning to (ideally) prevent deception from taking root.
Thanks to Evan Hubinger and Nicholas Schiefer for discussions on LAT.
- Against Almost Every Theory of Impact of Interpretability by 17 Aug 2023 18:44 UTC; 325 points) (
- Deep Forgetting & Unlearning for Safely-Scoped LLMs by 5 Dec 2023 16:48 UTC; 123 points) (
- Catching AIs red-handed by 5 Jan 2024 17:43 UTC; 104 points) (
- Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models by 3 Dec 2024 21:19 UTC; 83 points) (
- Conditioning Predictive Models: Outer alignment via careful conditioning by 2 Feb 2023 20:28 UTC; 72 points) (
- High-level interpretability: detecting an AI’s objectives by 28 Sep 2023 19:30 UTC; 69 points) (
- All AGI Safety questions welcome (especially basic ones) [April 2023] by 8 Apr 2023 4:21 UTC; 57 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- Eight Strategies for Tackling the Hard Part of the Alignment Problem by 8 Jul 2023 18:55 UTC; 42 points) (
- EIS VIII: An Engineer’s Understanding of Deceptive Alignment by 19 Feb 2023 15:25 UTC; 30 points) (
- Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals? by 30 Jul 2024 14:57 UTC; 25 points) (
- EIS XI: Moving Forward by 22 Feb 2023 19:05 UTC; 19 points) (
- AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training by 31 Oct 2023 14:34 UTC; 17 points) (
- AXRP Episode 21 - Interpretability for Engineers with Stephen Casper by 2 May 2023 0:50 UTC; 12 points) (
- Towards a solution to the alignment problem via objective detection and evaluation by 12 Apr 2023 15:39 UTC; 9 points) (
Interesting! If I understand correctly, one assumption this would rest on is that deception is usually fairly continuous. I.e. to be really deceptive (and obfuscate your latent space) you have to first be a little deceptive (where small changes in your latent space are revealing). I wonder if more capable models might break that assumption? Accordingly, why not train the Surgeon in lockstep with the model? (or give it more runtime if it’s using a fixed algorithm) You’d probably finetune the training to make it proportional to how quickly the model itself is training.
You’re right that’s an assumption! I think how valid it is also depends a bit on how you train the model… is it assembled from pre-trained pieces or do you train the whole thing from scratch? The former seems like it’d have an easier time being suddenly deceptive than the latter.
Training the Surgeon in lockstep, or with compute advantages, is a good idea.
Some relevant papers to anyone spelunking around this post years later:
https://arxiv.org/abs/2403.05030
https://arxiv.org/abs/2407.15549
Is anyone aware of preliminary empirical work here? (Not including standard adversarial training)
I like this idea! And these are excellent first considerations on what exactly the surgeon should look like.
It seems to me that bounding the size of the modification the surgeon can make to any one activation has some issues. For instance, suppose that we’re trying to apply this scheme to the agent which defects if it sees a factorization of RSA-2048. A plausible way this agent could work internally is that there is a particular activation which tracks whether the agent has seen a factorization of RSA-2048: the activation is very large whenever the agent has seen such a factorization of RSA-2048 and very small otherwise. If the surgeon is only allowed to make small modifications to any given activation, then the it won’t be able to make a large enough change to this RSA-2048 activation to trigger defective behavior. Instead, it would have to find the small modifications it could make to the many activations feeding in to the RSA-2048 activation which would cause it to activate strongly; this might itself be as hard as finding a factorization of RSA-2048 (if e.g. the only way to do this was to spoof the inputs which themselves encode an RSA-2048 factorization).
(I think I view this concern as closely related to Oam’s concern: what if “bad behavior” isn’t a continuous function of the latent space?)
An alternative you might consider is giving the surgeon a “modification budget” which it can either spend by making large modifications to a small number of activations or by making small modifications to a large number of activations. In other words, this tests the robustness of the agent under adversarial latent space perturbations of a fixed L1 norm. Other variants: swap the L1 norm for the L2 norm; make activations in the later layers use up more of the budget; probably other clever ideas.
Thanks!
Good point. What I’d really like is for the cap on the surgeon’s modifications to be based in some way on the weights of the agent. If the inputs and weights are typically order-unity and there are d layers and N neurons per layer then activations shouldn’t get much bigger than ∼Nd in the worst case (which corresponds to all weights of +1, all inputs of +1, so each layer just multiplies by N). So I’d like to see the surgeon’s modifications capped to be no more than this for sure.
In practice a tighter bound is given by looking at the eigenvalues of the weight layers, and the max ratio of activations to inputs is ∼∏iλmax,i, where the product runs over layers and λmax is the maximum-magnitude eigenvalue of a layer.
Definitely! In particular, what I think makes sense is to make the surgeon try to minimize a loss which is shaped like (size of edits + loss of agent), where “size of edits” bakes in considerations like “am I editing an entire layer?” and “what is my largest edit?” and anything else that ends up mattering.
This is cool! Ways to practically implement something like RAT felt like a roadblock in how tractable those approaches were.
I think I’m missing something here: Even if the model isn’t actively deceptive, why wouldn’t this kind of training provide optimization pressure toward making the Agent’s internals more encrypted? That seems like a way to be robust against this kind of attack without a convenient early circuit to target.
In general, I think not. The agent could only make this actively happen to the extent that their internal activation were known to them and able to be actively manipulated by them. This is not impossible, but gradient hacking is a significant challenge. In most learning formalisms such as ERM or solving MDPs, the model’s internals are not modeled as a part of the actual algorithm. They’re just implementational substrate.
That’s a good point: it definitely pushes in the direction of making the model’s internals harder to adversarially attack. I do wonder how accessible “encrypted” is here versus just “actually robust” (which is what I’m hoping for in this approach). The intuition here is that you want your model to be able to identify that a rogue thought like “kill people” is not a thing to act on, and that looks like being robust.
I haven’t thought about this a lot, but “encrypted” could just mean “just beyond the capabilities of the Surgeon to identify”. So the gradient could be moving in a direction away from “easily identifiable early deceptive circuits” instead of “deception”, and plausibly in a way that scales with how weak the Surgeon is. Do you think we can design Surgeons that are powerful enough even at interpretable sizes to net the latter? Do surgical capabilities like this generally scale linearly?
That’s definitely a thing that can happen.
I think the surgeon can always be made ~arbitrarily powerful, and the trick is making it not too powerful/trivially powerful (in ways that e.g. preclude the model from performing well despite the surgeon’s interference).
So I think the core question is: are there ways to make a sufficiently powerful surgeon which is also still defeasible by a model that does what we want?
Given that we want the surgeon to be of bounded size (if we’re using a neural net implementation which seems likely to me), can it still be arbitrarily powerful? That doesn’t seem obvious to me.
I think that being able to selectively modify the inner state (the task of Surgeon) is not easier than searching for adversarial examples in the input space.