Gradient hacking: definitions and examples
EDIT: I have now shifted my terminology as follows (format: old term-->new term):
“RL credit hacking”-->exploration hacking.
“Gradient descent hacking”-->gradient hacking.
“Gradient hacking”-->credit hacking.
That is, I now reserve “gradient hacking” for one specific sub-phenomenon, and use “credit hacking” as the overarching term which includes both exploration hacking and gradient hacking (along with other possibilities, like multi-agent credit hacking). However, I’ve kept the original terminology in this post.
Gradient hacking is a hypothesized phenomenon where:
A model has knowledge about possible training trajectories which isn’t being used by its training algorithms when choosing updates (such as knowledge about non-local features of its loss landscape which aren’t taken into account by local optimization algorithms).
The model uses that knowledge to influence its medium-term training trajectory (even if the effects wash out in the long term).
Below I give some potential examples of gradient hacking, divided into those which exploit RL credit assignment and those which exploit gradient descent itself. My concern is that models might use techniques like these either to influence which goals they develop, or to fool our interpretability techniques. Even if those effects don’t last in the long term, they might last until the model is smart enough to misbehave in other ways (e.g. specification gaming, or reward tampering), or until it’s deployed in the real world—especially in the RL examples, since convergence to a global optimum seems unrealistic (and ill-defined) for RL policies trained on real-world data. However, since gradient hacking isn’t very well-understood right now, both the definition above and the examples below should only be considered preliminary.
RL credit hacking examples
A policy might know that a certain sequence of actions leads to higher reward. But if it chooses not to take those actions, it won’t be rewarded for them, and that type of behavior won’t be reinforced (until it finds them via random exploration, which could take a very long time).
A policy might know that humans give mistaken feedback on a given trajectory, and then select that trajectory more frequently in order to update itself in a certain direction. When humans give mistakenly positive feedback, this is just an example of specification gaming. But we could also imagine the model deliberately inducing mistakenly negative feedback.
A policy might know which of its actions is actually responsible for receiving reward, even though the RL algorithm being used “doesn’t know”. Then the model can influence how much credit the RL algorithm gives different actions (e.g. by putting them earlier or later in the trajectory).
This seems easier for dumb RL algorithms like REINFORCE, and seems harder for actor-critic methods. However, finding algorithms that make use of all the knowledge a model has about the training landscape seems like it’ll require a full solution to the ELK problem.
A policy might know when it’s being trained off-policy versus on-policy (because off-policy it observes a trajectory of actions which it wouldn’t have taken), and choose actions differently in each case. A tentative possibility, related to the first example: during off-policy training, it could try novel “explore” actions which it knows would lead to high reward on-policy. Those actions would then be reinforced against (in favor of the action actually taken during that trajectory), since when performed off-policy the rewards are not actually seen.
In multi-agent RL, even if rewards are zero-sum, different policies could coordinate to choose which of them will receive the reward in which episodes, to influence which behaviors are reinforced.
Gradient descent hacking examples
When local optima exist, a model might choose to do worse than it otherwise could, with the intention of changing its parameters in a certain direction. Intuitively speaking, gradient descent will then push in two directions: towards changing the parameters which implemented the decision to gradient hack, and towards changing the parameters which the model wanted to change (since their current values influenced its decision).
Gradient descent hacking is most compelling when relatively small changes in one direction could have outsized impacts on overall training trajectory. For example, at a saddle point in the loss landscape, a model might choose whichever output will lead to gradient descent pushing it in its preferred direction. When local optima exist, a model steering towards or away from a given local optimum might also have big effects.
This is similar to the classic example of gradient hacking described by Evan Hubinger. I don’t think it works as he described, but I think our main disagreement is whether a model can create new local optima by gradient hacking, rather than whether it’s possible for a model to steer itself at all via gradient hacking. One intuition for my position is that local optima are rare in neural network loss landscapes, because those landscapes are so high-dimensional.
In the absence of local optima, gradient descent will eventually modify the parameters responsible for the decision to gradient hack enough to prevent further gradient hacking. However, if the decision to gradient hack is driven by high-level reasoning and world-knowledge, the parameters responsible for that decision may be robust enough that the effects on the parameters the model is trying to change are much bigger.
Gradient descent hacking might seem strange because, in one sense, the loss landscape of gradient descent is fixed. But since the loss landscape depends on how the model itself thinks, we can view the model’s decisions as a way of “choosing” what its loss landscape will be. This is a type of subjunctive dependence—analogous to how it makes sense to think in terms of humans “choosing” actions even when those choices were “predetermined” by the laws of physics.
A recurrent model might know that some of its early thoughts won’t be reinforced by backpropagation due to vanishing/exploding gradients. So by thinking thoughts earlier or later, it could change the extent to which those thoughts are reinforced by backprop.
My intuitions here are particularly fuzzy, and I wouldn’t be surprised if it didn’t work. But there’s a biological analogy: classical conditioning. E.g. I can choose to do X right before Y, and then I’ll learn an association between X and Y which I wouldn’t have learned if I’d done X a long time before doing Y.
I expect that all of the effects discussed above could be much stronger in environments which violate standard RL/SL abstractions. I think it’s still worth having the central examples of gradient hacking be defined within the abstractions, since that makes it easier to see that there’s something there. However, it’s probably worth constructing examples which can lead to big changes (e.g. creating new local optima) given small violations of the standard abstractions such as:
An RL environment being non-stationary even in small ways (e.g. if a policy can leave itself persistent notes).
A model knowing the ordering of data it’ll be trained on (which may in fact be a good formal model of gradient hacking: imagine a supervised model which gets to choose the order it sees data).
A model knowing the step size being used to train it. (I’m curious whether, in theory, a model could deduce this directly; and whether adding this information to its observations breaks standard convergence proofs.)
A model exploiting implementation details like rounding errors (although this seems pretty fiddly).
A model generating data on which it’s later trained (as discussed in Leo Gao’s post on gradient hacking).
Having said all this, I do have some concern that the concept of gradient hacking won’t be very useful, because as policies get more and more intelligent, more and more of their behavior will be driven by generalization (in particular via high-level reasoning) which by definition is not “taken into account” by training algorithms. For example, there are several ways in which very general reasoning about how to achieve goals could be interpreted as gradient hacking, which suggests that “gradient hacking” isn’t a very natural category:
Humans often reason about our goals in order to produce more coherent versions of them. Since we know while doing the reasoning that the concepts we produce will end up ingrained as our goals, this could be seen as a form of gradient hacking.
A model might know that “take over the datacenter” will lead to high reward, but never do it because it has learned the goal of obedience to humans. Under my definition this would count as gradient hacking; again, though, it seems like a central example of straightforwardly pursuing a learned goal.
Deceptive alignment could be seen as a type of gradient hacking, but it’s a valuable strategy for any misaligned policy that’s still under human control, whether or not it’s undergoing optimization.
Multi-agent collusion (as described above) could also be seen as a type of gradient hacking, but again it’s a very broad strategy which isn’t just restricted to training episodes.
So while techniques specifically aimed at preventing gradient hacking may be useful on the margin, I currently expect that the category of “gradient hacking” is most useful for framing the broader problem of misaligned reasoning in a way that fits more easily into existing ML frameworks.
- A Mechanistic Interpretability Analysis of Grokking by 15 Aug 2022 2:41 UTC; 373 points) (
- LLMs Sometimes Generate Purely Negatively-Reinforced Text by 16 Jun 2023 16:31 UTC; 177 points) (
- Some conceptual alignment research projects by 25 Aug 2022 22:51 UTC; 176 points) (
- The alignment problem from a deep learning perspective by 10 Aug 2022 22:46 UTC; 107 points) (
- Value systematization: how values become coherent (and misaligned) by 27 Oct 2023 19:06 UTC; 102 points) (
- The Alignment Problem from a Deep Learning Perspective (major rewrite) by 10 Jan 2023 16:06 UTC; 84 points) (
- The alignment problem from a deep learning perspective by 11 Aug 2022 3:18 UTC; 58 points) (EA Forum;
- Coalitional agency by 22 Jul 2024 0:09 UTC; 56 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- List of strategies for mitigating deceptive alignment by 2 Dec 2023 5:56 UTC; 36 points) (
- Non-myopia stories by 13 Nov 2023 17:52 UTC; 29 points) (
- (Extremely) Naive Gradient Hacking Doesn’t Work by 20 Dec 2022 14:35 UTC; 14 points) (
- 29 Nov 2023 18:23 UTC; 1 point) 's comment on Value systematization: how values become coherent (and misaligned) by (
I don’t think this is true:
I could not find any study that test this directly, but I don’t expect conditioning to work if you yourself causes the unconditioned stimuli (US), Y in your example. My understanding of conditioning is that if there is no surprise there is no learning. For example: If you first condition an animal to expect A to be followed by C, and then exposes them to A+B followed by C, they will not learn to associate B with C. This is a well replicated result, and the textbook explanation (which I believe) is that no learning occurs because C is already explained by A (i.e. there is no surprise).
Does this matter for understanding gradient hacking in future AGIs? Maybe?
Since humans are the closest thing we have to an AGI, it does make sense to try to understand things like gradient hacking in ourselves. Or if we don’t have this problem, it would be very interesting to understand why not.
Are there other examples of biological gradient hacking?
(1)
I heard that whatever you do while taking nicotine, will be reinforced (don’t remember source but seems plausible to me). But this would be more analog to directly over writing the back prop signal, instead of manipulating the gradient via controlling the training data. If we end up with an AI that can just straight forwardly edit its outer learning regime in this way, then I think we are outside the scope of what you are talking about. However, if this nicotine hack works, it is interesting it is not used more? Maybe it is not a strong enough effect to be useful?
(2)
You give an other example:
I can’t decide if I think this should count as gradient hacking.
(3)
I know that I to some extent absorb the values of people around me, and I have used this for self manipulation. This is the best analog to gradient hacking I can think of for humans. Unfortunately I don’t expect this to tell us much about AI’s, since this method depends on a specific human drive towards conformism.
I’m curious if an opposite strategy works for contrarians? If you want to self manipulate you should hang out with people who believe/value the opposite of what you want yourself to believe/value?
Can you provide a citation? I don’t think this is true. My reading of this is that (if you’re training a dog) you can start with an unconditioned stimulus (sight of food) which causes salivating, and then you can add in the sound of a bell with the sight of food, and this also elicits salivating. And then you can remove the sight of food but still have the bell and the dog is likely to salivate. I don’t think you need to have a surprise to have learning in this context, you just need associations/patterns built up over time. Perhaps I’m misunderstanding you.