Gradient hacking intuitively seems harder still. The preconditions for it seem to be something like “the preconditions for deceptive alignment, plus the AI figuring out some effective maneuver to execute with the design of its own brain.”
It seems to me that the main difficulty is storing your reference policy somewhere where the gradient can’t touch it (even indirectly). Does anyone have a story of how that happens?
In future (unsafe but) sophisticated systems, models will have access to external storage, and be given the ability to read/write. For example, AutoGPT-like systems will have shared scratchpads across instances of the model, to coordinate with themselves. It seems instrumentally convergent for such systems to store things related to their true / eventual goals in such places, for example, so that they can be referenced by local instances. And if we do some sort of oversight of that data storage, key facts that the model wants to preserve might be obfuscated.
Of course, this is only one story, and I don’t expect it to be the only way such things could happen, but it seems to be a reasonable candidate as a failure mode for systems designed with normal (read: insufficient, but far more than otherwise expected,) levels of caution.
But gradient descent will still change the way that the system interprets things in its data storage, right?
I guess part of the question here is whether gradient descent will even scale to AutoGPT-like systems. You’re probably not going to be able to differentiate thru your external notes / other changes you could make to your environment.
Gradient hacking issues might still arise without that, and might even be very conceptually straightforward for the AI to execute.
Here’s one possibility:
Let’s assume the AI is situationally aware that it is inside of an RL training run with PPO (which seems to be how RLHF is being done at OpenAI?)
Often enough, it will get a hint it is going to (win, lose) the roll-out. [I have this experience every time I write a timed essay. Very clear whether it’s going well or poorly.]
In the remainder of the roll-out, it can start engaging in behaviors/thoughts that it wants to (learn, un-learn); and the gradient update will hit that behavior.
Congrats, we just gave the AI control to shape its own motivations.
It just needs to figure out how to (a) anticipate the result of roll-outs partway thru (easy) (b) be situationally aware, figure out this strategy (capabilities-limited) (c) figure out how to trigger its own desired or undesired thought patterns conditionally (could be easy; might be harder if there is oversight looking for this)
...I suppose none of this strategy actually requires the AI to be mis-aligned? So it might be something we see models attempt to do for “good” reasons, once they’re capable of it. E.g., we might imagine a model hallucinating a factual error, and then like a child being punished in detention, writing out its mistake repeatedly so that it will be remembered
If the gradient is able to flow back to the decisionmaker (in this case, the potentially misaligned agent that’s trying to gradient hack), the decisionmaker gets smacked by the gradient update because it’s responsible for the output.
I’m not quite ready to say it’s impossible to pull of this sort of attack within a single differentiable model under backprop, but this strategy doesn’t seem sufficient because the decisionmaker would basically be killing itself.
It’d be really helpful if someone could demonstrate this sort of attack at any scale or level of contrivance. I could see how optimizers that lack the robust credit assignment of backprop could be more vulnerable; that might be a good place to poke to try to demonstrate it first.
-If the thought processes in (a) and (b) are being run on all rollouts, by symmetry the gradients on those operations should mostly cancel.
-The part which does get consistently affected is letter (c), the parts which operate conditionally on success/fail status.
Which way does the gradient go on c? I suppose on the losing rollouts, you could argue that the conditional-thinking mechanism will be dis-incentivized. The gradient hacker would need to “restore” these gradients with the hacking process, which does seem unlikely. But by the symmetry of this logic, doesn’t that mean that conditioned on winning roll-outs, the gradient hacking process is naturally self-reinforcing? Maybe the issue is that exposure to the gradient will necessarily screw with the computations, in (c) sending its internal pathways off to infinity so it gets cancer and blows up.
-But if the mechanisms in (c) route through general logic infrastructure that are useful for task completion [e.g. by having (a) and (b) writing down its plan in a conditional workflow, and then letting its general executions functions handle the subsequent operations], these may be stabilized/maintained by the RL process because they are generally useful, and not killed
I don’t really think about it in terms of discrete capabilities. For an output that scores poorly:
There is an internal process that is responsible the output.
The gradient of the output with respect to the loss can be backpropagated through the entire network.
The responsible process will have larger gradients.
The responsible process gets smacked.
The responsible process encompasses everything, including all the metareasoning. The more a chunk of computation, any computation, contributed to that bad output, the more it gets smacked.
The gradient will simply move the implementation towards something that isn’t outputting poorly scoring things, and most things that score well aren’t doing some galaxy brained deception strategy.
doesn’t that mean that conditioned on winning roll-outs, the gradient hacking process is naturally self-reinforcing?
To the extent that the “gradient hacking process” is an optimizer-accessible path to lower loss, it can persist. A secretly-devious subagent could survive indefinitely, be reinforced, and take over more of input space if it always outputs things that are scored well during training.
If the RL training process starts with a situationally aware superintelligent adversary, I bet it could persist by never outputting anything bad during training and by outperforming other nascent implementations.
But suppose the training doesn’t start with a superintelligent adversary. Finding its way to some form of internal generalizing algorithm seems likely because doing so is just so often useful, but going beyond that to actually implementing a policy of merely hiding, let alone gradient hacking, is hard.
It would need to snap from a benign implementation to a highly capable deceptive process effectively instantly with respect to SGD. It couldn’t develop slowly; incomplete attempts at deception would very likely get the nascent deceiver killed early. And, while I expect capability on certain tasks to sometimes be rapid through snapping together the last pieces of a more strongly generalizing algorithm, capability alone is not sufficient.
At the moment, I suspect the space of possible algorithms that could be ‘snapped’ into is large and obviously biased against loss increasing algorithms. Getting a deceptive mesaoptimizer or gradient hacker out of this process requires rolling a lot of dice and getting an extremely specific subset of algorithms, and I don’t see a reason to think they are remotely probable compared to all the “not that” options.
Or to phrase it another way, the simplest way to reduce loss is to reduce loss, not to secretly plan to increase loss just out of sight.
This is an area where a single experiment could change my mind immediately, but so far I’ve only seen failed attempts (and all my own attempts to think of a way to do it have dead-ended).
That said, using an optimizer that doesn’t do the same sort of credit assignment combined with some form of training process with tons of open space for it to run free changes the game. It seems way easier in that context, even if I still don’t know how to demonstrate it.
I think we’re talking past each other here. Some subtle points I should have been more clear on:
-This approach to gradient hacking doesn’t affect the RLHF loss at all. (The gradient hacking is only initiated after we’ve solved our tasks, and in samples where the reward won’t be affected by additional text) -PPO RLHF training is weird due to the circularity involved where the model is used to generate its own training samples; in this way RL is not like pre-training; and consequently you can get self-reinforcing phenomena out of it like mode collapse. I think my proposal is an example in that vein. -This kind of gradient hacking also could be attempted in plain sight by a helpful assistant that just wants to learn to be even more helpful!
To state the main idea a different way:
The suggestion is that the network could abuse the fact that RLHF is happening to “ride along” with the update, using it to train some unrelated behavior of its choice.
The way it would hypothetically do this, is by figuring out which way the gradient is going to hit, positive or negative for increasing the likelihood of the sample (let’s say by waiting til it is confident that its current sample will be the favored sample in PPO), and then once it knows that direction, injecting text or thought patterns that it wants to encourage
It seems to me that the main difficulty is storing your reference policy somewhere where the gradient can’t touch it (even indirectly). Does anyone have a story of how that happens?
A story of how that happens:
In future (unsafe but) sophisticated systems, models will have access to external storage, and be given the ability to read/write. For example, AutoGPT-like systems will have shared scratchpads across instances of the model, to coordinate with themselves. It seems instrumentally convergent for such systems to store things related to their true / eventual goals in such places, for example, so that they can be referenced by local instances. And if we do some sort of oversight of that data storage, key facts that the model wants to preserve might be obfuscated.
Of course, this is only one story, and I don’t expect it to be the only way such things could happen, but it seems to be a reasonable candidate as a failure mode for systems designed with normal (read: insufficient, but far more than otherwise expected,) levels of caution.
But gradient descent will still change the way that the system interprets things in its data storage, right?
I guess part of the question here is whether gradient descent will even scale to AutoGPT-like systems. You’re probably not going to be able to differentiate thru your external notes / other changes you could make to your environment.
Gradient hacking issues might still arise without that, and might even be very conceptually straightforward for the AI to execute.
Here’s one possibility:
Let’s assume the AI is situationally aware that it is inside of an RL training run with PPO (which seems to be how RLHF is being done at OpenAI?)
Often enough, it will get a hint it is going to (win, lose) the roll-out.
[I have this experience every time I write a timed essay. Very clear whether it’s going well or poorly.]
In the remainder of the roll-out, it can start engaging in behaviors/thoughts that it wants to (learn, un-learn); and the gradient update will hit that behavior.
Congrats, we just gave the AI control to shape its own motivations.
It just needs to figure out how to
(a) anticipate the result of roll-outs partway thru (easy)
(b) be situationally aware, figure out this strategy (capabilities-limited)
(c) figure out how to trigger its own desired or undesired thought patterns conditionally (could be easy; might be harder if there is oversight looking for this)
...I suppose none of this strategy actually requires the AI to be mis-aligned? So it might be something we see models attempt to do for “good” reasons, once they’re capable of it. E.g., we might imagine a model hallucinating a factual error, and then like a child being punished in detention, writing out its mistake repeatedly so that it will be remembered
If the gradient is able to flow back to the decisionmaker (in this case, the potentially misaligned agent that’s trying to gradient hack), the decisionmaker gets smacked by the gradient update because it’s responsible for the output.
I’m not quite ready to say it’s impossible to pull of this sort of attack within a single differentiable model under backprop, but this strategy doesn’t seem sufficient because the decisionmaker would basically be killing itself.
It’d be really helpful if someone could demonstrate this sort of attack at any scale or level of contrivance. I could see how optimizers that lack the robust credit assignment of backprop could be more vulnerable; that might be a good place to poke to try to demonstrate it first.
Let’s think this through.
-If the thought processes in (a) and (b) are being run on all rollouts, by symmetry the gradients on those operations should mostly cancel.
-The part which does get consistently affected is letter (c), the parts which operate conditionally on success/fail status.
Which way does the gradient go on c? I suppose on the losing rollouts, you could argue that the conditional-thinking mechanism will be dis-incentivized. The gradient hacker would need to “restore” these gradients with the hacking process, which does seem unlikely. But by the symmetry of this logic, doesn’t that mean that conditioned on winning roll-outs, the gradient hacking process is naturally self-reinforcing? Maybe the issue is that exposure to the gradient will necessarily screw with the computations, in (c) sending its internal pathways off to infinity so it gets cancer and blows up.
-But if the mechanisms in (c) route through general logic infrastructure that are useful for task completion [e.g. by having (a) and (b) writing down its plan in a conditional workflow, and then letting its general executions functions handle the subsequent operations], these may be stabilized/maintained by the RL process because they are generally useful, and not killed
I don’t really think about it in terms of discrete capabilities. For an output that scores poorly:
There is an internal process that is responsible the output.
The gradient of the output with respect to the loss can be backpropagated through the entire network.
The responsible process will have larger gradients.
The responsible process gets smacked.
The responsible process encompasses everything, including all the metareasoning. The more a chunk of computation, any computation, contributed to that bad output, the more it gets smacked.
The gradient will simply move the implementation towards something that isn’t outputting poorly scoring things, and most things that score well aren’t doing some galaxy brained deception strategy.
To the extent that the “gradient hacking process” is an optimizer-accessible path to lower loss, it can persist. A secretly-devious subagent could survive indefinitely, be reinforced, and take over more of input space if it always outputs things that are scored well during training.
If the RL training process starts with a situationally aware superintelligent adversary, I bet it could persist by never outputting anything bad during training and by outperforming other nascent implementations.
But suppose the training doesn’t start with a superintelligent adversary. Finding its way to some form of internal generalizing algorithm seems likely because doing so is just so often useful, but going beyond that to actually implementing a policy of merely hiding, let alone gradient hacking, is hard.
It would need to snap from a benign implementation to a highly capable deceptive process effectively instantly with respect to SGD. It couldn’t develop slowly; incomplete attempts at deception would very likely get the nascent deceiver killed early. And, while I expect capability on certain tasks to sometimes be rapid through snapping together the last pieces of a more strongly generalizing algorithm, capability alone is not sufficient.
At the moment, I suspect the space of possible algorithms that could be ‘snapped’ into is large and obviously biased against loss increasing algorithms. Getting a deceptive mesaoptimizer or gradient hacker out of this process requires rolling a lot of dice and getting an extremely specific subset of algorithms, and I don’t see a reason to think they are remotely probable compared to all the “not that” options.
Or to phrase it another way, the simplest way to reduce loss is to reduce loss, not to secretly plan to increase loss just out of sight.
This is an area where a single experiment could change my mind immediately, but so far I’ve only seen failed attempts (and all my own attempts to think of a way to do it have dead-ended).
That said, using an optimizer that doesn’t do the same sort of credit assignment combined with some form of training process with tons of open space for it to run free changes the game. It seems way easier in that context, even if I still don’t know how to demonstrate it.
I think we’re talking past each other here. Some subtle points I should have been more clear on:
-This approach to gradient hacking doesn’t affect the RLHF loss at all. (The gradient hacking is only initiated after we’ve solved our tasks, and in samples where the reward won’t be affected by additional text)
-PPO RLHF training is weird due to the circularity involved where the model is used to generate its own training samples; in this way RL is not like pre-training; and consequently you can get self-reinforcing phenomena out of it like mode collapse. I think my proposal is an example in that vein.
-This kind of gradient hacking also could be attempted in plain sight by a helpful assistant that just wants to learn to be even more helpful!
To state the main idea a different way:
The suggestion is that the network could abuse the fact that RLHF is happening to “ride along” with the update, using it to train some unrelated behavior of its choice.
The way it would hypothetically do this, is by figuring out which way the gradient is going to hit, positive or negative for increasing the likelihood of the sample (let’s say by waiting til it is confident that its current sample will be the favored sample in PPO), and then once it knows that direction, injecting text or thought patterns that it wants to encourage