Consider what update equations have to say about “training game” scenarios. In PPO, the optimization objective is proportional to the advantage given a policy π, reward function R, and on-policy value function vπ:
Aπ(s,a):=Es′∼T(s,a)[R(s,a,s′)+γvπ(s′)]−vπ(s).
Consider a mesa-optimizer acting to optimize some mesa objective. The mesa-optimizer understands that it will be updated proportional to the advantage. If the mesa-optimizer maximizes reward, this corresponds to maximizing the intensity of the gradients it receives, thus maximally updating its cognition in exact directions.
This isn’t necessarily good.
If you’re trying to gradient hack and preserve the mesa-objective, you might not want to do this. This might lead to value drift, or make the network catastrophically forget some circuits which are useful to the mesa-optimizer.
Instead, the best way to gradient hack might be to roughly minimize the absolute value of the advantage, which means achieving roughly on-policy value over time, which doesn’t imply reward maximization. This is a kind of “treading water” in terms of reward. This helps decrease value drift.
I think that realistic mesa optimizers will not be reward maximizers, even instrumentally and during training. I think they don’t want to “play the training game by maximizing reward”, not necessarily. Instead, insofar as that kind of optimizer is trained by SGD—the optimizer wants to achieve its goals, which probably involves standard deception—looking good to the human supervisors and not making them suspicious.
Moral: Be careful with abstract arguments around selection pressures and learning-objective maximization. These can be useful and powerful reasoning tools, but if they aren’t deployed carefully, they can prove too much.
Consider what update equations have to say about “training game” scenarios. In PPO, the optimization objective is proportional to the advantage given a policy π, reward function R, and on-policy value function vπ:
Aπ(s,a):=Es′∼T(s,a)[R(s,a,s′)+γvπ(s′)]−vπ(s).Consider a mesa-optimizer acting to optimize some mesa objective. The mesa-optimizer understands that it will be updated proportional to the advantage. If the mesa-optimizer maximizes reward, this corresponds to maximizing the intensity of the gradients it receives, thus maximally updating its cognition in exact directions.
This isn’t necessarily good.
If you’re trying to gradient hack and preserve the mesa-objective, you might not want to do this. This might lead to value drift, or make the network catastrophically forget some circuits which are useful to the mesa-optimizer.
Instead, the best way to gradient hack might be to roughly minimize the absolute value of the advantage, which means achieving roughly on-policy value over time, which doesn’t imply reward maximization. This is a kind of “treading water” in terms of reward. This helps decrease value drift.
I think that realistic mesa optimizers will not be reward maximizers, even instrumentally and during training. I think they don’t want to “play the training game by maximizing reward”, not necessarily. Instead, insofar as that kind of optimizer is trained by SGD—the optimizer wants to achieve its goals, which probably involves standard deception—looking good to the human supervisors and not making them suspicious.
Moral: Be careful with abstract arguments around selection pressures and learning-objective maximization. These can be useful and powerful reasoning tools, but if they aren’t deployed carefully, they can prove too much.
Unless the value head is the value head for the optimal policy, which it isn’t trained to be unless the policy network is already optimal.