I’m wondering if regularization techniques could be used to make the pure deception regime unstable.
As a simple example, consider a neural network that is trained with gradient descent and weight decay. If the parameters can be (approximately) split into a set that determines the mesa-objective and a set for everything else, then the gradient of the loss with respect to the “objective parameters” would be zero in the pure deception regime, so weight decay would ensure that the mesa-objective couldn’t be maintained.
The learned algorithm might be able to prevent this by “hacking” its gradient as mentioned in the post, making the parameters that determine the mesa-objective also have an effect on its output. But intuitively, this should at least make it more difficult to reach a stable pure deception regime.
Of course regularization is a double-edged sword because as has been pointed out, the shortest algorithms that perform well on the base objective are probably not robustly aligned.
I’m wondering if regularization techniques could be used to make the pure deception regime unstable.
As a simple example, consider a neural network that is trained with gradient descent and weight decay. If the parameters can be (approximately) split into a set that determines the mesa-objective and a set for everything else, then the gradient of the loss with respect to the “objective parameters” would be zero in the pure deception regime, so weight decay would ensure that the mesa-objective couldn’t be maintained.
The learned algorithm might be able to prevent this by “hacking” its gradient as mentioned in the post, making the parameters that determine the mesa-objective also have an effect on its output. But intuitively, this should at least make it more difficult to reach a stable pure deception regime.
Of course regularization is a double-edged sword because as has been pointed out, the shortest algorithms that perform well on the base objective are probably not robustly aligned.