This post calls attention to the problem of **gradient hacking**, where a powerful agent being trained by gradient descent could structure its computation in such a way that it causes its gradients to update it in some particular way. For example, a mesa optimizer could structure its computation to first check whether its objective has been tampered with, and if so to fail catastrophically, so that the gradients tend to point away from tampering with the objective.
Planned opinion:
I’d be interested in work that further sketches out a scenario in which this could occur. I wrote about some particular details in this comment.
Planned summary:
This post calls attention to the problem of **gradient hacking**, where a powerful agent being trained by gradient descent could structure its computation in such a way that it causes its gradients to update it in some particular way. For example, a mesa optimizer could structure its computation to first check whether its objective has been tampered with, and if so to fail catastrophically, so that the gradients tend to point away from tampering with the objective.
Planned opinion:
I’d be interested in work that further sketches out a scenario in which this could occur. I wrote about some particular details in this comment.