Consider a neural network like GPT-3 trained by gradient descent on (say) the cross-entropy loss function. This loss function forms the _base objective_ that the process is optimizing for. Gradient descent typically ends up at some local minimum, global minimum, or saddle point of this base objective.
However, if we look at the gradient descent equation, θ = θ - αG, where G is the gradient, we can see that this is effectively minimizing the size of the gradients. We can think of this as the mesa objective: the gradient descent process (with an appropriate learning rate decay schedule) will eventually get G down to zero, its minimum possible value (even though it may not be at the global minimum for the base objective).
The author then proposes defining capability of an optimizer based on how well it decreases its loss function in the limit of infinite training. Meanwhile, given a base optimizer and mesa optimizer, alignment is given by the capability of the base optimizer divided by the capability of the mesa optimizer. (Since the mesa optimizer is the one that actually acts, this is effectively measuring how much progress on the mesa objective also causes progress on the true base objective.)
This has all so far assumed a fixed training setup (such as a fixed dataset and network architecture). Ideally, we would also want to talk about robustness and generalization. For this, the author introduces the notion of a “perturbation” to the training setup, and then defines [capability / alignment] [robustness / generalization] based on whether the optimization stays approximately the same when the training setup is perturbed.
It should be noted that these are all definitions about the behavior of optimizers in the infinite limit. We may also want stronger guarantees that also talk about the behavior on the way to the infinite limit.
Please note that I’m currently working on a correction for part of this post — the form of the mesa-objective G(t) I’m claiming is in fact wrong, as Charlie correctly alludes to in a sibling comment.
Planned summary for the Alignment Newsletter:
Thanks, Rohin!
Please note that I’m currently working on a correction for part of this post — the form of the mesa-objective G(t) I’m claiming is in fact wrong, as Charlie correctly alludes to in a sibling comment.