Not sure if I agree with your interpretation of the “real objective”—might be better served by looking for stable equilibria and just calling them as such.
I think this is a reasonable objection. I don’t make this very clear in the post, but the “true objective” I’ve written down in the example indeed isn’t unique: like any measure of utility or loss, it’s only unique up to affine transformations with positive coefficients. And that could definitely damage the usefulness of these definitions, since it means that alignment factors, for example, aren’t uniquely defined either. (I’ll be doing a few experiments soon to investigate this, and a few other questions, in a couple of real systems.)
Don’t we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don’t add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a “strong alignment” question).
Interesting question! To try to interpret in light of the definitions I’m proposing: adding annealing changes the true objective (or mesa-objective) of the optimizer, which is no longer solely trying to minimize its gradients — it now has this new annealing term that it’s also trying to optimize for. Whether this improves alignment or not depends on the effect annealing has on 1) the long-term performance of the mesa-optimizer on its new (gradient + annealing) objective; and 2) the long-term performance this induces on the base objective.
Hope that’s somewhat helpful, but please let me know if it’s unclear and I can try to unpack things a bit more!
Thanks for the comment!
I think this is a reasonable objection. I don’t make this very clear in the post, but the “true objective” I’ve written down in the example indeed isn’t unique: like any measure of utility or loss, it’s only unique up to affine transformations with positive coefficients. And that could definitely damage the usefulness of these definitions, since it means that alignment factors, for example, aren’t uniquely defined either. (I’ll be doing a few experiments soon to investigate this, and a few other questions, in a couple of real systems.)
Interesting question! To try to interpret in light of the definitions I’m proposing: adding annealing changes the true objective (or mesa-objective) of the optimizer, which is no longer solely trying to minimize its gradients — it now has this new annealing term that it’s also trying to optimize for. Whether this improves alignment or not depends on the effect annealing has on 1) the long-term performance of the mesa-optimizer on its new (gradient + annealing) objective; and 2) the long-term performance this induces on the base objective.
Hope that’s somewhat helpful, but please let me know if it’s unclear and I can try to unpack things a bit more!