I suspect most mesa optimizers generated by SGD will have objectives and implementations that are amenable to being modified by SGD. Mesa optimizers form because they give good performance on the base objective. A mesa optimizer whose mesa objective and inner optimization algorithm is easily updated by SGD is better for performance than a mesa optimizer that’s harder to update.
Gradient hacking (especially from first principles) seems pretty complex. The first version of the mesa optimizer SGD generates almost surely won’t gradient hack. From there, “do what the base optimizer wants” is far simpler and leads to far better loss than “run a complex strategy that increases loss”. To deal with cases where the mesa objective conflicts significantly with the base objective, I expect the base optimizer to build robust control structures into the mesa optimizer and that the mesa optimizer will usually not challenge those structures.
Consider that humans are surprisingly close to being mesa aligned with evolution. We may not intrinsically value inclusive genetic fitness, but our sex drive is still pretty effective at getting us to reproduce. Note that evolution takes one step for every ~10s of billions of steps the brain takes, that evolution uses a far less efficient learning algorithm than the brain, and that evolution can’t even directly update brain parameters.
Finally, I think we can deal with the form of gradient hacking you describe (and most other forms of gradient hacking) by using meta learning (see the learn2learn repository). The basic problem of gradient hacking is that there’s some model parameter configuration that leads to poor SGD performance. Meta learning addresses exactly that issue.
Currently, meta learning research tries to find some parameter initialization that leads to quick learning on many tasks. The idea is that you can set the improvement from SGD as an optimization objective and use meta-gradients to get model parameters that are easy for the base optimizer to optimize. However, if the current parameterization of the mesa optimizer is preventing the base optimizer from learning, we should be able to use meta learning to find a mesa optimizer parameterization that’s more amenable to modification by the base optimizer.
The downside is that now you have to think about a base optimizer, a mesa optimizer and a meta optimizer (and maybe a meta meta optimizer if you’re feeling adventurous).
Currently, meta learning research tries to find some parameter initialization that leads to quick learning on many tasks. The idea is that you can set the improvement from SGD as an optimization objective and use meta-gradients to get model parameters that are easy for the base optimizer to optimize.
Just a small nitpick—what you describe (meta learning a param init) is certainly a meta-learning technique, but that term is broader and also encompasses actually learning better optimizers (although I guess you could make the two more equivalent by burning the SGD/update circuitry into the model).
I suspect most mesa optimizers generated by SGD will have objectives and implementations that are amenable to being modified by SGD. Mesa optimizers form because they give good performance on the base objective. A mesa optimizer whose mesa objective and inner optimization algorithm is easily updated by SGD is better for performance than a mesa optimizer that’s harder to update.
Gradient hacking (especially from first principles) seems pretty complex. The first version of the mesa optimizer SGD generates almost surely won’t gradient hack. From there, “do what the base optimizer wants” is far simpler and leads to far better loss than “run a complex strategy that increases loss”. To deal with cases where the mesa objective conflicts significantly with the base objective, I expect the base optimizer to build robust control structures into the mesa optimizer and that the mesa optimizer will usually not challenge those structures.
Consider that humans are surprisingly close to being mesa aligned with evolution. We may not intrinsically value inclusive genetic fitness, but our sex drive is still pretty effective at getting us to reproduce. Note that evolution takes one step for every ~10s of billions of steps the brain takes, that evolution uses a far less efficient learning algorithm than the brain, and that evolution can’t even directly update brain parameters.
Finally, I think we can deal with the form of gradient hacking you describe (and most other forms of gradient hacking) by using meta learning (see the learn2learn repository). The basic problem of gradient hacking is that there’s some model parameter configuration that leads to poor SGD performance. Meta learning addresses exactly that issue.
Currently, meta learning research tries to find some parameter initialization that leads to quick learning on many tasks. The idea is that you can set the improvement from SGD as an optimization objective and use meta-gradients to get model parameters that are easy for the base optimizer to optimize. However, if the current parameterization of the mesa optimizer is preventing the base optimizer from learning, we should be able to use meta learning to find a mesa optimizer parameterization that’s more amenable to modification by the base optimizer.
The downside is that now you have to think about a base optimizer, a mesa optimizer and a meta optimizer (and maybe a meta meta optimizer if you’re feeling adventurous).
Just a small nitpick—what you describe (meta learning a param init) is certainly a meta-learning technique, but that term is broader and also encompasses actually learning better optimizers (although I guess you could make the two more equivalent by burning the SGD/update circuitry into the model).