Great post! I liked the clean analysis of the problem, the formalization, and the effort to point the potential issues with your definitions. Now I’m really excited for the next posts, where I assume that you will study robustness and generalization (based on your definitions) for simple examples of gradient descent. I’m interested in commenting early drafts if you need feedback!
Some of the fixed points θ∗ of this system will coincide with global or local minima of our base objective, the cross-entropy loss L(t) — but not all of them. Some will be saddle points, while others will be local or global maxima. And while we don’t consider all these fixed points to be equally performant with respect to our base objective, our gradient descent optimizer does consider them all to be equally performant with respect to its true objective.
This disagreement is the core of the inner alignment problem in this setting: our gradient descent process isn’t always optimizing for the quantity we want it to. So what quantity is it optimizing for?
I agree wholeheartedly with this characterization. For me, that’s the gist of the inner alignment problem if the objective is the right one (i.e. if outer alignment is solved).
Let’s look at a second example. This time we’ll compare Optimizer A to a ththird optimizer,
Typo on “ththird”.
Definition 1. Let Lt be a base optimizer acting over t optimization steps, and let L(t) represent the value of its base objective at optimization step t. Then the capability of Lt with respect to the base objective L(t) is
C(L)=limT→∞1T∑Tt=1L(t)−L(0)
At first I wondered why you were taking the sum instead of just C(L)=limT→∞L(T)−L(0)T, but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the post itself.
In our gradient descent example, our mesa-optimizer LtM was the gradient descent process, and its mesa-objective was what, at the time, I called the “true objective”, G(t). But the base optimizer LtB was the human who designed the neural network and ran the gradient process on it.
This is not where I thought you were going when I read the intro, but that’s a brilliant idea that removes completely the question of whether and why the base optimizer would find a mesa-optimizer to which it can delegate work.
Thanks for the kind words, Adam! I’ll follow up over DM about early drafts — I’m interested in getting feedback that’s as broad as possible and really appreciate the kind offer here.
Typo is fixed — thanks for pointing it out!
At first I wondered why you were taking the sum instead of just C(L)=limT→∞L(T)−L(0)T, but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the post itself.
Yes, the problem with that definition would indeed be that if your optimizer converges to some limiting loss function value like limT→∞L(T)=L∞, then you’d get limT→∞L(T)−L(0)T=limT→∞L∞−L(0)T=0 for any L∞.
Great post! I liked the clean analysis of the problem, the formalization, and the effort to point the potential issues with your definitions. Now I’m really excited for the next posts, where I assume that you will study robustness and generalization (based on your definitions) for simple examples of gradient descent. I’m interested in commenting early drafts if you need feedback!
I agree wholeheartedly with this characterization. For me, that’s the gist of the inner alignment problem if the objective is the right one (i.e. if outer alignment is solved).
Typo on “ththird”.
At first I wondered why you were taking the sum instead of just C(L)=limT→∞L(T)−L(0)T, but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the post itself.
This is not where I thought you were going when I read the intro, but that’s a brilliant idea that removes completely the question of whether and why the base optimizer would find a mesa-optimizer to which it can delegate work.
Thanks for the kind words, Adam! I’ll follow up over DM about early drafts — I’m interested in getting feedback that’s as broad as possible and really appreciate the kind offer here.
Typo is fixed — thanks for pointing it out!
Yes, the problem with that definition would indeed be that if your optimizer converges to some limiting loss function value like limT→∞L(T)=L∞, then you’d get limT→∞L(T)−L(0)T=limT→∞L∞−L(0)T=0 for any L∞.
Thanks again!