some objective function that is explicitly represented within the system
but that is not the case here.
There is a fundamental difference between
Programs that implement the computation of taking the derivative. (f→f′, or perhaps f,x→f′(x).)
Programs that implement some particular function g, which happens to be the derivative of some other function. (x→g(x), where it so happens that g=F′ for some F.)
The transformers in this paper are programs of the 2nd type. They don’t contain any logic about taking the gradient of an arbitrary function, and one couldn’t “retarget” them toward L1 loss or some other function.
(One could probably construct similar layers that implement the gradient step for L1, but they’d again be programs of the 2nd type, just with a different hardcoded g.)
Calling something like this an optimizer strikes me as vacuous: if you don’t require the ability to adapt to a change of objective function, you can always take any program and say it’s “optimizing” some function. Just pick a function that’s maximal when you do whatever it is that the program does.
It’s not vacuous to say that the transformers in the paper “implement gradient descent,” as long as one means they “implement [gradient descent on L2 loss]” rather than “implement [gradient descent] on [L2 loss].” They don’t implement general gradient descent, but happen to coincide with the gradient step for L2 loss.
If in-content learning in real transformers involves figuring out the objective function from the context, then this result cannot explain it. If we assume some fixed objective function (perhaps LM loss itself?) and ask whether the model might be doing gradient steps on this function internally, then these results are relevant.
I think the claim that an optimizer is a retargetable search process makes a lot of sense* and I’ve edited the post to link to this clarification.
That being said, I’m still confused about the details.
Suppose that I do a goal-conditioned version of the paper, where (hypothetically) I exhibit a transformer circuit that, conditioned on some prompt or the other, was able to alternate between performing gradient descent on three types of objectives (say, L1, L2, L\infty) -- would this suffice? How about if, instead, there wasn’t any prompt that let me switch between three types of objectives, but there was a parameter inside of the neural network that I could change that causes the circuit to optimize different objectives? How much of the circuit do I have to change before it becomes a new circuit instead of retargeting the optimizer?
I guess part of answer to these questions might look like, “there might not be a clear cutoff, in the same sense that there’s not a clear cutoff for most other definitions that we use in AI alignment (‘agent’ or ‘deceptive alignment’ for example)”, while another part might be “this is left for future work”.
We now characterize goal misgeneralization. Intuitively, goal misgeneralization occurs when we learn a function fθbad that has robust capabilities but pursues an undesired goal.
It is quite challenging to define what a “capability” is in the context of neural networks. We provide a provisional definition following Chen et al. [11]. We say that the model is capable of some task X in setting Y if it can be quickly tuned to perform task X well in setting Y (relative to learning X from scratch). For example, tuning could be done by prompt engineering or by fine-tuning on a small quantity of data [52].
That definition of “optimizer” requires
but that is not the case here.
There is a fundamental difference between
Programs that implement the computation of taking the derivative. (f→f′, or perhaps f,x→f′(x).)
Programs that implement some particular function g, which happens to be the derivative of some other function. (x→g(x), where it so happens that g=F′ for some F.)
The transformers in this paper are programs of the 2nd type. They don’t contain any logic about taking the gradient of an arbitrary function, and one couldn’t “retarget” them toward L1 loss or some other function.
(One could probably construct similar layers that implement the gradient step for L1, but they’d again be programs of the 2nd type, just with a different hardcoded g.)
Calling something like this an optimizer strikes me as vacuous: if you don’t require the ability to adapt to a change of objective function, you can always take any program and say it’s “optimizing” some function. Just pick a function that’s maximal when you do whatever it is that the program does.
It’s not vacuous to say that the transformers in the paper “implement gradient descent,” as long as one means they “implement [gradient descent on L2 loss]” rather than “implement [gradient descent] on [L2 loss].” They don’t implement general gradient descent, but happen to coincide with the gradient step for L2 loss.
If in-content learning in real transformers involves figuring out the objective function from the context, then this result cannot explain it. If we assume some fixed objective function (perhaps LM loss itself?) and ask whether the model might be doing gradient steps on this function internally, then these results are relevant.
I think the claim that an optimizer is a retargetable search process makes a lot of sense* and I’ve edited the post to link to this clarification.
That being said, I’m still confused about the details.
Suppose that I do a goal-conditioned version of the paper, where (hypothetically) I exhibit a transformer circuit that, conditioned on some prompt or the other, was able to alternate between performing gradient descent on three types of objectives (say, L1, L2, L\infty) -- would this suffice? How about if, instead, there wasn’t any prompt that let me switch between three types of objectives, but there was a parameter inside of the neural network that I could change that causes the circuit to optimize different objectives? How much of the circuit do I have to change before it becomes a new circuit instead of retargeting the optimizer?
I guess part of answer to these questions might look like, “there might not be a clear cutoff, in the same sense that there’s not a clear cutoff for most other definitions that we use in AI alignment (‘agent’ or ‘deceptive alignment’ for example)”, while another part might be “this is left for future work”.
*This is also similar to the definition used for inner misalignment in Shah et al’s Goal Misgeneralization paper: