The implicit assumption seems to be that an optimizer_1 could turn into an optimizer_2 unexpectedly if it becomes sufficiently powerful. It is not at all clear to me that this is the case – I have not seen any good argument to support this, nor can I think of any myself.
I think that this is one of the major ways in which old discussions of optimization daemons would often get confused. I think the confusion was coming from the fact that, while it is true in isolation that an optimizer_1 won’t generally self-modify into an optimizer_2, there is a pretty common case in which this is a possibility: the presence of a training procedure (e.g. gradient descent) which can perform the modification from the outside. In particular, it seems very likely to me that there will be many cases where you’ll get an optimizer_1 early in training and then an optimizer_2 later in training.
That being said, while having an optimizer_2 seems likely to be necessary for deceptive alignment, I think you only need an optimizer_1 for pseudo-alignment: every search procedure has an objective, and if that objective is misaligned, it raises the possibility of capability generalization without objective generalization.
Also, as a terminological note, I’ve taken to using “optimizer” for optimizer_1 and “agent” for something closer to optimizer_2, where I’ve been defining an agent as an optimizer that is performing a search over what its own action should be. I prefer that definition to your definition of optimizer_2, as I prefer mechanistic definitions over behavioral definitions since I generally find them more useful, though I think your notion of optimizer_2 is also a useful concept.
Also, as a terminological note, I’ve taken to using “optimizer” for optimizer_1 and “agent” for something closer to optimizer_2, where I’ve been defining an agent as an optimizer that is performing a search over what its own action should be.
I’m confused about this part. According to this definition, is “agent” a special case of optimizer_1? If so it doesn’t seem close to how we might want to define a “consequentialist” (which I think should capture some programs that do interesting stuff other than just implementing [a Turing Machine that performs well on a formal optimization problem and does not do any other interesting stuff]).
I think that this is one of the major ways in which old discussions of optimization daemons would often get confused. I think the confusion was coming from the fact that, while it is true in isolation that an optimizer_1 won’t generally self-modify into an optimizer_2, there is a pretty common case in which this is a possibility: the presence of a training procedure (e.g. gradient descent) which can perform the modification from the outside. In particular, it seems very likely to me that there will be many cases where you’ll get an optimizer_1 early in training and then an optimizer_2 later in training.
That being said, while having an optimizer_2 seems likely to be necessary for deceptive alignment, I think you only need an optimizer_1 for pseudo-alignment: every search procedure has an objective, and if that objective is misaligned, it raises the possibility of capability generalization without objective generalization.
Also, as a terminological note, I’ve taken to using “optimizer” for optimizer_1 and “agent” for something closer to optimizer_2, where I’ve been defining an agent as an optimizer that is performing a search over what its own action should be. I prefer that definition to your definition of optimizer_2, as I prefer mechanistic definitions over behavioral definitions since I generally find them more useful, though I think your notion of optimizer_2 is also a useful concept.
I agree with everything you have said.
I’m confused about this part. According to this definition, is “agent” a special case of optimizer_1? If so it doesn’t seem close to how we might want to define a “consequentialist” (which I think should capture some programs that do interesting stuff other than just implementing [a Turing Machine that performs well on a formal optimization problem and does not do any other interesting stuff]).