My intuition: imagent LLM-based agent. It has fixed prompt and some context text and use this iteratively. Context part can change and as it changes, it affects interpretation of fixed part of the prompt. Examples are Waluigi and other attacks. This causes goal drift.
This may have bad consequences as a robot suddenly turns in Waluigi and start kill randomly everyone around. But long-term planning and deceptive alignment requires very fixed goal system.
My intuition: imagent LLM-based agent. It has fixed prompt and some context text and use this iteratively. Context part can change and as it changes, it affects interpretation of fixed part of the prompt. Examples are Waluigi and other attacks. This causes goal drift.
This may have bad consequences as a robot suddenly turns in Waluigi and start kill randomly everyone around. But long-term planning and deceptive alignment requires very fixed goal system.
Right, makes complete sense in the case of LLM-based agents, I guess I was just thinking about much more directly goal-trained agents.