avturchin comments on “AI Alignment” is a Dangerously Overloaded Term

avturchin 15 Dec 2023 16:28 UTC
2 points
0
My intuition: imagent LLM-based agent. It has fixed prompt and some context text and use this iteratively. Context part can change and as it changes, it affects interpretation of fixed part of the prompt. Examples are Waluigi and other attacks. This causes goal drift.
This may have bad consequences as a robot suddenly turns in Waluigi and start kill randomly everyone around. But long-term planning and deceptive alignment requires very fixed goal system.
- bideup 15 Dec 2023 16:32 UTC
  3 points
  0
  Parent
  Right, makes complete sense in the case of LLM-based agents, I guess I was just thinking about much more directly goal-trained agents.