Multi-factor goals might mostly look like information learned in earlier steps getting expressed in a new way in later steps. E.g. an LLM that learns from a dataset that includes examples of humans prompting LLMs, and then is instructed to give prompts to versions of itself doing subtasks within an agent structure, may have emergent goal-like behavior from the interaction of these facts.
I think locating goals “within the CoT” often doesn’t work, a ton of work is done implicitly, especially after RL on a model using CoT. What does that mean for attempts to teach metacognition that’s good according to humans?
I think you’re pointing to more layers of complexity in how goals will arise in LLM agents.
As for what it all means WRT metacognition that can stabilize the goal structure: I don’t know, but I’ve got some thoughts! They’ll be in the form of a long post I’ve almost finished editing; I plan to publish tomorrow.
Those sources of goals are going to interact in complex ways both during training, as you note, and during chain of thought. No goals are truly arising solely from the chain of thought, since that’s entirely based on the semantics it’s learned from training.
Multi-factor goals might mostly look like information learned in earlier steps getting expressed in a new way in later steps. E.g. an LLM that learns from a dataset that includes examples of humans prompting LLMs, and then is instructed to give prompts to versions of itself doing subtasks within an agent structure, may have emergent goal-like behavior from the interaction of these facts.
I think locating goals “within the CoT” often doesn’t work, a ton of work is done implicitly, especially after RL on a model using CoT. What does that mean for attempts to teach metacognition that’s good according to humans?
I think you’re pointing to more layers of complexity in how goals will arise in LLM agents.
As for what it all means WRT metacognition that can stabilize the goal structure: I don’t know, but I’ve got some thoughts! They’ll be in the form of a long post I’ve almost finished editing; I plan to publish tomorrow.
Those sources of goals are going to interact in complex ways both during training, as you note, and during chain of thought. No goals are truly arising solely from the chain of thought, since that’s entirely based on the semantics it’s learned from training.