A fine-tuning could be an identity or mission statement for an agent (bureaucracy), so that it speaks with a purpose or attention to particular features of a situation, or to a particular concept, or to an aspect of preference. Then in an HCH-like setting, let’s define for each situation (initial prompt) an episode on it that involves multiple agents discussing the situation, elucidating its aspects pertaining to those agents. Each agent participates in some set of episodes defined on a set of situations (agent’s scope), and the scope can be different for different agents (each agent is specialized and only participates in episodes about situations where its fine-tuning is relevant).
An agent is aligned when it consistently acts according to its fine-tuning’s intent within its scope (it’s robust to situations in its scope, or to episodes on the situations in its scope). An agent doesn’t need to behave correctly outside its scope to be considered aligned, so a fine-tuning doesn’t need to generalize too far, but its scope must conservatively estimate how far it does generalize.
So a large space of situations can be covered by overlapping smaller scopes of agents that bind behaviors of episodes on those situations together. Each agent acts as a sort of acausal coordination device across the episodes on its scope, if agents are iteratively retrained on the data of episodes (as a sort of reflection). And each episode binds behaviors of agents participating in it together (in a sort of bargaining).
In this sketch, alignment/extrapolation (across distributional shift) is sought by training new specialized agents that cover novel situations further from the initial training/fine-tuning distribution with their scopes. This is done by adding them to episodes on situations that are in scopes of both old and new agents, where they bargain with old agents and learn to extend their alignment to new situations within their new scopes. A new agent is trained to understand new situations (within its scope) and arguments that take place within the episodes on those situations. These are unfamiliar to old agents, so their adequate descriptions/explanations won’t fit into a context window for an old agent (the way prompts can’t replace fine-tunings), but the new agent is expecting these situations already, so can discuss them (after iterating reflection, fine-tuning to the episodes on the new agent’s scope).
A fine-tuning could be an identity or mission statement for an agent (bureaucracy), so that it speaks with a purpose or attention to particular features of a situation, or to a particular concept, or to an aspect of preference. Then in an HCH-like setting, let’s define for each situation (initial prompt) an episode on it that involves multiple agents discussing the situation, elucidating its aspects pertaining to those agents. Each agent participates in some set of episodes defined on a set of situations (agent’s scope), and the scope can be different for different agents (each agent is specialized and only participates in episodes about situations where its fine-tuning is relevant).
An agent is aligned when it consistently acts according to its fine-tuning’s intent within its scope (it’s robust to situations in its scope, or to episodes on the situations in its scope). An agent doesn’t need to behave correctly outside its scope to be considered aligned, so a fine-tuning doesn’t need to generalize too far, but its scope must conservatively estimate how far it does generalize.
So a large space of situations can be covered by overlapping smaller scopes of agents that bind behaviors of episodes on those situations together. Each agent acts as a sort of acausal coordination device across the episodes on its scope, if agents are iteratively retrained on the data of episodes (as a sort of reflection). And each episode binds behaviors of agents participating in it together (in a sort of bargaining).
In this sketch, alignment/extrapolation (across distributional shift) is sought by training new specialized agents that cover novel situations further from the initial training/fine-tuning distribution with their scopes. This is done by adding them to episodes on situations that are in scopes of both old and new agents, where they bargain with old agents and learn to extend their alignment to new situations within their new scopes. A new agent is trained to understand new situations (within its scope) and arguments that take place within the episodes on those situations. These are unfamiliar to old agents, so their adequate descriptions/explanations won’t fit into a context window for an old agent (the way prompts can’t replace fine-tunings), but the new agent is expecting these situations already, so can discuss them (after iterating reflection, fine-tuning to the episodes on the new agent’s scope).