if an agent consistently acts to navigate the state of the environment into a certain regime, we can call this a “goal of the agent”
if that goal corresponds to states of the environment that we value, the agent is aligned
It is less clear what it means to align an LLM:
Generating words (or other tokens) can be viewed as actions. Aligning LLMs then means: make it say nice things.
Generating words can also be seen as thoughts. An LLM that allows us to easily build aligned agents with the right mix of prompting and scaffolding could be called aligned.
One definition that a friend proposed is: an LLM is aligned if it can never serve as the cognition engine for a misaligned agent—this interpretation most strongly emphasizes the “harmlessness” aspect of LLM alignment
Probably, we should have different alignment goals for different deployment cases: LLM assistants should say nice and harmless things, while agents that help automate alignment research should be free to think anything they deem useful, and reason about the harmlessness of various actions “out loud” in their CoT, rather than implicitly in a forward pass.
I think even your definition of what it means for an agent to be aligned is a bit underspecified because it doesn’t distinguish between two possibilities:
Is the agent creating positive outcomes because it trades and compromises with us, creating a mutually beneficial situation that benefits both us and the agent, or
Is the agent creating positive outcomes because it inherently “values what we value”, i.e. its utility function overlaps with ours, and it directly pursues what we want from it, with no compromises?
Definition (1) is more common in the human world. We say that a worker is aligned with us if they do their job as instructed (receiving a wage in return). Definition (2) is more common in theoretical discussions of AI alignment, because people frequently assume that compromise is either unnecessary or impossible, as a strategy that we can take in an AI-human scenario.
By itself, the meaning you gave appears to encompass both definitions, but it seems beneficial to clarify which of these definitions you’d consider closer to the “spirit” of the word “aligned”. It’s also important to specify what counts as a good outcome by our values if these things are a matter of degree, as opposed to being binary. As they say, clear thinking requires making distinctions.
What does it mean to align an LLM?
It is very clear what it means to align an agent:
an agent acts in an environment
if an agent consistently acts to navigate the state of the environment into a certain regime, we can call this a “goal of the agent”
if that goal corresponds to states of the environment that we value, the agent is aligned
It is less clear what it means to align an LLM:
Generating words (or other tokens) can be viewed as actions. Aligning LLMs then means: make it say nice things.
Generating words can also be seen as thoughts. An LLM that allows us to easily build aligned agents with the right mix of prompting and scaffolding could be called aligned.
One definition that a friend proposed is: an LLM is aligned if it can never serve as the cognition engine for a misaligned agent—this interpretation most strongly emphasizes the “harmlessness” aspect of LLM alignment
Probably, we should have different alignment goals for different deployment cases: LLM assistants should say nice and harmless things, while agents that help automate alignment research should be free to think anything they deem useful, and reason about the harmlessness of various actions “out loud” in their CoT, rather than implicitly in a forward pass.
I think even your definition of what it means for an agent to be aligned is a bit underspecified because it doesn’t distinguish between two possibilities:
Is the agent creating positive outcomes because it trades and compromises with us, creating a mutually beneficial situation that benefits both us and the agent, or
Is the agent creating positive outcomes because it inherently “values what we value”, i.e. its utility function overlaps with ours, and it directly pursues what we want from it, with no compromises?
Definition (1) is more common in the human world. We say that a worker is aligned with us if they do their job as instructed (receiving a wage in return). Definition (2) is more common in theoretical discussions of AI alignment, because people frequently assume that compromise is either unnecessary or impossible, as a strategy that we can take in an AI-human scenario.
By itself, the meaning you gave appears to encompass both definitions, but it seems beneficial to clarify which of these definitions you’d consider closer to the “spirit” of the word “aligned”. It’s also important to specify what counts as a good outcome by our values if these things are a matter of degree, as opposed to being binary. As they say, clear thinking requires making distinctions.