Matthew Barnett comments on nielsrolf’s Shortform

Matthew Barnett 10 Mar 2024 3:20 UTC
3 points
0
I think even your definition of what it means for an agent to be aligned is a bit underspecified because it doesn’t distinguish between two possibilities:
1. Is the agent creating positive outcomes because it trades and compromises with us, creating a mutually beneficial situation that benefits both us and the agent, or
2. Is the agent creating positive outcomes because it inherently “values what we value”, i.e. its utility function overlaps with ours, and it directly pursues what we want from it, with no compromises?
Definition (1) is more common in the human world. We say that a worker is aligned with us if they do their job as instructed (receiving a wage in return). Definition (2) is more common in theoretical discussions of AI alignment, because people frequently assume that compromise is either unnecessary or impossible, as a strategy that we can take in an AI-human scenario.

By itself, the meaning you gave appears to encompass both definitions, but it seems beneficial to clarify which of these definitions you’d consider closer to the “spirit” of the word “aligned”. It’s also important to specify what counts as a good outcome by our values if these things are a matter of degree, as opposed to being binary. As they say, clear thinking requires making distinctions.