Max H answers Why do we care about agency for alignment?

Max H 23 Apr 2023 19:18 UTC
LW: 9 AF: 4
2
AF
Agency may be a convergent property of most AI systems (or at least, of many systems people are likely to try to build), once those systems reach a certain capability level. The simplest and most useful way to predict the behavior of such systems may therefore be to model them as agents.
Perhaps we can avoid the problems posed by agency by building only tool AI. In that case, we probably still need a deep understanding of agency to make sure we avoid building an agent by accident. Instrumental convergence may imply that all sufficiently powerful AI systems start looking like agents eventually, past a certain point. Though, when a particular system is best modeled as an agent may depend on the particulars of that system, and we may want to push that point out as far as possible.
Boiling this down to a single specific reason about why we should care about agency: the concept of agency is likely to be key for creating simple, predictively accurate models of many kinds of powerful AI systems, regardless of whether the builders of those systems:
- deeply understand the concepts of agency (or alignment) themselves
- are deliberately trying to build an agent, or deliberately trying to avoid that, or just trying to build the most powerful system as fast as possible, without explicitly trying to avoid or target agency at all. (We seem to be in a world in which different people are trying all three of these things simultaneously.)
A few arguments or stubs of arguments for why the bolded claim is correct and important:
- Agency is already a useful model of humans and human behavior in many situations.
- Agency is already a useful model of some current AI systems: Mu Zero, Dreamer, Stockfish in the domains of their respective game worlds. It might soon be a useful model of constructions like Auto-GPT, in the domain of the real world.
- The hypothesis that agency is instrumentally convergent means that it will be important in understanding all AI systems above a certain capability level.
- MattJ 24 Apr 2023 13:06 UTC
  1 point
  0
  Parent
  This answer is a little bit confusing to me. You say that ”agency” may be an important concept even if we don’t have a deep understanding of what it entails. But how about a simple understanding?
  
  I thought that when people spoke about ”agency” and AI, they meant something like ”a capacity to set their own final goals”, but then you claim that Stockfish could best be understood by using the concept of ”agency”. I don’t see how.
  
  I myself kind of agree with the sentiment in the original post that ”agency” is a superfluous concept, but want to understand the opposite view?
  - Chris_Leong 24 Apr 2023 13:32 UTC
    2 points
    0
    Parent
    To clarify, I wasn’t arguing it was a superficial concept, just trying to collect all the possible reasons together and nudge people towards making the use cases/applications more precise.
  - Max H 24 Apr 2023 13:48 UTC
    1 point
    0
    Parent
    you claim that Stockfish could best be understood by using the concept of ”agency”. I don’t see how.
    I didn’t claim that Stockfish was best understood by using the concept of agency. I claimed agency was one useful model. Consider the first diagram in this post on embedded agency: you can regard Stockfish as Alexi playing a chess video game. By modeling Stockfish as an agent in this situation, you can abstract its internal workings somewhat and predict that it will beat you at chess, even if you can’t predict the exact moves it makes, or why it makes those moves.
    
    > I thought that when people spoke about ”agency” and AI, they meant something like ”a capacity to set their own final goals”
    
    I think this is not what is meant by agency. Do I have the capacity to set (i.e. change) my own final goals? Maybe so, but I sure don’t want to. My final goals are probably complex and I may be uncertain about what they are, but one day I might be able to extrapolate them into something concrete and coherent.
    
    I would say that an agent is something which is usefully modeled as having any kind of goals at all, regardless of what those goals are. As the ability of a system to achieve those goals increases, the model as an agent gets more useful and predictive relative to other models based on (for example) the system’s internal workings. If I want to know whether Kasparov is going to beat me at chess, a detailed model of neuroscience and a scan of his brain is less useful than modeling him as an agent whose goal is to win the game. Similarly, for Mu Zero, a mechanistic understanding of the artificial neural networks it is comprised of is probably less useful than modeling it as a capable Go agent, when predicting what moves it will make.
- M. Y. Zuo 8 Aug 2023 1:09 UTC
  1 point
  0
  Parent
  Is there a fixed, concrete, definition of “convergent property” or “instrumentally convergent” that most folks can agree on?
  From what I see it’s more loosely and vaguely defined then “agency” itself, so it’s not really dispelling the confusion anyone may have.
  - Max H 8 Aug 2023 18:15 UTC
    2 points
    0
    Parent
    I don’t know if there’s a standard definition or reference for instrumental convergence other than the LW tag, but convergence in general is a pretty well-known phenomenon.
    For example, many biological mechanisms which evolved independently end up looking remarkably similar, because that just happens to be the locally-optimal way of doing things, if you’re in the design space of of iterative mutation of DNA.
    
    Similarly in all sorts of engineering fields, methods or tools or mechanisms are often re-invented independently, but end up converging on very functionally or even visually similar designs, because they are trying to accomplish or optimize for roughly the same thing, and there’s only so many ways of doing so optimally, given enough constraints.
    Instrumental convergence in the context of agency and AI is just that principle applied to strategic thinking and mind design specifically. In that context, it’s more of a hypothesis, since we don’t actually have more than one example of a human-level intelligence being developed. But even if mind designs don’t converge on some kind of optimal form, artificial minds could still be efficient w.rt. humans, which would have many of the same implications.