The following is an adaptation of part of my (unsuccessful) application to the
Winter 2022 cohort of SERI MATS, as prompted by
Victoria Krakovnaās āWhat is your favorite
definition of agency, and how would you apply it to a language model?ā
This is far from a perfect treatment of the topic, but I am publishing it to
document my early days in the field. Or should I say āEpistemic status: I am
learningā š.
I am interested in this question because it is one which came up over the summer
when I got more interested in AI Safety. In my readings of LessWrong and
technical AI Safety Research surveys, I started to question the need for agency
in AI models in the first place.
It seemed to me that most hypotheses about future AGI scenarios involved a notion of agency
in the model(s) considered, but it was unclear to me what advantages developing
agentic models would provide to humans. I thought that if agency was not
essential, then perhaps a ācompromiseā between safety and AGI capabilities could
be achieved by relying on oracle-like models, devoid of agency but useful for
guidance and whose answers and recommendations we could ultimately
reject[1]. I asked a
related question on the AI stackexchange forum
but unfortunately was met with confusion and ultimately my question was closed
as I had made the very obvious mistake of not defining my terms. In an attempt
(which never succeeded) to resuscitate the discussion, I proposed the following
definition of agency:
I define agency as the ability to autonomously perceive and interact
with a given environment. Anything capable of agency is then an agent.
Furthermore, I define
autonomous perception: the ability to perceive a given environment
without the need of an external agent
interaction: the ability to change the state of the environment
I liked (and still like) my definition, but admittedly I have not tried
evaluating it[2]. One issue is that I did not develop it in
isolation: my bias was that language models are tools, not agents and purposely
defined agency backwards from this claim. As such it is (perhaps artificially)
difficult to apply this definition of agency to language models. The choice of
environment for a language model is trivial: it can be anything whose state can
be represented (noisily) through language. The āinteractionā requirement of the
definition can in a sense be met by a LM capable of influencing and influencing
the actions of agents that form and can in their turn form the LMās environment.
For example, when prompted, an LM could output a recipe which could then be
realised in the real world by a human. In this sense, the LM is interacting with
the environment, albeit indirectly. The āautonomous perceptionā requirement is
instead a bit more difficult to be met by a LM. A LMās perception of its
environment is limited to the textual inputs that someone has to provide. With
multimodal LMs, the inputs are no longer limited to text, but still need to be
inputted by someone. It is only once a LM is coupled with something capable of
agency (a human, or say, a RL policy guiding sensors) that it can receive inputs
and perceive its environment.
Of the readings suggested, my favorite is Barandian et al.ās definition of
agency [1], reported in [2], as it seemed the most focused and less brittle to
confusion with other terms such as āgoal-directednessā and āoptimizationā which
may have overlap with agency but may also not be entirely the same. The
definition is as follows:
agency involves, at least, a system doing something by itself according to
certain goals or norms within a specific environment.
In more detail, agency requires:
Individuality: the system is capable of defining its own identity as an
individual and thus distinguishing itself from its surroundings; in doing
so, it defines an environment in which it carries out its actions;
Interactional asymmetry: the system actively and repeatedly modulates its
coupling with the environment (not just passive symmetrical interaction);
Normativity: the above modulation can succeed or fail according to some
norm.
It is once again difficult to recognise agency in language models under this
definition. We can work through the requirements and see why. Note that I will
treat each requirement assuming the other two requirements are satisfied.
However do also note that the authors indicate that each requirement is
co-dependent on the satisfaction of the others.
A LMās capability of defining its own identity is debatable. Modern LMās are
certainly able to produce output stating something along the lines of āI am a
language model, I distinguish myself from my surrounding because [ā¦].ā That
is, it is possible that such an output will be produced. However, can this be
considered as the LM defining its own identity? Firstly, the LM needs to be
non-trivially prompted to produce such an output. Secondly, by nature of
their training objective, assuming a decoder-only GPT-like LM, the output is
simply what the model weights compute to be the most likely completion of the
previous tokens provided in the prompt. It is therefore controversial whether
the model even āknowsā that it is defining itself, which seems like a
necessary condition for an agent to define itself in words. In a sense, any
self-reference and/āor self-maintenance which the authors outline as
definitional for individuality would seem to be an āactā in the case of
language models.
I like this requirement because it is similar to the āautonomous perception
and interactionā requirements present in my definition. In this case, a
similar evaluation for whether LMās can meet this requirement is concluded:
because any actions performed (outputs emitted) by a LM are caused by at most
a chain of inputs originating at some input provided by an external agent,
LMs do not meet the interactional asymmetry requirement, at least not in
their current form.
It can be more easily argued that LMs meet the normativity requirement.
Because they are trained with one (or more) objective functions, the
resulting LM acts to satisfy some goal (which the authors use interchangeably
with ānormā), that is finding and outputting the most likely completion to
their inputs. This norm is encoded in the network parameters and as such is
generated by the system itself.
In general, it seems difficult to reconcile satisfactory definitions of agency
and viewing language models as agents. I am nevertheless very interested in the
question, and do not rule out more appropriate definitions being developed and
tested, and different interpretations and understandings of language models that
I have not yet considered. I think this is an important question to consider,
given the possibility of agency emerging in models not explicitly designed to be
agentic. Having a definition of the word could aid us in recognising such a
scenario.
References
[1] Barandiaran XE, Di Paolo E, Rohde M. Defining Agency: Individuality,
Normativity, Asymmetry, and Spatio-temporality in Action. Adaptive Behavior.
2009;17(5):367-386. doi:10.1177/ā1059712309343819
I recognise this is somewhat naiveāa sufficiently intelligent oracle could
provide answers persuasive enough that we donāt even consider their
rejection.
Victoria had to evaluate it as part of my application, and said the
following: āyour definition of agency was a bit circularāyou defined
agency in terms of autonomy, and autonomy in terms of not needing an
external agent.ā I agree this is valid criticism.
Agency, LLMs and AI SafetyāA First Pass
Link post
The following is an adaptation of part of my (unsuccessful) application to the Winter 2022 cohort of SERI MATS, as prompted by Victoria Krakovnaās āWhat is your favorite definition of agency, and how would you apply it to a language model?ā
This is far from a perfect treatment of the topic, but I am publishing it to document my early days in the field. Or should I say āEpistemic status: I am learningā š.
I am interested in this question because it is one which came up over the summer when I got more interested in AI Safety. In my readings of LessWrong and technical AI Safety Research surveys, I started to question the need for agency in AI models in the first place. It seemed to me that most hypotheses about future AGI scenarios involved a notion of agency in the model(s) considered, but it was unclear to me what advantages developing agentic models would provide to humans. I thought that if agency was not essential, then perhaps a ācompromiseā between safety and AGI capabilities could be achieved by relying on oracle-like models, devoid of agency but useful for guidance and whose answers and recommendations we could ultimately reject[1]. I asked a related question on the AI stackexchange forum but unfortunately was met with confusion and ultimately my question was closed as I had made the very obvious mistake of not defining my terms. In an attempt (which never succeeded) to resuscitate the discussion, I proposed the following definition of agency:
I liked (and still like) my definition, but admittedly I have not tried evaluating it[2]. One issue is that I did not develop it in isolation: my bias was that language models are tools, not agents and purposely defined agency backwards from this claim. As such it is (perhaps artificially) difficult to apply this definition of agency to language models. The choice of environment for a language model is trivial: it can be anything whose state can be represented (noisily) through language. The āinteractionā requirement of the definition can in a sense be met by a LM capable of influencing and influencing the actions of agents that form and can in their turn form the LMās environment. For example, when prompted, an LM could output a recipe which could then be realised in the real world by a human. In this sense, the LM is interacting with the environment, albeit indirectly. The āautonomous perceptionā requirement is instead a bit more difficult to be met by a LM. A LMās perception of its environment is limited to the textual inputs that someone has to provide. With multimodal LMs, the inputs are no longer limited to text, but still need to be inputted by someone. It is only once a LM is coupled with something capable of agency (a human, or say, a RL policy guiding sensors) that it can receive inputs and perceive its environment.
Of the readings suggested, my favorite is Barandian et al.ās definition of agency [1], reported in [2], as it seemed the most focused and less brittle to confusion with other terms such as āgoal-directednessā and āoptimizationā which may have overlap with agency but may also not be entirely the same. The definition is as follows:
It is once again difficult to recognise agency in language models under this definition. We can work through the requirements and see why. Note that I will treat each requirement assuming the other two requirements are satisfied. However do also note that the authors indicate that each requirement is co-dependent on the satisfaction of the others.
A LMās capability of defining its own identity is debatable. Modern LMās are certainly able to produce output stating something along the lines of āI am a language model, I distinguish myself from my surrounding because [ā¦].ā That is, it is possible that such an output will be produced. However, can this be considered as the LM defining its own identity? Firstly, the LM needs to be non-trivially prompted to produce such an output. Secondly, by nature of their training objective, assuming a decoder-only GPT-like LM, the output is simply what the model weights compute to be the most likely completion of the previous tokens provided in the prompt. It is therefore controversial whether the model even āknowsā that it is defining itself, which seems like a necessary condition for an agent to define itself in words. In a sense, any self-reference and/āor self-maintenance which the authors outline as definitional for individuality would seem to be an āactā in the case of language models.
I like this requirement because it is similar to the āautonomous perception and interactionā requirements present in my definition. In this case, a similar evaluation for whether LMās can meet this requirement is concluded: because any actions performed (outputs emitted) by a LM are caused by at most a chain of inputs originating at some input provided by an external agent, LMs do not meet the interactional asymmetry requirement, at least not in their current form.
It can be more easily argued that LMs meet the normativity requirement. Because they are trained with one (or more) objective functions, the resulting LM acts to satisfy some goal (which the authors use interchangeably with ānormā), that is finding and outputting the most likely completion to their inputs. This norm is encoded in the network parameters and as such is generated by the system itself.
In general, it seems difficult to reconcile satisfactory definitions of agency and viewing language models as agents. I am nevertheless very interested in the question, and do not rule out more appropriate definitions being developed and tested, and different interpretations and understandings of language models that I have not yet considered. I think this is an important question to consider, given the possibility of agency emerging in models not explicitly designed to be agentic. Having a definition of the word could aid us in recognising such a scenario.
References
[1] Barandiaran XE, Di Paolo E, Rohde M. Defining Agency: Individuality, Normativity, Asymmetry, and Spatio-temporality in Action. Adaptive Behavior. 2009;17(5):367-386. doi:10.1177/ā1059712309343819
[2] Literature Review on Goal-Directedness
I recognise this is somewhat naiveāa sufficiently intelligent oracle could provide answers persuasive enough that we donāt even consider their rejection.
Victoria had to evaluate it as part of my application, and said the following: āyour definition of agency was a bit circularāyou defined agency in terms of autonomy, and autonomy in terms of not needing an external agent.ā I agree this is valid criticism.