The discussion of the different conceptualisations of agency in the paper left me confused. The authors pull a number of different strings, some of them are hardly related to each other:
“Agent is a system that would adapt their policy if their actions influenced the world in a different way”
Accidentality: “In contrast, the output of non-agentic systems might accidentally be optimal for producing a certain outcome, but these do not typically adapt. For example, a rock that is accidentally optimal for reducing water flow through a pipe would not adapt its size if the pipe was wider.”
Intentional stance
“Our definition is important for goal-directedness, as it distinguishes incidental influence that a decision might have on some variable, from more directed influence: only a system that counterfactually adapts can be said to be trying to influence the variable in a systematic way.” — here, it’s unclear what does it mean to adapt “counterfactually”.
Learning behaviour, as discussed in section 1.3, and also in the section “Understanding a system with the process of its creation as an agent” below.
I’m not sure this is much of an improvement over Russell and Norvig’s classification of agents, which seems to capture most of the same threads, and also the fact that agency is more of a gradient than a binary yes/no property. This is also consistent with minimal physicalism, the scale-free understanding of cognition (and, hence, agency) by Fields et al.
“Agent” is such a loaded term that I feel it would be easier to read the paper if the authors didn’t attempt to “seize” the term “agent” but instead used, for example, the term “consequentialist”. The term “agent” carries too much semantic burden. Most readers probably already have an ingrained intuitive understanding of this word or have their own favourite theory of agency. So, readers should fight a bias in their head while reading the paper and nonetheless risk misunderstanding it.
Understanding a system with the process of its creation as an agent
[…] our definition depends on whether one considers the creation process of a system when looking for adaptation of the policy. Consider, for example, changing the mechanism for how a heater operates, so that it cools rather than heats a room. An existing thermostat will not adapt to this change, and is therefore not an agent by our account. However, if the designers were aware of the change to the heater, then they would likely have designed the thermostat differently. This adaptation means that the thermostat with its creation process is an agent under our definition. Similarly, most RL agents would only pursue a different policy if retrained in a different environment. Thus we consider the system of the RL training process to be an agent, but the learnt RL policy itself, in general, won’t be an agent according to our definition (as after training, it won’t adapt to a change in the way its actions influence the world, as the policy is frozen).
I think most readers have a strong intuition that agents are physical systems. A system with its creation processis actually a physical object in the BORO ontology (where a process is a 4D spacetime object, and any collection of objects can be seen as an object, too) and probably some other approaches to ontology, but I suspect this might be highly counterintuitive to most readers, so perhaps warrants some discussion.
Also, I think that a “learnt RL policy” can mean either an abstract description of behaviour or a physical encoding of this description on some information storage device. Neither of these can intuitively be an “agent”, so I would stick with simply “RL agent” (meaning “a physical object that acts according to a learned policy”) in this sentence again, to avoid confusion.
I think the authors of the paper had a goal of bridging the gap between the real world roamed by suspected agents and the mathematical formalism of causal graphs. From the conclusion:
We proposed the first formal causal definition of agents. Grounded in causal discovery, our key contribution is to formalise the idea that agents are systems that adapt their behaviour in response to changes in how their actions influence the world. Indeed, Algorithms 1 and 2 describe a precise experimental process that can, in principle and under some assumptions, be done to assess whether something is an agent. Our process is largely consistent with previous, informal characterisations of agents (e.g. Dennett, 1987; Flint, 2020; Garrabrant, 2021; Wiener, 1961), but making it formal enables agents and their incentives to be identifiedempirically [emphasis mine — Roman Leventov] or from the system architecture.
There are other pieces of language that hint that the authors see their contributions as epistemological rather than mathematical (all emphasis is mine):
We derive the first causal discovery algorithm for discovering agents from empirical data.
[Paper’s contributions] ground game graph representations of agents in causal experiments. These experiments can be applied to real systems, or used in thought-experiments to determine the correct game graph and resolve confusions (see Section 4).
However, I don’t think the authors created a valid epistemological method for discovering agents from empirical data. In the following sections, I lay out the arguments supporting this claim.
For the mechanism variables that the modeller can’t intervene and gain information about which only via observing their corresponding object-level variables, it doesn’t make sense to draw a causal link from the mechanism to the object-level variable
If the mechanism variable can’t be intervened and is only observed through its object-level variable, then this mechanism is purely a product of the modeller’s imagination and can be anything.
Such mechanism variables, however, still have a place on the causal graphs: they are represented physically by the modeller (e. g. stored in the computer memory, or in the modeller’s brain) and these representations do physically affect other modeller’s representations, including of other variables, and their decision policies. For example, in graph 1c, the mechanisms of all three variables should be seen as stored in the mouse’s brain:
The causal links X → ~X and U → ~U point from the object-level to the mechanism variables because the mouse learns the mechanisms by observing the object-level.
An example of a mechanism variable which we can’t intervene in, but may observe independently from the object-level is humans’ explicit report of their preferences, separate from what is revealed in their behaviour (e. g. in the content recommendation setting, which I explore in more detail below).
In section 5.2 “Modelling advice”, the authors express a very similar idea:
It should be fully clear both how to measure and how to intervene on a variable. Otherwise its causal relationship to other variables will be ill-defined. In our case, this requirement extends also to the mechanism of each variable.
This idea is similar because the modeller often can’t intervene in the abstract “mechanism” attached to an object-level variable which causes it, but the modeller always can intervene on its own belief about this mechanism. And if mechanism variables represent the modeller’s beliefs rather than “real mechanisms” (cf. Dennett’s real patterns), then it’s obvious that the direction of the causal links should be from the object-level variables to the corresponding beliefs about their mechanism, rather than vice versa.
So, I agree with this last quote, but it seems to contradict a major chunk of the other paper’s content.
There is no clear boundary between the object-level and the mechanism variables
In explaining mechanism and object-level variables, the authors seemingly jump between reasoning within a mathematical formalism of SCMs (or a simulated environment, where all mechanisms are explicitly specified and are controllable by the simulator; this environment doesn’t differ much from a mathematical formalism) and within a real world.
The mathematical/simulation frame:
The intended interpretation is that the mechanism variables parameterise how the object-level variables depend on their object-level parents.
The formalism doesn’t tell us anything about how to distinguish between object-level and mechanism variables: object-level variables are just the variables that are included in the object-level graph, but that itself is arbitrary. For example, in section 4.4, the authors note that Langlois and Everitt (2021) included the decision rule in the game graph, but it should have been a mechanism variable. However, in the content recommendation setting (section 4.2), the “human model” (M) is clearly a mechanism variable for the original human preferences (H1), but is nevertheless included in the object-level graph because other object-level variables depend on it.
There are also two phrases in section 3.3 that suppose a mathematical frame. First, “the set of interventional distributions generated by a mechanised SCM” (in Lemma 2) says that interventional distributions are created by the model, rather than by the physical system (the algorithm executor) performing the interventions in the modelled world. Second, the sentence “Applied to the mouse example of Fig. 1, Algorithm 1 would take interventional data from the system and draw the edge-labelled mechanised causal graph in Fig. 1c.” doesn’t emphasise the fact that an algorithm is always performed by an executor (a physical system) and it’s important who that executor is, including for the details of the algorithm (cf. Deutsch’s and Marletto’s constructor theory of information). Algorithms don’t execute themselves.
The real-world frame:
An intervention on an object-level variable V changes the value of V without changing its mechanism, ~V. This can be interpreted as the intervention occurring after all mechanisms variables have been determined/sampled [emphasis mine—Roman Leventov].
In section 3.5, it is said that mechanised SCM is a “physical representation of the system”.
The distinction between mechanism and object-level variables can be made more concrete by considering repeated interactions. In Section 1.1, assume that the mouse is repeatedly placed in the gridworld, and can adapt its decision rule based (only) on previous episodes. A mechanism intervention would correspond to a (soft) intervention that takes place across all time steps, so that the mouse is able to adapt to it. Similarly, the outcome of a mechanism can then be measured by observing a large number of outcomes of the game, after any learning dynamics has converged. Finally, object-level interventions correspond to intervening on variables in one particular (postconvergence) episode.
[In this quote, the authors used both the terms “interaction” and “episode” to point to the same concept, but I stick to the former because in the Reinforcement Learning literature, the term “episode” has a meaning slightly different from the meaning that is implied by the authors in this quote. — Roman Leventov]
I read from this quote that the authors take an inductive bias that mechanism variables update slowly, so we take a simplifying assumption that they don’t update at all within a single interaction. However, I think this assumption is dangerously ignorant for reasoning about agents capable of reflection and explicit policy planning. Such agents (including humans) can switch their policy (i. e., the mechanism) of making a certain kind of decision in response to a single event. And other agents in the game, aware of this possibility, can and should take it into account, effectively modelling the game as having direct causal paths from some object-level variables into the mechanism variables, which in turn inform their own decisions.
Optimising a model of a human
If, as is common in practice, the model was obtained by predicting clicks based on past user data, then changing how a human reacts to recommended content (~H2), would lead to a change in the way that predicted clicks depend on the model of the original user (~U). This means that there should be an edge, as we have drawn in Fig. 3b. Everitt et al. (2021a) likely have in mind a different interpretation, where the predicted clicks are derived from 𝑀 according to a different procedure, described in more detail by Farquhar et al. (2022). But the intended interpretation is ambiguous when looking only at Fig. 3a – the mechanised graph is needed to reveal the difference.
Why does all this matter? Everitt et al. (2021a) use Fig. 3a to claim that there is no incentive for the policy to instrumentally control how the human’s opinion is updated and they deem the proposed system safe as a result. However, under one plausible interpretation, our causal discovery approach yields the mechanised causal graph representation of Fig. 3b, which contains a directed path H2 → D. This can be interpreted as the recommendation system is influencing the human in a goal-directed way, as it is adapting its behaviour to changes in how the human is influenced by its recommendation (cf. discussion in Section 1.2).
This example casts doubt on the reliability of graphical incentive analysis (Everitt et al., 2021a) and its applications (Ashurst et al., 2022; Cohen et al., 2021; Evans and Kasirzadeh, 2021; Everitt et al., 2021b; Farquhar et al., 2022; Langlois and Everitt, 2021). If different interpretations of the same graph yields different conclusions, then graph-based inference does not seem possible.
I think the real problem with the graph in Figure 3a is that it has already stepped onto the “mechanism land”, but didn’t depict any mechanism variables apart from that of H1, which is M. The discussion in the quote above assumes that the graph in Figure 3a models repeated recommendations (and, well, otherwise there weren’t both H1 and H2 on this graph simultaneously). Therefore, as I noted above, causal links between object-level chance variables and the corresponding mechanism variables should point from the object-level to the mechanism. And, indeed, there is a link H1 → M. Thus, M on the graph is identical to ~H1 in the notation of mechanised SCMs, and ~M should be interpreted as the mechanism of deriving the mechanism: that is, the statistical algorithm used to derive M (~H1) from ~H2 and U.
I think the mechanised SCM of content recommendation should look closer to this, taking the “the model was obtained by predicting clicks based on past user data” interpretation:
On this graph, red causal links are those that are different from Figure 3b, which doesn’t imply any special causal semantics.
I also assumed that ~U and M are learned jointly, hence a direct bidirectional link between these models. However, ~U might also be fixed to some static algorithm, such as a static click rate discounted according to a certain static formula which takes the strength of the preference of the user to the recommended content as the input. The preference is taken from the user model M (which is still learned). In this interpretation, all the incoming causal links into ~U should be erased, and “changing how a human reacts to recommended content (~H2), would lead to a change in the way that predicted clicks depend on the model of the original user (~U)” is not inevitable (see the beginning of the quote from the paper above).
Actor–Critic
This can help avoid modelling mistakes and incorrect inference of agent incentives. In particular, Christiano (private communication, 2019) has questioned the reliability of incentive analysis from CIDs, because of an apparently reasonable way of modelling the actor-critic system where the actor is not modelled as an agent, shown in Fig. 4c. Doing incentive analysis on this single-agent diagram would lead to the assertion that the system is not trying to influence the state 𝑆 or the reward 𝑅, because they don’t lie on the directed path 𝑄 → 𝑊 (i.e. neither 𝑆 nor 𝑅 has an instrumental control incentive; Everitt et al., 2021a). This would be incorrect, as the system is trying to influence both these variables (in an intuitive and practical sense).
I tried to model the system described on page 332 of Sutton and Barto, “One-step Actor–Critic (episodic)” algorithm, preserving the structure of the above graph, but using notation from Sutton and Barto. To me, it seems that the best model of the system is a single-agent, but where A is still a decision rather than a chance variable, and ^v(⋅,w) (the equivalent of Q on the graph above) should best be seen as a part of the mechanism for decision A which includes both ^v(⋅,w) and π(⋅,θ) rather than a single mechanism variable:
The advice that the variables should be logically independent is phrased stronger than other modelling advice, which is not well justified
Variables should be logically independent: one variable taking on a value should never be mutually exclusive with another variable taking on a particular value.
In this modelling advice, the word “never” communicates that this is “stronger” advice than others provided in section 5.2. However, under bounded rationality of physical causal reasoners, some models can have logical inconsistencies, yet still enable better inference (and more efficient action policies) than alternative models without the variables that are co-dependent with some other variables and thus introduce logical inconsistencies.
Two miscellaneous clarification notes for section 3.2
“whether a variable’s distribution adaptively responds for a downstream reason, (i.e. is a decision node), rather than for no downstream consequence (e.g. its distribution is set mechanistically by some natural process)” — a “natural process” implies an object-level causal link from W to V. However, from the text below, it seems to be implied that W is also downstream of V on the object-level. This means there is a causal cycle on the object-level, but we don’t consider such models. So, while this example might be formally correct, I think this example is more confusing than helpful.
“to determine whether a variable, 𝑉, adapts for a downstream reason, we can test whether 𝑉’s mechanism still responds even when the children of 𝑉 stop responding to 𝑉 (i.e. 𝑉 has no downstream effect).” — This sentence is confusing. I think it should be “… whether V’s mechanism stops responding to changes in the mechanisms of its downstream variables if the children of 𝑉 stop responding to 𝑉” (as it’s formalised in Definition 3).
Should agency remain an informal concept?
The discussion of the different conceptualisations of agency in the paper left me confused. The authors pull a number of different strings, some of them are hardly related to each other:
“Agent is a system that would adapt their policy if their actions influenced the world in a different way”
Accidentality: “In contrast, the output of non-agentic systems might accidentally be optimal for producing a certain outcome, but these do not typically adapt. For example, a rock that is accidentally optimal for reducing water flow through a pipe would not adapt its size if the pipe was wider.”
Intentional stance
“Our definition is important for goal-directedness, as it distinguishes incidental influence that a decision might have on some variable, from more directed influence: only a system that counterfactually adapts can be said to be trying to influence the variable in a systematic way.” — here, it’s unclear what does it mean to adapt “counterfactually”.
Learning behaviour, as discussed in section 1.3, and also in the section “Understanding a system with the process of its creation as an agent” below.
I’m not sure this is much of an improvement over Russell and Norvig’s classification of agents, which seems to capture most of the same threads, and also the fact that agency is more of a gradient than a binary yes/no property. This is also consistent with minimal physicalism, the scale-free understanding of cognition (and, hence, agency) by Fields et al.
“Agent” is such a loaded term that I feel it would be easier to read the paper if the authors didn’t attempt to “seize” the term “agent” but instead used, for example, the term “consequentialist”. The term “agent” carries too much semantic burden. Most readers probably already have an ingrained intuitive understanding of this word or have their own favourite theory of agency. So, readers should fight a bias in their head while reading the paper and nonetheless risk misunderstanding it.
Understanding a system with the process of its creation as an agent
I think most readers have a strong intuition that agents are physical systems. A system with its creation process is actually a physical object in the BORO ontology (where a process is a 4D spacetime object, and any collection of objects can be seen as an object, too) and probably some other approaches to ontology, but I suspect this might be highly counterintuitive to most readers, so perhaps warrants some discussion.
Also, I think that a “learnt RL policy” can mean either an abstract description of behaviour or a physical encoding of this description on some information storage device. Neither of these can intuitively be an “agent”, so I would stick with simply “RL agent” (meaning “a physical object that acts according to a learned policy”) in this sentence again, to avoid confusion.
I think the authors of the paper had a goal of bridging the gap between the real world roamed by suspected agents and the mathematical formalism of causal graphs. From the conclusion:
There are other pieces of language that hint that the authors see their contributions as epistemological rather than mathematical (all emphasis is mine):
However, I don’t think the authors created a valid epistemological method for discovering agents from empirical data. In the following sections, I lay out the arguments supporting this claim.
For the mechanism variables that the modeller can’t intervene and gain information about which only via observing their corresponding object-level variables, it doesn’t make sense to draw a causal link from the mechanism to the object-level variable
If the mechanism variable can’t be intervened and is only observed through its object-level variable, then this mechanism is purely a product of the modeller’s imagination and can be anything.
Such mechanism variables, however, still have a place on the causal graphs: they are represented physically by the modeller (e. g. stored in the computer memory, or in the modeller’s brain) and these representations do physically affect other modeller’s representations, including of other variables, and their decision policies. For example, in graph 1c, the mechanisms of all three variables should be seen as stored in the mouse’s brain:
The causal links X → ~X and U → ~U point from the object-level to the mechanism variables because the mouse learns the mechanisms by observing the object-level.
An example of a mechanism variable which we can’t intervene in, but may observe independently from the object-level is humans’ explicit report of their preferences, separate from what is revealed in their behaviour (e. g. in the content recommendation setting, which I explore in more detail below).
In section 5.2 “Modelling advice”, the authors express a very similar idea:
This idea is similar because the modeller often can’t intervene in the abstract “mechanism” attached to an object-level variable which causes it, but the modeller always can intervene on its own belief about this mechanism. And if mechanism variables represent the modeller’s beliefs rather than “real mechanisms” (cf. Dennett’s real patterns), then it’s obvious that the direction of the causal links should be from the object-level variables to the corresponding beliefs about their mechanism, rather than vice versa.
So, I agree with this last quote, but it seems to contradict a major chunk of the other paper’s content.
There is no clear boundary between the object-level and the mechanism variables
In explaining mechanism and object-level variables, the authors seemingly jump between reasoning within a mathematical formalism of SCMs (or a simulated environment, where all mechanisms are explicitly specified and are controllable by the simulator; this environment doesn’t differ much from a mathematical formalism) and within a real world.
The mathematical/simulation frame:
The formalism doesn’t tell us anything about how to distinguish between object-level and mechanism variables: object-level variables are just the variables that are included in the object-level graph, but that itself is arbitrary. For example, in section 4.4, the authors note that Langlois and Everitt (2021) included the decision rule in the game graph, but it should have been a mechanism variable. However, in the content recommendation setting (section 4.2), the “human model” (M) is clearly a mechanism variable for the original human preferences (H1), but is nevertheless included in the object-level graph because other object-level variables depend on it.
There are also two phrases in section 3.3 that suppose a mathematical frame. First, “the set of interventional distributions generated by a mechanised SCM” (in Lemma 2) says that interventional distributions are created by the model, rather than by the physical system (the algorithm executor) performing the interventions in the modelled world. Second, the sentence “Applied to the mouse example of Fig. 1, Algorithm 1 would take interventional data from the system and draw the edge-labelled mechanised causal graph in Fig. 1c.” doesn’t emphasise the fact that an algorithm is always performed by an executor (a physical system) and it’s important who that executor is, including for the details of the algorithm (cf. Deutsch’s and Marletto’s constructor theory of information). Algorithms don’t execute themselves.
The real-world frame:
In section 3.5, it is said that mechanised SCM is a “physical representation of the system”.
I read from this quote that the authors take an inductive bias that mechanism variables update slowly, so we take a simplifying assumption that they don’t update at all within a single interaction. However, I think this assumption is dangerously ignorant for reasoning about agents capable of reflection and explicit policy planning. Such agents (including humans) can switch their policy (i. e., the mechanism) of making a certain kind of decision in response to a single event. And other agents in the game, aware of this possibility, can and should take it into account, effectively modelling the game as having direct causal paths from some object-level variables into the mechanism variables, which in turn inform their own decisions.
Optimising a model of a human
I think the real problem with the graph in Figure 3a is that it has already stepped onto the “mechanism land”, but didn’t depict any mechanism variables apart from that of H1, which is M. The discussion in the quote above assumes that the graph in Figure 3a models repeated recommendations (and, well, otherwise there weren’t both H1 and H2 on this graph simultaneously). Therefore, as I noted above, causal links between object-level chance variables and the corresponding mechanism variables should point from the object-level to the mechanism. And, indeed, there is a link H1 → M. Thus, M on the graph is identical to ~H1 in the notation of mechanised SCMs, and ~M should be interpreted as the mechanism of deriving the mechanism: that is, the statistical algorithm used to derive M (~H1) from ~H2 and U.
I think the mechanised SCM of content recommendation should look closer to this, taking the “the model was obtained by predicting clicks based on past user data” interpretation:
On this graph, red causal links are those that are different from Figure 3b, which doesn’t imply any special causal semantics.
I also assumed that ~U and M are learned jointly, hence a direct bidirectional link between these models. However, ~U might also be fixed to some static algorithm, such as a static click rate discounted according to a certain static formula which takes the strength of the preference of the user to the recommended content as the input. The preference is taken from the user model M (which is still learned). In this interpretation, all the incoming causal links into ~U should be erased, and “changing how a human reacts to recommended content (~H2), would lead to a change in the way that predicted clicks depend on the model of the original user (~U)” is not inevitable (see the beginning of the quote from the paper above).
Actor–Critic
I tried to model the system described on page 332 of Sutton and Barto, “One-step Actor–Critic (episodic)” algorithm, preserving the structure of the above graph, but using notation from Sutton and Barto. To me, it seems that the best model of the system is a single-agent, but where A is still a decision rather than a chance variable, and ^v(⋅,w) (the equivalent of Q on the graph above) should best be seen as a part of the mechanism for decision A which includes both ^v(⋅,w) and π(⋅,θ) rather than a single mechanism variable:
The advice that the variables should be logically independent is phrased stronger than other modelling advice, which is not well justified
In this modelling advice, the word “never” communicates that this is “stronger” advice than others provided in section 5.2. However, under bounded rationality of physical causal reasoners, some models can have logical inconsistencies, yet still enable better inference (and more efficient action policies) than alternative models without the variables that are co-dependent with some other variables and thus introduce logical inconsistencies.
Two miscellaneous clarification notes for section 3.2
“whether a variable’s distribution adaptively responds for a downstream reason, (i.e. is a decision node), rather than for no downstream consequence (e.g. its distribution is set mechanistically by some natural process)” — a “natural process” implies an object-level causal link from W to V. However, from the text below, it seems to be implied that W is also downstream of V on the object-level. This means there is a causal cycle on the object-level, but we don’t consider such models. So, while this example might be formally correct, I think this example is more confusing than helpful.
“to determine whether a variable, 𝑉, adapts for a downstream reason, we can test whether 𝑉’s mechanism still responds even when the children of 𝑉 stop responding to 𝑉 (i.e. 𝑉 has no downstream effect).” — This sentence is confusing. I think it should be “… whether V’s mechanism stops responding to changes in the mechanisms of its downstream variables if the children of 𝑉 stop responding to 𝑉” (as it’s formalised in Definition 3).