Work done with Ramana Kumar, Sebastian Farquhar (Oxford), Jonathan Richens, Matt MacDermott (Imperial) and Tom Everitt.
Our DeepMind Alignment team researches ways to avoid AGI systems that knowingly act against the wishes of their designers. We’re particularly concerned about agents which may be pursuing a goal that is not what their designers want.
These types of safety concerns motivate developing a formal theory of agents to facilitate our understanding of their properties and avoid designs that pose a safety risk. Causal influence diagrams (CIDs) aim to be a unified theory of how design decisions create incentives that shape agent behaviour to illuminate potentialrisks before an agent is trained and inspire better agent designs with more appealing alignment properties.
Our new paper, Discovering Agents, introduces new ways of tackling these issues, including:
The first formal causal definition of agents, roughly: Agents are systems that would adapt their policy if their actions influenced the world in a different way
An algorithm for discovering agents from empirical data
A translation between causal models and CIDs
Resolving earlier confusions from incorrect causal modelling of agents
Combined, these results provide an extra layer of assurance that a modelling mistake hasn’t been made, which means that CIDs can be used to analyse an agent’s incentives and safety properties with greater confidence.
Example: modelling a mouse as an agent
To help illustrate our method, consider the following example consisting of a world containing three squares, with a mouse starting in the middle square choosing to go left or right, getting to its next position and then potentially getting some cheese. The floor is icy, so the mouse might slip. Sometimes the cheese is on the right, but sometimes on the left.
This can be represented by the following CID:
The intuition that the mouse would choose a different behaviour for different environment settings (iciness, cheese distribution) can be captured by a mechanised causal graph (a variant of mechanised causal game graph), which for each (object-level) variable, also includes a mechanism variable that governs how the variable depends on its parents. Crucially, we allow for links between mechanism variables.
This graph contains additional mechanism nodes in black, representing the mouse’s policy and the iciness and cheese distribution.
Edges between mechanisms represent direct causal influence. The blue edges are special terminal edges – roughly, mechanism edges ~A → ~B that would still be there, even if the object-level variable A was altered so that it had no outgoing edges.
In the example above, since U has no children, its mechanism edge must be terminal. But the mechanism edge ~X → ~D is not terminal, because if we cut X off from its child Uthen the mouse will no longer adapt its decision (because its position won’t affect whether it gets the cheese).
Causal definition of agents
We build on Dennet’s intentional stance – that agents are systems whose outputs are moved by reasons. The reason that an agent chooses a particular action is that it expects it to lead to a certain desirable outcome. Such systems would act differently if they knew that the world worked differently, which suggests the following informal characterisation of agents:
Agents are systems that would adapt their policy if their actions influenced the world in a different way.
The mouse in the example above is an agent because it will adapt its policy if it knows that the ice has become more slippery, or if the cheese is more likely on the left. In contrast, the output of non-agentic systems might accidentally be optimal for producing a certain outcome, but these do not typically adapt. For example, a rock that is accidentally optimal for reducing water flow through a pipe would not adapt its size if the pipe was wider.
This characterisation of agency may be read as an alternative to, or an elaboration of, the intentional stance (depending on how you interpret it) couched in the language of causality and counterfactuals. See our paper for comparisons of our notion of agents with other characterisations of agents, including Cybernetics, Optimising Systems, Goal-directed systems, time travel, and compression.
Our formal definition of agency is given in terms of causal discovery, discussd in the next section.
Causal discovery of agents
Causal discovery infers a causal graph from experiments involving interventions. In particular, one can discover an arrow from a variable A to a variable B by experimentally intervening on A and checking if B responds, even if all other variables are held fixed.
Our first algorithm uses this causal discovery principle to discover the mechanised causal graph, given the interventional distributions (which can be obtained from experimental international data). The below image visualises the inputs and outputs of the algorithm, see our paper for the full details.
Our second algorithm transforms this mechanised causal graph to a game graph:
It works by assigning utilities to nodes with outgoing blue terminal edges on their mechanisms and decisions to nodes with incoming blue terminal edges on their mechanisms. The mechanism connections reveal which decisions and utilities belong to the same agent, and are used to determine node colours in multi-agent CIDs.
Our third algorithm transforms the game graph into a mechanised causal graph, to establish an equivalence between the different representations. The equivalence only holds under some additional assumptions, as the mechanised causal graph can contain more information than the game graph in some cases.
In the paper we prove theorems concerning the correctness of these algorithms.
An example where this helps
In this example, we have an Actor-Critic RL setup for a one-step MDP. The underlying system has the following game graph.
Here an actor selects action A as advised by a critic. The critic’s action Q states the expected reward for each action (in the form of a vector with one element for each possible choice of A, this is often called a Q-value function). The action A influences the state S, which in turn determines the reward R. We model the actor as just wanting to follow the advice of the critic, so its utility is Y=Q(A), (the A-th element of the Q-vector). The critic wants its advice Y to match the actual reward R. Formally, it optimises W=−(R−Y)2.
Algorithm 1 produces the following mechanised causal graph:
Let’s focus on a few key edges: (~S,~Q) is present, but (~S,~A) is not, i.e. the critic cares about the state mechanism but the actor does not. The critic cares because it is optimising W which is causally downstream of S, and so the optimal decision rule for Q will depend on the mechanism of S even when other mechanisms are held constant. The dependence disappears if R is cut off from S, so the edge (~S,~Q) is not terminal. In contrast, the actor doesn’t care about the mechanism of S, because Y is not downstream of S, so when holding all other mechanisms fixed, varying ~S won’t affect the optimal decision rule for A. There is however an indirect effect of the mechanism for S on the decision rule for A, which is mediated through the decision rule for Q.
Our Algorithm 2 applied to the mechanised causal graph produces the correct game graph by identifying that ~A and ~Q have incoming terminal edges, and therefore are decisions; that Y‘s mechanism has an outgoing terminal edge to A‘s mechanism and so is its utility; and that W’s mechanism has an outgoing terminal edge to the mechanism for Q, and so is its utility. The decisions and utilities get coloured differently due to their terminal edge subgraphs being disconnected.
This can help avoid modelling mistakes and incorrect inference of agent incentives. In particular, Paul Christiano (private communication, 2019) has questioned the reliability of incentive analysis from CIDs, because an apparently reasonable way of modelling the actor-critic system would be as follows, where the actor is not modelled as an agent:
Doing incentive analysis on this single-agent diagram would lead to the assertion that the system is not trying to influence the state S or the reward R, because they don’t lie on the directed path Q→W (i.e. neither S nor R has an instrumental control incentive). This would be incorrect, as the system is trying to influence both these variables (in an intuitive and practical sense).
Thanks to our algorithms, we can now crisply explain why this is an incorrect model of the system, since it’s not modelling A as a decision, and Y as its utility. This modelling mistake would be avoided by applying Algorithms 1 and 2 to the underlying system, which produce the correct diagram (with A as decision, Y as its utility). The correct diagram has two agents, and it’s not possible to apply the single-agent instrumental control incentive. Instead, an incentive concept suitable for multi-agent systems would need to be developed. A key criterion for such a fruitful multi-agent incentives concept is that it captures the influence on S and R jointly exerted by A and Q.
Discussion
Whether a variable is a decision, utility or a chance variable is relative to the overall choice of variables. This choice represents a frame in which to model that system, and the notions of decision and utility make sense only with reference to this frame. See Appendix C in our paper for some examples of this relativism.
Our work suggests some modelling advice for the practitioner, mostly that it is good practice to clarify whether a variable is object-level, or a mechanism; and that it’s best to distinguish when a variable is a utility, or is merely instrumental for some downstream utility.
Conclusion
We proposed the first formal causal definition of agents. Grounded in causal discovery, our key insight is that agents are systems that adapt their behaviour in response to changes in how their actions influence the world. Indeed, our Algorithms 1 and 2 describe a precise experimental process that can be done to assess whether something is an agent. Our process is largely consistent with previous informal characterisations of agents, but making it formal makes it more precise and enables agents to be identified empirically.
As illustrated with an example above, our work improves the reliability of methods building on causal models of AI systems, such as analyses of the safety and fairness of machine learning algorithms (the paper contains additional examples).
Overall we’ve found that causality is a useful framework for discovering whether there is an agent in a system – a key concern for assessing risks from AGI .
Excited to learn more? Check out our paper. Feedback and comments are most welcome.
Discovering Agents
Work done with Ramana Kumar, Sebastian Farquhar (Oxford), Jonathan Richens, Matt MacDermott (Imperial) and Tom Everitt.
Our DeepMind Alignment team researches ways to avoid AGI systems that knowingly act against the wishes of their designers. We’re particularly concerned about agents which may be pursuing a goal that is not what their designers want.
These types of safety concerns motivate developing a formal theory of agents to facilitate our understanding of their properties and avoid designs that pose a safety risk. Causal influence diagrams (CIDs) aim to be a unified theory of how design decisions create incentives that shape agent behaviour to illuminate potential risks before an agent is trained and inspire better agent designs with more appealing alignment properties.
Our new paper, Discovering Agents, introduces new ways of tackling these issues, including:
The first formal causal definition of agents, roughly: Agents are systems that would adapt their policy if their actions influenced the world in a different way
An algorithm for discovering agents from empirical data
A translation between causal models and CIDs
Resolving earlier confusions from incorrect causal modelling of agents
Combined, these results provide an extra layer of assurance that a modelling mistake hasn’t been made, which means that CIDs can be used to analyse an agent’s incentives and safety properties with greater confidence.
Example: modelling a mouse as an agent
To help illustrate our method, consider the following example consisting of a world containing three squares, with a mouse starting in the middle square choosing to go left or right, getting to its next position and then potentially getting some cheese. The floor is icy, so the mouse might slip. Sometimes the cheese is on the right, but sometimes on the left.
This can be represented by the following CID:
The intuition that the mouse would choose a different behaviour for different environment settings (iciness, cheese distribution) can be captured by a mechanised causal graph (a variant of mechanised causal game graph), which for each (object-level) variable, also includes a mechanism variable that governs how the variable depends on its parents. Crucially, we allow for links between mechanism variables.
This graph contains additional mechanism nodes in black, representing the mouse’s policy and the iciness and cheese distribution.
Edges between mechanisms represent direct causal influence. The blue edges are special terminal edges – roughly, mechanism edges ~A → ~B that would still be there, even if the object-level variable A was altered so that it had no outgoing edges.
In the example above, since U has no children, its mechanism edge must be terminal. But the mechanism edge ~X → ~D is not terminal, because if we cut X off from its child Uthen the mouse will no longer adapt its decision (because its position won’t affect whether it gets the cheese).
Causal definition of agents
We build on Dennet’s intentional stance – that agents are systems whose outputs are moved by reasons. The reason that an agent chooses a particular action is that it expects it to lead to a certain desirable outcome. Such systems would act differently if they knew that the world worked differently, which suggests the following informal characterisation of agents:
The mouse in the example above is an agent because it will adapt its policy if it knows that the ice has become more slippery, or if the cheese is more likely on the left. In contrast, the output of non-agentic systems might accidentally be optimal for producing a certain outcome, but these do not typically adapt. For example, a rock that is accidentally optimal for reducing water flow through a pipe would not adapt its size if the pipe was wider.
This characterisation of agency may be read as an alternative to, or an elaboration of, the intentional stance (depending on how you interpret it) couched in the language of causality and counterfactuals. See our paper for comparisons of our notion of agents with other characterisations of agents, including Cybernetics, Optimising Systems, Goal-directed systems, time travel, and compression.
Our formal definition of agency is given in terms of causal discovery, discussd in the next section.
Causal discovery of agents
Causal discovery infers a causal graph from experiments involving interventions. In particular, one can discover an arrow from a variable A to a variable B by experimentally intervening on A and checking if B responds, even if all other variables are held fixed.
Our first algorithm uses this causal discovery principle to discover the mechanised causal graph, given the interventional distributions (which can be obtained from experimental international data). The below image visualises the inputs and outputs of the algorithm, see our paper for the full details.
Our second algorithm transforms this mechanised causal graph to a game graph:
It works by assigning utilities to nodes with outgoing blue terminal edges on their mechanisms and decisions to nodes with incoming blue terminal edges on their mechanisms. The mechanism connections reveal which decisions and utilities belong to the same agent, and are used to determine node colours in multi-agent CIDs.
Our third algorithm transforms the game graph into a mechanised causal graph, to establish an equivalence between the different representations. The equivalence only holds under some additional assumptions, as the mechanised causal graph can contain more information than the game graph in some cases.
In the paper we prove theorems concerning the correctness of these algorithms.
An example where this helps
In this example, we have an Actor-Critic RL setup for a one-step MDP. The underlying system has the following game graph.
Here an actor selects action A as advised by a critic. The critic’s action Q states the expected reward for each action (in the form of a vector with one element for each possible choice of A, this is often called a Q-value function). The action A influences the state S, which in turn determines the reward R. We model the actor as just wanting to follow the advice of the critic, so its utility is Y=Q(A), (the A-th element of the Q-vector). The critic wants its advice Y to match the actual reward R. Formally, it optimises W=−(R−Y)2.
Algorithm 1 produces the following mechanised causal graph:
Let’s focus on a few key edges: (~S,~Q) is present, but (~S,~A) is not, i.e. the critic cares about the state mechanism but the actor does not. The critic cares because it is optimising W which is causally downstream of S, and so the optimal decision rule for Q will depend on the mechanism of S even when other mechanisms are held constant. The dependence disappears if R is cut off from S, so the edge (~S,~Q) is not terminal. In contrast, the actor doesn’t care about the mechanism of S, because Y is not downstream of S, so when holding all other mechanisms fixed, varying ~S won’t affect the optimal decision rule for A. There is however an indirect effect of the mechanism for S on the decision rule for A, which is mediated through the decision rule for Q.
Our Algorithm 2 applied to the mechanised causal graph produces the correct game graph by identifying that ~A and ~Q have incoming terminal edges, and therefore are decisions; that Y‘s mechanism has an outgoing terminal edge to A‘s mechanism and so is its utility; and that W’s mechanism has an outgoing terminal edge to the mechanism for Q, and so is its utility. The decisions and utilities get coloured differently due to their terminal edge subgraphs being disconnected.
This can help avoid modelling mistakes and incorrect inference of agent incentives. In particular, Paul Christiano (private communication, 2019) has questioned the reliability of incentive analysis from CIDs, because an apparently reasonable way of modelling the actor-critic system would be as follows, where the actor is not modelled as an agent:
Doing incentive analysis on this single-agent diagram would lead to the assertion that the system is not trying to influence the state S or the reward R, because they don’t lie on the directed path Q→W (i.e. neither S nor R has an instrumental control incentive). This would be incorrect, as the system is trying to influence both these variables (in an intuitive and practical sense).
Thanks to our algorithms, we can now crisply explain why this is an incorrect model of the system, since it’s not modelling A as a decision, and Y as its utility. This modelling mistake would be avoided by applying Algorithms 1 and 2 to the underlying system, which produce the correct diagram (with A as decision, Y as its utility). The correct diagram has two agents, and it’s not possible to apply the single-agent instrumental control incentive. Instead, an incentive concept suitable for multi-agent systems would need to be developed. A key criterion for such a fruitful multi-agent incentives concept is that it captures the influence on S and R jointly exerted by A and Q.
Discussion
Whether a variable is a decision, utility or a chance variable is relative to the overall choice of variables. This choice represents a frame in which to model that system, and the notions of decision and utility make sense only with reference to this frame. See Appendix C in our paper for some examples of this relativism.
Our work suggests some modelling advice for the practitioner, mostly that it is good practice to clarify whether a variable is object-level, or a mechanism; and that it’s best to distinguish when a variable is a utility, or is merely instrumental for some downstream utility.
Conclusion
We proposed the first formal causal definition of agents. Grounded in causal discovery, our key insight is that agents are systems that adapt their behaviour in response to changes in how their actions influence the world. Indeed, our Algorithms 1 and 2 describe a precise experimental process that can be done to assess whether something is an agent. Our process is largely consistent with previous informal characterisations of agents, but making it formal makes it more precise and enables agents to be identified empirically.
As illustrated with an example above, our work improves the reliability of methods building on causal models of AI systems, such as analyses of the safety and fairness of machine learning algorithms (the paper contains additional examples).
Overall we’ve found that causality is a useful framework for discovering whether there is an agent in a system – a key concern for assessing risks from AGI .
Excited to learn more? Check out our paper. Feedback and comments are most welcome.