LCDT is has major structural similarities with some of the
incentive-managing agent designs that have been considered by Everitt
et al in work on Causal Influence Diagrams (CIDs),
e.g. here and by me in work on
counterfactual planning,
e.g. here. These similarities are
not immediately apparent however from the post above, because of
differences in terminology and in the benchmarks chosen.
So I feel it is useful (also as a multi-disciplinary or
community-bridging exercise) to make these similarities more explicit
in this comment. Below I will map the LCDT defined above to the
frameworks of CIDs and counterfactual planning, frameworks that were
designed to avoid (and/or expose) all ambiguity by relying on exact
mathematical definitions.
Mapping LCDT to detailed math
Lonely CDT is a twist on CDT: an LCDT agent will make its decision by using a causal model just like a CDT agent would, except that the LCDT agent first cuts the last link in every path from its decision node to any other decision node, including its own future decision nodes.
OK, so in the terminology of counterfactual planning defined
here,
an LCDT agent is built to make decisions by constructing a model of a
planning world inside its compute core, then computing the optimal
action to take in the planning world, and then doing the same action
on the real world. The LCDT planning world model is a causal model,
let’s call it C. This C is constructed by modifying a causal
model B by cutting links. The B we modify is a fully accurate, or
reasonably approximate, model of bow the LCDT agent interacts with its
environment, where the interaction aims to maximize a reward or
minimize a loss function.
The planning world C is a modification of B that intentionally
mis-approximates some of the real world mechanics visible in B. C
is constructed to predict future agent actions less accurately than is
possible, given all information in B. This intentional
mis-approximation this makes the LCDT into what I call a
counterfactual
planner. The
LCDT plans actions that maximize reward (or minimize losses) in C,
and then performs these same actions in the real world it is in.
Some mathematical detail: in many graphical models of decision making,
the nodes that represent the decision(s) made by the agent(s) do not
have any incoming arrows. For the LCDT definition above to work, we
need a graphical model where the decision-making nodes do have such
incoming arrows. Conveniently, CIDs are such models. So we can
disambiguate LCDT by saying that B and C are full causal models as
defined in the CID framework. Terminology/mathematical details: in
the CID definitions here, these
full causal models B and C are called SCIMs, in the terminology
defined
here
they are called policy-defining world models whose input parameters
are fully known.
Now I identify some ambiguities that are left in the LCDT definition
of the post. First, the definition has remained silent on how the
initial causal world model B is obtained. It might be by learning,
by hand-coding (as in the benchmark examples), or a combination of the
two. For an example of a models B that is constructed with a
combination of hand-coding and machine learning, see the planning
world (p)here.
There is also significant work in the ML community on using machine
learning to construct from scratch full causal models including the
nodes and the routing of the arrows themselves, or (more often) full
Bayesian networks with nodes and arrows where the authors do not worry
too much about any causal interpretation of the arrows. I have not
tried this out in any examples, but I believe the LCDT approach might
be usefully applied to predictive Bayesian networks too.
Regardless of how B is obtained, we can do some safety analysis on
the construction of C out of B.
The two works on CIDs here and
here both consider that we can
modify agent incentives by removing paths in the CID-based world model
that the agent uses for planning its actions. In the terminology of
the first paper above, the modifications made by LCDT to produce the
model C work to ‘remove an instrumental control incentive on a
future action’. In the terminology of the second paper, the
modifications will ‘make the agent indifferent about downstream nodes
representing agent actions’. The post above speculates:
LCDT shows a form of indifference (related to indifference corrigibility maybe)
This is not a maybe: the indifference produced is definitely related
to indifference corrigibility, the type of
indifference-that-causes-corrigibility that the 2015 MIRI/FHI paper
titled
Corrigibility
talks about. For some detailed mathematical work relating the two, see
here.
A second ambiguity in LCDT is that it tell us how exactly the nodes in
B that represent agent decisions are to be identified. If B is a
hand-coded model of a game world, identifying these nodes may be easy.
If B is a somewhat opaque model produced by machine learning,
identifying the nodes may be difficult. In many graphical world
models, a single node may represent the state of a huge chunk of the
agent environment: say both the vases and conveyor belts in the agent
environment and the people in the agent environment. Does this node
then become a node that represents agent decisions? We might imagine
splitting the node into two nodes (this is often called factoring the
state) to separate out the humans.
That being said, even a less-than-perfect identification of these
nodes would work to suppress certain deceptive forms of manipulation,
so LCDT could be usefully applied even to somewhat opaque learned
causal models.
A third ambiguity is in the definition of the operations needed to
create a computable causal model C after taking a copy of B and
cutting incoming links to the downstream decision nodes:
What do we replace these decision nodes with (as their actual expression does depend on our decision)? We assume that the model has some fixed prior over its own decision, and then we marginalize the cut decision node with this prior, to leave the node with a distribution independent of our decision.
It is ambiguous how to construct this ‘fixed prior over its own
decision’ that we should use to marginalize on. Specifically, is this
prior allowed to take into account some or all of the events that
preceded the decision to be made? This ambiguity leaves a large
degree of freedom in constructing C by modifying B, especially in
a setting where the agents involved make multiple decisions over time.
This ambiguity is not necessarily a bad thing: we can interpret is as
an open (hyper)parameter choice that allows us to create differently
tuned versions of C that trade off differently between suppressing
manipulation and still achieving a degree of economic decision making
effectiveness. On a side note, in a multi-decision setting, drawing a
B that encodes marginalization on 10 downstream decisions will
generally create a huge diagram: it will add 10 new sub-diagrams
feeding input observations into these decisions.
LCDT also considers agent self-modification, However, given the way
these self-modification decisions are drawn, I cannot easily see how
these would generalize to a multi-decision situation where the agent
makes several decisions over time. Representations of
self-modification in a multi-decision CID framework usually require
that one draws a lot of extra nodes, see e.g. this
paper. As this comment is long
already, I omit the topic of how to map multi-action self-modification
to unambiguous math. My safety analysis below is therefore limited to
the case of the LCDT agent manipulating other agents, not the agent
manipulating itself.
Some safety analysis
LCDT obviously removes some agent incentives, incentives to control
the future decisions made by human agents in the agent environment.
This is nice because one method of control is deception, so it
suppresses deception. However, I do not believe LCDT removes all
incentives to deceive in the general case.
As I explain in this
example
and in more detail in sections 9.2 and 11.5.2
here, the use of a counterfactual
planning world model for decision making may remove some incentives
for deception, compared to using a fully correct world model, but the
planning world may still retain some game-theoretical mechanics that
make deception part of an optimal planning world strategy. So we have
to consider the value of deception in the planning world.
I’ll now do this for a particular toy example: the decision making
problem of a soccer playing agent that tries to score a goal, with a
human goalkeeper trying to block the goal. I simplify this toy world
by looking at one particular case only: the case where the agent is
close to the goal, and must decide whether to kick the ball in the
left or right corner. As the agent is close, the human goalkeeper
will have to decide to run to the left corner or right corner of the
goal even before the agent takes the shot: the goalkeeper does not
have enough time to first observe where the ball is going and only
then start moving. So this toy world decision problem has the agent
deciding on kick left of right, and the goalkeeper simultaneously deciding on
running left or right.
[Edited to add: as discussed in the comments below, the discussion of about marginsalisation that follows is somewhat wrong/confusing. It fails to mention that if we construct B exactly as described above, there is no causal link from the agent action to the goalkeeper action, so LCDT would construct a C that is exaclty the same model as B, and the question of what prior to marginalise on does not even come up. To make that question come up, we need to model an iterative game where the goalkeeper remembers (learns from) past moves by the agent.]
Say that agent is mechanically stronger at kicking left, so that a
ball kicked left, when the goalkeeper also moves left, has a lower
probability of being intercepted by the goalkeeper than in the
right+right alternative. In that case, the most reasonable prior over
the agent action will model an agent kicking left most of the time.
Now, if we use this prior to marginalize the expectations of the human
goalkeeper in the planning world, the planning world goalkeeper will
expect the agent to kick the ball left most of the time, so they are
more likely to move left.
Now observe that in the LCDT planning world
model C constructed by marginalization, this knowledge of the
goalkeeper is a known parameter of the ball kicking optimization
problem that the agent must solve. If we set the outcome
probabilities right, the game theoretical outcome will be that the
optimal policy is for the agent to kicks right, so it plays the
opposite move that the goalkeeper expects. I’d argue that this is a
form of deception, a deceptive scenario that LCDT is trying to
prevent.
A safer decision theory would marginalize the goalkeeper expectations
with a random prior over agent actions. thereby removing the
game-theoretical benefit of the agent doing something unexpected. If
the goalkeeper knows the agent is using this safer decision theory,
they can always run left.
Now, I must admit that I associate the word ‘deception’ mostly with
multi-step policies that aim to implant incorrect knowledge into the
opposite party’s world model first, and then exploit that incorrect
knowledge in later steps. The above example does only one of these
things. So maybe others would deconfuse (define) the term ‘deception’
differently in a single-action setting, so that the above example does
not in fact count as deception.
Benchmarking
The post above does not benchmark LCDT on Newcomb’s Problem, which I
feel is a welcome change, compared to many other decision theory posts
on this forum. Still, I feel that there is somewhat of a gap in the
benchmarking coverage provided by the post above, as ‘mainstream’ ML
agent designs are usually benchmarked in MDP or RL problem settings,
that is on multi-step decision making problems where the objective is
to maximize a time discounted sum of rewards. (Some of the benchmarks
in the post above can be mapped to MDP problems in toy worlds, but
they would be somewhat unusual MDP toy worlds.)
A first obvious MDP-type benchmark would be an RL setting where the
reward signal is provided directly by a human agent in the
environment. When we apply LCDT in this context, it makes the LCDT
agent totally indifferent to influencing the human-generated reward
signal: any random policy will perform equally well in the planning
world C. So the LCDT agent becomes totally non-responsive to its
reward signal, and non-competitive as a tool to achieve economic
goals.
In a second obvious MDP-type benchmark, the reward signal is provided
by a sensor in the environment, or by some software that reads and
processes sensor signals. If we model this sensor and this software
as not being agents themselves, then LCDT may perform very well.
Specifically, if there are innocent human bystanders too in the agent
environment, bystanders who are modeled as agents, then we can expect
that the incentive of the agent to control or deceive these human
bystanders into helping it achieve its goals is suppressed. This is
because under LCDT, the agent will lose some, potentially all, of its
ability to correctly anticipate the consequences of its own actions on
the actions of these innocent human bystanders.
Other remarks
There is an interesting link between LCDT and counterfactual oracles:
whereas LCDT breaks the last link in any causal chain that influences
human decisions, counterfactual oracle designs can be said to break
the first link. See e.g. section 13
here for example causal diagrams.
When applying an LCDT-like approach construct a C from a causal
model B, it may sometimes be easier to keep the incoming links to
nodes in B that model future agent decisions intact, and instead cut
the outgoing links. This would mean replacing these nodes in B with
fresh nodes that generate probability distributions over future
actions taken by the future agents(s). These fresh nodes could
potentially use node values that occurred earlier in time than the
agent action(s) as inputs, to create better predictions. When I
picture this approach visually as editing a causal graph B into a
C, the approach is more easy to visualize than the approach of
marginalizing on a prior.
To conclude, my feeling is that LCDT can definitely be used as a
safety mechanism, as an element of an agent design that suppresses
deceptive policies. But it is definitely not a perfect safety tool
that will offer perfect suppression of deception in all possible
game-theoretical situations. When it comes to suppressing deception,
I feel that time-limited myopia and the use of very high time discount
factors are equally useful but imperfect tools.
I’ll now do this for a particular toy example: the decision making problem of a soccer playing agent that tries to score a goal, with a human goalkeeper trying to block the goal. I simplify this toy world by looking at one particular case only: the case where the agent is close to the goal, and must decide whether to kick the ball in the left or right corner. As the agent is close, the human goalkeeper will have to decide to run to the left corner or right corner of the goal even before the agent takes the shot: the goalkeeper does not have enough time to first observe where the ball is going and only then start moving. So this toy world decision problem has the agent deciding on kick left of right, and the goalkeeper simultaneously deciding on running left or right.
By this setting, you ensure that the goal-keeper isn’t a causal descendant of the LCDT-agent. Which means there is no cutting involved, and the prior doesn’t play any role. In this case the LCDT agent decides exactly like a CDT agent, based on its model of what the goal-keeper will do.
If the goal-keeper’s decision depends on his knowledge about the agent’s predisposition, then what you describe might actually happen. But I hardly see that as a deception: it completely reveals what the LCDT-agent “wants” instead of hiding it.
By this setting, you ensure that the goal-keeper isn’t a causal descendant of the LCDT-agent.
Oops! You are right, there is no cutting involved to create C from B in my toy example. Did not realise that. Next time, I need to draw these models on paper before I post, not just in my head.
C and B do work as examples to explore what one might count as deception or non-deception. But my discussion of a random prior above makes sense only if you first extend B to a multi-step model, where the knowledge of the goal keeper explicitly depends on earlier agent actions.
However, I don’t think this is quite right (unless I’m missing something):
Now observe that in the LCDT planning world model C constructed by marginalization, this knowledge of the goalkeeper is a known parameter of the ball kicking optimization problem that the agent must solve. If we set the outcome probabilities right, the game theoretical outcome will be that the optimal policy is for the agent to kicks right, so it plays the opposite move that the goalkeeper expects. I’d argue that this is a form of deception, a deceptive scenario that LCDT is trying to prevent.
I don’t think the situation is significantly different between B and C here. In B, the agent will decide to kick left most of the time since that’s the Nash equilibrium. In C the agent will also decide to kick left most of the time: knowing the goalkeeper’s likely action still leaves the same Nash solution (based on knowing both that the keeper will probably go left, and that left is the agent’s stronger side). If the agent knew the keeper would definitely go left, then of course it’d kick right—but I don’t think that’s the situation.
I’d be interested on your take on Evan’s comment on incoherence in LCDT. Specifically, do you think the issue I’m pointing at is a difference between LCDT and counterfactual planners? (or perhaps that I’m just wrong about the incoherence??) As I currently understand things, I believe that CPs are doing planning in a counterfactual-but-coherent world, whereas LCDT is planning in an (intentionally) incoherent world—but I might be wrong in either case.
However, I don’t think this is quite right (unless I’m missing something) [,,,] I don’t think the situation is significantly different between B and C here. In B, the agent will decide to kick left most of the time since that’s the Nash equilibrium. In C the agent will also decide to kick left most of the time: knowing the goalkeeper’s likely action still leaves the same Nash solution
To be clear: the point I was trying to make is also that I do not think that B and C are significantly different in the goalkeeper benchmark. My point was that we need to go to a random prior to produce a real difference.
But your question makes me realise that this goalkeeper benchmark world opens up a bigger can of worms than I expected. When writing it, I was not thinking about Nash equilibrium policies, which I associate mostly with iterated games, and I was specifically thinking about an agent design that uses the planning world to compute a deterministic policy function. To state what I was thinking about in different mathematical terms, I was thinking of an agent design that is trying to compute the action a that optimizes argmaxa in the non-iterated gameplay world C.
To produce the Nash equilibrium type behaviour you are thinking about (i.e. the agent will kick left most of the time but not all the time), you need to start out with an agent design that will use the C constructed by LCDT to compute a nondeterministic policy function, which it will then use to do compute its real world action. If I follow that line of thought, it I would need additional ingredients to make the agent actually compute that Nash equilibrium policy function. I would need need to have iterated gameplay in B, with mechanics that allow the goalkeeper to observe whether the agent is playing a non-Nash-equilibrium policy/strategy, so that the goalkeeper will exploit this inefficiency for sure if the agent plays the non-Nash-equilibrium strategy. The possibility of exploitation by the goalkeeper is what would push the optimal agent policy towards a Nash equilibrium. But interestingly, such mechanics where the goalkeeper can learn about a non-Nash agent policy being used might be present in an iterated version of the real world model B, but they will be removed by LCDT from an iterated version of C. (Another wrinkle: some AI algorithms for solving the optimal policy in a single-shot game in B or C would turn B or C into an iterated game automatically and then solve the iterated game. Such iteration might also update the prior, if we are not careful. But if we solve B or C analytically or with Monte Carlo simulation, this type of expansion to an iterated game will not happen.)
Hope this clarifies what I was thinking about. I think it is also true that, if the prior you use in your LCDT construction is that everybody is playing according to a Nash equilibrium, then agent may end up playing exactly that under LCDT.
(I plan to comment on your question about incoherence in a few days.)
Interesting!
LCDT is has major structural similarities with some of the incentive-managing agent designs that have been considered by Everitt et al in work on Causal Influence Diagrams (CIDs), e.g. here and by me in work on counterfactual planning, e.g. here. These similarities are not immediately apparent however from the post above, because of differences in terminology and in the benchmarks chosen.
So I feel it is useful (also as a multi-disciplinary or community-bridging exercise) to make these similarities more explicit in this comment. Below I will map the LCDT defined above to the frameworks of CIDs and counterfactual planning, frameworks that were designed to avoid (and/or expose) all ambiguity by relying on exact mathematical definitions.
Mapping LCDT to detailed math
OK, so in the terminology of counterfactual planning defined here, an LCDT agent is built to make decisions by constructing a model of a planning world inside its compute core, then computing the optimal action to take in the planning world, and then doing the same action on the real world. The LCDT planning world model is a causal model, let’s call it C. This C is constructed by modifying a causal model B by cutting links. The B we modify is a fully accurate, or reasonably approximate, model of bow the LCDT agent interacts with its environment, where the interaction aims to maximize a reward or minimize a loss function.
The planning world C is a modification of B that intentionally mis-approximates some of the real world mechanics visible in B. C is constructed to predict future agent actions less accurately than is possible, given all information in B. This intentional mis-approximation this makes the LCDT into what I call a counterfactual planner. The LCDT plans actions that maximize reward (or minimize losses) in C, and then performs these same actions in the real world it is in.
Some mathematical detail: in many graphical models of decision making, the nodes that represent the decision(s) made by the agent(s) do not have any incoming arrows. For the LCDT definition above to work, we need a graphical model where the decision-making nodes do have such incoming arrows. Conveniently, CIDs are such models. So we can disambiguate LCDT by saying that B and C are full causal models as defined in the CID framework. Terminology/mathematical details: in the CID definitions here, these full causal models B and C are called SCIMs, in the terminology defined here they are called policy-defining world models whose input parameters are fully known.
Now I identify some ambiguities that are left in the LCDT definition of the post. First, the definition has remained silent on how the initial causal world model B is obtained. It might be by learning, by hand-coding (as in the benchmark examples), or a combination of the two. For an example of a models B that is constructed with a combination of hand-coding and machine learning, see the planning world (p) here. There is also significant work in the ML community on using machine learning to construct from scratch full causal models including the nodes and the routing of the arrows themselves, or (more often) full Bayesian networks with nodes and arrows where the authors do not worry too much about any causal interpretation of the arrows. I have not tried this out in any examples, but I believe the LCDT approach might be usefully applied to predictive Bayesian networks too.
Regardless of how B is obtained, we can do some safety analysis on the construction of C out of B.
The two works on CIDs here and here both consider that we can modify agent incentives by removing paths in the CID-based world model that the agent uses for planning its actions. In the terminology of the first paper above, the modifications made by LCDT to produce the model C work to ‘remove an instrumental control incentive on a future action’. In the terminology of the second paper, the modifications will ‘make the agent indifferent about downstream nodes representing agent actions’. The post above speculates:
This is not a maybe: the indifference produced is definitely related to indifference corrigibility, the type of indifference-that-causes-corrigibility that the 2015 MIRI/FHI paper titled Corrigibility talks about. For some detailed mathematical work relating the two, see here.
A second ambiguity in LCDT is that it tell us how exactly the nodes in B that represent agent decisions are to be identified. If B is a hand-coded model of a game world, identifying these nodes may be easy. If B is a somewhat opaque model produced by machine learning, identifying the nodes may be difficult. In many graphical world models, a single node may represent the state of a huge chunk of the agent environment: say both the vases and conveyor belts in the agent environment and the people in the agent environment. Does this node then become a node that represents agent decisions? We might imagine splitting the node into two nodes (this is often called factoring the state) to separate out the humans.
That being said, even a less-than-perfect identification of these nodes would work to suppress certain deceptive forms of manipulation, so LCDT could be usefully applied even to somewhat opaque learned causal models.
A third ambiguity is in the definition of the operations needed to create a computable causal model C after taking a copy of B and cutting incoming links to the downstream decision nodes:
It is ambiguous how to construct this ‘fixed prior over its own decision’ that we should use to marginalize on. Specifically, is this prior allowed to take into account some or all of the events that preceded the decision to be made? This ambiguity leaves a large degree of freedom in constructing C by modifying B, especially in a setting where the agents involved make multiple decisions over time. This ambiguity is not necessarily a bad thing: we can interpret is as an open (hyper)parameter choice that allows us to create differently tuned versions of C that trade off differently between suppressing manipulation and still achieving a degree of economic decision making effectiveness. On a side note, in a multi-decision setting, drawing a B that encodes marginalization on 10 downstream decisions will generally create a huge diagram: it will add 10 new sub-diagrams feeding input observations into these decisions.
LCDT also considers agent self-modification, However, given the way these self-modification decisions are drawn, I cannot easily see how these would generalize to a multi-decision situation where the agent makes several decisions over time. Representations of self-modification in a multi-decision CID framework usually require that one draws a lot of extra nodes, see e.g. this paper. As this comment is long already, I omit the topic of how to map multi-action self-modification to unambiguous math. My safety analysis below is therefore limited to the case of the LCDT agent manipulating other agents, not the agent manipulating itself.
Some safety analysis
LCDT obviously removes some agent incentives, incentives to control the future decisions made by human agents in the agent environment. This is nice because one method of control is deception, so it suppresses deception. However, I do not believe LCDT removes all incentives to deceive in the general case.
As I explain in this example and in more detail in sections 9.2 and 11.5.2 here, the use of a counterfactual planning world model for decision making may remove some incentives for deception, compared to using a fully correct world model, but the planning world may still retain some game-theoretical mechanics that make deception part of an optimal planning world strategy. So we have to consider the value of deception in the planning world.
I’ll now do this for a particular toy example: the decision making problem of a soccer playing agent that tries to score a goal, with a human goalkeeper trying to block the goal. I simplify this toy world by looking at one particular case only: the case where the agent is close to the goal, and must decide whether to kick the ball in the left or right corner. As the agent is close, the human goalkeeper will have to decide to run to the left corner or right corner of the goal even before the agent takes the shot: the goalkeeper does not have enough time to first observe where the ball is going and only then start moving. So this toy world decision problem has the agent deciding on kick left of right, and the goalkeeper simultaneously deciding on running left or right.
[Edited to add: as discussed in the comments below, the discussion of about marginsalisation that follows is somewhat wrong/confusing. It fails to mention that if we construct B exactly as described above, there is no causal link from the agent action to the goalkeeper action, so LCDT would construct a C that is exaclty the same model as B, and the question of what prior to marginalise on does not even come up. To make that question come up, we need to model an iterative game where the goalkeeper remembers (learns from) past moves by the agent.]
Say that agent is mechanically stronger at kicking left, so that a ball kicked left, when the goalkeeper also moves left, has a lower probability of being intercepted by the goalkeeper than in the right+right alternative. In that case, the most reasonable prior over the agent action will model an agent kicking left most of the time. Now, if we use this prior to marginalize the expectations of the human goalkeeper in the planning world, the planning world goalkeeper will expect the agent to kick the ball left most of the time, so they are more likely to move left.
Now observe that in the LCDT planning world model C constructed by marginalization, this knowledge of the goalkeeper is a known parameter of the ball kicking optimization problem that the agent must solve. If we set the outcome probabilities right, the game theoretical outcome will be that the optimal policy is for the agent to kicks right, so it plays the opposite move that the goalkeeper expects. I’d argue that this is a form of deception, a deceptive scenario that LCDT is trying to prevent.
A safer decision theory would marginalize the goalkeeper expectations with a random prior over agent actions. thereby removing the game-theoretical benefit of the agent doing something unexpected. If the goalkeeper knows the agent is using this safer decision theory, they can always run left.
Now, I must admit that I associate the word ‘deception’ mostly with multi-step policies that aim to implant incorrect knowledge into the opposite party’s world model first, and then exploit that incorrect knowledge in later steps. The above example does only one of these things. So maybe others would deconfuse (define) the term ‘deception’ differently in a single-action setting, so that the above example does not in fact count as deception.
Benchmarking
The post above does not benchmark LCDT on Newcomb’s Problem, which I feel is a welcome change, compared to many other decision theory posts on this forum. Still, I feel that there is somewhat of a gap in the benchmarking coverage provided by the post above, as ‘mainstream’ ML agent designs are usually benchmarked in MDP or RL problem settings, that is on multi-step decision making problems where the objective is to maximize a time discounted sum of rewards. (Some of the benchmarks in the post above can be mapped to MDP problems in toy worlds, but they would be somewhat unusual MDP toy worlds.)
A first obvious MDP-type benchmark would be an RL setting where the reward signal is provided directly by a human agent in the environment. When we apply LCDT in this context, it makes the LCDT agent totally indifferent to influencing the human-generated reward signal: any random policy will perform equally well in the planning world C. So the LCDT agent becomes totally non-responsive to its reward signal, and non-competitive as a tool to achieve economic goals.
In a second obvious MDP-type benchmark, the reward signal is provided by a sensor in the environment, or by some software that reads and processes sensor signals. If we model this sensor and this software as not being agents themselves, then LCDT may perform very well. Specifically, if there are innocent human bystanders too in the agent environment, bystanders who are modeled as agents, then we can expect that the incentive of the agent to control or deceive these human bystanders into helping it achieve its goals is suppressed. This is because under LCDT, the agent will lose some, potentially all, of its ability to correctly anticipate the consequences of its own actions on the actions of these innocent human bystanders.
Other remarks
There is an interesting link between LCDT and counterfactual oracles: whereas LCDT breaks the last link in any causal chain that influences human decisions, counterfactual oracle designs can be said to break the first link. See e.g. section 13 here for example causal diagrams.
When applying an LCDT-like approach construct a C from a causal model B, it may sometimes be easier to keep the incoming links to nodes in B that model future agent decisions intact, and instead cut the outgoing links. This would mean replacing these nodes in B with fresh nodes that generate probability distributions over future actions taken by the future agents(s). These fresh nodes could potentially use node values that occurred earlier in time than the agent action(s) as inputs, to create better predictions. When I picture this approach visually as editing a causal graph B into a C, the approach is more easy to visualize than the approach of marginalizing on a prior.
To conclude, my feeling is that LCDT can definitely be used as a safety mechanism, as an element of an agent design that suppresses deceptive policies. But it is definitely not a perfect safety tool that will offer perfect suppression of deception in all possible game-theoretical situations. When it comes to suppressing deception, I feel that time-limited myopia and the use of very high time discount factors are equally useful but imperfect tools.
By this setting, you ensure that the goal-keeper isn’t a causal descendant of the LCDT-agent. Which means there is no cutting involved, and the prior doesn’t play any role. In this case the LCDT agent decides exactly like a CDT agent, based on its model of what the goal-keeper will do.
If the goal-keeper’s decision depends on his knowledge about the agent’s predisposition, then what you describe might actually happen. But I hardly see that as a deception: it completely reveals what the LCDT-agent “wants” instead of hiding it.
Oops! You are right, there is no cutting involved to create C from B in my toy example. Did not realise that. Next time, I need to draw these models on paper before I post, not just in my head.
C and B do work as examples to explore what one might count as deception or non-deception. But my discussion of a random prior above makes sense only if you first extend B to a multi-step model, where the knowledge of the goal keeper explicitly depends on earlier agent actions.
Interesting, thanks.
However, I don’t think this is quite right (unless I’m missing something):
I don’t think the situation is significantly different between B and C here. In B, the agent will decide to kick left most of the time since that’s the Nash equilibrium. In C the agent will also decide to kick left most of the time: knowing the goalkeeper’s likely action still leaves the same Nash solution (based on knowing both that the keeper will probably go left, and that left is the agent’s stronger side).
If the agent knew the keeper would definitely go left, then of course it’d kick right—but I don’t think that’s the situation.
I’d be interested on your take on Evan’s comment on incoherence in LCDT. Specifically, do you think the issue I’m pointing at is a difference between LCDT and counterfactual planners? (or perhaps that I’m just wrong about the incoherence??)
As I currently understand things, I believe that CPs are doing planning in a counterfactual-but-coherent world, whereas LCDT is planning in an (intentionally) incoherent world—but I might be wrong in either case.
See the comment here for my take.
To be clear: the point I was trying to make is also that I do not think that B and C are significantly different in the goalkeeper benchmark. My point was that we need to go to a random prior to produce a real difference.
But your question makes me realise that this goalkeeper benchmark world opens up a bigger can of worms than I expected. When writing it, I was not thinking about Nash equilibrium policies, which I associate mostly with iterated games, and I was specifically thinking about an agent design that uses the planning world to compute a deterministic policy function. To state what I was thinking about in different mathematical terms, I was thinking of an agent design that is trying to compute the action a that optimizes argmaxa in the non-iterated gameplay world C.
To produce the Nash equilibrium type behaviour you are thinking about (i.e. the agent will kick left most of the time but not all the time), you need to start out with an agent design that will use the C constructed by LCDT to compute a nondeterministic policy function, which it will then use to do compute its real world action. If I follow that line of thought, it I would need additional ingredients to make the agent actually compute that Nash equilibrium policy function. I would need need to have iterated gameplay in B, with mechanics that allow the goalkeeper to observe whether the agent is playing a non-Nash-equilibrium policy/strategy, so that the goalkeeper will exploit this inefficiency for sure if the agent plays the non-Nash-equilibrium strategy. The possibility of exploitation by the goalkeeper is what would push the optimal agent policy towards a Nash equilibrium. But interestingly, such mechanics where the goalkeeper can learn about a non-Nash agent policy being used might be present in an iterated version of the real world model B, but they will be removed by LCDT from an iterated version of C. (Another wrinkle: some AI algorithms for solving the optimal policy in a single-shot game in B or C would turn B or C into an iterated game automatically and then solve the iterated game. Such iteration might also update the prior, if we are not careful. But if we solve B or C analytically or with Monte Carlo simulation, this type of expansion to an iterated game will not happen.)
Hope this clarifies what I was thinking about. I think it is also true that, if the prior you use in your LCDT construction is that everybody is playing according to a Nash equilibrium, then agent may end up playing exactly that under LCDT.
(I plan to comment on your question about incoherence in a few days.)
Thanks for your detailed reading and feedback! I’ll answer you later this week. ;)