Koen.Holtman comments on LCDT, A Myopic Decision Theory

Koen.Holtman 19 Aug 2021 10:24 UTC
LW: 10 AF: 8
AF
Interesting!

LCDT is has major structural similarities with some of the incentive-managing agent designs that have been considered by Everitt et al in work on Causal Influence Diagrams (CIDs), e.g. here and by me in work on counterfactual planning, e.g. here. These similarities are not immediately apparent however from the post above, because of differences in terminology and in the benchmarks chosen.

So I feel it is useful (also as a multi-disciplinary or community-bridging exercise) to make these similarities more explicit in this comment. Below I will map the LCDT defined above to the frameworks of CIDs and counterfactual planning, frameworks that were designed to avoid (and/or expose) all ambiguity by relying on exact mathematical definitions.

Mapping LCDT to detailed math

Lonely CDT is a twist on CDT: an LCDT agent will make its decision by using a causal model just like a CDT agent would, except that the LCDT agent first cuts the last link in every path from its decision node to any other decision node, including its own future decision nodes.

OK, so in the terminology of counterfactual planning defined here, an LCDT agent is built to make decisions by constructing a model of a planning world inside its compute core, then computing the optimal action to take in the planning world, and then doing the same action on the real world. The LCDT planning world model is a causal model, let’s call it $C$ . This $C$ is constructed by modifying a causal model $B$ by cutting links. The $B$ we modify is a fully accurate, or reasonably approximate, model of bow the LCDT agent interacts with its environment, where the interaction aims to maximize a reward or minimize a loss function.

The planning world $C$ is a modification of $B$ that intentionally mis-approximates some of the real world mechanics visible in $B$ . $C$ is constructed to predict future agent actions less accurately than is possible, given all information in $B$ . This intentional mis-approximation this makes the LCDT into what I call a counterfactual planner. The LCDT plans actions that maximize reward (or minimize losses) in $C$ , and then performs these same actions in the real world it is in.

Some mathematical detail: in many graphical models of decision making, the nodes that represent the decision(s) made by the agent(s) do not have any incoming arrows. For the LCDT definition above to work, we need a graphical model where the decision-making nodes do have such incoming arrows. Conveniently, CIDs are such models. So we can disambiguate LCDT by saying that $B$ and $C$ are full causal models as defined in the CID framework. Terminology/mathematical details: in the CID definitions here, these full causal models $B$ and $C$ are called SCIMs, in the terminology defined here they are called policy-defining world models whose input parameters are fully known.

Now I identify some ambiguities that are left in the LCDT definition of the post. First, the definition has remained silent on how the initial causal world model $B$ is obtained. It might be by learning, by hand-coding (as in the benchmark examples), or a combination of the two. For an example of a models $B$ that is constructed with a combination of hand-coding and machine learning, see the planning world (p) here. There is also significant work in the ML community on using machine learning to construct from scratch full causal models including the nodes and the routing of the arrows themselves, or (more often) full Bayesian networks with nodes and arrows where the authors do not worry too much about any causal interpretation of the arrows. I have not tried this out in any examples, but I believe the LCDT approach might be usefully applied to predictive Bayesian networks too.

Regardless of how $B$ is obtained, we can do some safety analysis on the construction of $C$ out of $B$ .

The two works on CIDs here and here both consider that we can modify agent incentives by removing paths in the CID-based world model that the agent uses for planning its actions. In the terminology of the first paper above, the modifications made by LCDT to produce the model $C$ work to ‘remove an instrumental control incentive on a future action’. In the terminology of the second paper, the modifications will ‘make the agent indifferent about downstream nodes representing agent actions’. The post above speculates:

LCDT shows a form of indifference (related to indifference corrigibility maybe)

This is not a maybe: the indifference produced is definitely related to indifference corrigibility, the type of indifference-that-causes-corrigibility that the 2015 MIRI/FHI paper titled Corrigibility talks about. For some detailed mathematical work relating the two, see here.

A second ambiguity in LCDT is that it tell us how exactly the nodes in $B$ that represent agent decisions are to be identified. If $B$ is a hand-coded model of a game world, identifying these nodes may be easy. If $B$ is a somewhat opaque model produced by machine learning, identifying the nodes may be difficult. In many graphical world models, a single node may represent the state of a huge chunk of the agent environment: say both the vases and conveyor belts in the agent environment and the people in the agent environment. Does this node then become a node that represents agent decisions? We might imagine splitting the node into two nodes (this is often called factoring the state) to separate out the humans.

That being said, even a less-than-perfect identification of these nodes would work to suppress certain deceptive forms of manipulation, so LCDT could be usefully applied even to somewhat opaque learned causal models.

A third ambiguity is in the definition of the operations needed to create a computable causal model $C$ after taking a copy of $B$ and cutting incoming links to the downstream decision nodes:

What do we replace these decision nodes with (as their actual expression does depend on our decision)? We assume that the model has some fixed prior over its own decision, and then we marginalize the cut decision node with this prior, to leave the node with a distribution independent of our decision.

It is ambiguous how to construct this ‘fixed prior over its own decision’ that we should use to marginalize on. Specifically, is this prior allowed to take into account some or all of the events that preceded the decision to be made? This ambiguity leaves a large degree of freedom in constructing $C$ by modifying $B$ , especially in a setting where the agents involved make multiple decisions over time. This ambiguity is not necessarily a bad thing: we can interpret is as an open (hyper)parameter choice that allows us to create differently tuned versions of $C$ that trade off differently between suppressing manipulation and still achieving a degree of economic decision making effectiveness. On a side note, in a multi-decision setting, drawing a $B$ that encodes marginalization on 10 downstream decisions will generally create a huge diagram: it will add 10 new sub-diagrams feeding input observations into these decisions.

LCDT also considers agent self-modification, However, given the way these self-modification decisions are drawn, I cannot easily see how these would generalize to a multi-decision situation where the agent makes several decisions over time. Representations of self-modification in a multi-decision CID framework usually require that one draws a lot of extra nodes, see e.g. this paper. As this comment is long already, I omit the topic of how to map multi-action self-modification to unambiguous math. My safety analysis below is therefore limited to the case of the LCDT agent manipulating other agents, not the agent manipulating itself.

Some safety analysis

LCDT obviously removes some agent incentives, incentives to control the future decisions made by human agents in the agent environment. This is nice because one method of control is deception, so it suppresses deception. However, I do not believe LCDT removes all incentives to deceive in the general case.

As I explain in this example and in more detail in sections 9.2 and 11.5.2 here, the use of a counterfactual planning world model for decision making may remove some incentives for deception, compared to using a fully correct world model, but the planning world may still retain some game-theoretical mechanics that make deception part of an optimal planning world strategy. So we have to consider the value of deception in the planning world.

I’ll now do this for a particular toy example: the decision making problem of a soccer playing agent that tries to score a goal, with a human goalkeeper trying to block the goal. I simplify this toy world by looking at one particular case only: the case where the agent is close to the goal, and must decide whether to kick the ball in the left or right corner. As the agent is close, the human goalkeeper will have to decide to run to the left corner or right corner of the goal even before the agent takes the shot: the goalkeeper does not have enough time to first observe where the ball is going and only then start moving. So this toy world decision problem has the agent deciding on kick left of right, and the goalkeeper simultaneously deciding on running left or right.

[Edited to add: as discussed in the comments below, the discussion of about marginsalisation that follows is somewhat wrong/confusing. It fails to mention that if we construct $B$ exactly as described above, there is no causal link from the agent action to the goalkeeper action, so LCDT would construct a $C$ that is exaclty the same model as $B$ , and the question of what prior to marginalise on does not even come up. To make that question come up, we need to model an iterative game where the goalkeeper remembers (learns from) past moves by the agent.]

Say that agent is mechanically stronger at kicking left, so that a ball kicked left, when the goalkeeper also moves left, has a lower probability of being intercepted by the goalkeeper than in the right+right alternative. In that case, the most reasonable prior over the agent action will model an agent kicking left most of the time. Now, if we use this prior to marginalize the expectations of the human goalkeeper in the planning world, the planning world goalkeeper will expect the agent to kick the ball left most of the time, so they are more likely to move left.

Now observe that in the LCDT planning world model $C$ constructed by marginalization, this knowledge of the goalkeeper is a known parameter of the ball kicking optimization problem that the agent must solve. If we set the outcome probabilities right, the game theoretical outcome will be that the optimal policy is for the agent to kicks right, so it plays the opposite move that the goalkeeper expects. I’d argue that this is a form of deception, a deceptive scenario that LCDT is trying to prevent.

A safer decision theory would marginalize the goalkeeper expectations with a random prior over agent actions. thereby removing the game-theoretical benefit of the agent doing something unexpected. If the goalkeeper knows the agent is using this safer decision theory, they can always run left.

Now, I must admit that I associate the word ‘deception’ mostly with multi-step policies that aim to implant incorrect knowledge into the opposite party’s world model first, and then exploit that incorrect knowledge in later steps. The above example does only one of these things. So maybe others would deconfuse (define) the term ‘deception’ differently in a single-action setting, so that the above example does not in fact count as deception.

Benchmarking

The post above does not benchmark LCDT on Newcomb’s Problem, which I feel is a welcome change, compared to many other decision theory posts on this forum. Still, I feel that there is somewhat of a gap in the benchmarking coverage provided by the post above, as ‘mainstream’ ML agent designs are usually benchmarked in MDP or RL problem settings, that is on multi-step decision making problems where the objective is to maximize a time discounted sum of rewards. (Some of the benchmarks in the post above can be mapped to MDP problems in toy worlds, but they would be somewhat unusual MDP toy worlds.)

A first obvious MDP-type benchmark would be an RL setting where the reward signal is provided directly by a human agent in the environment. When we apply LCDT in this context, it makes the LCDT agent totally indifferent to influencing the human-generated reward signal: any random policy will perform equally well in the planning world $C$ . So the LCDT agent becomes totally non-responsive to its reward signal, and non-competitive as a tool to achieve economic goals.

In a second obvious MDP-type benchmark, the reward signal is provided by a sensor in the environment, or by some software that reads and processes sensor signals. If we model this sensor and this software as not being agents themselves, then LCDT may perform very well. Specifically, if there are innocent human bystanders too in the agent environment, bystanders who are modeled as agents, then we can expect that the incentive of the agent to control or deceive these human bystanders into helping it achieve its goals is suppressed. This is because under LCDT, the agent will lose some, potentially all, of its ability to correctly anticipate the consequences of its own actions on the actions of these innocent human bystanders.

Other remarks

There is an interesting link between LCDT and counterfactual oracles: whereas LCDT breaks the last link in any causal chain that influences human decisions, counterfactual oracle designs can be said to break the first link. See e.g. section 13 here for example causal diagrams.

When applying an LCDT-like approach construct a $C$ from a causal model $B$ , it may sometimes be easier to keep the incoming links to nodes in $B$ that model future agent decisions intact, and instead cut the outgoing links. This would mean replacing these nodes in $B$ with fresh nodes that generate probability distributions over future actions taken by the future agents(s). These fresh nodes could potentially use node values that occurred earlier in time than the agent action(s) as inputs, to create better predictions. When I picture this approach visually as editing a causal graph $B$ into a $C$ , the approach is more easy to visualize than the approach of marginalizing on a prior.

To conclude, my feeling is that LCDT can definitely be used as a safety mechanism, as an element of an agent design that suppresses deceptive policies. But it is definitely not a perfect safety tool that will offer perfect suppression of deception in all possible game-theoretical situations. When it comes to suppressing deception, I feel that time-limited myopia and the use of very high time discount factors are equally useful but imperfect tools.
What links here?
- Don’t Influence the Influencers! by lhc (19 Dec 2021 9:02 UTC; 14 points)
- adamShimi 23 Aug 2021 15:28 UTC
  LW: 4 AF: 3
  AF Parent
  I’ll now do this for a particular toy example: the decision making problem of a soccer playing agent that tries to score a goal, with a human goalkeeper trying to block the goal. I simplify this toy world by looking at one particular case only: the case where the agent is close to the goal, and must decide whether to kick the ball in the left or right corner. As the agent is close, the human goalkeeper will have to decide to run to the left corner or right corner of the goal even before the agent takes the shot: the goalkeeper does not have enough time to first observe where the ball is going and only then start moving. So this toy world decision problem has the agent deciding on kick left of right, and the goalkeeper simultaneously deciding on running left or right.
  By this setting, you ensure that the goal-keeper isn’t a causal descendant of the LCDT-agent. Which means there is no cutting involved, and the prior doesn’t play any role. In this case the LCDT agent decides exactly like a CDT agent, based on its model of what the goal-keeper will do.
  If the goal-keeper’s decision depends on his knowledge about the agent’s predisposition, then what you describe might actually happen. But I hardly see that as a deception: it completely reveals what the LCDT-agent “wants” instead of hiding it.
  - Koen.Holtman 23 Aug 2021 17:38 UTC
    LW: 5 AF: 4
    AF Parent
    
    By this setting, you ensure that the goal-keeper isn’t a causal descendant of the LCDT-agent.
    
    Oops! You are right, there is no cutting involved to create $C$ from $B$ in my toy example. Did not realise that. Next time, I need to draw these models on paper before I post, not just in my head.
    
    $C$ and $B$ do work as examples to explore what one might count as deception or non-deception. But my discussion of a random prior above makes sense only if you first extend $B$ to a multi-step model, where the knowledge of the goal keeper explicitly depends on earlier agent actions.
- Joe Collman 20 Aug 2021 15:33 UTC
  LW: 3 AF: 3
  AF Parent
  Interesting, thanks.
  However, I don’t think this is quite right (unless I’m missing something):
  Now observe that in the LCDT planning world model $C$ constructed by marginalization, this knowledge of the goalkeeper is a known parameter of the ball kicking optimization problem that the agent must solve. If we set the outcome probabilities right, the game theoretical outcome will be that the optimal policy is for the agent to kicks right, so it plays the opposite move that the goalkeeper expects. I’d argue that this is a form of deception, a deceptive scenario that LCDT is trying to prevent.
  I don’t think the situation is significantly different between B and C here. In B, the agent will decide to kick left most of the time since that’s the Nash equilibrium. In C the agent will also decide to kick left most of the time: knowing the goalkeeper’s likely action still leaves the same Nash solution (based on knowing both that the keeper will probably go left, and that left is the agent’s stronger side).
  If the agent knew the keeper would definitely go left, then of course it’d kick right—but I don’t think that’s the situation.
  I’d be interested on your take on Evan’s comment on incoherence in LCDT. Specifically, do you think the issue I’m pointing at is a difference between LCDT and counterfactual planners? (or perhaps that I’m just wrong about the incoherence??)
  As I currently understand things, I believe that CPs are doing planning in a counterfactual-but-coherent world, whereas LCDT is planning in an (intentionally) incoherent world—but I might be wrong in either case.
  What links here?
  - Koen.Holtman's comment on LCDT, A Myopic Decision Theory by adamShimi (25 Aug 2021 9:49 UTC; 9 points)
  - Koen.Holtman 25 Aug 2021 9:57 UTC
    LW: 1 AF: 1
    AF Parent
    
    I’d be interested on your take on...
    
    See the comment here for my take.
  - Koen.Holtman 21 Aug 2021 16:21 UTC
    LW: 1 AF: 1
    AF Parent
    
    However, I don’t think this is quite right (unless I’m missing something) [,,,] I don’t think the situation is significantly different between B and C here. In B, the agent will decide to kick left most of the time since that’s the Nash equilibrium. In C the agent will also decide to kick left most of the time: knowing the goalkeeper’s likely action still leaves the same Nash solution
    
    To be clear: the point I was trying to make is also that I do not think that $B$ and $C$ are significantly different in the goalkeeper benchmark. My point was that we need to go to a random prior to produce a real difference.
    
    But your question makes me realise that this goalkeeper benchmark world opens up a bigger can of worms than I expected. When writing it, I was not thinking about Nash equilibrium policies, which I associate mostly with iterated games, and I was specifically thinking about an agent design that uses the planning world to compute a deterministic policy function. To state what I was thinking about in different mathematical terms, I was thinking of an agent design that is trying to compute the action $a$ that optimizes $a r g m a x_{a}$ in the non-iterated gameplay world $C$ .
    
    To produce the Nash equilibrium type behaviour you are thinking about (i.e. the agent will kick left most of the time but not all the time), you need to start out with an agent design that will use the $C$ constructed by LCDT to compute a nondeterministic policy function, which it will then use to do compute its real world action. If I follow that line of thought, it I would need additional ingredients to make the agent actually compute that Nash equilibrium policy function. I would need need to have iterated gameplay in $B$ , with mechanics that allow the goalkeeper to observe whether the agent is playing a non-Nash-equilibrium policy/strategy, so that the goalkeeper will exploit this inefficiency for sure if the agent plays the non-Nash-equilibrium strategy. The possibility of exploitation by the goalkeeper is what would push the optimal agent policy towards a Nash equilibrium. But interestingly, such mechanics where the goalkeeper can learn about a non-Nash agent policy being used might be present in an iterated version of the real world model $B$ , but they will be removed by LCDT from an iterated version of $C$ . (Another wrinkle: some AI algorithms for solving the optimal policy in a single-shot game in $B$ or $C$ would turn $B$ or $C$ into an iterated game automatically and then solve the iterated game. Such iteration might also update the prior, if we are not careful. But if we solve $B$ or $C$ analytically or with Monte Carlo simulation, this type of expansion to an iterated game will not happen.)
    
    Hope this clarifies what I was thinking about. I think it is also true that, if the prior you use in your LCDT construction is that everybody is playing according to a Nash equilibrium, then agent may end up playing exactly that under LCDT.
    
    (I plan to comment on your question about incoherence in a few days.)
- adamShimi 20 Aug 2021 19:58 UTC
  LW: 2 AF: 1
  AF Parent
  Thanks for your detailed reading and feedback! I’ll answer you later this week. ;)

Koen.Holtman comments on LCDT, A Myopic Decision Theory

Mapping LCDT to detailed math

Some safety analysis

Benchmarking

Other remarks