Rohin Shah comments on Progress on Causal Influence Diagrams

Rohin Shah 22 Jul 2021 21:53 UTC
LW: 6 AF: 4
AF
Planned summary for the Alignment Newsletter:
Many of the problems we care about (reward gaming, wireheading, manipulation) are fundamentally a worry that our AI systems will have the _wrong incentives_. Thus, we need Causal Influence Diagrams (CIDs): a formal theory of incentives. These are <@graphical models@>(@Understanding Agent Incentives with Causal Influence Diagrams@) in which there are action nodes (which the agent controls) and utility nodes (which determine what the agent wants). Once such a model is specified, we can talk about various incentives the agent has. This can then be used for several applications:
1. We can analyze [what happens](https://arxiv.org/abs/2102.07716) when you [intervene](https://arxiv.org/abs/1707.05173) on the agent’s action. Depending on whether the RL algorithm uses the original or modified action in its update rule, we may or may not see the algorithm disable its off switch.
2. We can <@avoid reward tampering@>(@Designing agent incentives to avoid reward tampering@) by removing the connections from future rewards to utility nodes; in other words, we ensure that the agent evaluates hypothetical future outcomes according to its _current_ reward function.
3. A [multiagent version](https://arxiv.org/abs/2102.05008) allows us to recover concepts like Nash equilibria and subgames from game theory, using a very simple, compact representation.