We have written a paper that represents various frameworks for designing safe AGI (e.g., RL with reward modeling, CIRL, debate, etc.) as Causal Influence Diagrams (CIDs), to help us compare frameworks and better understand the corresponding agent incentives.
We would love to get comments, especially on
Are the depicted frameworks represented accurately?
Is the CID representation helpful?
Frameworks we did not include that would be useful to model this way?
The paper’s abstract:
Proposals for safe AGI systems are typically made at the level of frameworks, specifying how the components of the proposed system should be trained and interact with each other. In this paper, we model and compare the most promising AGI safety frameworks using causal influence diagrams. The diagrams show the optimization objective and causal assumptions of the framework. The unified representation permits easy comparison of frameworks and their assumptions. We hope that the diagrams will serve as an accessible and visual introduction to the main AGI safety frameworks.
I really like this layout, this idea, and the diagrams. Great work.
I don’t agree that counterfactual oracles fix the incentive. There are black boxes in that proposal, like “how is the automated system not vulnerable to manipulation” and “why do we think the system correctly formally measures the quantity in question?” (see more potential problems). I think relying only on this kind of engineering cleverness is generally dangerous, because it produces safety measures we don’t see how to break (and probably not safety measures that don’t break).
Also, on page 10 you write that during deployment, agents appear as if they are optimizing the training reward function. As evhub et al point out, this isn’t usually true: the objective recoverable from perfect IRL on a trained RL agent is often different (behavioral objective != training objective).
Glad to hear it :)
Yes, the argument is only valid under the assumptions that you mention. Thanks for pointing to the discussion post about the assumptions.
Fair point, we should probably weaken this claim somewhat.
The reason I don’t personally find these kinds of representation super useful is because each of those boxes is a quite complicated function, and what’s in the boxes usually involves many more bits worth of information about an AI system than how the boxes are connected. And sometimes one makes different choices in how to chop an AI’s operation up into causally linked boxes, which can lead to an apples-and-oranges problem when comparing diagrams (for example, the diagrams you use for CIRL and IDI are very different choppings-up of the algorithms).
I actually have a draft sitting around of how one might represent value learning schemes with a hierarchical diagram of information flow. Eventually I decided that the idea made lots of sense for a few paradigm cases and was more trouble than it was worth for everything else. When you need to carefully refer to the text description to understand a diagram, that’s a sign that maybe you should use the text description.
This isn’t to say I think one should never see anything like this. Different ways of presenting the same information (like diagrams) can help drive home a particularly important point. But I am skeptical that there’s a one-size-fits-all solution, and instead think that diagram usage should be tailored to the particular point it’s intended to make.
Hey Charlie,
Thanks for your comment! Some replies:
There is definitely a modeling choice involved in choosing how much “to pack” in each node. Indeed, most of the diagrams have been through a few iterations of splitting and combining nodes. The aim has been to focus on the key dynamics of each framework.
As for the CIRL and IDA difference, this is a direct effect of the different levels the frameworks are specified at. CIRL is a high-level framework, roughly saying “somehow you infer the human preferences from their actions”. IDA, in contrast, provides a reasonably detailed supervised learning criteria. So I think the frameworks themselves are already like apples and oranges, it’s not just the diagrams. (And drawing the diagrams, this is something you notice.)
We don’t want to claim the CIDs are the one-and-only diagram to always use, but as you mentioned above, they do allow for quite some flexibility in what aspects to highlight.
Interesting. A while back I was looking at information flow diagram myself, and was surprised to discover how hard it was to make them formally precise (there seems to be no formal semantics for them). In contrast, causal graphs and CIDs have formal semantics, which is quite useful.
For hierarchical representations, there are networks of influence diagrams https://arxiv.org/abs/1401.3426
All good points.
The paper you linked was interesting—the graphical model is part of an AI design that actually models other agents using that graph. That might be useful if you’re coding a simple game-playing agent, but I think you’d agree that you’re using CIDs in a more communicative / metaphorical way?
On point 2, which is the only one I can really comment on, yes, this seems like a useful paper, and I buy the argument that such an approach is critical for some purposes, including some of what we discussed on Goodhart’s Law—https://arxiv.org/abs/1803.04585 - where one class of misalignment can be explicitly addressed by your approach. Also see the recent paper here: https://arxiv.org/abs/1905.12186 that explicitly models causal dependencies (like in figure 2,) to show a safety result.