When chaining parallel and sequential calls to large language models (like LangChain), you implicitly create a causal graph that can be analyzed visually if you have the right tracing tools (https://github.com/oughtinc/ice). This notebook describes different agents using an explicit formalism based on causal influence diagrams, which we can treat as a notation for describing the data flow, components and steps involved when a user makes a request. We use the example diagrams to explain and fix risk scenarios, showing how easy it is to debug agent architectures if you can visually reason about the data flow, and ask questions about intent alignment for AGI in the context of such agents.
Examples and Theory in Colab to Get Started:
Work done at the Alignment Jam #8 (Verification), starts at 31:43 but the whole event was great: https://youtu.be/XauqlTQm-o4
TODO:
Mechanistic Interpretability: Info-Weighted Attention mechanisms, Info-weighted Averaging (https://youtu.be/etFCaFvt2Ks)
[viz] Animating the temporal dependence if we have timestamps of each sub-agent process starting—should add this to tracing code
[Theory] Study links to Garrabrant’s Temporal Inference with Finite Factored Sets: https://arxiv.org/abs/2109.11513
Answer Set Programming for Automated Verification of Intent Consistency
Brian Muhia, August 2023
The causal influence diagrams introduced here (see the appendix, also here), and the accompanied reasoning that favours certain diagrams over others based on links to the “I” node, are simple enough that we can devise automated rules that check if a diagram is correct or wrong. We call this property “intent consistency”. Here we introduce three simple rules written in the Answer-Set Programming (ASP) formalism that
find paths between any two nodes X and Y, then
check if a path exists from the input node “I” to any decision node, and
fails if there is no direct link from the node “I” to any decision node.
These rules encode our expectations and intuitions, and let us describe a framework for automatically deciding if a diagram satisfies them. We call these rules “intent consistency models” (ICM), after [https://doi.org/10.1017/S1471068410000554].
ASP enables us to encode the graphs described here using facts, and run them through a conventional SAT solver like ‘clingo’ that checks for satisfiability. We can then describe unsatisfiable graphs as “incorrect”.