Hierarchical Agency: A Missing Piece in AI Alignment

The most ambitious direction I’ve worked on in the past few years is a theory of hierarchical agency (as mentioned in the ACS announcement). Recently, I’ve become worried that the inferential distance between “the median of current AI safety discourse” and “what I and my colleagues at ACS work on” has become too large. After multiple attempts to write a good introduction to hierarchical agency ended up in perpetual draft form, I decided to try something else: explain it to Claude.

This is a lightly edited transcript of the first part of our conversation.

Claude
Shall we begin? Could you start by explaining what you mean by “hierarchical agency” and why you think it’s important for AI safety?

Jan
Broadly, there is a pattern in the world where you can often see agents composed of other agents. Like, corporations and their departments, states and citiziens, and so on.

Claude
So you’re looking at nested or layered structures of agency—where agents can contain or be made up of other agents. Let me test my understanding with some examples and their properties:

A corporation as an agent that can make decisions and take actions, but it’s composed of departments (also agents) which are in turn composed of individual human agents?
States acting as agents while being composed of various bureaucracies and ultimately individual human officials and citizens?

Would biological examples also fit this pattern? Like how a human body could be seen as an agent composed of organs (which have their own regulatory systems) and cells (which make their own decisions about things like protein production)? I’m curious about what makes this perspective particularly relevant for AI safety

Jan
I hope to get to AI safety relevance later.
Yes, biological examples fit well—particularly multicellular organisms, which demonstrate emergent agency at multiple scales. To better define the boundaries of hierarchical agency, it might help to examine some edge cases and non-examples.

Claude
Let me probe the boundaries systematically by analyzing different types of systems:

A crowd of people in a mall—they’re a collection of agents, but are they a cohesive higher-level agent? My guess is no, because there’s no real coordination or unified decision-making?
A computer network—it’s a connected system of parts, but do the individual computers act as agents in a meaningful way? And does the network itself have agency?
An ecosystem—while it’s a system of living agents (animals, plants), would you consider the ecosystem itself an agent?
A social insect colony—this might be closer to your multicellular example, but I’d be curious where you see it fitting or not fitting?

Am I on the right track with these examples? What makes some collections of agents form a meaningful higher-level agent while others don’t?

Jan
The question of what makes something a real superagent versus just a bunch of agents hanging out together is super interesting.
I don’t have a satisfactory answer yet, but empirically, you can ask it makes sense to model the collective as an agent. I usually operationalize this using Dennett’s three stances perspective—do you know these?

Claude
Yes! If I recall correctly, Dennett proposed three levels or stances for understanding/predicting behavior:

Physical stance—predicting based on physical laws
Design stance—predicting based on how something is designed/supposed to function
Intentional stance—predicting by attributing beliefs, desires, rationality (treating as an agent)

So, if I’m understanding you correctly, you’re suggesting that what makes something a “higher-level agent” is whether taking the intentional stance towards the collective is actually useful/predictive?

Let me test this understanding:

An ant colony: Treating it as an agent with goals (survive, grow, protect queen) might be more useful than trying to predict its behavior just from individual ant interactions
A random crowd: Probably better predicted by understanding individual human behavior and physical crowd dynamics
A corporation: Often very useful to model as an agent with goals, strategies, beliefs about the market, etc. Is this the kind of distinction you mean?

Jan
You get it. In thinking about hierarchical agency, I’m interested in systems where there are at least two layers where it is useful to think about both layers as agentic, let’s say a superagent composed of subagents.

Now, having this concept, I want to make a bold claim: I think a critical missing piece for AI safety and AI alignment is we lack a good mathematical formalism for describing these systems—at least as far as I know.

You would probably ask what I mean by ‘good’, so I’ll spell it outright. My desiderata are

objects at different levels of the hierarchy are of the same type; for a negative example, formalism where on lower layer you have agents, they vote, and the result of aggregation is ‘contract’, which is different type than agent, is not good enough. In other words, the theory is scale-free.
I want something expressive enough to describe real-world situations;
The objects are type ‘agent’, i.e. somehow represent intentionality

Claude
Let me break this down to make sure I understand your criteria:

Type consistency across levels:
Could you confirm if this example captures what you mean:
Good: A formalism where both OpenAI (superagent) and employees (subagents) are represented by the same type of mathematical object
Not good: A formalism where individual workers are “agents” but their collective becomes a different mathematical type like “contract”
When you say “expressive enough to describe real-world situations”—could you give an example of the kind of real-world complexity this formalism should be able to capture? I’m imagining things like partial alignment between levels or conflicts between levels
For representing intentionality—does this mean the mathematical objects should somehow capture things like goals, beliefs, and decision-making? Similar to how we might model an agent in decision theory or game theory?

Also, I’m curious—have you seen formalisms that satisfy some but not all of these criteria?

Jan

Yes
Yes: I want to be able to describe situations like conflicts between layers, or interactions where the superagent gains agency at the expense of subagents.
Yes. For example, Scott Garrabrant has a nice stab roughly at this direction, called Cartesian Frames, but the mathematical objects involved are more suitable for representing actions than beliefs, goals or intentionality.

Actually, you have better overview of all mathematical formalisms invented: you read it all, so maybe you can try to come up with some of them and check if they fulfill the criteria?

Claude
Let me think about mathematical formalisms I’m aware of that might be relevant:

Game Theory:

Captures intentionality (preferences, strategies)
Can represent nested games
But: Usually different types at different levels (players vs coalitions vs games themselves)

Category Theory:

Good at compositionality and type consistency
Objects compose into objects of same type
But: Abstract away too much of the intentional content? Real world relevance?

Classical Multi-agent Systems:

Good at representing intentionality
Can model interactions
But: Usually flat structure, not hierarchical?

Markov Decision Processes:

Can capture decision-making
Can be nested (hierarchical MDPs)
But: More focused on actions/policies than beliefs/goals?

Economic Mechanism Design:

Models incentives and preferences
Can handle multiple layers of interaction
But: Different mathematical objects at different levels (agents vs mechanisms)

Am I missing any obvious formalisms? Would combining aspects of these get closer to what you’re looking for?

Jan
Clearly related field is public choice. Less obvious, active inference

Claude
Let’s explore these:

Public Choice Theory:

Deals with collective decision-making
Models how individual preferences aggregate into institutional decisions
Might offer insights about inter-level dynamics

But: Does it typically treat the collective level as the same type of agent as individuals?

Active Inference:

Models agents as minimizing free energy/surprise
Can handle hierarchical structures (hierarchical predictive coding)

Has a unified mathematical framework across levels

This seems potentially promising!

Let me try to understand why active inference might be relevant here:

It models both perception and action in terms of belief updates
Could potentially represent both subagents and superagents as systems trying to minimize surprise
Might capture interactions between levels as bidirectional prediction/control flows

Am I getting warm? Could you explain what aspects of these frameworks you find relevant to hierarchical agency?

Also, this makes me wonder: are you looking for a formalism that could capture how different levels of agency might have different models/beliefs about the world and each other? Like how a corporation might have a different “view” of reality than its individual employees?

Jan
Yes. You are actually pretty good at this.

What I like about active inference in this context is it tries to express everything as the same type—roughly “beliefs”. Also, in its more neuro-inspired versions, assumes bidirectional control/prediction flows is what happens in human brain.

Yes: I want something which is expressive enough to allow different beliefs on different levels. Also I want something allowing to have “goal-oriented beliefs” or “goals” concerning other layers. For example, corporation having an instrumental goal of their employees being more loyal.

Given you get this, I think we can move to your previous question “how is this related to AI safety”—or we should clarify more?

Claude
Yes, I think I understand enough of the core concept to explore its relevance to AI safety.

Let me venture a guess about some potential connections to AI safety:

If we’re creating powerful AI systems, they might become part of hierarchical agency structures (either as subagents or superagents). Without a good formal understanding of hierarchical agency, we might miss critical dynamics.
Maybe current alignment approaches are too focused on single-agent scenarios, when in reality AI will exist within complex hierarchical structures?
The ability to have “goal-oriented beliefs” about other layers seems particularly relevant—like how an AI system might develop instrumental goals about its human users/creators.

Am I on the right track?

Jan
Broadly correct direction in case of 1. and 2., but actually I want to describe more ambitious vision.

How I think about it is using a metaphor of “game theory”.

Think about how humans understood cooperation and conflict before the mathematical formalization by von Neumann, Schelling, and others. There was lots of informal understanding scattered across diverse domains—military strategy books, gambling theory, board games, even ethical treatises. I suspect most people at the time didn’t see this as an area where mathematics could be useful—sure, math worked great for physics, but concepts like cooperation or defection seemed like fuzzy human things. Yet I imagine a mathematically-minded person looking across these domains might have noticed something interesting: similar patterns of incomplete understanding showing up in contexts as different as warfare and board games, suggesting a deeper mathematical structure waiting to be discovered.

Claude
Let me see if I understand the parallel you’re drawing:

Before game theory, understanding of strategic interaction was:

Scattered across diverse domains (military, games, ethics)
Largely informal/intuitive
Not obviously “mathematizable”
Had similar patterns that weren’t formally unified

You see hierarchical agency in a similar position now:

We see it everywhere (corporations, states, biological systems)
We have informal understanding in different contexts
The patterns are similar but not formally unified
Most people might not see it as something that could/should be mathematized

And you’re suggesting that just as game theory provided a unified mathematical framework that:

Revealed deep connections between seemingly different phenomena
Allowed for more rigorous analysis
Led to new insights and predictions
Became crucial for understanding modern systems (economics, nuclear deterrence)

You’re aiming for a similar mathematical unification of hierarchical agency?

Is this capturing your vision correctly? And if so, what do you see as the “similarly shaped holes” across different domains that your formalism would help fill?

Jan

Yes, you got this right. The “similarly shaped holes” are a bit of a rabbit hole we can spend a lot of time on. I’d like to connect this to AI safety in a way that hopefully makes the relevance clear.

One of the safety relevant holes is “how AI deals with different objectives”. You can imagine in the training AI learns about different goals, and the result is you can think about the AI being composed of various “smaller AIs”, tracking different things. For example, let’s say part of the AI “cares” or “predicts” or whatever you prefer … answers oriented toward helpfulness to the human user. Another about “harmlessness”. And so on. Further, you can think about the AI undergoing some process where these parts interact, and the “whole” evolves. For example, you can imagine yourself having these various drives, and preparing data for your further training.

Claude
Let me try to unpack this through a few lenses to make sure I understand:

Internal structure of AI systems:

Rather than a single monolithic objective/agent
Multiple “subagents” tracking different objectives
These might interact/negotiate like parts of ourselves do

The whole system emerges from these interactions

So for example:

Part of me wants to be helpful to others
Part wants to protect myself
Part focuses on truth-telling
Part manages social reputation

And these interact to produce behavior?

Training dynamics: If I understand correctly, you’re suggesting that:

During training, different “parts” might develop to track different objectives

These parts then interact/evolve

The AI might even participate in shaping its own further training

Let me test my understanding with a concrete example:

When I (Claude) respond to a user, there might be different “subagents” considering:

Helpfulness to the user
Truthfulness/accuracy
Safety considerations
Social appropriateness

And my responses emerge from some interaction between these?

Is this the kind of dynamic you’re trying to formalize? And would your formalism help us understand:

How these subagents might develop during training
How they interact to produce behavior
How the system as a whole might evolve
Potential conflicts or alignment issues between levels

Am I getting closer to what you mean by this particular “hole” that needs filling?

Jan
Yes, a bit more the points 2,3 and 4. I hope the formalism may help with problems like
… what Richard Ngo calls “Value systematization” or the problem me I described in “Self-unalignment problem”, where the AI faces the problem how to align with self-unaligned human. Do you vaguely remember these?

Claude
I will try to recall and connect these ideas:

From what I remember, “Value systematization” refers to the challenge of how an AI system could/should resolve conflicts or contradictions in human values—not just implementing fixed values, but engaging in some kind of philosophical reasoning about values.

The “Self-unalignment problem” deals with how an AI should handle the fact that humans themselves often have conflicting values, aren’t fully aligned with their own stated goals, and change over time.

So if I’m connecting this to hierarchical agency:

These problems involve multiple levels of agency and belief/value structures
They involve complex interactions between levels (AI trying to understand/systematize human values, humans trying to understand their own values)
There’s a question of how different “parts” or “levels” should interact and resolve conflicts

Would your formalism help by providing a way to:

Model how different levels of values/beliefs interact
Understand how systematization of values could emerge from these interactions
Analyze what “alignment” even means in such hierarchical systems
Perhaps suggest principles for how an AI should approach these challenges?

Jan
Yes: my guess is a good formalism would help with all of these. In essence, I’m seeking a game theory-esque breakthrough for hierarchical agency — a simple yet powerful formalism that captures the essence of multi-layered intentional interactions.

Claude
So, you believe a good formalism for hierarchical agency could:

Give us precise ways to talk about nested intentionality and belief structures
Provide core concepts/principles about how different levels of agency interact
Help analyze everything from internal AI value conflicts to human-AI alignment
Make previously fuzzy ideas (like “value systematization”) more tractable

Would it be fair to say you’re looking for the hierarchical agency equivalent of concepts like Nash equilibrium—simple but profound ideas that emerge once you have the right mathematical framework?

I’m curious—do you see any candidate concepts that might play this role? Things that seem like they might be fundamental to understanding hierarchical agency, but we lack the formal tools to properly analyze them yet?

Jan
Yes. I hope to return to the formal concepts later. Less formally, I think something like Holy grail here is to have formalization which is suitable for expressing concepts like ‘kindness’.

I often think it makes sense to think about individual human mind using the hierarchical agency frame. And, in this case, empirically, I would say people have different internal architectures. As a human, you (the superagent) can do different things with your internal conflicts. For example, we sometimes try to supress or even delete some of our parts. Some people seem to have something like “internal dictatorship”, where a lot of their motivations or parts are suppressed. Personally, I strive for cooperative, win-win relationships, where my different desires “trust the whole”.

Ideally, I would like something where we express thing like the superagent being kind to the subagents, or a dynamic of value evolution we would consider fair.

Claude
I’ll try to unpack several layers here:

Internal Architecture Patterns:

“Internal dictatorship”—suppressing/controlling subagents
“Internal democracy/cooperation”—seeking win-win solutions
“Internal understanding”—cultivating trust between levels

Am I right that you see these as different ways the superagent-subagent relationship can be structured?

Values at Meta-Level:

If I understand correctly, you’re suggesting we need formalism that can express not just object-level values, but meta-level values about how different levels of agency should interact. Like:

Kindness
Fairness in value evolution
Trust between parts and whole

This makes me think about therapeutic approaches to internal conflict. For example:

Internal Family Systems therapy sees internal parts as subagents and emphasizes compassionate leadership from the “Self”
Some meditation practices aim to cultivate a kind awareness of all mental states
Psychological integration vs suppression of traits/desires

Would your formalism aim to capture what makes some of these approaches “healthier” or “more ethical” than others? And could this then inform how we think about AI systems managing their internal conflicts or interacting with humans?

Is this getting at what you mean by expressing values like “kindness” in the context of hierarchical agency?

Jan
Yes.

To be continued.
I’m interested in feedback about the format: is it useful? Would be better if the conversation was more or less condensed? If Claude was simulating someone more or less informed?

^
This has various downsides, like difficulty in coordinating with other people, incomprehensibility to funders, etc.