Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

Epistemic status: I’m currently unsure whether that’s a fake framework, a probably-wrong mechanistic model, or a legitimate insight into the fundamental nature of agency. Regardless, viewing things from this angle has been helpful for me.

In addition, the ambitious implications of this view is one of the reasons I’m fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)


Input Side: Observations

Consider what happens when we draw inferences based on observations.

Photons hit our eyes. Our brains draw an image aggregating the information each photon gave us. We interpret this image, decomposing it into objects, and inferring which latent-variable object is responsible for generating which part of the image. Then we wonder further: what process generated each of these objects? For example, if one of the “objects” is a news article, what is it talking about? Who wrote it? What events is it trying to capture? What set these events into motion? And so on.

In diagram format, we’re doing something like this:

Blue are ground-truth variables, grey is the “Cartesian boundary” of our mind from which we read off observations, purple are nodes in our world-model each of which can be mapped to a ground-truth variable.

We take in observations, infer what latent variables generated them, then infer what generated those variables, and so on. We go backwards: from effects to causes, iteratively. The Cartesian boundary of our input can be viewed as a “mirror” of a sort, reflecting the Past.

It’s a bit messier in practice, of course. There are shortcuts, ways to map immediate observations to far-off states. But the general idea mostly checks out – especially given that these “shortcuts” probably still implicitly route through all the intermediate variables, just without explicitly computing them. (You can map a news article to the events it’s describing without explicitly modeling the intermediary steps of witnesses, journalists, editing, and publishing. But your mapping function is still implicitly shaped by the known quirks of those intermediaries.)

Output Side: Actions

Consider what happens when we’re planning to achieve some goal, in a consequentialist-like manner.

We envision the target state. What we want to achieve, how the world would look like. Then we ask ourselves: what would cause this? What forces could influence the outcome to align with our desires? And then: how do we control these forces? What actions would we need to take in order to make the network of causes and effects steer the world towards our desires?

In diagram format, we’re doing something like this:

Green are goals, purple are intermediary variables we compute, grey is the Cartesian boundary of our actions, red are ground-truth variables through which we influence our target variables.

We start from our goals, infer what latent variables control their state in the real world, then infer what controls those latent variables, and so on. We go backwards: from effects to causes, iteratively, until getting to our own actions. The Cartesian boundary of our output can be viewed as a “mirror” of a sort, reflecting the Future.

It’s a bit messier in practice, of course. There are shortcuts, ways to map far-off goals to immediate actions. But the general idea mostly checks out – especially given that these heuristics probably still implicitly route through all the intermediate variables, just without explicitly computing them. (“Acquire resources” is a good heuristical starting point for basically any plan. But what counts as resources is something you had to figure out in the first place by mapping from “what lets me achieve goals in this environment?”.)

And indeed, that side of my formulation isn’t novel! From this post by Scott Garrabrant:

Time is also crucial for thinking about agency. My best short-phrase definition of agency is that agency is time travel. An agent is a mechanism through which the future is able to affect the past. An agent models the future consequences of its actions, and chooses actions on the basis of those consequences. In that sense, the consequence causes the action, in spite of the fact that the action comes earlier in the standard physical sense.

Both Sides: A Causal Mirror

Putting it together, an idealized, compute-unbounded “agent” could be laid out in this manner:

You may not like it, but this is what peak agency looks like.

It reflects the past at the input side, and reflects the future at the output side. In the middle, there’s some “glue”/​”bridge” connecting the past and the future by a forwards-simulation. During that, the agent “catches up to the present”: figures out what’ll happen while it’s figuring out what to do.

If we consider the relation between utility functions and probability distributions, it gets even more literal. An utility function over could be viewed as a target probability distribution over , and maximizing expected utility is equivalent to minimizing cross-entropy between this target distribution and the real distribution.

That brings the “planning” process in alignment with the “inference” process: both are about propagating target distributions “backwards” in time through the network of causality.

Why Is This Useful?

The primary, “ordinary” use-case is that this allows to import intuitions and guesses about how planning works to how inference works, and vice versa. It’s a helpful heuristic to guide one’s thoughts when doing research.

An example: Agency researchers are fond of talking about “coherence theorems” that constrain how agents work. There’s a lot of controversy around this idea. John Wentworth had speculated that “real” coherence theorems are yet to be discovered, and that they may be based on a more solid bedrock of probability theory or information theory. This might be the starting point for formulating those – by importing some inference-based derivations to planning procedures.

Another example: Consider the information-bottleneck method. Setup: Suppose we have a causal structure . We want to derive a mapping such that it discards as much information in as possible while retaining all data it has about . In optimization-problem terms, we want to minimize under the constraint of. The IBM paper then provides a precise algorithm on how to do that, if you know the mapping of . And that’s a pretty solid description of some aspects of inference.

But if inference is equivalent to planning, then it’d stand to reason that something similar happens on the planning side, too. Some sort of “observations”, some sort of information-theoretic bottleneck, etc.

And indeed: the bottleneck is actions! When we’re planning, we (approximately) generate a whole target world-state. But we can’t just assert it upon reality, we have to bring it about through our actions. So we “extract” a plan, we compress that hypothetical world-state into actions that would allow us to generate it… and funnel those actions through our output-side interface with the world.

In diagram format:

We have two bottlenecks: our agent’s processing capacity, which requires it to compress all observational data into a world-model, and our agent’s limited ability to influence the world, which causes it to compress its target world-state into an action-plan. We can now adapt the IBM for the task of deriving planning-heuristics as well.

And we’ve arrived at this idea by reasoning from the equivalence of inference to planning.

The ambitious use-case is that if this framework is meaningfully true, this implies that all cognitive functions can be viewed as inverse problems to the environmental functions our universe computes. Which suggests a proper paradigm to agent-foundations research. A way to shed light on all of it by understanding how certain aspects of the environment work.

On which topic...

Missing Piece: Approximation Theory

Now, of course, agents can’t be literal causal mirrors. It would essentially require each agent to be as big as the universe, if it has to literally infer the state of every variable the universe computes (bigger, actually: inverse problems tend to be harder).

The literal formulation also runs into all sorts of infinite recursion paradoxes. What if the agent wants to model itself? What if the environment contains other agents? What if some of them are modeling this agent? And so on.

But, of course, it doesn’t have to model everything. I’d already alluded to it when mentioning “shortcuts”. No, in practice, even idealized agents are only approximate causal mirrors. Their cognition is optimized for low computational complexity and efficient performance. The question then is: how does that “approximation” work?

That is precisely what the natural abstractions research agenda is trying to figure out. What is the relevant theory of approximation, that would suffice for efficiently modeling any system in our world?

Taking that into account, and assuming that my ambitious idea – that all cognitive functions can be derived as inversions of environmental functions – is roughly right...

Well, in that case, figuring out abstraction would be the last major missing piece in agent foundations. If we solve that puzzle, it’ll be smooth sailing from there on out. No more fundamental questions about paradigms, no theoretical confusions, no inane philosophizing like this post.

The work remaining after that may still not end up easy, mind. Inverse problems tend to be difficult, and the math for inversions of specific environmental transformations may be hard to figure out. But only in a strictly technical sense. It would be straightforwardly difficult, and much, much more scalable and parallelizable.

We won’t need to funnel it all through a bunch of eccentric agent-foundation researchers. We would at last attain high-level expertise in the domain of agency, which would let us properly factorize the problem.

And then, all we’d need to do is hire a horde of mathematicians and engineers (or, if we’re really lucky, get some math-research AI tools), pose them well-defined technical problems, and blow the problem wide open.