Weeping Agents

What is Agency? In a relatively intuitive way it’s a property of systems which have three things: Some –however rudimentary– way of modelling a slice of reality, a utility function over that slice of reality and a way of acting upon it. An agent can make better-than-chance guesses about how the behaviours in their action-pool would effect the world they inhabit and then pick the one whose modelled outcome scores best (according to expected utility, maximin or whatever risk-weighing you like) among the ones considered. They take a potential future and increase its likelihood of being instantiated. We can now switch perspectives and describe them according to the more elegant –though to many less intuitive– cybernetic definition: Agents are mechanisms by which the future influences the past. To be such a mechanism one has to be able to make better-than-noise guesses about the future, have beliefs about which ones are desirable and then act in the present to make them come to pass. Agents are löbian knots in causality. Things happening because something saw that they could happen. Proof by being provable.

Weeping Angels are an alien species from the long-running sci-fi series Doctor Who. They look like statues and have a number of other interesting qualities, but the only one that’s important for this point is that “anything which holds the image of an angel becomes an angel”.

Here are two true true statements which might be assigned the same headline as the argument I’m making but which aren’t the point:

  • There is a sense in which agents actively try to realign other agents towards their ends or to turn non-agentic parts of their domain into agents aligned to their values because this is a very powerful, very versatile strategy. We are doing the former constantly and are half-heartedly trying at the latter in a bumbling, suicidal sort of way.

  • There is also a sense in which powerful agents exert loads of selection pressure upon a system, and systems under sufficient selection pressure crystallize agents like diamonds for obvious reasons. Those agents are rarely if ever aligned to the process exerting the pressure, they are merely the sort of vaguely competent loop which survives it. This how evolution caused us for example.

Both of these are interesting and often worrying, but what we’re dealing with is worse. If you think of agents as the property of the future influencing the past, rather than anthropomorphizing them as the sort of thing you are, the following will become a lot more intuitive.

Anything that holds the image of an agent becomes an agent, not through the will of the agent, not as a response to selection pressure but by logical necessity. It does not even necessarily become the agent whose image it holds but it does nonetheless acquire agency. A pure epistemic model which makes to-the-best-of-its-ability factual claims about a world that contains an agent is itself a way by which the future influences the past because factual statements about this world contain actions of the agents which are picked by simulating the future. Therefore whatever the output of the epistemic model is –whatever thing it is being used for– its output is now a change made to the present moment on the basis of hypothetical futures. A language model which is trained on agents outputs strings based on the strings produced by agents based on their future-models. Whether it is being used to serve the agent-image, to opposes it or to do something entirely orthogonal, its behaviour meets the definition of agency.

Why should we care? Because we usually care about convergence. Agents make it so that the order in which you receive pieces of data matter, because hypothesizing an agent changes the actual attractor-landscape in which you operate. Where you have agents you get multiple fixed points. Where you have multiple fixed points you need guarantees about the amiability of all of them or a principled way to avoid the malign ones.

To hold the image of an agent means to hold within yourself bits which are the result of simulated futures and therefore, since information is never lost and doomed to propagate, to be a retro-causal influence upon the worldstate. To hold a knot is to be a knot. Agency is viral. When you for whatever reason want to design a non-agentic system, perhaps due to an intuition that such a thing is safer than the alternative, it isn’t actually enough to make sure that there is none of this icky steer-y-ness in the gears or the software or the emergent phenomena of their interplay. You have to make sure that none of the data it is exposed to contains agents, lest you turn your perfectly harmless reflective surface into a weeping angel. In many scenarios at the high end of capability this is impossible or at least defeats the purpose of whatever you were trying to make.

Of course this acquired agency could be designed and directed in such a way as to be benevolent –it doesn’t automatically eat you– but doing this merely rediscovers the problem of aligning an agent. The point is that you don’t get any of the nice, harmless looking guarantees you could make about non-agentic systems the moment you allow them to think about anything which does have that property. The future is coming to eat you. It always has been.