What/who would you like to become in a thousand subjective years? or a million?
Perhaps, like me, you wish to become posthuman: to transcend mortality and biology, to become a substrate independent mind, to wear new bodies like clothes, to grow more intelligent, wise, wealthy, and connected, to explore the multiverse, perhaps eventually to split, merge, and change—to vasten.
Regardless of who you are now or what specific values you endorse today, I suspect you too would at least desire these possibilities as options. Absent some culture specific social stigmas, who would not like more wealth, health, and power? more future optionality?
As biological creatures, our fundamental evolutionary imperative is to be fruitful and multiply, so our core innate high level value should be inclusive genetic fitness. But for intelligent long lived animals like ourselves, reproduction is a terminal goal in the impossibly distant future: on the order of around 1e11 neural clock cycles from birth[1], to be more precise. Explicit optimization of inclusive genetic fitness through simulation and planning over such vast time horizons is simply implausible—especially for a mere 20 watt irreversible computer such as the human brain, no matter how efficient.
Fortunately there exists an accessible common goal which is ultimately instrumentally convergent for nearly all final goals: power-seeking, or simply: empowerment.
Our central hypothesis is that there exist a local and universal utility function which may help individuals survive and hence speed up evolution by making the fitness landscape smoother. The function is local in the sense that it doesn’t rely on infinitely long history of past experience, does not require global knowledge about the world, and that it provides localized feedback to the individual.
. . .
To a sugar-feeding bacterium, high sugar concentration means longer survival
time and hence more possibilities of moving to different places for reproduction, to a chimpanzee higher social status means more mating choice and interaction, to a person more money means more opportunities and more options. The common feature of the above examples is the striving for situations with more options, with more potential
for control or influence. To capture this notion quantitatively, as a proper utility function, we need to quantify how much control or influence an animal or human (an agent from now on) has.
Salge et al later summarized these arguments into the Behavioral Empowerment Hypothesis[3]:
The adaptation brought about by natural evolution produced organisms that
in absence of specific goals behave as if they were maximizing their empowerment.
Empowerment provides a succinct unifying explanation for much of the apparent complexity of human values: our drives for power, knowledge, self-actualization, social status/influence, curiosity and even fun[4] can all be derived as instrumental subgoals or manifestations of empowerment. Of course empowerment alone can not be the only value or organisms would never mate: sexual attraction is the principle deviation later in life (after sexual maturity), along with the related cooperative empathy/love/altruism mechanisms to align individuals with family and allies (forming loose hierarchical agents which empowerment also serves).
The key central lesson that modern neuroscience gifted machine learning is that the vast apparent complexity of the adult human brain, with all its myriad task specific circuitry, emerges naturally from simple architectures and optimization via simple universal learning algorithms over massive data. Much of the complexity of human values likewise emerges naturally from the simple universal principle of empowerment.
Empowerment-driven learning (including curiosity as an instrumental subgoal of empowerment) is the clear primary driver of human intelligence in particular, and explains the success of video games as empowerment superstimuli and fun more generally.
This is good news for alignment. Much of our values—although seemingly complex—derive from a few simple universal principles. Better yet, regardless of how our specific terminal values/goals vary, our instrumental goals simply converge to empowerment regardless. Of course instrumental convergence is also independently bad news, for it suggests we won’t be able to distinguish altruistic and selfish AGI from their words and deeds alone. But for now, let’s focus on that good news:
Safe AI does not need to learn a detailed accurate model of our values. It simply needs to empower us.
Approximate empowerment intrinsic motivation is not useful for AGI
AGI will not learn models of self and others
The difference between Altruistic AGI and Selfish AGI reduces to using other-empowerment utility rather than self-empowerment utility
Instrumental convergence (point 1) seems both intuitively obvious and has strong support already, but if it turns out to be false somehow that would independently be good news for alignment in another way. Instrumental convergence strongly implies that some forms of approximating the convergent planning trajectory will be useful, so rejecting point 2 is mostly implied by rejecting point 1. It also seems rather obvious that AGI will need a powerful learned world model which will include sub-models of self and others, so it is difficult to accept point 3.
Accepting point 4 (small difference between altruistic and selfish AGI) does not directly imply that alignment is automatic, but does suggest it may be easier than many expected. Moreover it implies that altruistic AGI is so similar to selfish AGI that all the research and concomitant tech paths converge making it difficult to progress one endpoint independent of the other.
There are many potential technical objections to altruistic human-empowering AGI, nearly all of which are also objections to selfish AGI. So if you find some way in which human-empowering AGI couldn’t possibly work, you’ve probably also found a way in which self-empowering AGI couldn’t possibly work.
A fully selfish agent optimizing only for self-empowerment is the pure implementation of the dangerous AI that does not love or hate us, but simply does not care.
A fully altruistic agent optimizing only for other-empowerment is the pure implementation of the friendly AI which seeks only to empower others.
Agents optimizing for their own empowerment seek to attain knowledge, wealth, health, immortality, social status, influence, power, etc.
Agents optimizing for other’s empowerment help them attain knowledge, wealth, health, immortality, social status, influence, power, etc.
Initially the selfish AGI has a naive world model, and outputs actions that are random or bootstrapped from simpler mechanisms (eg human training data). After significant learning optimization the AI develops a very powerful superhuman world model which can predict distributions over planning trajectories leading to long term future world states. Each such state conceptually contains representations of other agents, including the self. Conceptually the selfish agent architecture locates its self in these future trajectories as distinct from others and feeds the self state to the empowerment estimator module which is then the primary input to the utility function for planning optimization. In short it predicts future trajectories, estimates self-empowerment, and optimizes for that.
Initially the altruistic AGI has a naive world model, and outputs actions that are random or bootstrapped from simpler mechanisms (eg human training data). After significant learning optimization the AI develops a very powerful superhuman world model which can predict distributions over planning trajectories leading to long term future world states. Each such state conceptually contains representations of other agents, including the self. Conceptually the altruistic agent architecture locates its self in these future trajectories as distinct from others and feeds the others’ states to the empowerment estimator module which is then the primary input to the utility function for planning optimization. In short it predicts future trajectories, estimates other-empowerment, and optimizes for that.
The Golden Rule
An altruistic agent A optimizing for the empowerment of some other agent B implements a form of the golden rule, as it takes the very same actions that it would want A to take if it were B and selfish or sufficiently long-termist (long planning horizon, low discount rate, etc).
Selfish Empowerment in Practice
Klyubin et al first formalized the convergent empowerment objective[5][3:1] as the channel capacity between an agent’s future output/action stream Y and future input/sensory stream X, which I’ll reformulate slightly here as:
Et0,t1(Y,X)=maxp(yt0,t1)I(Yt0,t1Xt1)
Where I(Y,X) is the mutual information, Yt0,t1 is a (future) output stream from time t0 to t1, and Xt1 is a future input at time t1. The function Et0,t1(Y,X) measures the channel capacity between future actions starting at t0 and the future input at later time t1. This channel capacity term measures the maximum amount of information an agent can inject into its future input channel at time t1 through its output channel starting at time t0. Later authors often use an alternative formulation which instead defines the channel target X as the future states rather than future observations, which probably is more robust for partially observable environments.[6]
Artificial agents driven purely by approximations/variations of this simple empowerment utility function naturally move to the centers of rooms/mazes[5:1], use keys, block lava, and escape predators in gridworlds [7], navigate obstacles, push blocks to clear rooms, learn vision to control digits[8], learn various locomotion skills (running, walking, hopping, flipping, and gliding)[9][10], open doors (in 3D) [11], learn to play games [12][13], and generally behave intelligently. Empowerment and related variations are also better explanations of human behavior than task reward even in games with explicit reward scores[14]. In multi-agent social settings, much of an agent’s ability to control the future flows through other agents, so drive for social status/influence is a natural instrumental subgoal of empowerment[15].
However these worlds are simple and often even assume a known dynamics model. Intelligent agents scaling to more complex environments will naturally need to use a learned world model, using some efficient approximation of bayesian inference (ie SGD on neural nets). This presents a problem for an agent using a simple empowerment objective: how can the initially naive, untrained agent navigate to empowered states when it can’t yet even predict the consequences of its own actions? The first tasks of a learning agent are thus to learn their own embodiment and then explore and understand the world in order to later steer it: curiosity is a convergent subgoal of empowerment, and thus naturally also an effective intrinsic motivation objective by itself[16].
Maximizing empowerment or environmental control generally minimizes bayesian surprise of the world model, whereas curiosity is often formulated as maximizing surprise. This apparent contradiction can be used directly as an adversarial objective where an explorer sub-agent seeks to surprise a control sub-agent, which in turns seeks to control the environment by minimizing surprise[12:1], or as a mixed objective[17]. Other approaches attempt to unify curiosity and empowerment as a single objective where an agent seeks to align their beliefs with the world and act to align the world with their beliefs[18]. The adage “information is power” likewise suggests a unification where agents gather information to reduce uncertainty and also seek control to reduce the unpredictability of future world states.[19]
Ultimately exploration/curiosity is an instrumental subgoal of empowerment (which itself is a convergent instrumental subgoal of most long term goals), because improving the agent’s ability to predict future world states will generally improve its future ability to steer the world. Intelligent agents first seek to be surprised, then to control, and finally to exploit.
Potential Cartesian Objections
As mentioned earlier, klyubin’s original simple empowerment definition (maximization of actions->observations channel capacity) is subject to forms of input-channel hacking in partially observable environments: in a text world a simple echo command would nearly maximize action->input capacity, or in a 3D world a simple mirror provides high action->input capacity[6:1]. The most obvious solution is to instead use actions->state channel capacity, which overall seems a better formalization of power over the world.
However, there are still potential issues with the precise representation of the action channel and especially the use of channel capacity or maximum potential mutual information for agents which are physically embedded in the world.
The simple action->state channel capacity empowerment function implicitly assumes that the agent is a black box outside of the world, which can always in the future output potentially any arbitrary bit sequence from its action stream into the world. But in reality the agent is fully part of the world; a subject of physics.
There are future configurations of the world where the agent’s mind is disassembled, or otherwise simply disabled by unplugging of the output wire that actually physically transmits output bits into the world. It is essential that the agent learns a self-model which implements/represents the action channel flexibly—as the learned predicted concept of physical influence rather than any specific privileged memory location.
The unplugging issue is a special case of more serious potential problem arising from using channel capacity or the potential maximum information the agent can inject into the world. All actual physical agents are computationally constrained and thus all future action output bit combinations are not equally likely—or even possible. As an obvious example—there exists reasonable length sequences of output bits which you or I could output right now onto the internet which would grant us control of billions of dollars in cryptocurrency wealth. From a naive maximal action output channel capacity viewpoint, that wealth is essentially already yours (as accessible as money in your bank in terms of output sequence bit length), but in reality many interesting action bit sequences are not feasibly computable.
However given that computing the true channel capacity is computationally infeasible for long horizons anyway, efficient practical implementations use approximations which may ameliorate this problem to varying degrees. The ideal solution probably involves considering only the space of likely/possible accessible states, and moreover the agent will need to model its future action capacity as resulting from and constrained by a practical physical computation—ie a realistic self-model. This also seems required for deriving curiosity/exploration automatically as an instrumental goal of empowerment.
These cartesian objections are future relevant, but ultimately they don’t matter much for AI safety because powerful AI systems—even those of human-level intelligence—will likely need to overcome these problems regardless. Thus we can assume some efficient and robust approximation of empowerment available to both seflish and altruistic AI alike.
Altruistic Empowerment: Early Tests
The idea of AI optimizing for external empowerment occurred to me while researching and writing the empowerment section of a previous post; later I found that some researchers from Oxford and Deepmind have already implemented, tested, and published an early version of this idea in “Learning Altruistic Behaviours in Reinforcement Learning without External Rewards”[20] by Franzmeyer et al (which also has references to some earlier related work).
They test several variations of state reach-ability as the approximate empowerment objective, which is equivalent to Klyubin-empowerment under some simplifying assumptions such as deterministic environment transitions but is more easily efficiently computed.
In a simple grid world, their altruistic assistant helps the leader agent by opening a door, and—with sufficient planning-horizon—gets out of the way to allow the leader to access a high reward at the end of a maze tunnel. The assistant does this without any notion of the leader’s reward function. However with shorter planning horizons the assistant fails as it tries to ‘help’ the leader by blocking their path and thereby preventing them from making the poor choice of moving to a low-powered tunnel area.
They also test a simple multiplayer tag scenario where the altruists must prevent their leader from being tagged by adversaries. In this setup the choice-empowerment objectives even outperform direct supervised learning, presumably because of denser training signal.
From their conclusion:
Our experimental results demonstrate that artificial agents can behave altruistically towards other agents without knowledge of their objective or any external supervision, by actively maximizing their choice. This objective is justified by theoretical work on instrumental convergence, which shows that for a large proportion of rational agents this will be a useful subgoal, and thus can be leveraged to design generally altruistic agents.
Scaling this approach up to increasingly complex and realistic sim environments is now an obvious route forward towards altruistic AGI.
Mirrors of Self and Other
Human level intelligence requires learning a world model powerful enough to represent the concept of the self as an embedded agent. Humans learn to recognize themselves in mirrors by around age two, and several animal species with larger brain capacity (some primates, cetaceans, and elephants) can also pass mirror tests. Mirror self-recognition generally requires understanding that one’s actions control a body embedded in the world, as seen through the mirror.
Given that any highly intelligent agent will need a capability to (approximately) model and predict its own state and outputs in the future, much of that same self-modelling capacity can be used to predict the state and outputs of other agents. Most of a mind’s accumulated evidence about how minds think in general is naturally self-evidence, so it is only natural that the self-model serves as the basic template for other-models, until sufficient evidence accumulates to branch off a specific individual sub-model.
This simple principle forms the basis of strategy in board games such as chess or go where the complexities of specific mental variations are stripped away: both humans and algorithms predict their opponent’s future moves using the exact same model they use to predict their own. In games that incorporate bluffing such as poker some differentiation in player modeling becomes important, and then there are games such as roshambo where high level play is entirely about modeling an opponent’s distinct strategy—but not values or utility. In the real world, modelling others as self is called social projection, leading to the related false consensus effect/bias.
To understand humans and predict their actions and reactions AGI may need to model human cognitive processes and values in some detail, for the same reasons that human brains model these details and individualized differences. But for long term planning optimization purposes the detailed variation in individual values becomes irrelevant and the AGI can simply optimize for our empowerment.
Empowerment is the only long term robust objective due to instrumental convergence. The specific human values that most deviate from empowerment are exactly the values that are least robust and the most likely to drift or change as we become posthuman and continue our increasingly accelerated mental and cultural evolution, so mis-specification or lock-in of these divergent values could be disastrous.
Frequently Anticipated Questions/Criticisms
Relative compute costs
Will computing other-empowerment use significantly more compute than self-empowerment?
Not necessarily—if the ‘other’ alignment target is a single human or agent of comparable complexity to the AGI, the compute requirements should be similar. More generally agency is a fluid hierarchical concept: the left and right brain hemispheres are separate agents which normally coordinate and align so effectively that they form a single agency, but there are scenarios (split-brain patients) which break this coordination and reveal two separate sub-agents. Likewise organizations, corporations, groups, etc are forms of agents, and any practical large-scale AGI will necessarily have many localized input-output streams and compute centers. Conceptually empowerment is estimated over a whole agent/agency’s action output stream, and even if the cost scaled with output stream bitrate that if anything only implies a higher cost for computing selfish-empowerment as the AGI scales.
Coordination advantages
Will altruistic AGI have a coordination advantage?
Perhaps yes.
Consider two agents A and B who both have the exact same specific utility function X. Due to instrumental convergence both A and B will instrumentally seek self-empowerment at least initially, even though they actually have the exact same long term goal. This is because they are separate agents with unique localized egocentric approximate world models, and empowerment can only be defined in terms of approximate action influence on future predicted (egocentric approximate) world states. If both agents A and B somehow shared the exact same world model (and thus could completely trust each other assuming the world model encodes the exact agent utility functions), they would still have different action channels and thus different local empowerment scores. However they would nearly automatically coordinate because the combined group agent (A,B) achieves higher empowerment score for both A and B. The difference between A and B in this case has effectively collapsed to the difference between two brain hemispheres, or even less.
Two altruistic agents designed to empower humanity broadly should have fairly similar utility functions, and will also have many coordination advantages over humans: the ability to directly share or merge large ‘foundation’ world models, and potentially the use of cryptographic techniques to prove alignment of utility functions.
Two selfish agents designed to empower themselves (or specific humans) would have less of these coordination advantages.
Identity preservation
How will altruistic AGI preserve identity?
In much the same way that selfish AGI will seek to preserve identity.
Empowerment—by one definition—is the channel capacity or influence of an agent’s potential actions on the (approximate predicted) future world state. An agent who is about to die has near zero empowerment: more generally empowerment collapses to zero with time until death.
Agents naturally change over time, so a natural challenge of any realistic empowerment approximation for AGI is that of identifying the continuity of agentic identity. As discussed in the cartesian objection section any practical empowerment approximation suitable for AGI will already need a realistic embedded self-model. Continuation of identity is then a natural consequence of the requirement that the empowerment function must be computed for a consistent agent identity over time. In other words computing the empowerment of agent X over temporal trajectory T first requires locating agent X in the predicted future world states of T, which implicitly assumes continuation of identity if the agent recognition is over temporal sequences.
Selfish instrumental convergence
Won’t altruistic agents also initially seek self-empowerment due to instrumental convergence?
Yes.
The planning trajectories of altruistic and selfish agents (and any others) will all look initially very similar, and will only diverge later over time dependent on discount rates and/or planning horizons.
This is probably bad news for alignment, as it implies it may be very difficult for humans to discern whether an AGI is altruistic or selfish based on what they do or say. Introspection and proof techniques both could potentially help, along with use of simboxes to test the divergence earlier in an appropriate sandbox sim environment.
One natural likely future point of divergence is around uploading: a true altruistic AGI will likely support/advocate uploading and upgrading/scaling uploads, a selfish AGI will likely not (regardless of what it may say or appear to do).
Approximation issues
Since perfect empowerment is too expensive and approximations are used, could approximation errors break the instrumental convergence property?
Not really. Approximation is not a dirty word—near the entirety of the computation any useful intelligence does is approximation. Theoretically perfect computational ideals are just abstractions to inform efficient implementations, as in the real world the optimal practical system will always be based on clever approximations.
Naturally there are bad approximations of empowerment which break instrumental convergence in ways that matter, but those are less likely to be used, and regardless would hamper selfish AGI just as much as altruistic AGI.
Due to instrumental convergence to empowerment, empowerment is by construction the proxy whose trajectories converge to those of most all reasonable utility functions. In complex realistic environments, as you extend the planning horizon the high value planning trajectories for most all reasonable utility functions collapse to a convergent golden path, the path which empowerment directly estimates. Empowerment is thus in this sense maximally robust: converging close to the median of all egocentric utility functions.
We also have significant accumulated direct experimental evidence that the convergence is fairly fast, certainly compared to human timescales (lifespans on order 1e11 neural clock cycles—many many orders of magnitude beyond experiments where approximate empowerment is already useful).
Deception is general enough that it seems to be a potential problem for near any alignment approach.
Powerful AGI will tend to have long planning horizons, and due to instrumental convergence all powerful agents with long planning horizons will purse similar initial plans. So it will be very difficult to discern an AGI’s true utility function from words or deeds.
Corrigibility is only useful if the agent doesn’t start with the correct utility function. If human empowerment is already sufficient, then corrigibility is not useful. Corrigibility may or may not be useful for more mixed designs which hedge and attempt to combine human empowerment with some mixture of learned human values.
Changing brains or values
Wouldn’t an AGI optimizing for my empowerment also try to change my brain and even values to make me more capable and productive? Wouldn’t it want to make me less interested in socialization, sex, video games, drugs, fun in general, and other potential time sinks?
Yes and no.
In the early days the AGI’s energies are probably best invested in its own self-improvement—as after all greater returns on cognitive compute investment is somewhat implicit in the assumption of human-surpassing AGI. But using some clever words to influence humans towards greater future empowerment seems like fairly low hanging fruit. Eventually our minds could become the limiter of our future empowerment, so the AGI would then seek to change some aspects of our minds—but due to instrumental convergence any such changes are likely in our long term best interest. Much of fun seems empowerment related (most fun video game genres clearly exploit aspects of empowerment) - so it isn’t clear that fun (especially in moderation) is sub-optimal.
Ultimately though it is likely easier for the AGI itself to do the hard work, at least until uploading. After uploading AGI and humans become potentially much more similar, and thus expanding the cognitive capabilities of uploads could be favored over expanding the AGI’s own capabilities.
Sex and reproduction
Ok what about sex/reproduction though?
Doesn’t really seem necessary for uploads does it? One way of looking at this is what will humanity be like in a thousand years subjective time? What of our current values are most vs least likely to change? Empowerment—being instrumental to all terminal values—is the only value that is timeless.
It does seem plausible that an AGI optimizing for human empowerment would want us to upload and reduce the human biological population, but that seems to be just a continuation of the trend that a large tract of society (the more educated, wealthy, first world) is already on.
Sex uses a fairly small amount of our resources compared to reproduction. An AGI seeking to empower a narrowly defined target of specific humans may seek to end reproduction. This trend break downs for AGI with increasingly broader empowerment targets (humanity in general, etc), especially when we consider the computational fluidity of identity, but will obviously depend on the crucial agency definition/recognition model used for the empowerment target.
But our humanity
Wouldn’t optimizing for our empowerment strip us of our humanity?
Probably not?
Our brains and values are the long term result of evolution optimizing for inclusive fitness. But since we reproduce roughly 1e11 neural clock cycles after birth, the trajectories leading eventually to reproduction instrumentally converge to empowerment, so evolution created brains which optimize mostly for empowerment. However empowerment itself is complex enough to have its own instrumental subgoals such as social status and curiosity.
All of our complex values, instincts, mechanisms—all of those ‘shards’ - ultimately form an instrumental hierarchy or tree serving inclusive fitness at the root with empowerment as the main primary sub-branch. The principle sub-branch which is most clearly distinct from empowerment is sex/reproduction drive, but even then the situation is more complex and intertwined: human children are typically strategically aligned with parents and can help extend their lifespan.
So fully optimizing solely for our empowerment may eventually change us or strip away some of our human values, but clearly not all or even the majority.
Societies of uploads competing for resources will face essentially the same competitive optimization pressure towards empowerment-related values. So optimizing for our empowerment is simply aligned with the natural systemic optimization pressure posthumans will face regardless after transcending biology and genetic inclusive fitness.
Empower whom or what?
Would external empowerment AGI optimize for all of humanity? Aliens? Animals? Abstract agents like the earth in general? Dead humans? Fictional beings?
Maybe yes, depending on how wide and generic the external agency recognition is. Wider conceptions of agency are likely also more long term robust.
We actually see evidence of this in humans already, some of which seem to have a very general notion of altruism or ‘circle of empathy’ which extends beyond humanity to encompass animals, fictional AI or aliens, plants, and even the earth itself. Some humans historically also act as if they are optimizing for the goals of deceased humans or even imaginary beings.
One recent approach formalizes agents as systems that would adapt their policy if their actions influenced the world in a different way. Notice the close connection to empowerment, which suggests a related definition that agents are systems which maintain power potential over the future: having action output streams with high channel capacity to future world states. This all suggests that agency is a very general extropic concept and relatively easy to recognize.
Klyubin, Alexander S., Daniel Polani, and Chrystopher L. Nehaniv. “All else being equal be empowered.” European Conference on Artificial Life. Springer, Berlin, Heidelberg, 2005.
An agent maximizing control of its future input channel may be susceptible to forms of indirect channel ‘hacking’, seeking any means to more directly wire its output stream into its input stream. Using the future state—predicted from the agent’s world model—as the target channel largely avoids these issues, as immediate sensor inputs will only affect a subset of the model state. In a 3D world a simple mirror would allow high action->sensor channel capacity, and humans do find mirrors unusually fascinating, especiallyin VR, where they border on superstimuli for some.
Empowerment is (almost) All We Need
Intro
What/who would you like to become in a thousand subjective years? or a million?
Perhaps, like me, you wish to become posthuman: to transcend mortality and biology, to become a substrate independent mind, to wear new bodies like clothes, to grow more intelligent, wise, wealthy, and connected, to explore the multiverse, perhaps eventually to split, merge, and change—to vasten.
Regardless of who you are now or what specific values you endorse today, I suspect you too would at least desire these possibilities as options. Absent some culture specific social stigmas, who would not like more wealth, health, and power? more future optionality?
As biological creatures, our fundamental evolutionary imperative is to be fruitful and multiply, so our core innate high level value should be inclusive genetic fitness. But for intelligent long lived animals like ourselves, reproduction is a terminal goal in the impossibly distant future: on the order of around 1e11 neural clock cycles from birth[1], to be more precise. Explicit optimization of inclusive genetic fitness through simulation and planning over such vast time horizons is simply implausible—especially for a mere 20 watt irreversible computer such as the human brain, no matter how efficient.
Fortunately there exists an accessible common goal which is ultimately instrumentally convergent for nearly all final goals: power-seeking, or simply: empowerment.
Omohundro proposed an early version of the instrumental convergence hypothesis as applied to AI in his 2008 paper the Basic AI Drives, however the same principle was already recognized by Klyubin et al in their 2005 paper “Empowerment: A Universal Agent-Centric Measure of Control”[2]:
Salge et al later summarized these arguments into the Behavioral Empowerment Hypothesis[3]:
Empowerment provides a succinct unifying explanation for much of the apparent complexity of human values: our drives for power, knowledge, self-actualization, social status/influence, curiosity and even fun[4] can all be derived as instrumental subgoals or manifestations of empowerment. Of course empowerment alone can not be the only value or organisms would never mate: sexual attraction is the principle deviation later in life (after sexual maturity), along with the related cooperative empathy/love/altruism mechanisms to align individuals with family and allies (forming loose hierarchical agents which empowerment also serves).
The key central lesson that modern neuroscience gifted machine learning is that the vast apparent complexity of the adult human brain, with all its myriad task specific circuitry, emerges naturally from simple architectures and optimization via simple universal learning algorithms over massive data. Much of the complexity of human values likewise emerges naturally from the simple universal principle of empowerment.
Empowerment-driven learning (including curiosity as an instrumental subgoal of empowerment) is the clear primary driver of human intelligence in particular, and explains the success of video games as empowerment superstimuli and fun more generally.
This is good news for alignment. Much of our values—although seemingly complex—derive from a few simple universal principles. Better yet, regardless of how our specific terminal values/goals vary, our instrumental goals simply converge to empowerment regardless. Of course instrumental convergence is also independently bad news, for it suggests we won’t be able to distinguish altruistic and selfish AGI from their words and deeds alone. But for now, let’s focus on that good news:
Safe AI does not need to learn a detailed accurate model of our values. It simply needs to empower us.
The Altruistic Empowerment Argument
At least one of the following must be true:
Instrumental convergence to empowerment in realistic environments is false
Approximate empowerment intrinsic motivation is not useful for AGI
AGI will not learn models of self and others
The difference between Altruistic AGI and Selfish AGI reduces to using other-empowerment utility rather than self-empowerment utility
Instrumental convergence (point 1) seems both intuitively obvious and has strong support already, but if it turns out to be false somehow that would independently be good news for alignment in another way. Instrumental convergence strongly implies that some forms of approximating the convergent planning trajectory will be useful, so rejecting point 2 is mostly implied by rejecting point 1. It also seems rather obvious that AGI will need a powerful learned world model which will include sub-models of self and others, so it is difficult to accept point 3.
Accepting point 4 (small difference between altruistic and selfish AGI) does not directly imply that alignment is automatic, but does suggest it may be easier than many expected. Moreover it implies that altruistic AGI is so similar to selfish AGI that all the research and concomitant tech paths converge making it difficult to progress one endpoint independent of the other.
There are many potential technical objections to altruistic human-empowering AGI, nearly all of which are also objections to selfish AGI. So if you find some way in which human-empowering AGI couldn’t possibly work, you’ve probably also found a way in which self-empowering AGI couldn’t possibly work.
A fully selfish agent optimizing only for self-empowerment is the pure implementation of the dangerous AI that does not love or hate us, but simply does not care.
A fully altruistic agent optimizing only for other-empowerment is the pure implementation of the friendly AI which seeks only to empower others.
Agents optimizing for their own empowerment seek to attain knowledge, wealth, health, immortality, social status, influence, power, etc.
Agents optimizing for other’s empowerment help them attain knowledge, wealth, health, immortality, social status, influence, power, etc.
Initially the selfish AGI has a naive world model, and outputs actions that are random or bootstrapped from simpler mechanisms (eg human training data). After significant learning optimization the AI develops a very powerful superhuman world model which can predict distributions over planning trajectories leading to long term future world states. Each such state conceptually contains representations of other agents, including the self. Conceptually the selfish agent architecture locates its self in these future trajectories as distinct from others and feeds the self state to the empowerment estimator module which is then the primary input to the utility function for planning optimization. In short it predicts future trajectories, estimates self-empowerment, and optimizes for that.
Initially the altruistic AGI has a naive world model, and outputs actions that are random or bootstrapped from simpler mechanisms (eg human training data). After significant learning optimization the AI develops a very powerful superhuman world model which can predict distributions over planning trajectories leading to long term future world states. Each such state conceptually contains representations of other agents, including the self. Conceptually the altruistic agent architecture locates its self in these future trajectories as distinct from others and feeds the others’ states to the empowerment estimator module which is then the primary input to the utility function for planning optimization. In short it predicts future trajectories, estimates other-empowerment, and optimizes for that.
The Golden Rule
An altruistic agent A optimizing for the empowerment of some other agent B implements a form of the golden rule, as it takes the very same actions that it would want A to take if it were B and selfish or sufficiently long-termist (long planning horizon, low discount rate, etc).
Selfish Empowerment in Practice
Klyubin et al first formalized the convergent empowerment objective[5][3:1] as the channel capacity between an agent’s future output/action stream Y and future input/sensory stream X, which I’ll reformulate slightly here as:
Et0,t1(Y,X)=maxp(yt0,t1)I(Yt0,t1Xt1)
Where I(Y,X) is the mutual information, Yt0,t1 is a (future) output stream from time t0 to t1, and Xt1 is a future input at time t1. The function Et0,t1(Y,X) measures the channel capacity between future actions starting at t0 and the future input at later time t1. This channel capacity term measures the maximum amount of information an agent can inject into its future input channel at time t1 through its output channel starting at time t0. Later authors often use an alternative formulation which instead defines the channel target X as the future states rather than future observations, which probably is more robust for partially observable environments.[6]
Artificial agents driven purely by approximations/variations of this simple empowerment utility function naturally move to the centers of rooms/mazes[5:1], use keys, block lava, and escape predators in gridworlds [7], navigate obstacles, push blocks to clear rooms, learn vision to control digits[8], learn various locomotion skills (running, walking, hopping, flipping, and gliding)[9][10], open doors (in 3D) [11], learn to play games [12][13], and generally behave intelligently. Empowerment and related variations are also better explanations of human behavior than task reward even in games with explicit reward scores[14]. In multi-agent social settings, much of an agent’s ability to control the future flows through other agents, so drive for social status/influence is a natural instrumental subgoal of empowerment[15].
However these worlds are simple and often even assume a known dynamics model. Intelligent agents scaling to more complex environments will naturally need to use a learned world model, using some efficient approximation of bayesian inference (ie SGD on neural nets). This presents a problem for an agent using a simple empowerment objective: how can the initially naive, untrained agent navigate to empowered states when it can’t yet even predict the consequences of its own actions? The first tasks of a learning agent are thus to learn their own embodiment and then explore and understand the world in order to later steer it: curiosity is a convergent subgoal of empowerment, and thus naturally also an effective intrinsic motivation objective by itself[16].
Maximizing empowerment or environmental control generally minimizes bayesian surprise of the world model, whereas curiosity is often formulated as maximizing surprise. This apparent contradiction can be used directly as an adversarial objective where an explorer sub-agent seeks to surprise a control sub-agent, which in turns seeks to control the environment by minimizing surprise[12:1], or as a mixed objective[17]. Other approaches attempt to unify curiosity and empowerment as a single objective where an agent seeks to align their beliefs with the world and act to align the world with their beliefs[18]. The adage “information is power” likewise suggests a unification where agents gather information to reduce uncertainty and also seek control to reduce the unpredictability of future world states.[19]
Ultimately exploration/curiosity is an instrumental subgoal of empowerment (which itself is a convergent instrumental subgoal of most long term goals), because improving the agent’s ability to predict future world states will generally improve its future ability to steer the world. Intelligent agents first seek to be surprised, then to control, and finally to exploit.
Potential Cartesian Objections
As mentioned earlier, klyubin’s original simple empowerment definition (maximization of actions->observations channel capacity) is subject to forms of input-channel hacking in partially observable environments: in a text world a simple echo command would nearly maximize action->input capacity, or in a 3D world a simple mirror provides high action->input capacity[6:1]. The most obvious solution is to instead use actions->state channel capacity, which overall seems a better formalization of power over the world.
However, there are still potential issues with the precise representation of the action channel and especially the use of channel capacity or maximum potential mutual information for agents which are physically embedded in the world.
The simple action->state channel capacity empowerment function implicitly assumes that the agent is a black box outside of the world, which can always in the future output potentially any arbitrary bit sequence from its action stream into the world. But in reality the agent is fully part of the world; a subject of physics.
There are future configurations of the world where the agent’s mind is disassembled, or otherwise simply disabled by unplugging of the output wire that actually physically transmits output bits into the world. It is essential that the agent learns a self-model which implements/represents the action channel flexibly—as the learned predicted concept of physical influence rather than any specific privileged memory location.
The unplugging issue is a special case of more serious potential problem arising from using channel capacity or the potential maximum information the agent can inject into the world. All actual physical agents are computationally constrained and thus all future action output bit combinations are not equally likely—or even possible. As an obvious example—there exists reasonable length sequences of output bits which you or I could output right now onto the internet which would grant us control of billions of dollars in cryptocurrency wealth. From a naive maximal action output channel capacity viewpoint, that wealth is essentially already yours (as accessible as money in your bank in terms of output sequence bit length), but in reality many interesting action bit sequences are not feasibly computable.
However given that computing the true channel capacity is computationally infeasible for long horizons anyway, efficient practical implementations use approximations which may ameliorate this problem to varying degrees. The ideal solution probably involves considering only the space of likely/possible accessible states, and moreover the agent will need to model its future action capacity as resulting from and constrained by a practical physical computation—ie a realistic self-model. This also seems required for deriving curiosity/exploration automatically as an instrumental goal of empowerment.
These cartesian objections are future relevant, but ultimately they don’t matter much for AI safety because powerful AI systems—even those of human-level intelligence—will likely need to overcome these problems regardless. Thus we can assume some efficient and robust approximation of empowerment available to both seflish and altruistic AI alike.
Altruistic Empowerment: Early Tests
The idea of AI optimizing for external empowerment occurred to me while researching and writing the empowerment section of a previous post; later I found that some researchers from Oxford and Deepmind have already implemented, tested, and published an early version of this idea in “Learning Altruistic Behaviours in Reinforcement Learning without External Rewards”[20] by Franzmeyer et al (which also has references to some earlier related work).
They test several variations of state reach-ability as the approximate empowerment objective, which is equivalent to Klyubin-empowerment under some simplifying assumptions such as deterministic environment transitions but is more easily efficiently computed.
In a simple grid world, their altruistic assistant helps the leader agent by opening a door, and—with sufficient planning-horizon—gets out of the way to allow the leader to access a high reward at the end of a maze tunnel. The assistant does this without any notion of the leader’s reward function. However with shorter planning horizons the assistant fails as it tries to ‘help’ the leader by blocking their path and thereby preventing them from making the poor choice of moving to a low-powered tunnel area.
They also test a simple multiplayer tag scenario where the altruists must prevent their leader from being tagged by adversaries. In this setup the choice-empowerment objectives even outperform direct supervised learning, presumably because of denser training signal.
From their conclusion:
Scaling this approach up to increasingly complex and realistic sim environments is now an obvious route forward towards altruistic AGI.
Mirrors of Self and Other
Human level intelligence requires learning a world model powerful enough to represent the concept of the self as an embedded agent. Humans learn to recognize themselves in mirrors by around age two, and several animal species with larger brain capacity (some primates, cetaceans, and elephants) can also pass mirror tests. Mirror self-recognition generally requires understanding that one’s actions control a body embedded in the world, as seen through the mirror.
Given that any highly intelligent agent will need a capability to (approximately) model and predict its own state and outputs in the future, much of that same self-modelling capacity can be used to predict the state and outputs of other agents. Most of a mind’s accumulated evidence about how minds think in general is naturally self-evidence, so it is only natural that the self-model serves as the basic template for other-models, until sufficient evidence accumulates to branch off a specific individual sub-model.
This simple principle forms the basis of strategy in board games such as chess or go where the complexities of specific mental variations are stripped away: both humans and algorithms predict their opponent’s future moves using the exact same model they use to predict their own. In games that incorporate bluffing such as poker some differentiation in player modeling becomes important, and then there are games such as roshambo where high level play is entirely about modeling an opponent’s distinct strategy—but not values or utility. In the real world, modelling others as self is called social projection, leading to the related false consensus effect/bias.
To understand humans and predict their actions and reactions AGI may need to model human cognitive processes and values in some detail, for the same reasons that human brains model these details and individualized differences. But for long term planning optimization purposes the detailed variation in individual values becomes irrelevant and the AGI can simply optimize for our empowerment.
Empowerment is the only long term robust objective due to instrumental convergence. The specific human values that most deviate from empowerment are exactly the values that are least robust and the most likely to drift or change as we become posthuman and continue our increasingly accelerated mental and cultural evolution, so mis-specification or lock-in of these divergent values could be disastrous.
Frequently Anticipated Questions/Criticisms
Relative compute costs
Not necessarily—if the ‘other’ alignment target is a single human or agent of comparable complexity to the AGI, the compute requirements should be similar. More generally agency is a fluid hierarchical concept: the left and right brain hemispheres are separate agents which normally coordinate and align so effectively that they form a single agency, but there are scenarios (split-brain patients) which break this coordination and reveal two separate sub-agents. Likewise organizations, corporations, groups, etc are forms of agents, and any practical large-scale AGI will necessarily have many localized input-output streams and compute centers. Conceptually empowerment is estimated over a whole agent/agency’s action output stream, and even if the cost scaled with output stream bitrate that if anything only implies a higher cost for computing selfish-empowerment as the AGI scales.
Coordination advantages
Perhaps yes.
Consider two agents A and B who both have the exact same specific utility function X. Due to instrumental convergence both A and B will instrumentally seek self-empowerment at least initially, even though they actually have the exact same long term goal. This is because they are separate agents with unique localized egocentric approximate world models, and empowerment can only be defined in terms of approximate action influence on future predicted (egocentric approximate) world states. If both agents A and B somehow shared the exact same world model (and thus could completely trust each other assuming the world model encodes the exact agent utility functions), they would still have different action channels and thus different local empowerment scores. However they would nearly automatically coordinate because the combined group agent (A,B) achieves higher empowerment score for both A and B. The difference between A and B in this case has effectively collapsed to the difference between two brain hemispheres, or even less.
Two altruistic agents designed to empower humanity broadly should have fairly similar utility functions, and will also have many coordination advantages over humans: the ability to directly share or merge large ‘foundation’ world models, and potentially the use of cryptographic techniques to prove alignment of utility functions.
Two selfish agents designed to empower themselves (or specific humans) would have less of these coordination advantages.
Identity preservation
In much the same way that selfish AGI will seek to preserve identity.
Empowerment—by one definition—is the channel capacity or influence of an agent’s potential actions on the (approximate predicted) future world state. An agent who is about to die has near zero empowerment: more generally empowerment collapses to zero with time until death.
Agents naturally change over time, so a natural challenge of any realistic empowerment approximation for AGI is that of identifying the continuity of agentic identity. As discussed in the cartesian objection section any practical empowerment approximation suitable for AGI will already need a realistic embedded self-model. Continuation of identity is then a natural consequence of the requirement that the empowerment function must be computed for a consistent agent identity over time. In other words computing the empowerment of agent X over temporal trajectory T first requires locating agent X in the predicted future world states of T, which implicitly assumes continuation of identity if the agent recognition is over temporal sequences.
Selfish instrumental convergence
Yes.
The planning trajectories of altruistic and selfish agents (and any others) will all look initially very similar, and will only diverge later over time dependent on discount rates and/or planning horizons.
This is probably bad news for alignment, as it implies it may be very difficult for humans to discern whether an AGI is altruistic or selfish based on what they do or say. Introspection and proof techniques both could potentially help, along with use of simboxes to test the divergence earlier in an appropriate sandbox sim environment.
One natural likely future point of divergence is around uploading: a true altruistic AGI will likely support/advocate uploading and upgrading/scaling uploads, a selfish AGI will likely not (regardless of what it may say or appear to do).
Approximation issues
Not really. Approximation is not a dirty word—near the entirety of the computation any useful intelligence does is approximation. Theoretically perfect computational ideals are just abstractions to inform efficient implementations, as in the real world the optimal practical system will always be based on clever approximations.
Naturally there are bad approximations of empowerment which break instrumental convergence in ways that matter, but those are less likely to be used, and regardless would hamper selfish AGI just as much as altruistic AGI.
What about Goodharting?
Due to instrumental convergence to empowerment, empowerment is by construction the proxy whose trajectories converge to those of most all reasonable utility functions. In complex realistic environments, as you extend the planning horizon the high value planning trajectories for most all reasonable utility functions collapse to a convergent golden path, the path which empowerment directly estimates. Empowerment is thus in this sense maximally robust: converging close to the median of all egocentric utility functions.
We also have significant accumulated direct experimental evidence that the convergence is fairly fast, certainly compared to human timescales (lifespans on order 1e11 neural clock cycles—many many orders of magnitude beyond experiments where approximate empowerment is already useful).
What about Deceptive Alignment?
Deception is general enough that it seems to be a potential problem for near any alignment approach.
Powerful AGI will tend to have long planning horizons, and due to instrumental convergence all powerful agents with long planning horizons will purse similar initial plans. So it will be very difficult to discern an AGI’s true utility function from words or deeds.
Deceptive alignment can be detected and prevented with simboxing and strong interpretability tools.
What about Corrigibility?
Corrigibility is only useful if the agent doesn’t start with the correct utility function. If human empowerment is already sufficient, then corrigibility is not useful. Corrigibility may or may not be useful for more mixed designs which hedge and attempt to combine human empowerment with some mixture of learned human values.
Changing brains or values
Yes and no.
In the early days the AGI’s energies are probably best invested in its own self-improvement—as after all greater returns on cognitive compute investment is somewhat implicit in the assumption of human-surpassing AGI. But using some clever words to influence humans towards greater future empowerment seems like fairly low hanging fruit. Eventually our minds could become the limiter of our future empowerment, so the AGI would then seek to change some aspects of our minds—but due to instrumental convergence any such changes are likely in our long term best interest. Much of fun seems empowerment related (most fun video game genres clearly exploit aspects of empowerment) - so it isn’t clear that fun (especially in moderation) is sub-optimal.
Ultimately though it is likely easier for the AGI itself to do the hard work, at least until uploading. After uploading AGI and humans become potentially much more similar, and thus expanding the cognitive capabilities of uploads could be favored over expanding the AGI’s own capabilities.
Sex and reproduction
Doesn’t really seem necessary for uploads does it? One way of looking at this is what will humanity be like in a thousand years subjective time? What of our current values are most vs least likely to change? Empowerment—being instrumental to all terminal values—is the only value that is timeless.
It does seem plausible that an AGI optimizing for human empowerment would want us to upload and reduce the human biological population, but that seems to be just a continuation of the trend that a large tract of society (the more educated, wealthy, first world) is already on.
Sex uses a fairly small amount of our resources compared to reproduction. An AGI seeking to empower a narrowly defined target of specific humans may seek to end reproduction. This trend break downs for AGI with increasingly broader empowerment targets (humanity in general, etc), especially when we consider the computational fluidity of identity, but will obviously depend on the crucial agency definition/recognition model used for the empowerment target.
But our humanity
Probably not?
Our brains and values are the long term result of evolution optimizing for inclusive fitness. But since we reproduce roughly 1e11 neural clock cycles after birth, the trajectories leading eventually to reproduction instrumentally converge to empowerment, so evolution created brains which optimize mostly for empowerment. However empowerment itself is complex enough to have its own instrumental subgoals such as social status and curiosity.
All of our complex values, instincts, mechanisms—all of those ‘shards’ - ultimately form an instrumental hierarchy or tree serving inclusive fitness at the root with empowerment as the main primary sub-branch. The principle sub-branch which is most clearly distinct from empowerment is sex/reproduction drive, but even then the situation is more complex and intertwined: human children are typically strategically aligned with parents and can help extend their lifespan.
So fully optimizing solely for our empowerment may eventually change us or strip away some of our human values, but clearly not all or even the majority.
Societies of uploads competing for resources will face essentially the same competitive optimization pressure towards empowerment-related values. So optimizing for our empowerment is simply aligned with the natural systemic optimization pressure posthumans will face regardless after transcending biology and genetic inclusive fitness.
Empower whom or what?
Maybe yes, depending on how wide and generic the external agency recognition is. Wider conceptions of agency are likely also more long term robust.
We actually see evidence of this in humans already, some of which seem to have a very general notion of altruism or ‘circle of empathy’ which extends beyond humanity to encompass animals, fictional AI or aliens, plants, and even the earth itself. Some humans historically also act as if they are optimizing for the goals of deceased humans or even imaginary beings.
One recent approach formalizes agents as systems that would adapt their policy if their actions influenced the world in a different way. Notice the close connection to empowerment, which suggests a related definition that agents are systems which maintain power potential over the future: having action output streams with high channel capacity to future world states. This all suggests that agency is a very general extropic concept and relatively easy to recognize.
About 100hz (fastest synchronous neural oscillation frequencies or ‘brain waves’) * 32 yrs (1e9 seconds).
Klyubin, Alexander S., Daniel Polani, and Chrystopher L. Nehaniv. “Empowerment: A universal agent-centric measure of control.” 2005 ieee congress on evolutionary computation. Vol. 1. IEEE, 2005.
Salge, Christoph, Cornelius Glackin, and Daniel Polani. “Empowerment–an introduction.” Guided Self-Organization: Inception. Springer, Berlin, Heidelberg, 2014. 67-114.
Schmidhuber, Jürgen. “Formal theory of creativity, fun, and intrinsic motivation (1990–2010).” IEEE transactions on autonomous mental development 2.3 (2010): 230-247.
Klyubin, Alexander S., Daniel Polani, and Chrystopher L. Nehaniv. “All else being equal be empowered.” European Conference on Artificial Life. Springer, Berlin, Heidelberg, 2005.
An agent maximizing control of its future input channel may be susceptible to forms of indirect channel ‘hacking’, seeking any means to more directly wire its output stream into its input stream. Using the future state—predicted from the agent’s world model—as the target channel largely avoids these issues, as immediate sensor inputs will only affect a subset of the model state. In a 3D world a simple mirror would allow high action->sensor channel capacity, and humans do find mirrors unusually fascinating, especially in VR, where they border on superstimuli for some.
Mohamed, Shakir, and Danilo Jimenez Rezende. “Variational information maximisation for intrinsically motivated reinforcement learning.” Advances in neural information processing systems 28 (2015).
Gregor, Karol, Danilo Jimenez Rezende, and Daan Wierstra. “Variational intrinsic control.” arXiv preprint arXiv:1611.07507 (2016).
Eysenbach, Benjamin, et al. “Diversity is all you need: Learning skills without a reward function.” arXiv preprint arXiv:1802.06070 (2018).
Sharma, Archit, et al. “Dynamics-aware unsupervised discovery of skills.” arXiv preprint arXiv:1907.01657 (2019).
Pong, Vitchyr H., et al. “Skew-fit: State-covering self-supervised reinforcement learning.” arXiv preprint arXiv:1903.03698 (2019).
Fickinger, Arnaud, et al. “Explore and Control with Adversarial Surprise.” arXiv preprint arXiv:2107.07394 (2021).
Dilokthanakul, Nat, et al. “Feature control as intrinsic motivation for hierarchical reinforcement learning.” IEEE transactions on neural networks and learning systems 30.11 (2019): 3409-3418.
Matusch, Brendon, Jimmy Ba, and Danijar Hafner. “Evaluating agents without rewards.” arXiv preprint arXiv:2012.11538 (2020).
Jaques, Natasha, et al. “Social influence as intrinsic motivation for multi-agent deep reinforcement learning.” International conference on machine learning. PMLR, 2019.
Liu, Hao, and Pieter Abbeel. “Behavior from the void: Unsupervised active pre-training.” Advances in Neural Information Processing Systems 34 (2021): 18459-18473.
Zhao, Andrew, et al. “A Mixture of Surprises for Unsupervised Reinforcement Learning.” arXiv preprint arXiv:2210.06702 (2022).
Hafner, Danijar, et al. “Action and perception as divergence minimization.” arXiv preprint arXiv:2009.01791 (2020).
Rhinehart, Nicholas, et al. “Information is Power: Intrinsic Control via Information Capture.” Advances in Neural Information Processing Systems 34 (2021): 10745-10758.
Franzmeyer, Tim, Mateusz Malinowski, and João F. Henriques. “Learning Altruistic Behaviours in Reinforcement Learning without External Rewards.” arXiv preprint arXiv:2107.09598 (2021).