This paper—accepted as a poster to NeurIPS 2022— is the sequel to Optimal Policies Tend to Seek Power. The new theoretical results are extremely broad, discarding the requirements of full observability, optimal policies, or even requiring a finite number of options.
Abstract:
If capable AI agents are generally incentivized to seek power in service of the objectives we specify for them, then these systems will pose enormous risks, in addition to enormous benefits. In fully observable environments, most reward functions have an optimal policy which seeks power by keeping options open and staying alive. However, the real world is neither fully observable, nor must trained agents be even approximately reward-optimal.
We consider a range of models of AI decision-making, from optimal, to random, to choices informed by learning and interacting with an environment. We discover that many decision-making functions are retargetable, and that retargetability is sufficient to cause power-seeking tendencies. Our functional criterion is simple and broad.
We show that a range of qualitatively dissimilar decision-making procedures incentivize agents to seek power. We demonstrate the flexibility of our results by reasoning about learned policy incentives in Montezuma’s Revenge. These results suggest a safety risk: Eventually, retargetable training procedures may train real-world agents which seek power over humans.
Examples of agent designs the power-seeking theorems now apply to:
Boltzmann-rational agents,
Expected utility maximizers and minimizers,
Even if they uniformly randomly sample a few plans and then choose the best sampled
Satisficers (as I formalized them),
Quantilizing with a uniform prior over plans, and
RL-trained agents under certain modeling assumptions.
The key insight is that the original results hinge not on optimality per se, but on the retargetability of the policy-generation process via a reward or utility function or some other parameter. See Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability for intuitions and illustrations.
Why am I only now posting this?
First, I’ve been way more excited about shard theory. I still think these theorems are really cool, though.
Second, I think the results in this paper are informative about the default incentives for decision-makers which “care about things.” IE, make decisions on the basis of e.g. how many diamonds that decision leads to, or how many paperclips, and so on. However, I think that conventional accounts and worries around “utility maximization” are subtly misguided. Whenever I imagined posting this paper, I felt like “ugh sharing this result will just make it worse.” I’m not looking to litigate that concern right now, but I do want to flag it.
Third, Optimal Policies Tend to Seek Power makes the “reward is the optimization target” mistake super strongly. Parametrically retargetable decision-makers tend to seek power makes the mistake less hard, both because it discusses utility functions and learned policies instead of optimal policies, and also thanks to edits I’ve made since realizing my optimization-target mistake.
Conclusion
This paper isolates the key mechanism—retargetability—which enables the results in Optimal Policies Tend to Seek Power. This paper also takes healthy steps away from the optimal policy regime (which I consider to be a red herring for alignment) and lays out a bunch of theory I found—and still find—beautiful.
This paper is both published in a top-tier conference and, unlike the previous paper, actually has a shot of being applicable to realistic agents and training processes. Therefore, compared to the original[1] optimal policy paper, I think this paper is better for communicating concerns about power-seeking to the broader ML world.
- ^
I’ve since updated the optimal policy paper with disclaimers about Reward is not the optimization target, so the updated version is at least passable in this regard. I still like the first paper, am proud of it, and think it was well-written within its scope. It also takes a more doomy tone about AGI risk, which seems good to me.
I appreciate this generalization of the results—I think it’s a good step towards showing the underlying structure involved here.
One point I want to comment on is transitivity of ≥nmost, as a relation on induced functions f:Θ→R. Namely, it isn’t, and can even contain cycles of non-equivalent elements. (This came up when I was trying to apply a version of these results, and hoping that ≥nmost would be the preference relation I was looking for out of the box.) Quite possibly you noticed this since you give ‘limited transitivity’ in Lemma B.1 rather than full transitivity, but to give a concrete example:
Let V=⎛⎜⎝123312231⎞⎟⎠ and f(i|j)=Vij. The permutations are σ∈S3 with the usual action on {1,2,3}. Then we have [1] f1≥2mostf2≥2mostf3≥2mostf1 (and f2≱2mostf1). This also works on retargetability directly, with f being A2→B, B2→C, C2→A retargetable. Notice also that f is invariant under joint permutations (constant diagonals), and I think can be represented as EU-determined, so neither of these save it.
A narrow point is that for a non-transitive relation, I think the notation should be something other than ≥ (maybe ≽).
But more importantly, I think we would really rather a transitive (at least acyclic) relation, if we want to interpret this is ‘most θ prefer’ or any kind of preference / aggregation of preferences. If our theorem gives us only an intransitive relation as our conclusion, then we should tweak it.
One way you can do this: aim for a stronger relation like ≥no-m:
Definition (Orbit-mean dominance?): Let Of,A≠B(θ)={θ′∈Orbit|Θ(θ):f(A|θ′)≠f(B|θ′)}. Write f(B|θ)≥no-mf(A|θ) if ∀θ:∑Of,A≠B(θ)f(B|θ′)≥n∑Of,A≠B(θ)f(A|θ′).
Since the orbits are under Sd i.e. finite, it’s easy to just sum over them. More generally, you could parameterize this with an arbitrary aggregator g:OrbitsΘ(f)→R in place of summation; I’m not sure whether this general form or the ∑ case should be the focus.
This is transitive for n=1 and acyclic for[2] n>1 (consider θ by θ); and possibly any orbit-based transitive relation is representable in basically this form[3] (with some g), since I’d guess any partial order on sets with cardinality ≤c can be represented as a pointwise inequality of functions, but I haven’t thought about this too carefully.
With this notion of ≥no-m, we also need a stronger version of retargetability for the main theorem to hold. For the ∑ version, this could be
Definition (scalar-retargetability): Write f is AB−−−→scalar if there exists σ∈Sd such that for all θ with f(A|θA)−f(B|θA)=c>0 we have f(B|σθA)−f(A|σθA)≥c (and likewise multiply scalar-retargetable).
Then scalar-retargetability from A to B will imply f(B|θ)≥no-mf(A|θ).
And: I think many (all?) of the main power-seeking results are already secretly in this form. For example, θ-wise comparison of ∑θ′∈Orbit|Θ(θ)IsOptimal(X|C,θ′) gives a preference relation ≥no-m identical to the relation ≥nmost. Assuming this also works for the other rationalities, then the cases we care about were transitive all along exactly because the relations can be expressed in this way.
What do you think?
We get the same single orbit {1,2,3} for all θ a.k.a. j; the orbit elements j with f(i|j)>f(i′|j) are the columns where row i > row i′. There are always two such columns when comparing row i and row i+1 (mod 3). For example, f(1,1)=1<3=f(2,1)f(1,2)=2>1=f(2,2)f(1,3)=3>2=f(2,3)
We exclude θ s.t.f(A|θ′)=f(B|θ′) in this version of the definition to match the behaviour of ≥nmost with n>1, and allow n-scalar-retargetability to imply ≥no-m. There’s a case that you should include them, in which case you do get transitivity, and even the stronger property: if x≤ny≤mz, then x≤nmz. I think this corresponds to looking at likelihood ratios of P(A∧¬B)::P(B∧¬A) vs. P(A)::P(B).
Compare also what would give you a total order (instead of partial order): aggregating over all of Θ at once, like ∫Θf(A|θ)dμ(θ), instead of aggregating orbitwise at each θ.
This is a nice contribution, thank you!
I agree with the parts I could verify within about 10 minutes of staring (it’s been a while). The scalar-retargetability is nice, and I like the delineation of what definitions yield what properties. Seems like an additional hour of work would yield a good AF post, where I’d expect most of the useful additional work to come from fleshing out the example more and justifying the claims in a bit more detail.
To clarify:
What are A,B,C here?
FWIW—here (finally) is the related post I mentioned, which motivated this observation: Natural Abstraction: Convergent Preferences Over Information Structures The context is a power-seeking-style analysis of the naturality of abstractions, where I was determined to have transitive preferences.
It had quite a bit of scope creep already, so I ended up not including a general treatment of the (transitive) ‘sum over orbits’ version of retargetability (and some parts I considered only optimality—sorry! still think it makes sense to start there first and then generalize in this case). The full translation also isn’t necessarily as easy as I thought—it turns out that ≥nmost is transitive specifically for binary functions, so the other cases may not translate as easily as IsOptimal. After noticing that I decided to leave the general case for later.
I did use the sum-over-orbits form, though; which turns out to describe the preferences shared by every “G-invariant” distribution over utility functions. Reading between the lines shows roughly what it would look like.
I also moved from Sd to any G≤Sd - not sure if you looked at that, but at least the parts I was using all seem to work just as well with any subgroup. This gives preferences shared by a larger set of distributions, e.g. for an MDP you could in some cases have s1 preferred to s2 for all priors on U that are merely invariant to permuting U(s1) and U(s2) (rather than requiring them to be invariant to all permutations of utilities).
Thanks for the reply. I’ll clean this up into a standalone post and/or cover this in a related larger post I’m working on, depending on how some details turn out.
Variables I forgot to rename, when I changed how I was labelling the arguments of f in my example. This should be 12→2, 22→3, 32→1 retargetable (as arguments i to f(i|j)).
I’m finally engaging with this after having spent too long afraid of the math. Initial thoughts:
This result is really impressive and I’m surprised it hasn’t been curated. My guess is that it’s not presented in the most accessible way, so maybe it deserves a distillation.
The conclusion isn’t as strong or clean as I’d want. It’s not clear how to think about orbit-level power-seeking. I’d be excited about a stronger conclusion but wouldn’t know how to get it.
I found the above sentence from the explainer interesting: “There is no possible way to combine EU-based decision-making functions so that orbit-level instrumental convergence doesn’t apply to their composite.” Elliott Thornley also has a theorem deriving nonshutdownability from assumptions like “Indifference to Attempted Button Manipulation: The agent is indifferent between trajectories that differ only with respect to the actions chosen in shutdown-influencing states.” Together, maybe these point at a general principle that corrigible agents must care about means, not just ends.
Some confusions I’m still trying to resolve:
Can we say that power-seeking agents will disempower humans? I saw a post in the sequence about POWER in multi-agent games.
How do AUP agents get around these theorems?
If LLMs end up being useful, how do they get around these theorems? Can we get some result where if RLHF has a capabilities component and a power-averseness component, the capabilities component can cause the agent to be power-seeking on net?
Can we get a crude measure of how power-seeking agents will be in the real world, especially with the weakened assumptions of this paper?
Intuitively, eliciting that kind of failure seems like it would be pretty easy, but it doesn’t seem to be a blocker for the usefulness of the generalized form of LLMs. My mental model goes something like:
Foundational goal agnosticism evades optimizer-induced automatic doom, and
Models implementing a strong approximation of Bayesian inference are, not surprisingly, really good at extracting and applying conditions, so
They open the door to incrementally building a system that holds the entirety of a safe wish.
Things like “caring about means,” or otherwise incorporating the vast implicit complexity of human intent and values, can arise in this path, while I’m not sure the same can be said for any implementation that tries to get around the need for that complexity.
It seems like the paths which try to avoid importing the full complexity while sticking to crisp formulations will necessarily be constrained in their applicability. In other words, any simple expression of values subject to optimization is only safe within a bounded region. I bet there are cases where you could define those bounded regions and deploy the simpler version safely, but I also bet the restriction will make the system mostly useless.
Biting the bullet and incorporating more of the necessary complexity expands the bounded region. LLMs, and their more general counterparts, have the nice property that turning the screws of optimization on the foundation model actually makes this safe region larger. Making use of this safe region correctly, however, is still not guaranteed😊
At some point, I’d be interested in seeing in a distillation of the results in this sequence. (Neither a request nor even a suggestion, but simply an observation about what I think would be nice.)
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Thanks for sharing. Though can you explain this phrasing in the abstract?:
As I understand, agents inherently have some non-zero possibility of seeking power over humans, other agents, etc., by definition.
Very interesting. In general I agree concerns about EU maximisation are subtly misguided, but how would you square this result with Shard Theory? Where does Shard Theory fit in with corrigibility?