I think AI agents (trained end-to-end) might intrinsically prefer power-seeking, in addition to whatever instrumental drives they gain.
The logical structure of the argument
Premises
People will configure AI systems to be autonomous and reliable in order to accomplish tasks.
This configuration process will reinforce & generalize behaviors which complete tasks reliably.
Many tasks involve power-seeking.
The AI will complete these tasks by seeking power.
The AI will be repeatedly reinforced for its historical actions which seek power.
There is a decent chance the reinforced circuits (“subshards”) prioritize gaining power for the AI’s own sake, not just for the user’s benefit.
Conclusion: There is a decent chance the AI seeks power for itself, when possible.
Read the full post at turntrout.com/intrinsic-power-seeking
Find out when I post more content: newsletter & RSS
Note that I don’t generally read or reply to comments on LessWrong. To contact me, email alex@
turntrout.com
.
I’m not sure I see any difference here between regular dangerously convergent instrumental drives and this added risk of ‘intrinsic’ drives. They just seem like the same thing to me. Like the two predictions you give seem already true and fulfilled:
Both of these seem like I would expect from a flexible, intelligent agent which is capable of handling many complicated changing domains, like a LLM: they are easy to steer to seek power (see: all the work on RLHF and the superficiality of alignment and ease of steering and low-dimensional embeddings), and they can execute useful heuristics even if those cannot be easily explained as part of a larger plan. (Arguably, that’s most of what they do currently.) In the hypotheticals you give, the actions seem just like a convergent instrumental drive of the sort that an agent will rationally develop in order to handle all the possible tasks which might be thrown at it in a bewildering variety of scenarios by billions of crazy humans and also other AIs. Trying to have ‘savings’ or ‘buying a bit of compute to be safe’, even if the agent cannot say exactly what it would use those for in the current scenario, seems like convergent, and desirable, behavior. Like buying insurance or adding validation checks to some new code, usually it won’t help, but sometimes the prudence will pay off. As humans say, “shit happens”. Agents which won’t do that and just helplessly succumb to hardware they know is flaky or give up the moment something is a little more than expensive than average or write code that explodes the instant you look at it funny because you didn’t say “make sure to check for X Y & Z”—those agents are not good agents for any purpose.
If there are ‘subshards’ which achieve this desirable behavior because they, from their own perspective, ‘intrinsically’ desire power (whatever that sort of distinction makes when you’ve broken things down that far), and it is these subshards which implement the instrumental drive… so what? After all, there has to be some level of analysis at which an agent stops thinking about whether or not it should do some thing and just starts doing the thing. Your muscles “intrinsically desire” to fire when told to fire, but the motor actions are still ultimately instrumental, to accomplish something other than individual muscles twitching. You can’t have ‘instrumental desire’ homunculuses all the way down to the individual transistor or ReLU neuron.
I sent this paragraph to TurnTrout as I was curious to get his reaction. Paraphrasing his response below:
No, that’s not the point. That’s actually the opposite of what i’m trying to say. The subshards implement the algorithmic pieces and the broader agent has an “intrinsic desire” for power. The subshards themselves are not agentic, and that’s why (in context) I substitute them in for “circuits”.
It’s explained in this post that I linked to. Though I guess in context I do say “prioritize” in a way that might be confusing. Shard Theory argues against homonculist accounts of cognition by considering the mechanistic effects of reinforcement processes. Also the subshards are not implementing an instrumental drive in the sense of “implementing the power-seeking behavior demanded by some broader consequentialist plan” they’re just seeking power, just ’cuz.
From my early post: Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problems
That’s a very thoughtful response from TurnTrout. I wonder if @Gwern agrees with its main points. If not, it would be good know where he thinks it fails.
I did not understand his response at all, and it sounds like I would have to reread a bunch of Turntrout posts before any further comment would just be talking past each other, so I don’t have anything useful to say. Maybe someone else can re-explain his point better and why I am apparently wrong.
This still feels like instrumentality. I guess maybe the addition is that it’s a sort of “when all you have is a hammer” situation; as in, even when the optimal strategy for a problem does not involve seeking power (assuming such a problem exists; really I’d say the question is what the optimal power seeking vs using that power trade-off is), the AI would be more liable to err on the side of seeking too much power because that just happens to be such a common successful strategy that it’s sort of biased towards it.