I’m not sure I see any difference here between regular dangerously convergent instrumental drives and this added risk of ‘intrinsic’ drives. They just seem like the same thing to me. Like the two predictions you give seem already true and fulfilled:
Relative to other goals, agentic systems are easy to steer to seek power.
Agentic systems seek power outside of the “training distribution”, but in ways which don’t seem to be part of larger power-seeking plans.
Both of these seem like I would expect from a flexible, intelligent agent which is capable of handling many complicated changing domains, like a LLM: they are easy to steer to seek power (see: all the work on RLHF and the superficiality of alignment and ease of steering and low-dimensional embeddings), and they can execute useful heuristics even if those cannot be easily explained as part of a larger plan. (Arguably, that’s most of what they do currently.) In the hypotheticals you give, the actions seem just like a convergent instrumental drive of the sort that an agent will rationally develop in order to handle all the possible tasks which might be thrown at it in a bewildering variety of scenarios by billions of crazy humans and also other AIs. Trying to have ‘savings’ or ‘buying a bit of compute to be safe’, even if the agent cannot say exactly what it would use those for in the current scenario, seems like convergent, and desirable, behavior. Like buying insurance or adding validation checks to some new code, usually it won’t help, but sometimes the prudence will pay off. As humans say, “shit happens”. Agents which won’t do that and just helplessly succumb to hardware they know is flaky or give up the moment something is a little more than expensive than average or write code that explodes the instant you look at it funny because you didn’t say “make sure to check for X Y & Z”—those agents are not good agents for any purpose.
If there are ‘subshards’ which achieve this desirable behavior because they, from their own perspective, ‘intrinsically’ desire power (whatever that sort of distinction makes when you’ve broken things down that far), and it is these subshards which implement the instrumental drive… so what? After all, there has to be some level of analysis at which an agent stops thinking about whether or not it should do some thing and just starts doing the thing. Your muscles “intrinsically desire” to fire when told to fire, but the motor actions are still ultimately instrumental, to accomplish something other than individual muscles twitching. You can’t have ‘instrumental desire’ homunculuses all the way down to the individual transistor or ReLU neuron.
If there are ‘subshards’ which achieve this desirable behavior because they, from their own perspective, ‘intrinsically’ desire power (whatever that sort of distinction makes when you’ve broken things down that far), and it is these subshards which implement the instrumental drive… so what? After all, there has to be some level of analysis at which an agent stops thinking about whether or not it should do some thing and just starts doing the thing. Your muscles “intrinsically desire” to fire when told to fire, but the motor actions are still ultimately instrumental, to accomplish something other than individual muscles twitching. You can’t have ‘instrumental desire’ homunculuses all the way down to the individual transistor or ReLU neuron.
I sent this paragraph to TurnTrout as I was curious to get his reaction. Paraphrasing his response below:
No, that’s not the point. That’s actually the opposite of what i’m trying to say. The subshards implement the algorithmic pieces and the broader agent has an “intrinsic desire” for power. The subshards themselves are not agentic, and that’s why (in context) I substitute them in for “circuits”.
It’s explained in this post that I linked to. Though I guess in context I do say “prioritize” in a way that might be confusing. Shard Theory argues against homonculist accounts of cognition by considering the mechanistic effects of reinforcement processes. Also the subshards are not implementing an instrumental drive in the sense of “implementing the power-seeking behavior demanded by some broader consequentialist plan” they’re just seeking power, just ’cuz.
I literally do not understand what the internal cognition is supposed to look like for an inner-aligned agent. Most of what I’ve read has been vague, on the level of “an inner-aligned agent cares about optimizing the outer objective.”
Charles Foster comments:
“We are attempting to mechanistically explain how an agent makes decisions. One proposed reduction is that inside the agent, there is an even smaller inner agent that interacts with a non-agential evaluative submodule to make decisions for the outer agent. But that raises the immediate questions of “How does the inner agent make its decisions about how to interact with the evaluative submodule?” and then “At some point, there’s gotta be some non-agential causal structure that is responsible for actually implementing decision-making, right?” and then “Can we just explain the original agent’s behavior in those terms? What is positing an externalized evaluative submodule buying us?”
Perhaps my emphasis on mechanistic reasoning and my unusuallevelofprecision in my speculation about AI internals, perhaps these make people realize how complicated realistic cognition is in the shard picture. Perhaps people realize how much might have to go right, how many algorithmic details may need to be etched into a network so that it does what we want and generalizes well.
That’s a very thoughtful response from TurnTrout. I wonder if @Gwern agrees with its main points. If not, it would be good know where he thinks it fails.
I did not understand his response at all, and it sounds like I would have to reread a bunch of Turntrout posts before any further comment would just be talking past each other, so I don’t have anything useful to say. Maybe someone else can re-explain his point better and why I am apparently wrong.
I’m not sure I see any difference here between regular dangerously convergent instrumental drives and this added risk of ‘intrinsic’ drives. They just seem like the same thing to me. Like the two predictions you give seem already true and fulfilled:
Both of these seem like I would expect from a flexible, intelligent agent which is capable of handling many complicated changing domains, like a LLM: they are easy to steer to seek power (see: all the work on RLHF and the superficiality of alignment and ease of steering and low-dimensional embeddings), and they can execute useful heuristics even if those cannot be easily explained as part of a larger plan. (Arguably, that’s most of what they do currently.) In the hypotheticals you give, the actions seem just like a convergent instrumental drive of the sort that an agent will rationally develop in order to handle all the possible tasks which might be thrown at it in a bewildering variety of scenarios by billions of crazy humans and also other AIs. Trying to have ‘savings’ or ‘buying a bit of compute to be safe’, even if the agent cannot say exactly what it would use those for in the current scenario, seems like convergent, and desirable, behavior. Like buying insurance or adding validation checks to some new code, usually it won’t help, but sometimes the prudence will pay off. As humans say, “shit happens”. Agents which won’t do that and just helplessly succumb to hardware they know is flaky or give up the moment something is a little more than expensive than average or write code that explodes the instant you look at it funny because you didn’t say “make sure to check for X Y & Z”—those agents are not good agents for any purpose.
If there are ‘subshards’ which achieve this desirable behavior because they, from their own perspective, ‘intrinsically’ desire power (whatever that sort of distinction makes when you’ve broken things down that far), and it is these subshards which implement the instrumental drive… so what? After all, there has to be some level of analysis at which an agent stops thinking about whether or not it should do some thing and just starts doing the thing. Your muscles “intrinsically desire” to fire when told to fire, but the motor actions are still ultimately instrumental, to accomplish something other than individual muscles twitching. You can’t have ‘instrumental desire’ homunculuses all the way down to the individual transistor or ReLU neuron.
I sent this paragraph to TurnTrout as I was curious to get his reaction. Paraphrasing his response below:
No, that’s not the point. That’s actually the opposite of what i’m trying to say. The subshards implement the algorithmic pieces and the broader agent has an “intrinsic desire” for power. The subshards themselves are not agentic, and that’s why (in context) I substitute them in for “circuits”.
It’s explained in this post that I linked to. Though I guess in context I do say “prioritize” in a way that might be confusing. Shard Theory argues against homonculist accounts of cognition by considering the mechanistic effects of reinforcement processes. Also the subshards are not implementing an instrumental drive in the sense of “implementing the power-seeking behavior demanded by some broader consequentialist plan” they’re just seeking power, just ’cuz.
From my early post: Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problems
That’s a very thoughtful response from TurnTrout. I wonder if @Gwern agrees with its main points. If not, it would be good know where he thinks it fails.
I did not understand his response at all, and it sounds like I would have to reread a bunch of Turntrout posts before any further comment would just be talking past each other, so I don’t have anything useful to say. Maybe someone else can re-explain his point better and why I am apparently wrong.