If there are ‘subshards’ which achieve this desirable behavior because they, from their own perspective, ‘intrinsically’ desire power (whatever that sort of distinction makes when you’ve broken things down that far), and it is these subshards which implement the instrumental drive… so what? After all, there has to be some level of analysis at which an agent stops thinking about whether or not it should do some thing and just starts doing the thing. Your muscles “intrinsically desire” to fire when told to fire, but the motor actions are still ultimately instrumental, to accomplish something other than individual muscles twitching. You can’t have ‘instrumental desire’ homunculuses all the way down to the individual transistor or ReLU neuron.
I sent this paragraph to TurnTrout as I was curious to get his reaction. Paraphrasing his response below:
No, that’s not the point. That’s actually the opposite of what i’m trying to say. The subshards implement the algorithmic pieces and the broader agent has an “intrinsic desire” for power. The subshards themselves are not agentic, and that’s why (in context) I substitute them in for “circuits”.
It’s explained in this post that I linked to. Though I guess in context I do say “prioritize” in a way that might be confusing. Shard Theory argues against homonculist accounts of cognition by considering the mechanistic effects of reinforcement processes. Also the subshards are not implementing an instrumental drive in the sense of “implementing the power-seeking behavior demanded by some broader consequentialist plan” they’re just seeking power, just ’cuz.
I literally do not understand what the internal cognition is supposed to look like for an inner-aligned agent. Most of what I’ve read has been vague, on the level of “an inner-aligned agent cares about optimizing the outer objective.”
Charles Foster comments:
“We are attempting to mechanistically explain how an agent makes decisions. One proposed reduction is that inside the agent, there is an even smaller inner agent that interacts with a non-agential evaluative submodule to make decisions for the outer agent. But that raises the immediate questions of “How does the inner agent make its decisions about how to interact with the evaluative submodule?” and then “At some point, there’s gotta be some non-agential causal structure that is responsible for actually implementing decision-making, right?” and then “Can we just explain the original agent’s behavior in those terms? What is positing an externalized evaluative submodule buying us?”
Perhaps my emphasis on mechanistic reasoning and my unusuallevelofprecision in my speculation about AI internals, perhaps these make people realize how complicated realistic cognition is in the shard picture. Perhaps people realize how much might have to go right, how many algorithmic details may need to be etched into a network so that it does what we want and generalizes well.
That’s a very thoughtful response from TurnTrout. I wonder if @Gwern agrees with its main points. If not, it would be good know where he thinks it fails.
I did not understand his response at all, and it sounds like I would have to reread a bunch of Turntrout posts before any further comment would just be talking past each other, so I don’t have anything useful to say. Maybe someone else can re-explain his point better and why I am apparently wrong.
I sent this paragraph to TurnTrout as I was curious to get his reaction. Paraphrasing his response below:
No, that’s not the point. That’s actually the opposite of what i’m trying to say. The subshards implement the algorithmic pieces and the broader agent has an “intrinsic desire” for power. The subshards themselves are not agentic, and that’s why (in context) I substitute them in for “circuits”.
It’s explained in this post that I linked to. Though I guess in context I do say “prioritize” in a way that might be confusing. Shard Theory argues against homonculist accounts of cognition by considering the mechanistic effects of reinforcement processes. Also the subshards are not implementing an instrumental drive in the sense of “implementing the power-seeking behavior demanded by some broader consequentialist plan” they’re just seeking power, just ’cuz.
From my early post: Inner and Outer Alignment Decompose One Hard Problem Into Two Extremely Hard Problems
That’s a very thoughtful response from TurnTrout. I wonder if @Gwern agrees with its main points. If not, it would be good know where he thinks it fails.
I did not understand his response at all, and it sounds like I would have to reread a bunch of Turntrout posts before any further comment would just be talking past each other, so I don’t have anything useful to say. Maybe someone else can re-explain his point better and why I am apparently wrong.