I would question the framing of mental subagents as “mesa optimizers” here. This sneaks in an important assumption: namely that they are optimizing anything. I think the general view of “humans are made of a bunch of different subsystems which use common symbols to talk to one another” has some merit, but I think this post ascribes a lot more agency to these subsystems than I would. I view most of the subagents of human minds as mechanistically relatively simple.
I actually like mesa-optimizer because it implies less agency than “subagent”. A mesa-optimizer in AI or evolution is a thing created to implement a value of its meta-optimizer, and the alignment problem is precisely the part where a mesa-optimizer isn’t necessarily smart enough to actually optimize anything, and especially not the thing that it was created for. It’s an adaptation-executor rather than a fitness-maximizer, whereas subagent implies (at least to me) that it’s a thing that has some sort of “agency” or goals that it seeks.
I actually like mesa-optimizer because it implies less agency than “subagent”. A mesa-optimizer in AI or evolution is a thing created to implement a value of its meta-optimizer, and the alignment problem is precisely the part where a mesa-optimizer isn’t necessarily smart enough to actually optimize anything, and especially not the thing that it was created for. It’s an adaptation-executor rather than a fitness-maximizer, whereas subagent implies (at least to me) that it’s a thing that has some sort of “agency” or goals that it seeks.