Great post. Thanks for writing this — it feels quite clarifying. I’m finding the diagram especially helpful in resolving the sources of my confusion.
I believe everything here is consistent with the definitions I proposed recently in this post (though please do point out any inconsistencies if you see them!), with the exception of one point.
This may be a fundamental confusion on my part — but I don’t see objective robustness, as defined here, as being a separate concept at all from inner alignment. The crucial point, I would argue, is that we ought to be treating the human who designed our agent as the base optimizer for the entire system.
Zooming in on the “inner alignment → objective robustness” part of the diagram, I think what’s actually going on is something like:
A human AI researcher wishes to optimize for some base objective, L.
It would take too much work for our researcher to optimize for L manually. So our researcher builds an agent to do the work instead, and sets L to be the agent’s loss function.
Depending on how it’s built, the agent could end up optimizing for L, or it could end up optimizing for something different. The thing the agent ends up truly optimizing for is the agent’s behavioral objective — let’s call it L′. If L′ is aligned with L, then the agent satisfies objective robustness by the above definition: its behavioral objective is aligned with the base. So far, so good.
But here’s the key point: from the point of view of the human researcher who built the agent, the agent is actually a mesa-optimizer, and the agent’s “behavioral objective” is really just the mesa-objective of that mesa-optimizer.
And now, we’ve got an agent that wishes to optimize for some mesa-objective L′. (Its “behavioral objective” by the above definition.)
And then our agent builds a sub-agent to do the work instead, and sets L′ to be the sub-agent’s loss function.
I’m sure you can see where I’m going with this by now, but the sub-agent the agent builds will have its own objective L′′ which may or may not be aligned with L′, which may or may not in turn be aligned with L. From the point of view of the agent, that sub-agent is a mesa-optimizer. But from the point of view of the researcher, it’s actually a “mesa-mesa-optimizer”.
That is to say, I think there are three levels of optimizers being invoked implicitly here, not just two. Through that lens, “intent alignment”, as defined here, is what I’d call “inner alignment between the researcher and the agent”; and “inner alignment”, as defined here, is what I’d call “inner alignment between the agent and the mesa-optimizer it may give rise to”.
In other words, humans live in this hierarchy too, and we should analyze ourselves in the same terms — and using the same language — as we’d use to analyze any other optimizer. (I do, for what it’s worth, make this point in my earlier post — though perhaps not clearly enough.)
Incidentally, this is one of the reasons I consider the concepts of inner alignment and mesa-optimization to be so compelling. When a conceptual tool we use to look inside our machines can be turned outward and aimed back at ourselves, that’s a promising sign that it may be pointing to something fundamental.
A final caveat: there may well be a big conceptual piece that I’m missing here, or a deep confusion that I have around one or more of these concepts that I’m still unaware of. But I wanted to lay out my thinking as clearly as I could, to make it as easy as possible for folks to point out any mistakes — would enormously appreciate any corrections!
I agree that what you’re describing is a valid way of looking at what’s going on—it’s just not the way I think about it, since I find that it’s not very helpful to think of a model as a subagent of gradient descent, as gradient descent really isn’t itself an agent in a meaningful sense, nor do I think it can really be understood as “trying” to do anything in particular.
Sure, makes sense! Though to be clear, I believe what I’m describing should apply to optimizers other than just gradient descent — including optimizers one might think of as reward-maximizing agents.
Great post. Thanks for writing this — it feels quite clarifying. I’m finding the diagram especially helpful in resolving the sources of my confusion.
I believe everything here is consistent with the definitions I proposed recently in this post (though please do point out any inconsistencies if you see them!), with the exception of one point.
This may be a fundamental confusion on my part — but I don’t see objective robustness, as defined here, as being a separate concept at all from inner alignment. The crucial point, I would argue, is that we ought to be treating the human who designed our agent as the base optimizer for the entire system.
Zooming in on the “inner alignment → objective robustness” part of the diagram, I think what’s actually going on is something like:
A human AI researcher wishes to optimize for some base objective, L.
It would take too much work for our researcher to optimize for L manually. So our researcher builds an agent to do the work instead, and sets L to be the agent’s loss function.
Depending on how it’s built, the agent could end up optimizing for L, or it could end up optimizing for something different. The thing the agent ends up truly optimizing for is the agent’s behavioral objective — let’s call it L′. If L′ is aligned with L, then the agent satisfies objective robustness by the above definition: its behavioral objective is aligned with the base. So far, so good.
But here’s the key point: from the point of view of the human researcher who built the agent, the agent is actually a mesa-optimizer, and the agent’s “behavioral objective” is really just the mesa-objective of that mesa-optimizer.
And now, we’ve got an agent that wishes to optimize for some mesa-objective L′. (Its “behavioral objective” by the above definition.)
And then our agent builds a sub-agent to do the work instead, and sets L′ to be the sub-agent’s loss function.
I’m sure you can see where I’m going with this by now, but the sub-agent the agent builds will have its own objective L′′ which may or may not be aligned with L′, which may or may not in turn be aligned with L. From the point of view of the agent, that sub-agent is a mesa-optimizer. But from the point of view of the researcher, it’s actually a “mesa-mesa-optimizer”.
That is to say, I think there are three levels of optimizers being invoked implicitly here, not just two. Through that lens, “intent alignment”, as defined here, is what I’d call “inner alignment between the researcher and the agent”; and “inner alignment”, as defined here, is what I’d call “inner alignment between the agent and the mesa-optimizer it may give rise to”.
In other words, humans live in this hierarchy too, and we should analyze ourselves in the same terms — and using the same language — as we’d use to analyze any other optimizer. (I do, for what it’s worth, make this point in my earlier post — though perhaps not clearly enough.)
Incidentally, this is one of the reasons I consider the concepts of inner alignment and mesa-optimization to be so compelling. When a conceptual tool we use to look inside our machines can be turned outward and aimed back at ourselves, that’s a promising sign that it may be pointing to something fundamental.
A final caveat: there may well be a big conceptual piece that I’m missing here, or a deep confusion that I have around one or more of these concepts that I’m still unaware of. But I wanted to lay out my thinking as clearly as I could, to make it as easy as possible for folks to point out any mistakes — would enormously appreciate any corrections!
I agree that what you’re describing is a valid way of looking at what’s going on—it’s just not the way I think about it, since I find that it’s not very helpful to think of a model as a subagent of gradient descent, as gradient descent really isn’t itself an agent in a meaningful sense, nor do I think it can really be understood as “trying” to do anything in particular.
Sure, makes sense! Though to be clear, I believe what I’m describing should apply to optimizers other than just gradient descent — including optimizers one might think of as reward-maximizing agents.