(content warning: long)
[...] let you pick and choose where to poke at my model.
You’ll forgive me if I end up writing multiple separate responses then.
TL;DR: Chess engines are non-general optimizers, but they’re not mesa-optimizers; and the fact that you could only come up with an example of the former and not the latter is not a coincidence but a reflection of a deeper truth. Of course, this previous statement could be falsified by providing an example of a non-general mesa-optimizer, and a good argument as to why it should be regarded as a mesa-optimizer.
It’s not that I couldn’t come up with examples, but more like I didn’t have time to write a longer comment just then. Are these not examples? What about godshatter?
The terms I first enumerated have specific meaning not coined by you or me, and I am trying to use them in the standard way. Now, it’s possible that I don’t understand the definitions correctly, but I think I do, and I think your definition for (at least) “mesa optimizer” is not the standard one. If you know this and just don’t like the standard definitions (because they are “not useful”), that’s fine, define your own terms, but call them something else, rather than changing them out from under me.
Specifically, I was going off the usage here. Does that match your understanding?
quoted passage reads like you’re talking about outer misalignment (a mismatch between the system’s outer optimization target—its so-called “base objective”—and its creators’ real target), whereas I’m reasonably certain mesa-optimization is much better thought of as a type of inner misalignment (a mismatch between the system’s base objective and whatever objective it ends up representing internally, and pursuing behaviorally).
I was specifically talking about inner alignment, where the mesa objective is a proxy measure for the base objective. But I can see how Goodhart’s law could apply to outer alignment too, come to think of it: if you fail to specify your real goal and instead specify a proxy.
It’s not that I couldn’t come up with examples, but more like I didn’t have time to write a longer comment just then. Are these not examples? What about godshatter?
I agree that “godshatter” is an example of a misaligned mesa-optimizer with respect to evolution’s base objective (inclusive genetic fitness). But note specifically that my argument was that there are no naturally occurring non-general mesa-optimizers, which category humans certainly don’t fit into. (I mean, you can look right at the passage you quoted; the phrase “non-general” is right there in the paragraph.)
In fact, I think humans’ status as general intelligences supports the argument I made, by acting as (moderately weak) evidence that naturally occurring mesa-optimizers do, in fact, exhibit high amounts of generality and agency (demonic-ness, you could say).
(If you wanted to poke at my model harder, you could ask about animals, or other organisms in general, and whether they count as mesa-optimizers. I’d argue that the answer depends on the animal, but that for many animals my answer would in fact be “no”—and even those for whom my answer is “yes” would obviously be nowhere near as powerful as humans in terms of optimization strength.)
As for the Rob Miles video: I mostly see those as outer alignment failures, despite the video name. (Remember, I did say in my previous comment that on my model, many outer alignment failures can masquerade as inner alignment failures!) To comment on the specific examples mentioned in the video:
The agent-in-the-maze examples strike me as a textbook instance of outer misalignment: the reward function by itself was not sufficient to distinguish correct behavior from incorrect behavior. It’s possible to paint this instead as inner misalignment, but only by essentially asserting, flat-out, that the reward function was correct, and the system simply generalized incorrectly. I confess I don’t really see strong reason to favor the latter characterization over the former, while I do see some reason for the converse.
The coin run example, meanwhile, makes a stronger case for being an inner alignment failure, mainly because of the fact that many possible forms of outer misalignment were ruled out via interpretability. The agent was observed to assign appropriately negative values to obstacles, and appropriately positive values to the coin. And while it’s still possible to make the argument that the training procedure failed to properly incentivize learning the correct objective, this is a much weaker claim, and somewhat question-begging.
And, of course, neither of these are examples of mesa-optimization in my view, because mesa-optimization is not synonymous with inner misalignment. From the original post on risks from learned optimization:
There need not always be a mesa-objective since the algorithm found by the base optimizer will not always be performing optimization. Thus, in the general case, we will refer to the model generated by the base optimizer as a learned algorithm, which may or may not be a mesa-optimizer.
And the main issue with these examples is that they occur in toy environments which are simply too… well, simple to produce algorithms usefully characterized as optimizers in their own right, outside of the extremely weak sense in which your thermostat is also an optimizer. (And, like—yes, in a certain sense it is, but that’s not a very high bar to meet; it’s not even at the level of the chess engine example you gave!)
The terms I first enumerated have specific meaning not coined by you or me, and I am trying to use them in the standard way. Now, it’s possible that I don’t understand the definitions correctly, but I think I do, and I think your definition for (at least) “mesa optimizer” is not the standard one. If you know this and just don’t like the standard definitions (because they are “not useful”), that’s fine, define your own terms, but call them something else, rather than changing them out from under me.
Specifically, I was going off the usage here. Does that match your understanding?
The usage in that video is based on the definition given by the authors of the linked post, who coined the term to begin with—which is to say, yes, I agree with it. And I already discussed above why this definition does not mean that literally any learned algorithm is a mesa-optimizer (and if it did, so much the worse for the definition)!
(Meta: I generally don’t consider it particularly useful to appeal to the origin of terms as a way to justify their use. In this specific case, it’s fine, since I don’t believe my usage conflicts with the original definition given. But even if you think I’m getting the definitions wrong, it’s more useful, from my perspective, if you explain to me why you think my usage doesn’t accord with the standard definitions. Presumably you yourself have specific reasons for thinking that the examples or arguments I give don’t sound quite right, right? If so, I’d petition you to elaborate on that directly! That seems to me like it would have a much better chance of locating our real disagreement. After all, when two people disagree, the root of that disagreement is usually significantly downstream of where it first appears—and I’ll thank you not to immediately assume that our source of disagreement is located somewhere as shallow as “one of us is misremembering/misunderstanding the definitions of terms”.)
I was specifically talking about inner alignment, where the mesa objective is a proxy measure for the base objective. But I can see how Goodhart’s law could apply to outer alignment too, come to think of it: if you fail to specify your real goal and instead specify a proxy.
This doesn’t sound right to me? To refer back to your quoted statement:
I see mesa optimization as a generalization of Goodhart’s Law. Any time you make a system optimize for a proxy measure instead of the real target, the proxy itself may become the goal of the inner system, even when overoptimizing it runs counter to hitting the real target.
I’ve bolded [what seem to me to be] the operative parts of that statement. I can easily see a way to map this description onto a description of outer alignment failure:
the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers.
where the mapping in question goes: proxy → base objective, real target → intended goal. Conversely, I don’t see an equally obvious mapping from that description to a description of inner misalignment, because that (as I described in my previous comment) is a mismatch between the system’s base and mesa-objectives (the latter of which it ends up behaviorally optimizing).
I’d appreciate it if you could explain to me what exactly you’re seeing here that I’m not, because at present, my best guess is that you’re not familiar with these terms (which I acknowledge isn’t a good guess, for basically the reasons I laid out in my “Meta:” note earlier).
Yeah, I don’t think that interpretation is what I was trying to get across. I’ll try to clean it up to clarify:
I see [the] mesa optimization [problem (i.e. inner alignment)] as a generalization of Goodhart’s Law[, which is that a]ny time you make a system optimize for a proxy measure instead of the real target, the proxy itself may become the goal of the inner system, even when overoptimizing it runs counter to hitting the real target.
Not helping? I did not mean to imply that a mesa optimizer is necessarily misaligned or learns the wrong goal, it’s just hard to ensure that it learns the base one.
Goodhart’s law is usually stated as “When a measure becomes a target, it ceases to be a good measure”, which I would interpret more succinctly as “proxies get gamed”.
More concretely, from the Wikipedia article,
For example, if an employee is rewarded by the number of cars sold each month, they will try to sell more cars, even at a loss.
Then the analogy would go like this. The desired target (base goal) was “profits”, but the proxy chosen to measure that goal was “number of cars sold”. Under normal conditions, this would work. The proxy is in the direction of the target. That’s why it’s a proxy. But if you optimize the proxy too hard, you blow past the base goal and hit the proxy itself instead. The outer system (optimizer) is the company. It’s trying to optimize the employees. The inner system (optimizer) is the employee, which tries to maximize his own reward. The employee “learned” the wrong (mesa) goal “sell as many cars as possible (at any cost)”, which is not aligned with the base goal of “profits”.
You’ll forgive me if I end up writing multiple separate responses then.
It’s not that I couldn’t come up with examples, but more like I didn’t have time to write a longer comment just then. Are these not examples? What about godshatter?
The terms I first enumerated have specific meaning not coined by you or me, and I am trying to use them in the standard way. Now, it’s possible that I don’t understand the definitions correctly, but I think I do, and I think your definition for (at least) “mesa optimizer” is not the standard one. If you know this and just don’t like the standard definitions (because they are “not useful”), that’s fine, define your own terms, but call them something else, rather than changing them out from under me.
Specifically, I was going off the usage here. Does that match your understanding?
I was specifically talking about inner alignment, where the mesa objective is a proxy measure for the base objective. But I can see how Goodhart’s law could apply to outer alignment too, come to think of it: if you fail to specify your real goal and instead specify a proxy.
I agree that “godshatter” is an example of a misaligned mesa-optimizer with respect to evolution’s base objective (inclusive genetic fitness). But note specifically that my argument was that there are no naturally occurring non-general mesa-optimizers, which category humans certainly don’t fit into. (I mean, you can look right at the passage you quoted; the phrase “non-general” is right there in the paragraph.)
In fact, I think humans’ status as general intelligences supports the argument I made, by acting as (moderately weak) evidence that naturally occurring mesa-optimizers do, in fact, exhibit high amounts of generality and agency (demonic-ness, you could say).
(If you wanted to poke at my model harder, you could ask about animals, or other organisms in general, and whether they count as mesa-optimizers. I’d argue that the answer depends on the animal, but that for many animals my answer would in fact be “no”—and even those for whom my answer is “yes” would obviously be nowhere near as powerful as humans in terms of optimization strength.)
As for the Rob Miles video: I mostly see those as outer alignment failures, despite the video name. (Remember, I did say in my previous comment that on my model, many outer alignment failures can masquerade as inner alignment failures!) To comment on the specific examples mentioned in the video:
The agent-in-the-maze examples strike me as a textbook instance of outer misalignment: the reward function by itself was not sufficient to distinguish correct behavior from incorrect behavior. It’s possible to paint this instead as inner misalignment, but only by essentially asserting, flat-out, that the reward function was correct, and the system simply generalized incorrectly. I confess I don’t really see strong reason to favor the latter characterization over the former, while I do see some reason for the converse.
The coin run example, meanwhile, makes a stronger case for being an inner alignment failure, mainly because of the fact that many possible forms of outer misalignment were ruled out via interpretability. The agent was observed to assign appropriately negative values to obstacles, and appropriately positive values to the coin. And while it’s still possible to make the argument that the training procedure failed to properly incentivize learning the correct objective, this is a much weaker claim, and somewhat question-begging.
And, of course, neither of these are examples of mesa-optimization in my view, because mesa-optimization is not synonymous with inner misalignment. From the original post on risks from learned optimization:
And the main issue with these examples is that they occur in toy environments which are simply too… well, simple to produce algorithms usefully characterized as optimizers in their own right, outside of the extremely weak sense in which your thermostat is also an optimizer. (And, like—yes, in a certain sense it is, but that’s not a very high bar to meet; it’s not even at the level of the chess engine example you gave!)
The usage in that video is based on the definition given by the authors of the linked post, who coined the term to begin with—which is to say, yes, I agree with it. And I already discussed above why this definition does not mean that literally any learned algorithm is a mesa-optimizer (and if it did, so much the worse for the definition)!
(Meta: I generally don’t consider it particularly useful to appeal to the origin of terms as a way to justify their use. In this specific case, it’s fine, since I don’t believe my usage conflicts with the original definition given. But even if you think I’m getting the definitions wrong, it’s more useful, from my perspective, if you explain to me why you think my usage doesn’t accord with the standard definitions. Presumably you yourself have specific reasons for thinking that the examples or arguments I give don’t sound quite right, right? If so, I’d petition you to elaborate on that directly! That seems to me like it would have a much better chance of locating our real disagreement. After all, when two people disagree, the root of that disagreement is usually significantly downstream of where it first appears—and I’ll thank you not to immediately assume that our source of disagreement is located somewhere as shallow as “one of us is misremembering/misunderstanding the definitions of terms”.)
This doesn’t sound right to me? To refer back to your quoted statement:
I’ve bolded [what seem to me to be] the operative parts of that statement. I can easily see a way to map this description onto a description of outer alignment failure:
where the mapping in question goes: proxy → base objective, real target → intended goal. Conversely, I don’t see an equally obvious mapping from that description to a description of inner misalignment, because that (as I described in my previous comment) is a mismatch between the system’s base and mesa-objectives (the latter of which it ends up behaviorally optimizing).
I’d appreciate it if you could explain to me what exactly you’re seeing here that I’m not, because at present, my best guess is that you’re not familiar with these terms (which I acknowledge isn’t a good guess, for basically the reasons I laid out in my “Meta:” note earlier).
Yeah, I don’t think that interpretation is what I was trying to get across. I’ll try to clean it up to clarify:
Not helping? I did not mean to imply that a mesa optimizer is necessarily misaligned or learns the wrong goal, it’s just hard to ensure that it learns the base one.
Goodhart’s law is usually stated as “When a measure becomes a target, it ceases to be a good measure”, which I would interpret more succinctly as “proxies get gamed”.
More concretely, from the Wikipedia article,
Then the analogy would go like this. The desired target (base goal) was “profits”, but the proxy chosen to measure that goal was “number of cars sold”. Under normal conditions, this would work. The proxy is in the direction of the target. That’s why it’s a proxy. But if you optimize the proxy too hard, you blow past the base goal and hit the proxy itself instead. The outer system (optimizer) is the company. It’s trying to optimize the employees. The inner system (optimizer) is the employee, which tries to maximize his own reward. The employee “learned” the wrong (mesa) goal “sell as many cars as possible (at any cost)”, which is not aligned with the base goal of “profits”.
Rob Miles specifically called out a thermostat as an example of not just an optimizer, but an agent in another video.