Thanks for responding! I think there are some key conceptual differences between us that need to be worked out/clarified—so here goes nothing:
(content warning: long)
I think your concept of “demons” is pointing to something useful, but I also think that definition is more specific than the meaning of “mesa optimizer”. A chess engine is an optimizer, but it’s not general. Optimizers need not be general; therefore, they need not be demons, and I think we have examples of such mesa optimizers already (they’re not hypothetical), even if no-one has managed to summon a demon yet.
The main thing I’m concerned about from mesa-optimizers (and hence the reason I think the attendant concept is useful) is that their presence is likely to lead to a treacherous turn—what I inside the demon metaphor referred to as “possession”, because from the outside it really does look like your nice little system that’s chugging along, helpfully trying to optimize the thing you wanted it to optimize, just suddenly gets Taken Over From Within by a strange and inscrutable (and malevolent) entity.
On this view, I don’t see it as particularly useful to weaken the term to encompass other types of optimization. This is essentially the point I was trying to make in the parenthetical remark included directly after the sentences you quoted:
(You might feel the need to talk about “fully-fledged” versus “nascent” demons here, to add some nuance—but I actually think the concept is mostly useful if we limit ourselves to talking about the strongest possible version of it, since that’s the version we’re actually worried about getting killed by.)
Of course, other types of optimizers do exist, and can be non-general, e.g. I fully accept your chess engine example as a valid type of optimizer. But my model is that these kinds of optimizers are (as a consequence of their non-generality) brittle: they spring into existence fully formed (because they were hardcoded by other, more general intelligences—in this case humans), and there is no incremental path to a chess engine that results from taking a non-chess engine and mutating it repeatedly according to some performance rule. Nor, for that matter, is there an incremental path continuing onward from a (classical) chess engine, through which it might mutate into something better, like AlphaZero.
(Aside: note that AlphaZero itself is not something I view as any kind of “general” system; you could argue it’s more general than a classical chess engine, but only if you view generality as a varying quantity, rather than as a binary—and I’ve already expressed that I’m not hugely fond of that view. But more on that later.)
In any case sense, hopefully I’ve managed to convey a sense in which these systems (and the things and ways they optimize) can be viewed as islands in the design space of possible architectures. And this is important in my view, because what this means is that you should not (by default) expect naturally arising mesa-optimizers to resemble these “non-general” optimizers. I expect that any natural category of mesa-optimizers—that is to say, categories with their boundaries drawn to cleave at the joints of reality—to essentially look like it contains a bunch of demons, and excludes everything else.
TL;DR: Chess engines are non-general optimizers, but they’re not mesa-optimizers; and the fact that you could only come up with an example of the former and not the latter is not a coincidence but a reflection of a deeper truth. Of course, this previous statement could be falsified by providing an example of a non-general mesa-optimizer, and a good argument as to why it should be regarded as a mesa-optimizer.
That segues fairly nicely into the next (related) point, which is, essentially: what is a mesa-optimizer? Let’s look at what you have to say about it:
I see mesa optimization as a generalization of Goodhart’s Law. Any time you make a system optimize for a proxy measure instead of the real target, the proxy itself may become the goal of the inner system, even when overoptimizing it runs counter to hitting the real target.
It shouldn’t come as a huge surprise at this point, but I don’t view this as a very useful way to draw the boundary. I’m not even sure it’s correct, for that matter—the quoted passage reads like you’re talking about outer misalignment (a mismatch between the system’s outer optimization target—its so-called “base objective”—and its creators’ real target), whereas I’m reasonably certain mesa-optimization is much better thought of as a type of inner misalignment (a mismatch between the system’s base objective and whatever objective it ends up representing internally, and pursuing behaviorally).
Given this, it’s plausible to me that when you earlier say you “think we have examples of such mesa optimizers already”, you’re referring specifically to modes of misbehavior I’d class as outer alignment failures (and hence not mesa-optimization at all). But that’s just speculation on my part, and on the whole this seems like a topic that would benefit more from being double-clicked and expanded than from my continuing to speculate on what exactly you might have meant.
In any case, my preferred take on what a mesa-optimizer “really” is would be something like: a system should be considered to contain a mesa-optimizer in precisely those cases where modeling it as consisting of a second optimizer with a different objectivebuys you more explanatory power than modeling it as a single optimizer whose objective happens to not be the one you wanted. Or, in more evocative terms: mesa-optimization shows up whenever the system gets possessed by—you guessed it—a demon.
And in this frame, I think it’s fair to say that we don’t have any real examples of mesa-optimization. We might have some examples of outer alignment failure, and perhaps even some examples of inner alignment failure (though I’d be wary on that front; many of those are actually outer alignment failures in disguise). But we certainly don’t have any examples of behavior where it makes sense to say, “See! You’ve got a second optimizer inside of your first one, and it’s doing stuff on its own!” which is what it would take to satisfy the definition I gave.
(And yes, going by this definition, I think it’s plausible that we won’t see “real” mesa-optimizers until very close to The End. And yes, this is very bad news if true, since it means that it’s going to be very difficult to experiment safely with “toy” mesa-optimizers, and come away with any real insight. I never said my model didn’t have bleak implications!)
Lastly:
I think there are degrees of generality, rather than absolute categories. They’re fuzzy sets. Deep blue can only play chess. It can’t even do an “easier” task like run your thermostat. It’s very narrow. AlphaZero can learn to play chess or go or shougi. More domains, more general. GPT-3 can also play chess, you just have to frame it as a text-completion task, using e.g. Portable Game Notation. The human language domain is general enough to play chess, even if it’s not as good at chess as AlphaZero is. More domains, or broader domains containing more subdomains, means more generality.
I agree that this is a way to think about generality—but as with mesa-optimization, I disagree that it’s a good way to think about generality. The problem here is that you’re looking at the systems in terms of their outer behavior—”how many separate domains does this system appear to be able to navigate?”, “how broad a range does its behavior seem to span?”, etc.—when what matters, on my model, is the internal structure of the systems in question.
(I mean, yes, what ultimately matters is the outer behavior; you care if a system wakes up and kills you. But understanding the internal structure constrains our expectations about the system’s outer behavior, in a way that simply counting the number of disparate “domains” it has under its belt doesn’t.)
As an objection, I think this basically rhymes with my earlier objection about mesa-optimizers, and when it’s most useful to model a system as containing one. You might notice that that definition I gave also seems to hang pretty heavily on the system’s internals—not completely, since I was careful to say things like “is usefully modeled as” and “buys you more explanatory power—but overall, it seems like a definition oriented towards an internal classification of whether mesa-optimizers (“demons”) are present, rather than an cruder external metric (“it’s not doing the thing we told it to do!”).
And so, my preferred framework under which to think about generality (under which none of our current systems, including—again—AlphaZero, which I mentioned way earlier in this comment, count as truly “general”) is basically what I sketched out in my previous reply:
Narrow intelligences, on the other hand, don’t qualify as demons; they’re something else entirely, a different species, more akin to plants or insects than to more complicated agents. They might be highly capable in a specific domain or set of domains, but they achieve this through specialization rather than through “intelligence” in any real sense, much in the same way that a fruit fly is specialized for being very good at avoiding your fly-swatter.
AlphaZero, as a system, contains a whole lot of specialized architecture for two-player games with a discretized state space and action space. It turns out that multiple board games fall under this categorization, making it a larger category than, like, “just chess” or “just shogi” or something. But that’s just a consequence of the size of the category; the algorithm itself is still specialized, and consequently (this is the crucial part, from my perspective) forms an island in design space.
I referenced this earlier, and I think it’s relevant here as well: there’s no continuous path in design space from AlphaZero to GPT-[anything]; nor was there an incremental design path from Stockfish 8 to AlphaZero. They’re different systems, each of which was individually designed and implemented by very smart people. But the seeming increase in “generality” of these systems is not due to any kind of internal “progression” of any kind, of the kind that might be found in e.g. a truly general system undergoing takeoff; instead, it’s a progression of discoveries by those very smart people: impressive, but not fundamentally different in kind from the progression of (say) heavier-than-air flight, which also consisted of a series of disparate but improving designs, none of which were connected to each other via incremental paths in design space.
(Here, I’m grouping together designs into “families”, where two designs that are basically variants of each other in size are considered the same design. I think that’s fair, since this is the case with the various GPT models as well.)
And this matters because (on my model) the danger from AGI that I see does not come from this kind of progression of design. If we were somehow assured that all further progress in AI would continue to look like this kind of progress, that would massively drop my P(doom) estimates (to, like, <0.01 levels). The reason AGI is different, the reason it constitutes (on my view) an existential risk, is precisely because artificial general intelligence is different from artificial narrow intelligence—not just in degree, but in kind.
(Lots to double-click on here, but this is getting stupidly long even for a LW comment, so I’m going to stop indulging the urge to preemptively double-click and expand everything for you, since that’s flatly impossible, and let you pick and choose where to poke at my model. Hope this helps!)
(content warning: long)
[...] let you pick and choose where to poke at my model.
You’ll forgive me if I end up writing multiple separate responses then.
TL;DR: Chess engines are non-general optimizers, but they’re not mesa-optimizers; and the fact that you could only come up with an example of the former and not the latter is not a coincidence but a reflection of a deeper truth. Of course, this previous statement could be falsified by providing an example of a non-general mesa-optimizer, and a good argument as to why it should be regarded as a mesa-optimizer.
It’s not that I couldn’t come up with examples, but more like I didn’t have time to write a longer comment just then. Are these not examples? What about godshatter?
The terms I first enumerated have specific meaning not coined by you or me, and I am trying to use them in the standard way. Now, it’s possible that I don’t understand the definitions correctly, but I think I do, and I think your definition for (at least) “mesa optimizer” is not the standard one. If you know this and just don’t like the standard definitions (because they are “not useful”), that’s fine, define your own terms, but call them something else, rather than changing them out from under me.
Specifically, I was going off the usage here. Does that match your understanding?
quoted passage reads like you’re talking about outer misalignment (a mismatch between the system’s outer optimization target—its so-called “base objective”—and its creators’ real target), whereas I’m reasonably certain mesa-optimization is much better thought of as a type of inner misalignment (a mismatch between the system’s base objective and whatever objective it ends up representing internally, and pursuing behaviorally).
I was specifically talking about inner alignment, where the mesa objective is a proxy measure for the base objective. But I can see how Goodhart’s law could apply to outer alignment too, come to think of it: if you fail to specify your real goal and instead specify a proxy.
It’s not that I couldn’t come up with examples, but more like I didn’t have time to write a longer comment just then. Are these not examples? What about godshatter?
I agree that “godshatter” is an example of a misaligned mesa-optimizer with respect to evolution’s base objective (inclusive genetic fitness). But note specifically that my argument was that there are no naturally occurring non-general mesa-optimizers, which category humans certainly don’t fit into. (I mean, you can look right at the passage you quoted; the phrase “non-general” is right there in the paragraph.)
In fact, I think humans’ status as general intelligences supports the argument I made, by acting as (moderately weak) evidence that naturally occurring mesa-optimizers do, in fact, exhibit high amounts of generality and agency (demonic-ness, you could say).
(If you wanted to poke at my model harder, you could ask about animals, or other organisms in general, and whether they count as mesa-optimizers. I’d argue that the answer depends on the animal, but that for many animals my answer would in fact be “no”—and even those for whom my answer is “yes” would obviously be nowhere near as powerful as humans in terms of optimization strength.)
As for the Rob Miles video: I mostly see those as outer alignment failures, despite the video name. (Remember, I did say in my previous comment that on my model, many outer alignment failures can masquerade as inner alignment failures!) To comment on the specific examples mentioned in the video:
The agent-in-the-maze examples strike me as a textbook instance of outer misalignment: the reward function by itself was not sufficient to distinguish correct behavior from incorrect behavior. It’s possible to paint this instead as inner misalignment, but only by essentially asserting, flat-out, that the reward function was correct, and the system simply generalized incorrectly. I confess I don’t really see strong reason to favor the latter characterization over the former, while I do see some reason for the converse.
The coin run example, meanwhile, makes a stronger case for being an inner alignment failure, mainly because of the fact that many possible forms of outer misalignment were ruled out via interpretability. The agent was observed to assign appropriately negative values to obstacles, and appropriately positive values to the coin. And while it’s still possible to make the argument that the training procedure failed to properly incentivize learning the correct objective, this is a much weaker claim, and somewhat question-begging.
And, of course, neither of these are examples of mesa-optimization in my view, because mesa-optimization is not synonymous with inner misalignment. From the original post on risks from learned optimization:
There need not always be a mesa-objective since the algorithm found by the base optimizer will not always be performing optimization. Thus, in the general case, we will refer to the model generated by the base optimizer as a learned algorithm, which may or may not be a mesa-optimizer.
And the main issue with these examples is that they occur in toy environments which are simply too… well, simple to produce algorithms usefully characterized as optimizers in their own right, outside of the extremely weak sense in which your thermostat is also an optimizer. (And, like—yes, in a certain sense it is, but that’s not a very high bar to meet; it’s not even at the level of the chess engine example you gave!)
The terms I first enumerated have specific meaning not coined by you or me, and I am trying to use them in the standard way. Now, it’s possible that I don’t understand the definitions correctly, but I think I do, and I think your definition for (at least) “mesa optimizer” is not the standard one. If you know this and just don’t like the standard definitions (because they are “not useful”), that’s fine, define your own terms, but call them something else, rather than changing them out from under me.
Specifically, I was going off the usage here. Does that match your understanding?
The usage in that video is based on the definition given by the authors of the linked post, who coined the term to begin with—which is to say, yes, I agree with it. And I already discussed above why this definition does not mean that literally any learned algorithm is a mesa-optimizer (and if it did, so much the worse for the definition)!
(Meta: I generally don’t consider it particularly useful to appeal to the origin of terms as a way to justify their use. In this specific case, it’s fine, since I don’t believe my usage conflicts with the original definition given. But even if you think I’m getting the definitions wrong, it’s more useful, from my perspective, if you explain to me why you think my usage doesn’t accord with the standard definitions. Presumably you yourself have specific reasons for thinking that the examples or arguments I give don’t sound quite right, right? If so, I’d petition you to elaborate on that directly! That seems to me like it would have a much better chance of locating our real disagreement. After all, when two people disagree, the root of that disagreement is usually significantly downstream of where it first appears—and I’ll thank you not to immediately assume that our source of disagreement is located somewhere as shallow as “one of us is misremembering/misunderstanding the definitions of terms”.)
I was specifically talking about inner alignment, where the mesa objective is a proxy measure for the base objective. But I can see how Goodhart’s law could apply to outer alignment too, come to think of it: if you fail to specify your real goal and instead specify a proxy.
This doesn’t sound right to me? To refer back to your quoted statement:
I see mesa optimization as a generalization of Goodhart’s Law. Any time you make a system optimize for a proxy measure instead of the real target, the proxy itself may become the goal of the inner system, even when overoptimizing it runs counter to hitting the real target.
I’ve bolded [what seem to me to be] the operative parts of that statement. I can easily see a way to map this description onto a description of outer alignment failure:
the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers.
where the mapping in question goes: proxy → base objective, real target → intended goal. Conversely, I don’t see an equally obvious mapping from that description to a description of inner misalignment, because that (as I described in my previous comment) is a mismatch between the system’s base and mesa-objectives (the latter of which it ends up behaviorally optimizing).
I’d appreciate it if you could explain to me what exactly you’re seeing here that I’m not, because at present, my best guess is that you’re not familiar with these terms (which I acknowledge isn’t a good guess, for basically the reasons I laid out in my “Meta:” note earlier).
Yeah, I don’t think that interpretation is what I was trying to get across. I’ll try to clean it up to clarify:
I see [the] mesa optimization [problem (i.e. inner alignment)] as a generalization of Goodhart’s Law[, which is that a]ny time you make a system optimize for a proxy measure instead of the real target, the proxy itself may become the goal of the inner system, even when overoptimizing it runs counter to hitting the real target.
Not helping? I did not mean to imply that a mesa optimizer is necessarily misaligned or learns the wrong goal, it’s just hard to ensure that it learns the base one.
Goodhart’s law is usually stated as “When a measure becomes a target, it ceases to be a good measure”, which I would interpret more succinctly as “proxies get gamed”.
More concretely, from the Wikipedia article,
For example, if an employee is rewarded by the number of cars sold each month, they will try to sell more cars, even at a loss.
Then the analogy would go like this. The desired target (base goal) was “profits”, but the proxy chosen to measure that goal was “number of cars sold”. Under normal conditions, this would work. The proxy is in the direction of the target. That’s why it’s a proxy. But if you optimize the proxy too hard, you blow past the base goal and hit the proxy itself instead. The outer system (optimizer) is the company. It’s trying to optimize the employees. The inner system (optimizer) is the employee, which tries to maximize his own reward. The employee “learned” the wrong (mesa) goal “sell as many cars as possible (at any cost)”, which is not aligned with the base goal of “profits”.
Thanks for responding! I think there are some key conceptual differences between us that need to be worked out/clarified—so here goes nothing:
(content warning: long)
The main thing I’m concerned about from mesa-optimizers (and hence the reason I think the attendant concept is useful) is that their presence is likely to lead to a treacherous turn—what I inside the demon metaphor referred to as “possession”, because from the outside it really does look like your nice little system that’s chugging along, helpfully trying to optimize the thing you wanted it to optimize, just suddenly gets Taken Over From Within by a strange and inscrutable (and malevolent) entity.
On this view, I don’t see it as particularly useful to weaken the term to encompass other types of optimization. This is essentially the point I was trying to make in the parenthetical remark included directly after the sentences you quoted:
Of course, other types of optimizers do exist, and can be non-general, e.g. I fully accept your chess engine example as a valid type of optimizer. But my model is that these kinds of optimizers are (as a consequence of their non-generality) brittle: they spring into existence fully formed (because they were hardcoded by other, more general intelligences—in this case humans), and there is no incremental path to a chess engine that results from taking a non-chess engine and mutating it repeatedly according to some performance rule. Nor, for that matter, is there an incremental path continuing onward from a (classical) chess engine, through which it might mutate into something better, like AlphaZero.
(Aside: note that AlphaZero itself is not something I view as any kind of “general” system; you could argue it’s more general than a classical chess engine, but only if you view generality as a varying quantity, rather than as a binary—and I’ve already expressed that I’m not hugely fond of that view. But more on that later.)
In any case sense, hopefully I’ve managed to convey a sense in which these systems (and the things and ways they optimize) can be viewed as islands in the design space of possible architectures. And this is important in my view, because what this means is that you should not (by default) expect naturally arising mesa-optimizers to resemble these “non-general” optimizers. I expect that any natural category of mesa-optimizers—that is to say, categories with their boundaries drawn to cleave at the joints of reality—to essentially look like it contains a bunch of demons, and excludes everything else.
TL;DR: Chess engines are non-general optimizers, but they’re not mesa-optimizers; and the fact that you could only come up with an example of the former and not the latter is not a coincidence but a reflection of a deeper truth. Of course, this previous statement could be falsified by providing an example of a non-general mesa-optimizer, and a good argument as to why it should be regarded as a mesa-optimizer.
That segues fairly nicely into the next (related) point, which is, essentially: what is a mesa-optimizer? Let’s look at what you have to say about it:
It shouldn’t come as a huge surprise at this point, but I don’t view this as a very useful way to draw the boundary. I’m not even sure it’s correct, for that matter—the quoted passage reads like you’re talking about outer misalignment (a mismatch between the system’s outer optimization target—its so-called “base objective”—and its creators’ real target), whereas I’m reasonably certain mesa-optimization is much better thought of as a type of inner misalignment (a mismatch between the system’s base objective and whatever objective it ends up representing internally, and pursuing behaviorally).
Given this, it’s plausible to me that when you earlier say you “think we have examples of such mesa optimizers already”, you’re referring specifically to modes of misbehavior I’d class as outer alignment failures (and hence not mesa-optimization at all). But that’s just speculation on my part, and on the whole this seems like a topic that would benefit more from being double-clicked and expanded than from my continuing to speculate on what exactly you might have meant.
In any case, my preferred take on what a mesa-optimizer “really” is would be something like: a system should be considered to contain a mesa-optimizer in precisely those cases where modeling it as consisting of a second optimizer with a different objective buys you more explanatory power than modeling it as a single optimizer whose objective happens to not be the one you wanted. Or, in more evocative terms: mesa-optimization shows up whenever the system gets possessed by—you guessed it—a demon.
And in this frame, I think it’s fair to say that we don’t have any real examples of mesa-optimization. We might have some examples of outer alignment failure, and perhaps even some examples of inner alignment failure (though I’d be wary on that front; many of those are actually outer alignment failures in disguise). But we certainly don’t have any examples of behavior where it makes sense to say, “See! You’ve got a second optimizer inside of your first one, and it’s doing stuff on its own!” which is what it would take to satisfy the definition I gave.
(And yes, going by this definition, I think it’s plausible that we won’t see “real” mesa-optimizers until very close to The End. And yes, this is very bad news if true, since it means that it’s going to be very difficult to experiment safely with “toy” mesa-optimizers, and come away with any real insight. I never said my model didn’t have bleak implications!)
Lastly:
I agree that this is a way to think about generality—but as with mesa-optimization, I disagree that it’s a good way to think about generality. The problem here is that you’re looking at the systems in terms of their outer behavior—”how many separate domains does this system appear to be able to navigate?”, “how broad a range does its behavior seem to span?”, etc.—when what matters, on my model, is the internal structure of the systems in question.
(I mean, yes, what ultimately matters is the outer behavior; you care if a system wakes up and kills you. But understanding the internal structure constrains our expectations about the system’s outer behavior, in a way that simply counting the number of disparate “domains” it has under its belt doesn’t.)
As an objection, I think this basically rhymes with my earlier objection about mesa-optimizers, and when it’s most useful to model a system as containing one. You might notice that that definition I gave also seems to hang pretty heavily on the system’s internals—not completely, since I was careful to say things like “is usefully modeled as” and “buys you more explanatory power—but overall, it seems like a definition oriented towards an internal classification of whether mesa-optimizers (“demons”) are present, rather than an cruder external metric (“it’s not doing the thing we told it to do!”).
And so, my preferred framework under which to think about generality (under which none of our current systems, including—again—AlphaZero, which I mentioned way earlier in this comment, count as truly “general”) is basically what I sketched out in my previous reply:
AlphaZero, as a system, contains a whole lot of specialized architecture for two-player games with a discretized state space and action space. It turns out that multiple board games fall under this categorization, making it a larger category than, like, “just chess” or “just shogi” or something. But that’s just a consequence of the size of the category; the algorithm itself is still specialized, and consequently (this is the crucial part, from my perspective) forms an island in design space.
I referenced this earlier, and I think it’s relevant here as well: there’s no continuous path in design space from AlphaZero to GPT-[anything]; nor was there an incremental design path from Stockfish 8 to AlphaZero. They’re different systems, each of which was individually designed and implemented by very smart people. But the seeming increase in “generality” of these systems is not due to any kind of internal “progression” of any kind, of the kind that might be found in e.g. a truly general system undergoing takeoff; instead, it’s a progression of discoveries by those very smart people: impressive, but not fundamentally different in kind from the progression of (say) heavier-than-air flight, which also consisted of a series of disparate but improving designs, none of which were connected to each other via incremental paths in design space.
(Here, I’m grouping together designs into “families”, where two designs that are basically variants of each other in size are considered the same design. I think that’s fair, since this is the case with the various GPT models as well.)
And this matters because (on my model) the danger from AGI that I see does not come from this kind of progression of design. If we were somehow assured that all further progress in AI would continue to look like this kind of progress, that would massively drop my P(doom) estimates (to, like, <0.01 levels). The reason AGI is different, the reason it constitutes (on my view) an existential risk, is precisely because artificial general intelligence is different from artificial narrow intelligence—not just in degree, but in kind.
(Lots to double-click on here, but this is getting stupidly long even for a LW comment, so I’m going to stop indulging the urge to preemptively double-click and expand everything for you, since that’s flatly impossible, and let you pick and choose where to poke at my model. Hope this helps!)
You’ll forgive me if I end up writing multiple separate responses then.
It’s not that I couldn’t come up with examples, but more like I didn’t have time to write a longer comment just then. Are these not examples? What about godshatter?
The terms I first enumerated have specific meaning not coined by you or me, and I am trying to use them in the standard way. Now, it’s possible that I don’t understand the definitions correctly, but I think I do, and I think your definition for (at least) “mesa optimizer” is not the standard one. If you know this and just don’t like the standard definitions (because they are “not useful”), that’s fine, define your own terms, but call them something else, rather than changing them out from under me.
Specifically, I was going off the usage here. Does that match your understanding?
I was specifically talking about inner alignment, where the mesa objective is a proxy measure for the base objective. But I can see how Goodhart’s law could apply to outer alignment too, come to think of it: if you fail to specify your real goal and instead specify a proxy.
I agree that “godshatter” is an example of a misaligned mesa-optimizer with respect to evolution’s base objective (inclusive genetic fitness). But note specifically that my argument was that there are no naturally occurring non-general mesa-optimizers, which category humans certainly don’t fit into. (I mean, you can look right at the passage you quoted; the phrase “non-general” is right there in the paragraph.)
In fact, I think humans’ status as general intelligences supports the argument I made, by acting as (moderately weak) evidence that naturally occurring mesa-optimizers do, in fact, exhibit high amounts of generality and agency (demonic-ness, you could say).
(If you wanted to poke at my model harder, you could ask about animals, or other organisms in general, and whether they count as mesa-optimizers. I’d argue that the answer depends on the animal, but that for many animals my answer would in fact be “no”—and even those for whom my answer is “yes” would obviously be nowhere near as powerful as humans in terms of optimization strength.)
As for the Rob Miles video: I mostly see those as outer alignment failures, despite the video name. (Remember, I did say in my previous comment that on my model, many outer alignment failures can masquerade as inner alignment failures!) To comment on the specific examples mentioned in the video:
The agent-in-the-maze examples strike me as a textbook instance of outer misalignment: the reward function by itself was not sufficient to distinguish correct behavior from incorrect behavior. It’s possible to paint this instead as inner misalignment, but only by essentially asserting, flat-out, that the reward function was correct, and the system simply generalized incorrectly. I confess I don’t really see strong reason to favor the latter characterization over the former, while I do see some reason for the converse.
The coin run example, meanwhile, makes a stronger case for being an inner alignment failure, mainly because of the fact that many possible forms of outer misalignment were ruled out via interpretability. The agent was observed to assign appropriately negative values to obstacles, and appropriately positive values to the coin. And while it’s still possible to make the argument that the training procedure failed to properly incentivize learning the correct objective, this is a much weaker claim, and somewhat question-begging.
And, of course, neither of these are examples of mesa-optimization in my view, because mesa-optimization is not synonymous with inner misalignment. From the original post on risks from learned optimization:
And the main issue with these examples is that they occur in toy environments which are simply too… well, simple to produce algorithms usefully characterized as optimizers in their own right, outside of the extremely weak sense in which your thermostat is also an optimizer. (And, like—yes, in a certain sense it is, but that’s not a very high bar to meet; it’s not even at the level of the chess engine example you gave!)
The usage in that video is based on the definition given by the authors of the linked post, who coined the term to begin with—which is to say, yes, I agree with it. And I already discussed above why this definition does not mean that literally any learned algorithm is a mesa-optimizer (and if it did, so much the worse for the definition)!
(Meta: I generally don’t consider it particularly useful to appeal to the origin of terms as a way to justify their use. In this specific case, it’s fine, since I don’t believe my usage conflicts with the original definition given. But even if you think I’m getting the definitions wrong, it’s more useful, from my perspective, if you explain to me why you think my usage doesn’t accord with the standard definitions. Presumably you yourself have specific reasons for thinking that the examples or arguments I give don’t sound quite right, right? If so, I’d petition you to elaborate on that directly! That seems to me like it would have a much better chance of locating our real disagreement. After all, when two people disagree, the root of that disagreement is usually significantly downstream of where it first appears—and I’ll thank you not to immediately assume that our source of disagreement is located somewhere as shallow as “one of us is misremembering/misunderstanding the definitions of terms”.)
This doesn’t sound right to me? To refer back to your quoted statement:
I’ve bolded [what seem to me to be] the operative parts of that statement. I can easily see a way to map this description onto a description of outer alignment failure:
where the mapping in question goes: proxy → base objective, real target → intended goal. Conversely, I don’t see an equally obvious mapping from that description to a description of inner misalignment, because that (as I described in my previous comment) is a mismatch between the system’s base and mesa-objectives (the latter of which it ends up behaviorally optimizing).
I’d appreciate it if you could explain to me what exactly you’re seeing here that I’m not, because at present, my best guess is that you’re not familiar with these terms (which I acknowledge isn’t a good guess, for basically the reasons I laid out in my “Meta:” note earlier).
Yeah, I don’t think that interpretation is what I was trying to get across. I’ll try to clean it up to clarify:
Not helping? I did not mean to imply that a mesa optimizer is necessarily misaligned or learns the wrong goal, it’s just hard to ensure that it learns the base one.
Goodhart’s law is usually stated as “When a measure becomes a target, it ceases to be a good measure”, which I would interpret more succinctly as “proxies get gamed”.
More concretely, from the Wikipedia article,
Then the analogy would go like this. The desired target (base goal) was “profits”, but the proxy chosen to measure that goal was “number of cars sold”. Under normal conditions, this would work. The proxy is in the direction of the target. That’s why it’s a proxy. But if you optimize the proxy too hard, you blow past the base goal and hit the proxy itself instead. The outer system (optimizer) is the company. It’s trying to optimize the employees. The inner system (optimizer) is the employee, which tries to maximize his own reward. The employee “learned” the wrong (mesa) goal “sell as many cars as possible (at any cost)”, which is not aligned with the base goal of “profits”.
Rob Miles specifically called out a thermostat as an example of not just an optimizer, but an agent in another video.