Take the use of natural selection and humans as examples of optimization and mesa-optimization—the entire concept of “natural selection” is a human-convenient way of describing a pattern in the universe. It’s approximately an optimizer, but in order to get rid of that “approximately” you have to reintroduce epicycles until your model is as complicated as a model of the world again. Humans aren’t optimizers either, that’s just a human-convenient way of describing humans.
More abstractly, the entire process of recognizing a mesa-optimizer—something that models the world and makes plans—is an act of stance-taking. Or Quinean radical translation or whatever. If a cat-recognizing neural net learns an attention mechanism that models the world of cats and makes plans, it’s not going to come with little labels on the neurons saying “these are my input-output interfaces, this is my model of the world, this is my planning algorithm.” It’s going to be some inscrutable little bit of linear algebra with suspiciously competent behavior.
Not only could this competent behavior be explained either by optimization or some variety of “rote behavior,” but the neurons don’t care about these boundaries and can occupy a continuum of possibilities between any two central examples. And worst of all, the same neurons might have multiple different useful ways of thinking about them, some of which are in terms of elements like “goals” and “search,” and others are in terms of the elements of rote behavior.
In light of this, the problem of mesa-optimizers is not “when will this bright line be crossed?” but “when will this simple model of the AI’s behavior be predictable useful?” Even though I think the first instinct is the opposite.
More abstractly, the entire process of recognizing a mesa-optimizer—something that models the world and makes plans—is an act of stance-taking.
And pretty specifically, the intentional stance. I think Daniel Dennett did some pretty powerful clarification decades ago which could help this debate.
Yeah I think this is definitely a “stance” thing.
Take the use of natural selection and humans as examples of optimization and mesa-optimization—the entire concept of “natural selection” is a human-convenient way of describing a pattern in the universe. It’s approximately an optimizer, but in order to get rid of that “approximately” you have to reintroduce epicycles until your model is as complicated as a model of the world again. Humans aren’t optimizers either, that’s just a human-convenient way of describing humans.
More abstractly, the entire process of recognizing a mesa-optimizer—something that models the world and makes plans—is an act of stance-taking. Or Quinean radical translation or whatever. If a cat-recognizing neural net learns an attention mechanism that models the world of cats and makes plans, it’s not going to come with little labels on the neurons saying “these are my input-output interfaces, this is my model of the world, this is my planning algorithm.” It’s going to be some inscrutable little bit of linear algebra with suspiciously competent behavior.
Not only could this competent behavior be explained either by optimization or some variety of “rote behavior,” but the neurons don’t care about these boundaries and can occupy a continuum of possibilities between any two central examples. And worst of all, the same neurons might have multiple different useful ways of thinking about them, some of which are in terms of elements like “goals” and “search,” and others are in terms of the elements of rote behavior.
In light of this, the problem of mesa-optimizers is not “when will this bright line be crossed?” but “when will this simple model of the AI’s behavior be predictable useful?” Even though I think the first instinct is the opposite.
And pretty specifically, the intentional stance. I think Daniel Dennett did some pretty powerful clarification decades ago which could help this debate.