There might sort of be three pieces of relevant information, out of which my previous answer only addressed the first one.
The second one is, what’s up with mesaoptimizers? Why should we expect an AI to have mesaoptimizers, and why might they end up misaligned?
In order to understand why we would expect mesaoptimizers, we should maybe start by considering how AI training usually work. We usually use an outer optimizer—gradient descent—to train some neural network that we want to apply for some want we have. However, per the argument I made in the other comment thread, when we want to achieve something diffocult, we’re likely going to have the neural network itself do some sort of search or optimization. (Though see What is general-purpose search, and why might we expect to see it in ML systems? for more info.)
One way one could see the above is, with simple neural networks, the neural network itself “is not the AI” in some metaphorical sense. It can’t learn things on its own, solve goals, etc.. Rather, the entire system of {engineers and other workers who collect the data and write the code and tune the hyperparameters, datacenters who train the network, neural network itself} is the intelligence, and it’s not exactly entirely artificial, since it contains a lot of natural intelligence too. This is expensive! And only really works for problems we already know how to solve, since the training data has to come from somewhere! And it’s not retargetable, you have to start over if you have some new task that needs solving, which also makes it even more expensive! It’s obviously possible to make intelligences that are more autonomous (humans are an existence proof), and people are going to attempt to do so since it’s enormously economically valuable (unless it kills us all), and those intelligences would probably have a big internal consequentialist aspect to them, because that is what allows them to achieve things.
So, if we have a neural network or something which is a consequentualist optimizer, and that neural network was constructed by gradient descent, which itself is also an optimizer, then by definition that makes the neural network a mesaoptimizer (since mesaoptimizers by definition are optimizers constructed by other optimizers). So in a sense we “want” to produce mesaoptimizers.
But the issue is, gradient descent is a really crude way of oroducing those mesaoptimizers. The current methods basically work by throwing the mesaoptimizer into some situation where we think we know what it should do, and then adjusting it so that it takes the actions we think it should take. So far, this leaves them very capability-limited, as they don’t do general optimization well, but capabilities researchers are aiming to fix that, and they have many plausible methods to improve them. So at some point, maybe we have some mesaoptimizer that was constructed through a bunch of examples of good and bad stuff, rather than through a careful definition of what we want it to do. And we might be worried that the process of “taking our definition of what we want → producing examples that do or do not align with that definition → stuffing those examples into the mesaoptimizer” goes wrong in such a way that the AI doesn’t follow our definition of what we want, but instead does something else—that’s the inner alignment problem. (Meanwhile the “take what we want → and define it” process is the outer alignment problem.)
So that was the second piece. Now the third piece of information: IMO it seems to me that a lot of people thinking about mesaoptimizers are not thinking about the “practical” case above, but instead more confused or hypothetical cases, where people end up with a mesaoptimizer almost no matter what. I’m probably not the right person to defend that perspective since they often seem confused to me, but here’s an attempt at a steelman:
Mesaoptimizers aren’t just a thing that you’re explicitly trying to make when you train advanced agents. They also happen automatically when trying to predict a system that itself contains agents, as those agents have to be predicted too. For instance for language models, you’re trying to predict text, but that text was written by people who were trying to do something when writing it, so a good language model will have a representation of an approximation of those goals.
In theory, language models are just predictive models. But as we’ve learned, if you prompt them right, you can activate one of those representations of human goals, and thereby have them solve some problems for you. So even predictive models become optimizers when the environment is advanced enough, and we need to beware of that and consider factors like whether they are aligned and what that means for safety.
There might sort of be three pieces of relevant information, out of which my previous answer only addressed the first one.
The second one is, what’s up with mesaoptimizers? Why should we expect an AI to have mesaoptimizers, and why might they end up misaligned?
In order to understand why we would expect mesaoptimizers, we should maybe start by considering how AI training usually work. We usually use an outer optimizer—gradient descent—to train some neural network that we want to apply for some want we have. However, per the argument I made in the other comment thread, when we want to achieve something diffocult, we’re likely going to have the neural network itself do some sort of search or optimization. (Though see What is general-purpose search, and why might we expect to see it in ML systems? for more info.)
One way one could see the above is, with simple neural networks, the neural network itself “is not the AI” in some metaphorical sense. It can’t learn things on its own, solve goals, etc.. Rather, the entire system of {engineers and other workers who collect the data and write the code and tune the hyperparameters, datacenters who train the network, neural network itself} is the intelligence, and it’s not exactly entirely artificial, since it contains a lot of natural intelligence too. This is expensive! And only really works for problems we already know how to solve, since the training data has to come from somewhere! And it’s not retargetable, you have to start over if you have some new task that needs solving, which also makes it even more expensive! It’s obviously possible to make intelligences that are more autonomous (humans are an existence proof), and people are going to attempt to do so since it’s enormously economically valuable (unless it kills us all), and those intelligences would probably have a big internal consequentialist aspect to them, because that is what allows them to achieve things.
So, if we have a neural network or something which is a consequentualist optimizer, and that neural network was constructed by gradient descent, which itself is also an optimizer, then by definition that makes the neural network a mesaoptimizer (since mesaoptimizers by definition are optimizers constructed by other optimizers). So in a sense we “want” to produce mesaoptimizers.
But the issue is, gradient descent is a really crude way of oroducing those mesaoptimizers. The current methods basically work by throwing the mesaoptimizer into some situation where we think we know what it should do, and then adjusting it so that it takes the actions we think it should take. So far, this leaves them very capability-limited, as they don’t do general optimization well, but capabilities researchers are aiming to fix that, and they have many plausible methods to improve them. So at some point, maybe we have some mesaoptimizer that was constructed through a bunch of examples of good and bad stuff, rather than through a careful definition of what we want it to do. And we might be worried that the process of “taking our definition of what we want → producing examples that do or do not align with that definition → stuffing those examples into the mesaoptimizer” goes wrong in such a way that the AI doesn’t follow our definition of what we want, but instead does something else—that’s the inner alignment problem. (Meanwhile the “take what we want → and define it” process is the outer alignment problem.)
So that was the second piece. Now the third piece of information: IMO it seems to me that a lot of people thinking about mesaoptimizers are not thinking about the “practical” case above, but instead more confused or hypothetical cases, where people end up with a mesaoptimizer almost no matter what. I’m probably not the right person to defend that perspective since they often seem confused to me, but here’s an attempt at a steelman:
Mesaoptimizers aren’t just a thing that you’re explicitly trying to make when you train advanced agents. They also happen automatically when trying to predict a system that itself contains agents, as those agents have to be predicted too. For instance for language models, you’re trying to predict text, but that text was written by people who were trying to do something when writing it, so a good language model will have a representation of an approximation of those goals.
In theory, language models are just predictive models. But as we’ve learned, if you prompt them right, you can activate one of those representations of human goals, and thereby have them solve some problems for you. So even predictive models become optimizers when the environment is advanced enough, and we need to beware of that and consider factors like whether they are aligned and what that means for safety.