Epistemic status: pretty exploratory. I think this is a coherent concept, but I wouldn’t be surprised if there need to be some large changes.
Edit [7/12/2023]: I think this post is pretty confused and confusing, and doesn’t really address important parts of the alignment problem. The strategy of “avoid mesa-optimizers” no longer even seems like a coherent thing to aim for, and instead just sounds like never building powerful AI. I do think this post almost gets to important problems (like robust delegation), but doesn’t really provide much that is useful. I don’t regret writing this, and I think having pretty bad ideas is the first step to having kinda good ideas.
TL;DR: A mesa-optimizer may instantiate other optimizers; these new optimizers may not be aligned with the original mesa-objective. Therefore to avoid dangers from misaligned mesa-optimizers we should avoid learned optimization entirely, rather than attempting to align mesa-optimizers.
When thinking of AI Alignment it is common to divide the questions into the “outer alignment problem” and the “inner alignment problem”. The outer alignment problem refers to the problem of telling an AI system to do what we actually want; telling a system to maximize paperclips could cause an outer alignment failure because humans do not actually want to single-mindedly maximize the number of paperclips in the universe. The inner alignment problem refers to the task of making sure an AI system actually does what we tell it to do. Even if we manage to solve the outer alignment problem, there is no guarantee that an AI system will actually optimize for the objective that we give it. This is commonly discussed in the context of mesa-optimization, where our base-optimizer (for example, gradient descent) trains our model to itself perform some kind of optimization (this model is hence called a mesa-optimizer). The inner alignment problem here is about how to ensure that the objective of this mesa-optimizer is the same as the objective of the base-optimizer.
This post discusses the possibility that a mesa-optimizer may itself create an optimizer. Similarly to the classic inner alignment problem, it is not obvious that the objective of a mesa-optimizer will be robustly inherited by any optimizers which it creates. Because of the difficulty of ensuring that every level of mesa-optimization is aligned, I think that this is an argument for dealing with the inner alignment problem by entirely avoiding mesa-optimization rather than ensuring that mesa-optimizers are aligned.
Motivation
In the inner alignment problem we have a sort of ‘nested’ system, where the inner mesa-optimizer is created by the outer base-optimizer. It seems to be a natural extension of this to think about further optimizers created by the mesa-optimizer; a sort of Babushka Doll of optimizers. This also means that there is a possibility of ‘nested’ alignment problems, where we can’t guarantee that the objective of one optimizer will be robustly transferred to the other levels.
… as even if a model’s objective is aligned, its search process might not be. Conceptually, we can think of this problem as the problem of “forwarding” the safety guarantee we have about the training process to the meta-learning process.
Levels of alignment problems
Here I will lay out how I see the levels of the alignment problems, along with some terminology which I hope will make things easier to discuss.
Level 0: Outer alignment problem
This is the problem of how we put our ‘human values’ into the base-objective for our base-optimizer. For training a neural network, the base-objective is specified in terms of a loss function, and the base-optimizer is some form of gradient descent algorithm. I’ll call alignment between the ‘human values’ and the base-objective Level 0 alignment.
Level 1: Classic inner alignment problem
If our base-optimizer instantiates an optimizer (a mesa-optimizer), how do we ensure that the objective of the mesa-optimizer (the mesa-objective) is the same as the base-objective? I’ll refer to the Level 1 mesa-optimizer and mesa-objective as the mesa1-optimizer and the mesa1-objective. Alignment between the base-objective and the mesa1-objective is called Level 1 alignment.
Level 2
But what if this mesa1-optimizer itself then creates an optimizer? I’ll call the optimizer created by the mesa1-optimizer the mesa2-optimizer and its objective is the mesa2-objective. How do we ensure that the mesa2-objective is the same as the mesa1-objective? This is the Level 2 alignment problem.
Do we even want to align these objectives? Is this a useful problem to be thinking about, or should we be focusing on simply ensuring we don’t get a mesa1-optimizer in the first place?
Stories
Training neural networks
Here I imagine a big neural network (or some other machine learning model) being trained with gradient descent (the base-optimizer). Gradient descent trains the model to have low training loss (this is the base-objective). As part of training, gradient descent modifies the network to perform optimization; this means that at runtime doing a forward pass of the network implements an algorithm which performs optimization. This here is our mesa1-optimizer and it has a mesa1-objective which may not be aligned with the base-objective.
So far this is the standard mesa-optimization story. But this mesa1-optimizer, as part of its optimization procedure, (at runtime!) may develop a new model to do well on the mesa1-objective. As with the classic inner alignment problem, we can’t be sure that the mesa2-optimizer will have the same objective as the mesa1-optimizer.
Optimizer
Objective
Neural network scenario optimizer
Neural network scenario objective
Level 0
Base-optimizer
Base-objective
Gradient descent
Low loss
Level 1
Mesa1-optimizer
Mesa1-objective
Mesa-optimizer instantiated at runtime
Good performance on the mesa-objective
Level 2
Mesa2-optimizer
Mesa2-objective
Optimizer designed by the mesa-optimizer
Good performance on its mesa-objective
Solving a maze
In one of the standard examples of inner misalignment we train an RL agent to mazes, so the base-objective is for the agent to reach the exit of the maze. But in training, if the exit is always in the same position in the maze (for example at position (10, 10)), then the agent may optimize for reaching this position instead; this is the mesa1-objective.
This mesa1-optimizer, in the process of attempting to optimize for getting to position (10, 10), may implement (at runtime) a new algorithm to find the shortest route to this position. This new algorithm may be related to the mesa1-objective, but not exactly the same; for example attempting to maximize the distance from the origin rather than minimize the distance to position (10, 10).
The different objectives can be expressed mathematically as
Base-objective: minimize ||Xagent−Xexit||2
Mesa1-objective: minimize ||Xagent−(10,10)||2
Mesa2-objective: maximize ||Xagent||2
Optimizer
Objective
Maze scenario optimizer
Maze scenario
objective
Level 0
Base-optimizer
Base-objective
Gradient descent
Get to the maze exit
Level 1
Mesa1-optimizer
Mesa1-objective
Mesa-optimizer instantiated at runtime
Get to position (10, 10)
Level 2
Mesa2-optimizer
Mesa2-objective
Optimizer designed by the mesa-optimizer
Maximize distance from the origin
Sub-agents in a language model
This example was inspired by this post, and considers a powerful language model which develops sub agents. I think this scenario relies on the model having some control over its training process, for example:
The model is a meta-learner and has explicitly been given control over its training process
The outputs of the model are fed back into the system as training data
We begin with a large language model, which is being trained with the base-objective of correctly classifying papers on arXiv. As part of this process it is helpful to be able to summarize the papers (as part of the model’s internal cognition), and so a mesa1-optimizer develops as a sub-agent which has the objective of accurately summarizing papers. More specifically, this mesa1-optimizer is given the body text from an article and attempts to generate an abstract that is ‘close’ to the original abstract. The mesa1-optimizer is attempting to minimize the difference between its generated abstract and the real abstract.
To help with this task of summarization, the mesa1-optimizer may create (at runtime) another optimizing model, for example, a model which generates a text description of an equation. This new mesa2-optimizer may be trained with a loss function which is simply how good the mesa1-optimizer finds the text descriptions. This mesa2-optimizer may be misaligned with the mesa1-optimizer.
The mesa2-optimizer could be optimizing for descriptions which are evaluated highly by the mesa1-optimizer, rather than accurate descriptions. This could lead it to output false descriptions which are maximally believable to the mesa1-optimizer.
Alternatively, the mesa2-optimizer could be deceptively aligned with the mesa1-optimizer. Here the mesa2-optimizer could be optimizing for any arbitrary goal, for example outputting the longest text description it can. This would lead the mesa2-optimizer to give good answers to the mesa1-optimizer, with the goal of eventually defecting. The mesa2-optimizer could continue giving good answers to the mesa1-optimizer, until the mesa1-optimizer is somehow “dependent”; the mesa1-optimizer trusts the mesa2-optimizer to the point where it can no longer choose not to use the mesa2-optimizer’s answers. At this point the mesa2-optimizer can defect, and use its outputs to control the mesa1-optimizer, and optimize for the mesa2-objective.
Optimizer
Objective
Language model scenario Optimizer
Language model scenario
Objective
Level 0
Base-optimizer
Base-objective
Gradient descent
Correctly classify ArXiv papers
Level 1
Mesa1-optimizer
Mesa1-objective
Mesa-optimizing subnetwork
Generate summaries of papers similar to the abstracts
Level 2
Mesa2-optimizer
Mesa2-objective
Optimizer made by the mesa-optimizer for describing equations
Output the longest text description
Evolution
The analogy between inner alignment failure and evolution via natural selection can also be extended to mesa-mesa-optimizers. In this analogy, genetic natural selection is the base-optimizer, and it is optimizing for the base-objective of reproductive fitness. Humans are a product of this natural selection process, and often optimize for goals other than reproductive fitness. We can view humans as mesa1-optimizers which have mesa1-objectives that encapsulate the things humans value (happiness, food, survival, sex, etc).
Humans may, while in pursuit of their human objectives, create an AI which is not aligned with their objectives. A human may think that running a successful paperclip factory will help them achieve their terminal objectives (happiness, food, status etc), and so create an AI to maximize the number of paperclips. Here a mesa1-optimizer (the human) has created a mesa2-optimizer (the paperclip maximizer), the mesa1-objective is not aligned with the mesa2-objective, and hence there is a Level 2 alignment failure.
There may even be a Level 3 alignment failure if the paperclip maximizing AI is not inner aligned. Even if its base-objective is to maximize the number of paperclips, the AI may develop a mesa-optimizer with a different mesa-objective.
Optimizer
Objective
Evolution optimizer
Evolution objective
Level 0
Base-optimizer
Base-objective
Natural Selection
Reproductive fitness
Level 1
Mesa1-optimizer
Mesa1-objective
Human
Happiness, food, survival, etc
Level 2
Mesa2-optimizer
Mesa2-objective
Outer-misaligned AI
Paperclips (?)
Level 3
Mesa3-optimizer
Mesa3-objective
Inner-misaligned AI
???
Could this happen?
One argument against there being Level 2 alignment problems, is that a competent mesa1-optimizer obviously has strong incentives to not instantiate another powerful optimizer with different goals to itself. If a mesa1-optimizer is competent enough, it seems like it won’t ‘accidentally’ create a mesa2-optimizer (in the way that a dumb base-optimizer might ‘accidentally’ create a mesa-optimizer). So it seems like if a mesa1-optimizer was to create a mesa2-optimizer it would do so deliberately, because it thought that creating this new optimizer would lead to good performance on the mesa1-objective.
The mesa1-optimizer is an algorithm which performs an optimization procedure when we run our model (i.e. do a forward pass of our neural network). If we want to stretch the limits of our imaginations, we could think of this mesa1-optimizer instantiating a crude ‘virtual machine’ as part of its algorithm, and then this virtual machine could be used to run another optimization algorithm (the mesa2-optimizer). This could maybe be achieved by a gradient hacker with near total control over its training process. Although conceivable, this seems incredibly unlikely, and if our mesa-optimizer was this capable it would probably have better ways of achieving its mesa1-objective which didn’t involve creating new optimizers.
If the mesa1-optimizer is going to create other optimizers it seems unlikely for it to do this using its own network weights or with some abstraction created at runtime. Rather, this scenario seems most likely to arise from humans giving the model its own resources which persist and don’t just exist for one forward pass of the model. For example, humans may just give a neural network the ability to write and execute its own computer programs. From here it seems like a relatively powerful mesa1-optimizer could create other optimizers while in pursuit of its mesa1-objective.
One reason to expect a mesa1-optimizer to arise while training a machine learning model is the inductive biases of the training algorithm. It seems as if neural networks trained with gradient descent have a bias towards simplicity, and an optimization algorithm is a relatively simple policy which is able to perform well on a range of tasks. For example, rather than brute memorizing the shortest route between every pair of cities on a map, it is far simpler to learn an optimization procedure which can run some sort of search to find the shortest route. It is not clear that our mesa1-optimizer would have the same inductive biases as our base-optimizer, and this might mean that the mesa1-optimizer does not have a tendency to create further optimizers. However, this ‘bias towards simplicity’ seems like a fairly straightforward and beneficial bias for an optimizer to have, and so it would not be surprising if our mesa1-optimizer did exhibit this behavior.
If the mesa1-optimizer either has low competence or high competence then I think there are unlikely to be Level 2 alignment failures. If the mesa1-optimizer has low competence, then it will be unable to instantiate another optimizer and so there will be nothing to Level 2 align. If the mesa1-optimizer has high competence, then it will be able to instantiate other optimizers (if it sees this as useful for the mesa1-objective) which are robustly aligned with the mesa1-objective.
There might be a regime of middling competence where a mesa1-optimizer has the ability to create a mesa2-optimizer, but doesn’t have the ability to align it (or the competence to realize it’s a bad idea to create the mesa2-optimizer). The competence of the mesa1-optimizer might increase with more training time, training data, or with larger networks and more compute. It seems fairly likely that as a model is trained for longer, the competence of the mesa1-optimizer would increase. The mesa1-optimizer may start with low competence (and hence be unable to instantiate any new optimizers) and during training enter this middling competence regime where it can create a new optimizer but can’t control it.
This seems analogous to humans developing dangerous new technologies; humans have the ‘power’ to create world changing technology, but might not have the ‘wisdom’ to control it or to know that creating the technology is a bad idea. Whether a mesa1-optimizer creates a misaligned mesa2-optimizer could depend on whether its ‘wisdom’ develops early enough, such that the mesa1-optimizer only creates optimizers it can control.
Importance and implications
From the perspective of the base-optimizer (and hence humans if we manage to adequately solve the outer alignment problem) it doesn’t really matter if the system as a whole ends up optimizing for the mesa1-objective or the mesa2-objective, as neither of these are the base-objective. These are both inner alignment failures, which result in the AI system optimizing for something that is not aligned with the base-objective.
However, it seems as if the base-objective is likely to be more similar to the mesa1-objective than the mesa2-objective. This feels a bit like a game of ‘telephone’ where at each level the objective becomes less correlated with the base-objective. We can see this in the natural selection analogy; for humans, happiness/food/sex/etc are still reasonably correlated with reproductive success, but maximizing the number of paperclips in the universe is not (using all the iron atoms in the humans’ bodies to make more paperclips is definitely not correlated with human reproductive success).
I think this does potentially have implications for which strategies we should use for tackling the inner alignment problem. When we want to avoid risks from misaligned mesa-optimizers there are two paths: either we ensure that we never have mesa-optimizing models, or we ensure that these mesa-optimizers are aligned with the base-objective/human values. I think the possibility of Level 2 alignment failure means that we should focus on ensuring that we don’t get mesa-optimizers. The idea that preventing mesa-optimization is the correct way to avoid catastrophic inner alignment failures already seems to be the most commonly held view, so this Level 2 alignment argument is another point in favor.
We could conceivably aim for aligning a mesa1-optimizer, and also aim for it to not create any more optimizers. Or we could allow any number of levels of mesa-optimizers to be created, but require that they are all robustly aligned with the level above. These approaches, although conceivable, seem very difficult because they rely on making predictions and controlling the behavior of a mesa-optimizer with middling competence. If we only cared about an asymptotically powerful mesa-optimizer then we might be able to make statements about it and constrain its action space. But because we may be dealing with a mesa-optimizer of middling competence, we would need to ensure that it doesn’t ‘make a mistake’ and create a misaligned mesa2-optimizer.
Conclusion
There seems to be a natural extension of the inner-alignment problem, where a mesa-optimizer can create a new optimizer with a different objective. It is not clear whether a mesa1-optimizer would create a mesa2-optimizer, or whether this mesa2-objective would be misaligned from the mesa1-objective.
Because inner misalignment might happen on other ‘levels’, this is another argument that the optimal way of avoiding inner alignment failures is to avoid mesa-optimization altogether, rather than attempting to ensure that the mesa-optimization is aligned.
Thanks to Charlie Steiner and Michele Campolo for feedback on drafts of this post. Thanks to everyone else who gave feedback on notation for this post (I’m still open to suggestions!). This work was supported by CEEALAR.
Alignment Problems All the Way Down
Epistemic status: pretty exploratory. I think this is a coherent concept, but I wouldn’t be surprised if there need to be some large changes.
Edit [7/12/2023]: I think this post is pretty confused and confusing, and doesn’t really address important parts of the alignment problem. The strategy of “avoid mesa-optimizers” no longer even seems like a coherent thing to aim for, and instead just sounds like never building powerful AI. I do think this post almost gets to important problems (like robust delegation), but doesn’t really provide much that is useful. I don’t regret writing this, and I think having pretty bad ideas is the first step to having kinda good ideas.
TL;DR: A mesa-optimizer may instantiate other optimizers; these new optimizers may not be aligned with the original mesa-objective. Therefore to avoid dangers from misaligned mesa-optimizers we should avoid learned optimization entirely, rather than attempting to align mesa-optimizers.
When thinking of AI Alignment it is common to divide the questions into the “outer alignment problem” and the “inner alignment problem”. The outer alignment problem refers to the problem of telling an AI system to do what we actually want; telling a system to maximize paperclips could cause an outer alignment failure because humans do not actually want to single-mindedly maximize the number of paperclips in the universe. The inner alignment problem refers to the task of making sure an AI system actually does what we tell it to do. Even if we manage to solve the outer alignment problem, there is no guarantee that an AI system will actually optimize for the objective that we give it. This is commonly discussed in the context of mesa-optimization, where our base-optimizer (for example, gradient descent) trains our model to itself perform some kind of optimization (this model is hence called a mesa-optimizer). The inner alignment problem here is about how to ensure that the objective of this mesa-optimizer is the same as the objective of the base-optimizer.
This post discusses the possibility that a mesa-optimizer may itself create an optimizer. Similarly to the classic inner alignment problem, it is not obvious that the objective of a mesa-optimizer will be robustly inherited by any optimizers which it creates. Because of the difficulty of ensuring that every level of mesa-optimization is aligned, I think that this is an argument for dealing with the inner alignment problem by entirely avoiding mesa-optimization rather than ensuring that mesa-optimizers are aligned.
Motivation
In the inner alignment problem we have a sort of ‘nested’ system, where the inner mesa-optimizer is created by the outer base-optimizer. It seems to be a natural extension of this to think about further optimizers created by the mesa-optimizer; a sort of Babushka Doll of optimizers. This also means that there is a possibility of ‘nested’ alignment problems, where we can’t guarantee that the objective of one optimizer will be robustly transferred to the other levels.
This is related to meta-learning with a mesa-optimiser discussed under “Meta-learning” in the hard cases for Relaxed adversarial training for inner alignment:
Levels of alignment problems
Here I will lay out how I see the levels of the alignment problems, along with some terminology which I hope will make things easier to discuss.
Level 0: Outer alignment problem
This is the problem of how we put our ‘human values’ into the base-objective for our base-optimizer. For training a neural network, the base-objective is specified in terms of a loss function, and the base-optimizer is some form of gradient descent algorithm. I’ll call alignment between the ‘human values’ and the base-objective Level 0 alignment.
Level 1: Classic inner alignment problem
If our base-optimizer instantiates an optimizer (a mesa-optimizer), how do we ensure that the objective of the mesa-optimizer (the mesa-objective) is the same as the base-objective? I’ll refer to the Level 1 mesa-optimizer and mesa-objective as the mesa1-optimizer and the mesa1-objective. Alignment between the base-objective and the mesa1-objective is called Level 1 alignment.
Level 2
But what if this mesa1-optimizer itself then creates an optimizer? I’ll call the optimizer created by the mesa1-optimizer the mesa2-optimizer and its objective is the mesa2-objective. How do we ensure that the mesa2-objective is the same as the mesa1-objective? This is the Level 2 alignment problem.
Do we even want to align these objectives? Is this a useful problem to be thinking about, or should we be focusing on simply ensuring we don’t get a mesa1-optimizer in the first place?
Stories
Training neural networks
Here I imagine a big neural network (or some other machine learning model) being trained with gradient descent (the base-optimizer). Gradient descent trains the model to have low training loss (this is the base-objective). As part of training, gradient descent modifies the network to perform optimization; this means that at runtime doing a forward pass of the network implements an algorithm which performs optimization. This here is our mesa1-optimizer and it has a mesa1-objective which may not be aligned with the base-objective.
So far this is the standard mesa-optimization story. But this mesa1-optimizer, as part of its optimization procedure, (at runtime!) may develop a new model to do well on the mesa1-objective. As with the classic inner alignment problem, we can’t be sure that the mesa2-optimizer will have the same objective as the mesa1-optimizer.
Optimizer
Objective
Neural network scenario optimizer
Neural network scenario objective
Level 0
Level 1
Level 2
Solving a maze
In one of the standard examples of inner misalignment we train an RL agent to mazes, so the base-objective is for the agent to reach the exit of the maze. But in training, if the exit is always in the same position in the maze (for example at position (10, 10)), then the agent may optimize for reaching this position instead; this is the mesa1-objective.
This mesa1-optimizer, in the process of attempting to optimize for getting to position (10, 10), may implement (at runtime) a new algorithm to find the shortest route to this position. This new algorithm may be related to the mesa1-objective, but not exactly the same; for example attempting to maximize the distance from the origin rather than minimize the distance to position (10, 10).
The different objectives can be expressed mathematically as
Base-objective: minimize ||Xagent−Xexit||2
Mesa1-objective: minimize ||Xagent−(10, 10)||2
Mesa2-objective: maximize ||Xagent||2
Optimizer
Objective
Maze scenario optimizer
Maze scenario
objective
Level 0
Level 1
Level 2
Sub-agents in a language model
This example was inspired by this post, and considers a powerful language model which develops sub agents. I think this scenario relies on the model having some control over its training process, for example:
The model is a very competent gradient hacker
The model is a meta-learner and has explicitly been given control over its training process
The outputs of the model are fed back into the system as training data
We begin with a large language model, which is being trained with the base-objective of correctly classifying papers on arXiv. As part of this process it is helpful to be able to summarize the papers (as part of the model’s internal cognition), and so a mesa1-optimizer develops as a sub-agent which has the objective of accurately summarizing papers. More specifically, this mesa1-optimizer is given the body text from an article and attempts to generate an abstract that is ‘close’ to the original abstract. The mesa1-optimizer is attempting to minimize the difference between its generated abstract and the real abstract.
To help with this task of summarization, the mesa1-optimizer may create (at runtime) another optimizing model, for example, a model which generates a text description of an equation. This new mesa2-optimizer may be trained with a loss function which is simply how good the mesa1-optimizer finds the text descriptions. This mesa2-optimizer may be misaligned with the mesa1-optimizer.
The mesa2-optimizer could be optimizing for descriptions which are evaluated highly by the mesa1-optimizer, rather than accurate descriptions. This could lead it to output false descriptions which are maximally believable to the mesa1-optimizer.
Alternatively, the mesa2-optimizer could be deceptively aligned with the mesa1-optimizer. Here the mesa2-optimizer could be optimizing for any arbitrary goal, for example outputting the longest text description it can. This would lead the mesa2-optimizer to give good answers to the mesa1-optimizer, with the goal of eventually defecting. The mesa2-optimizer could continue giving good answers to the mesa1-optimizer, until the mesa1-optimizer is somehow “dependent”; the mesa1-optimizer trusts the mesa2-optimizer to the point where it can no longer choose not to use the mesa2-optimizer’s answers. At this point the mesa2-optimizer can defect, and use its outputs to control the mesa1-optimizer, and optimize for the mesa2-objective.
Optimizer
Objective
Language model scenario Optimizer
Language model scenario
Objective
Level 0
Level 1
Level 2
Evolution
The analogy between inner alignment failure and evolution via natural selection can also be extended to mesa-mesa-optimizers. In this analogy, genetic natural selection is the base-optimizer, and it is optimizing for the base-objective of reproductive fitness. Humans are a product of this natural selection process, and often optimize for goals other than reproductive fitness. We can view humans as mesa1-optimizers which have mesa1-objectives that encapsulate the things humans value (happiness, food, survival, sex, etc).
Humans may, while in pursuit of their human objectives, create an AI which is not aligned with their objectives. A human may think that running a successful paperclip factory will help them achieve their terminal objectives (happiness, food, status etc), and so create an AI to maximize the number of paperclips. Here a mesa1-optimizer (the human) has created a mesa2-optimizer (the paperclip maximizer), the mesa1-objective is not aligned with the mesa2-objective, and hence there is a Level 2 alignment failure.
There may even be a Level 3 alignment failure if the paperclip maximizing AI is not inner aligned. Even if its base-objective is to maximize the number of paperclips, the AI may develop a mesa-optimizer with a different mesa-objective.
Optimizer
Objective
Evolution optimizer
Evolution objective
Level 0
Level 1
Level 2
Level 3
Could this happen?
One argument against there being Level 2 alignment problems, is that a competent mesa1-optimizer obviously has strong incentives to not instantiate another powerful optimizer with different goals to itself. If a mesa1-optimizer is competent enough, it seems like it won’t ‘accidentally’ create a mesa2-optimizer (in the way that a dumb base-optimizer might ‘accidentally’ create a mesa-optimizer). So it seems like if a mesa1-optimizer was to create a mesa2-optimizer it would do so deliberately, because it thought that creating this new optimizer would lead to good performance on the mesa1-objective.
The mesa1-optimizer is an algorithm which performs an optimization procedure when we run our model (i.e. do a forward pass of our neural network). If we want to stretch the limits of our imaginations, we could think of this mesa1-optimizer instantiating a crude ‘virtual machine’ as part of its algorithm, and then this virtual machine could be used to run another optimization algorithm (the mesa2-optimizer). This could maybe be achieved by a gradient hacker with near total control over its training process. Although conceivable, this seems incredibly unlikely, and if our mesa-optimizer was this capable it would probably have better ways of achieving its mesa1-objective which didn’t involve creating new optimizers.
If the mesa1-optimizer is going to create other optimizers it seems unlikely for it to do this using its own network weights or with some abstraction created at runtime. Rather, this scenario seems most likely to arise from humans giving the model its own resources which persist and don’t just exist for one forward pass of the model. For example, humans may just give a neural network the ability to write and execute its own computer programs. From here it seems like a relatively powerful mesa1-optimizer could create other optimizers while in pursuit of its mesa1-objective.
One reason to expect a mesa1-optimizer to arise while training a machine learning model is the inductive biases of the training algorithm. It seems as if neural networks trained with gradient descent have a bias towards simplicity, and an optimization algorithm is a relatively simple policy which is able to perform well on a range of tasks. For example, rather than brute memorizing the shortest route between every pair of cities on a map, it is far simpler to learn an optimization procedure which can run some sort of search to find the shortest route. It is not clear that our mesa1-optimizer would have the same inductive biases as our base-optimizer, and this might mean that the mesa1-optimizer does not have a tendency to create further optimizers. However, this ‘bias towards simplicity’ seems like a fairly straightforward and beneficial bias for an optimizer to have, and so it would not be surprising if our mesa1-optimizer did exhibit this behavior.
If the mesa1-optimizer either has low competence or high competence then I think there are unlikely to be Level 2 alignment failures. If the mesa1-optimizer has low competence, then it will be unable to instantiate another optimizer and so there will be nothing to Level 2 align. If the mesa1-optimizer has high competence, then it will be able to instantiate other optimizers (if it sees this as useful for the mesa1-objective) which are robustly aligned with the mesa1-objective.
There might be a regime of middling competence where a mesa1-optimizer has the ability to create a mesa2-optimizer, but doesn’t have the ability to align it (or the competence to realize it’s a bad idea to create the mesa2-optimizer). The competence of the mesa1-optimizer might increase with more training time, training data, or with larger networks and more compute. It seems fairly likely that as a model is trained for longer, the competence of the mesa1-optimizer would increase. The mesa1-optimizer may start with low competence (and hence be unable to instantiate any new optimizers) and during training enter this middling competence regime where it can create a new optimizer but can’t control it.
This seems analogous to humans developing dangerous new technologies; humans have the ‘power’ to create world changing technology, but might not have the ‘wisdom’ to control it or to know that creating the technology is a bad idea. Whether a mesa1-optimizer creates a misaligned mesa2-optimizer could depend on whether its ‘wisdom’ develops early enough, such that the mesa1-optimizer only creates optimizers it can control.
Importance and implications
From the perspective of the base-optimizer (and hence humans if we manage to adequately solve the outer alignment problem) it doesn’t really matter if the system as a whole ends up optimizing for the mesa1-objective or the mesa2-objective, as neither of these are the base-objective. These are both inner alignment failures, which result in the AI system optimizing for something that is not aligned with the base-objective.
However, it seems as if the base-objective is likely to be more similar to the mesa1-objective than the mesa2-objective. This feels a bit like a game of ‘telephone’ where at each level the objective becomes less correlated with the base-objective. We can see this in the natural selection analogy; for humans, happiness/food/sex/etc are still reasonably correlated with reproductive success, but maximizing the number of paperclips in the universe is not (using all the iron atoms in the humans’ bodies to make more paperclips is definitely not correlated with human reproductive success).
I think this does potentially have implications for which strategies we should use for tackling the inner alignment problem. When we want to avoid risks from misaligned mesa-optimizers there are two paths: either we ensure that we never have mesa-optimizing models, or we ensure that these mesa-optimizers are aligned with the base-objective/human values. I think the possibility of Level 2 alignment failure means that we should focus on ensuring that we don’t get mesa-optimizers. The idea that preventing mesa-optimization is the correct way to avoid catastrophic inner alignment failures already seems to be the most commonly held view, so this Level 2 alignment argument is another point in favor.
We could conceivably aim for aligning a mesa1-optimizer, and also aim for it to not create any more optimizers. Or we could allow any number of levels of mesa-optimizers to be created, but require that they are all robustly aligned with the level above. These approaches, although conceivable, seem very difficult because they rely on making predictions and controlling the behavior of a mesa-optimizer with middling competence. If we only cared about an asymptotically powerful mesa-optimizer then we might be able to make statements about it and constrain its action space. But because we may be dealing with a mesa-optimizer of middling competence, we would need to ensure that it doesn’t ‘make a mistake’ and create a misaligned mesa2-optimizer.
Conclusion
There seems to be a natural extension of the inner-alignment problem, where a mesa-optimizer can create a new optimizer with a different objective. It is not clear whether a mesa1-optimizer would create a mesa2-optimizer, or whether this mesa2-objective would be misaligned from the mesa1-objective.
Because inner misalignment might happen on other ‘levels’, this is another argument that the optimal way of avoiding inner alignment failures is to avoid mesa-optimization altogether, rather than attempting to ensure that the mesa-optimization is aligned.
Thanks to Charlie Steiner and Michele Campolo for feedback on drafts of this post. Thanks to everyone else who gave feedback on notation for this post (I’m still open to suggestions!). This work was supported by CEEALAR.