In “Why Read The Classics?”, Italo Calvino proposes many different definitions of a classic work of literature, including this one:
A classic is a book which has never exhausted all it has to say to its readers.
For me, this captures what makes this sequence and corresponding paper a classic in the AI Alignment literature: it keeps on giving, readthrough after readthrough. That doesn’t mean I agree with everything in it, or that I don’t think it could have been improved in terms of structure. But when pushed to reread it, I found again and again that I had missed or forgotten some nice argument, some interesting takeaway.
With that, a caveat: I’m collaborating with Evan Hubinger (one of the authors) on projects related to ideas introduced in this sequence, especially to Deceptive Alignment. I am thus probably biased positively about this work. That being said, I have no problem saying I disagree with collaborators, so I don’t think I’m too biased to write this review.
(Small point: I among other people tend to describe this sequence/paper as mainly Evan’s work, but he repeatedly told me that everyone participated equally, and that the names are in alphabetic order, not contribution order. So let’s keep that in mind)
Summary
Let’s start the review proper with a post by post summary (except for the conclusion):
(Introduction) This first post introduces the idea of mesa-optimizers, the learned optimizers from the title. A mesa-optimizer is an optimizer which is the result of a learning process, and it comes with the issue of inner alignment: how aligned is the objective of the mesa-optimizer (on which we don’t have direct control) with the objective of the base-optimizer that produced this mesa-optimizer? The post then split the safety questions related to mesa-optimizer in two categories: understanding which conditions make mesa-optimizer appear; and understanding how aligned is the mesa-objective with the base-objective.
(Conditions for Mesa-Optimization) This post tackles the first category outlined in the introduction: how can mesa-optimizers be learned? The task can push towards mesa-optimization by asking for more generalization (which is probably easier to deliver through search), by requiring a compressed complex policy, or by requiring human modeling (which probably entails understanding optimization and search in some sense). The base-optimiser can push towards mesa-optimization if it is reachable (not surrounded by high-loss solutions), if the models considered have enough algorithmic range, or more generally through details of the inductive bias like statefulness and simplicity bias.
(The Inner Alignment Problem) This post tackles the second category outlined in the introduction: if a mesa-optimizer does appear, how aligned will it be with the base objective? The misalignment considered here is called pseudo-alignment: being aligned on the training distribution but not at deployment. The authors propose to split pseudo-alignment in three subcategories:
Proxy alignment, where the mesa-objective is a proxy for the base-objective on the training distribution, but not necessarily elsewhere.
Approximate alignment, where the difference comes from the inability of the mesa-optimizer to represent the base-objective, and thus it learns an approximation.
Suboptimality alignment, where the mesa-objective is not the base-objective at all, but the mesa-optimizer makes decisions (through mistakes or deception) on the training distribution that fit with the base-objective even if it contradicts the mesa-objective.
The post also explores how the task and the base-optimizer can influence the apparition of pseudo-alignment assuming mesa-optimizers, and on which subcategory it falls.
(Deceptive Alignment) This post focuses on a specific instance of suboptimality alignment: deceptive alignment, where the mesa-optimizer is trying to deceive the base-optimizer in order to not be deployed and then change its behavior to pursue the mesa-objective. Among other ideas, the discussion examines necessary conditions for deceptive alignment (objective across parameter updates, self-modeling as learned model, expect eventual deployment without modification), how training can reinforce deception, and whether making the deceptive system thinks it’s still in training might deal with the problem.
Value
What is new in this? After all, the idea that training on a reward/objective might result in a model that doesn’t generalize correctly is hardly newsworthy, and wasn’t in 2019.
What’s missing here is replacing this idea in the context of safety. I’m always worried about saying “This is the first place some concept has been defined/mentioned”. But it’s safe to say that a lot of AI Alignment resources prior to this sequence centered around finding the right objective. The big catastrophic scenarios came from issues like the Orthogonality Thesis and the fragility of value, for which the solution seems obviously to find the right objective, and maybe add/train for good properties like corrigibility. Yet both ML theory and practice already knew that issues didn’t stop there.
So the value of this sequence comes in recasting the known generalization problems from classic ML in the context of alignment, in a public and easily readable form. Remember, I’m hardly saying nobody knew about it in the AI Alignment community before that sequence. But it is hard to find well-read and cited posts and discussions about the subject predating this sequence. I for one didn’t really think about such issues before reading this sequence and starting to work with Evan.
The other big contribution of this sequence is the introduction of deceptive alignment. Considering deception from within the trained model during its training is similar to some other previous ideas about deception (for example boxed AI getting out), but to my knowledge this is the first full fledged argument for how this could appear from local search, and even be maintained and reinforced. So deceptive alignment can be seen as recasting a traditional AI risk in the more recent context of prosaic AGI.
Criticisms
One potential issue with the sequence is its use of optimizers (programs doing explicit internal search over policies) as the problematic learned models. It makes sense from the formal point of view, since this assumption simplifies the analysis of the corresponding mesa-optimizers, and allows a relatively straightforward definition of notions like mesa-objective and inner alignment.
Yet this assumption has been criticized by multiple researchers in the community. For example, RIchard Ngo argues that for the kind of models trained through local search (like neural networks), it’s not obvious what “doing internal search means. Others, like Tom Everitt, defend that systems not doing internal search should be included in the discussion of inner alignment.
I’m sympathetic to both criticisms and would like to see someone attempt a similar take without this assumption—see the directions for further research below.
Another slight issue I have with this sequence comes from its density: some very interesting ideas end up getting lost in it. As one example, take the tradeoff from reducing time complexity, which helps to not create mesa-optimizer but increase the risk of pseudo-alignment if mesa-optimizers do appear. The first part is discussed in Conditions for Mesa-Optimization, and the second in The Inner Alignment Problem. But it’s deep inside the text—there’s no way for a casual reader or a quick rereader to know it is here. I think this could have been improved, even if it’s almost nitpicking at this point.
Follow-up research
What was the influence of this sequence? Google scholar returns only 8 citations, but this is misguided—most of the impact is on researchers who don’t publish papers that often. It seems more relevant to look at pingbacks from Alignment Forum posts. I count 62 such AF posts, not including the ones from the sequence itself (and without accounting for redundancy). That’s quite impressive.
Here is a choice of the most interesting from my perspective:
Abram Demski’s Selection vs Control, which crystallized an important dichotomy in how we think about optimizers
Adam Scholl’s Matt Botvinick on the spontaneous emergence of learning algorithms, who attempted to present an example of mesa-optimization, and sparked a big discussion about the meaning of the term, how surprising it should be, and even the need for more RL education in the AI Alignment community (see this comment thread for the “gist”).
Evan Hubinger’s Gradient Hacking, which expanded on the case with deceptive alignment where the trained system can influence only through its behavior what happens next in training. I think this is a big potential issue, which is why I’m investigating it with Evan.
Evan Hubinger’s Clarifying Inner alignment terminology, which replaced the term inner alignment in the context of mesa-optimizers (as defined initially in the sequence), and proposed a decomposition of the alignment problem.
Directions for further research
Mostly, I would be excited with two axes of research around this sequence:
Trying to break the arguments from this sequence. Either poking hole in them and showing why they might not work, or find reasonable assumptions for which they don’t work. Whether holes are found or no attacks breaks the reasoning, I think we will have learned quite a lot.
Trying to make the arguments in this sequence work without the optimization assumption for the learned models. I’m thinking either by assuming that the system will be well predicted by thinking of it as optimizing something, or through a more general idea of goal-directedness. (Evan is also quite interested in this project, so if it excites you, feel free to contact him!)
In “Why Read The Classics?”, Italo Calvino proposes many different definitions of a classic work of literature, including this one:
For me, this captures what makes this sequence and corresponding paper a classic in the AI Alignment literature: it keeps on giving, readthrough after readthrough. That doesn’t mean I agree with everything in it, or that I don’t think it could have been improved in terms of structure. But when pushed to reread it, I found again and again that I had missed or forgotten some nice argument, some interesting takeaway.
With that, a caveat: I’m collaborating with Evan Hubinger (one of the authors) on projects related to ideas introduced in this sequence, especially to Deceptive Alignment. I am thus probably biased positively about this work. That being said, I have no problem saying I disagree with collaborators, so I don’t think I’m too biased to write this review.
(Small point: I among other people tend to describe this sequence/paper as mainly Evan’s work, but he repeatedly told me that everyone participated equally, and that the names are in alphabetic order, not contribution order. So let’s keep that in mind)
Summary
Let’s start the review proper with a post by post summary (except for the conclusion):
(Introduction) This first post introduces the idea of mesa-optimizers, the learned optimizers from the title. A mesa-optimizer is an optimizer which is the result of a learning process, and it comes with the issue of inner alignment: how aligned is the objective of the mesa-optimizer (on which we don’t have direct control) with the objective of the base-optimizer that produced this mesa-optimizer?
The post then split the safety questions related to mesa-optimizer in two categories: understanding which conditions make mesa-optimizer appear; and understanding how aligned is the mesa-objective with the base-objective.
(Conditions for Mesa-Optimization) This post tackles the first category outlined in the introduction: how can mesa-optimizers be learned? The task can push towards mesa-optimization by asking for more generalization (which is probably easier to deliver through search), by requiring a compressed complex policy, or by requiring human modeling (which probably entails understanding optimization and search in some sense). The base-optimiser can push towards mesa-optimization if it is reachable (not surrounded by high-loss solutions), if the models considered have enough algorithmic range, or more generally through details of the inductive bias like statefulness and simplicity bias.
(The Inner Alignment Problem) This post tackles the second category outlined in the introduction: if a mesa-optimizer does appear, how aligned will it be with the base objective? The misalignment considered here is called pseudo-alignment: being aligned on the training distribution but not at deployment. The authors propose to split pseudo-alignment in three subcategories:
Proxy alignment, where the mesa-objective is a proxy for the base-objective on the training distribution, but not necessarily elsewhere.
Approximate alignment, where the difference comes from the inability of the mesa-optimizer to represent the base-objective, and thus it learns an approximation.
Suboptimality alignment, where the mesa-objective is not the base-objective at all, but the mesa-optimizer makes decisions (through mistakes or deception) on the training distribution that fit with the base-objective even if it contradicts the mesa-objective.
The post also explores how the task and the base-optimizer can influence the apparition of pseudo-alignment assuming mesa-optimizers, and on which subcategory it falls.
(Deceptive Alignment) This post focuses on a specific instance of suboptimality alignment: deceptive alignment, where the mesa-optimizer is trying to deceive the base-optimizer in order to not be deployed and then change its behavior to pursue the mesa-objective.
Among other ideas, the discussion examines necessary conditions for deceptive alignment (objective across parameter updates, self-modeling as learned model, expect eventual deployment without modification), how training can reinforce deception, and whether making the deceptive system thinks it’s still in training might deal with the problem.
Value
What is new in this? After all, the idea that training on a reward/objective might result in a model that doesn’t generalize correctly is hardly newsworthy, and wasn’t in 2019.
What’s missing here is replacing this idea in the context of safety. I’m always worried about saying “This is the first place some concept has been defined/mentioned”. But it’s safe to say that a lot of AI Alignment resources prior to this sequence centered around finding the right objective. The big catastrophic scenarios came from issues like the Orthogonality Thesis and the fragility of value, for which the solution seems obviously to find the right objective, and maybe add/train for good properties like corrigibility. Yet both ML theory and practice already knew that issues didn’t stop there.
So the value of this sequence comes in recasting the known generalization problems from classic ML in the context of alignment, in a public and easily readable form. Remember, I’m hardly saying nobody knew about it in the AI Alignment community before that sequence. But it is hard to find well-read and cited posts and discussions about the subject predating this sequence. I for one didn’t really think about such issues before reading this sequence and starting to work with Evan.
The other big contribution of this sequence is the introduction of deceptive alignment. Considering deception from within the trained model during its training is similar to some other previous ideas about deception (for example boxed AI getting out), but to my knowledge this is the first full fledged argument for how this could appear from local search, and even be maintained and reinforced. So deceptive alignment can be seen as recasting a traditional AI risk in the more recent context of prosaic AGI.
Criticisms
One potential issue with the sequence is its use of optimizers (programs doing explicit internal search over policies) as the problematic learned models. It makes sense from the formal point of view, since this assumption simplifies the analysis of the corresponding mesa-optimizers, and allows a relatively straightforward definition of notions like mesa-objective and inner alignment.
Yet this assumption has been criticized by multiple researchers in the community. For example, RIchard Ngo argues that for the kind of models trained through local search (like neural networks), it’s not obvious what “doing internal search means. Others, like Tom Everitt, defend that systems not doing internal search should be included in the discussion of inner alignment.
I’m sympathetic to both criticisms and would like to see someone attempt a similar take without this assumption—see the directions for further research below.
Another slight issue I have with this sequence comes from its density: some very interesting ideas end up getting lost in it. As one example, take the tradeoff from reducing time complexity, which helps to not create mesa-optimizer but increase the risk of pseudo-alignment if mesa-optimizers do appear. The first part is discussed in Conditions for Mesa-Optimization, and the second in The Inner Alignment Problem. But it’s deep inside the text—there’s no way for a casual reader or a quick rereader to know it is here. I think this could have been improved, even if it’s almost nitpicking at this point.
Follow-up research
What was the influence of this sequence? Google scholar returns only 8 citations, but this is misguided—most of the impact is on researchers who don’t publish papers that often. It seems more relevant to look at pingbacks from Alignment Forum posts. I count 62 such AF posts, not including the ones from the sequence itself (and without accounting for redundancy). That’s quite impressive.
Here is a choice of the most interesting from my perspective:
Abram Demski’s Selection vs Control, which crystallized an important dichotomy in how we think about optimizers
Adam Scholl’s Matt Botvinick on the spontaneous emergence of learning algorithms, who attempted to present an example of mesa-optimization, and sparked a big discussion about the meaning of the term, how surprising it should be, and even the need for more RL education in the AI Alignment community (see this comment thread for the “gist”).
Evan Hubinger’s Gradient Hacking, which expanded on the case with deceptive alignment where the trained system can influence only through its behavior what happens next in training. I think this is a big potential issue, which is why I’m investigating it with Evan.
Evan Hubinger’s Clarifying Inner alignment terminology, which replaced the term inner alignment in the context of mesa-optimizers (as defined initially in the sequence), and proposed a decomposition of the alignment problem.
Directions for further research
Mostly, I would be excited with two axes of research around this sequence:
Trying to break the arguments from this sequence. Either poking hole in them and showing why they might not work, or find reasonable assumptions for which they don’t work. Whether holes are found or no attacks breaks the reasoning, I think we will have learned quite a lot.
Trying to make the arguments in this sequence work without the optimization assumption for the learned models. I’m thinking either by assuming that the system will be well predicted by thinking of it as optimizing something, or through a more general idea of goal-directedness. (Evan is also quite interested in this project, so if it excites you, feel free to contact him!)