Very clear presentation! As someone outside the field who likes to follow along, I very much appreciate these clear conceptual frameworks and explanations.
I did however get slightly lost in section 1.2. At first reading I was expecting this part:
which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers.
to say, ”… gap between the behavioral objective and the intended goal of the programmers.” (In which case the inner alignment problem would be a subcomponent of the outer alignment problem.)
On second thought, I can see why you’d want to have a term just for the problem of making sure the base objective is aligned. But to help myself (and others who think similarly) keep this all straight, do you have a pithy term for “the intended goal of the programmers” that’s analogous to base objective, mesa objective, and behavioral objective?
Would meta objective be appropriate?
(Apologies if my question rests on a misunderstanding or if you’ve defined the term I’m looking for somewhere and I’ve missed it.)
I don’t have a good term for that, unfortunately—if you’re trying to build an aligned AI, “human values” could be the right term, though in most cases you really just want “move one strawberry onto a plate without killing everyone,” which is quite a lot less than “optimize for all human values.” I could see how meta-objective might make sense if you’re thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a meta-learner instead.
Also, the motivation for choosing outer alignment as the alignment problem between the base objective and the goals of the programmers was to capture the “classical” alignment problem as it has sometimes previously been envisioned, wherein you just need to specify an aligned set of goals and then you’re good. As we argue, though, mesa-optimization means that you need more than just outer alignment—if you have mesa-optimizers, you also need inner alignment, as even if your base objective is perfectly aligned, the resulting mesa-objective (and thus the resulting behavioral objective) might not be.
Very clear presentation! As someone outside the field who likes to follow along, I very much appreciate these clear conceptual frameworks and explanations.
I did however get slightly lost in section 1.2. At first reading I was expecting this part:
to say, ”… gap between the behavioral objective and the intended goal of the programmers.” (In which case the inner alignment problem would be a subcomponent of the outer alignment problem.)
On second thought, I can see why you’d want to have a term just for the problem of making sure the base objective is aligned. But to help myself (and others who think similarly) keep this all straight, do you have a pithy term for “the intended goal of the programmers” that’s analogous to base objective, mesa objective, and behavioral objective?
Would meta objective be appropriate?
(Apologies if my question rests on a misunderstanding or if you’ve defined the term I’m looking for somewhere and I’ve missed it.)
I don’t have a good term for that, unfortunately—if you’re trying to build an aligned AI, “human values” could be the right term, though in most cases you really just want “move one strawberry onto a plate without killing everyone,” which is quite a lot less than “optimize for all human values.” I could see how meta-objective might make sense if you’re thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a meta-learner instead.
Also, the motivation for choosing outer alignment as the alignment problem between the base objective and the goals of the programmers was to capture the “classical” alignment problem as it has sometimes previously been envisioned, wherein you just need to specify an aligned set of goals and then you’re good. As we argue, though, mesa-optimization means that you need more than just outer alignment—if you have mesa-optimizers, you also need inner alignment, as even if your base objective is perfectly aligned, the resulting mesa-objective (and thus the resulting behavioral objective) might not be.
Got it, that’s helpful. Thank you!
Phrases I’ve used: [intended/desired/designer’s] [objective/goal]
I think “designer’s objective” would fit in best with the rest of the terminology in this post, though “desired objective” is also good.