This post aims to clarify the definitions of a number of concepts in AI alignment introduced by the author and collaborators. The concepts are interesting, and some researchers evidently find them useful. Personally, I find the definitions confusing, but I did benefit a little from thinking about this confusion. In my opinion, the post could greatly benefit from introducing mathematical notation[1] and making the concepts precise at least in some very simplistic toy model.
In the following, I’ll try going over some of the definitions and explicating my understanding/confusion regarding each. The definitions I omitted either explicitly refer to these or have analogous structure.
(Impact) Alignment: An agent is impact aligned (with humans) if it doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic.
This one is more or less clear. Even though it’s not a formal definition, it doesn’t have to be: after all, this is precisely the problem we are trying to solve.
Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective is impact aligned with humans.
The “behavioral objective” is defined in a linked page as:
The behavioral objective is what an optimizer appears to be optimizing for. Formally, the behavioral objective is the objective recovered from perfect inverse reinforcement learning.
This is already thorny territory, since it’s far from clear what is “perfect inverse reinforcement learning”. Intuitively, an “intent aligned” agent is supposed to be one whose behavior demonstrates an aligned objective, but it can still make mistakes with catastrophic consequences. The example I imagine is: an AI researcher who is unwittingly building transformative unaligned AI.
Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.
This is confusing because it’s unclear what counts as “well” and what are the underlying assumptions. The no-free-lunch theorems imply that an agent cannot perform too well off-distribution, unless you’re still constraining the distribution somehow. I’m guessing that either this agent is doing online learning or it’s detecting off-distribution and failing gracefully in some sense, or maybe some combination of both.
Notably, the post asserts the implication intent alignment + capability robustness ⇒ impact alignment. Now, let’s go back to the example of the misguided AI researcher. In what sense are they not “capability robust”? I don’t know.
Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.
The “mesa-objective” is defined in the linked page as:
A mesa-objective is the objective of a mesa-optimizer.
So it seems like we could replace “mesa-objective” with just “objective”. This is confusing, because in other places the author felt the need to use “behavioral objective” but here he is referring to some other notion of objective, and it’s not clear what’s the difference.
I guess that different people have different difficulties. I often hear that my own articles are difficult to understand because of the dense mathematics. But for me, it is the absence of mathematics which is difficult!
This post aims to clarify the definitions of a number of concepts in AI alignment introduced by the author and collaborators. The concepts are interesting, and some researchers evidently find them useful. Personally, I find the definitions confusing, but I did benefit a little from thinking about this confusion. In my opinion, the post could greatly benefit from introducing mathematical notation[1] and making the concepts precise at least in some very simplistic toy model.
In the following, I’ll try going over some of the definitions and explicating my understanding/confusion regarding each. The definitions I omitted either explicitly refer to these or have analogous structure.
This one is more or less clear. Even though it’s not a formal definition, it doesn’t have to be: after all, this is precisely the problem we are trying to solve.
The “behavioral objective” is defined in a linked page as:
This is already thorny territory, since it’s far from clear what is “perfect inverse reinforcement learning”. Intuitively, an “intent aligned” agent is supposed to be one whose behavior demonstrates an aligned objective, but it can still make mistakes with catastrophic consequences. The example I imagine is: an AI researcher who is unwittingly building transformative unaligned AI.
This is confusing because it’s unclear what counts as “well” and what are the underlying assumptions. The no-free-lunch theorems imply that an agent cannot perform too well off-distribution, unless you’re still constraining the distribution somehow. I’m guessing that either this agent is doing online learning or it’s detecting off-distribution and failing gracefully in some sense, or maybe some combination of both.
Notably, the post asserts the implication intent alignment + capability robustness ⇒ impact alignment. Now, let’s go back to the example of the misguided AI researcher. In what sense are they not “capability robust”? I don’t know.
The “mesa-objective” is defined in the linked page as:
So it seems like we could replace “mesa-objective” with just “objective”. This is confusing, because in other places the author felt the need to use “behavioral objective” but here he is referring to some other notion of objective, and it’s not clear what’s the difference.
I guess that different people have different difficulties. I often hear that my own articles are difficult to understand because of the dense mathematics. But for me, it is the absence of mathematics which is difficult!