My own perspective about outer/inner alignment, after reading the Risks from Learned Optimization and lots of other posts and talking to Evan, is that:
An outer alignment failure is when you mess up the objective (the objective you train on wouldn’t be the right objective on the deployment environment)
An inner alignment failure is when the objective generalize well (optimizing it on the the deployment environment would produce the wanted behavior), but training on the training environment causes the learning of a different objective (whether implicit or explicit) that is not exactly the training objective.
To give an example, let’s assume the tasks of reaching the exit door of 2D mazes. And let’s say that on the training mazes, all exit doors are red; but on some other environments, they are of a different color.
Having a reward that gives +1 for reaching red objects and 0 otherwise is an outer alignment failure, because this reward would not train the right behavior in some mazes (the ones with blue exit doors, for example).
Having a reward that gives +1 for reaching the exit door and 0 otherwise will ensure outer alignment, but there is still a risk of inner alignment failure because the model could learn to just reach red objects. Note that this inner alignment failure can be caused either by the model learning an explicit and wrong objective (as in mesa-optimizers), or just because reaching red objects leads to simple models that are optimal for the training function.
Another way to put it is that for an outer-alignment failure, it’s generally not possible for the training to end up with an aligned model, whereas for an inner alignment failure, there is actually an optimal model that is aligned, but the data/environments provided push towards a misaligned optimal model.
I think the salient point of disagreement is that you seem to assume that for an objective to be actually the right one, all optimal policies should be aligned and generalize the deployment distribution. But I don’t think that’s a useful assumption, because it’s pretty much impossible IMO to get such perfect objectives.
On the other hand, I do think that we can get objectives according to which some aligned models are optimal. And splitting the failure modes into outer and inner alignment failures allows us to discuss whether we should try to change the objective or the data.
Lastly, a weakness of my definition is that sometimes it might be harder to find the right data/training environment for the “good objective” than to change the objective into another objective that is also aligned, but for which finding the right optimal policy is easier. In that case, solving an inner alignment failure would be easier by changing the base objective, which is really counter intuitive.
My own perspective about outer/inner alignment, after reading the Risks from Learned Optimization and lots of other posts and talking to Evan, is that:
An outer alignment failure is when you mess up the objective (the objective you train on wouldn’t be the right objective on the deployment environment)
An inner alignment failure is when the objective generalize well (optimizing it on the the deployment environment would produce the wanted behavior), but training on the training environment causes the learning of a different objective (whether implicit or explicit) that is not exactly the training objective.
To give an example, let’s assume the tasks of reaching the exit door of 2D mazes. And let’s say that on the training mazes, all exit doors are red; but on some other environments, they are of a different color.
Having a reward that gives +1 for reaching red objects and 0 otherwise is an outer alignment failure, because this reward would not train the right behavior in some mazes (the ones with blue exit doors, for example).
Having a reward that gives +1 for reaching the exit door and 0 otherwise will ensure outer alignment, but there is still a risk of inner alignment failure because the model could learn to just reach red objects. Note that this inner alignment failure can be caused either by the model learning an explicit and wrong objective (as in mesa-optimizers), or just because reaching red objects leads to simple models that are optimal for the training function.
Another way to put it is that for an outer-alignment failure, it’s generally not possible for the training to end up with an aligned model, whereas for an inner alignment failure, there is actually an optimal model that is aligned, but the data/environments provided push towards a misaligned optimal model.
I think the salient point of disagreement is that you seem to assume that for an objective to be actually the right one, all optimal policies should be aligned and generalize the deployment distribution. But I don’t think that’s a useful assumption, because it’s pretty much impossible IMO to get such perfect objectives.
On the other hand, I do think that we can get objectives according to which some aligned models are optimal. And splitting the failure modes into outer and inner alignment failures allows us to discuss whether we should try to change the objective or the data.
Lastly, a weakness of my definition is that sometimes it might be harder to find the right data/training environment for the “good objective” than to change the objective into another objective that is also aligned, but for which finding the right optimal policy is easier. In that case, solving an inner alignment failure would be easier by changing the base objective, which is really counter intuitive.