We could definitely break up what-I’m-calling-outer-alignment into one piece that says “this reward function would give good policies high reward if the policies were evaluated on the deployment distribution” (i.e. outer alignment) and a second piece which says “policies which perform well in training also perform well in deployment” (i.e. robustness).
Fwiw, I claim that this is the actually useful decomposition. (Though I’m not going to argue for it here—I’m writing this comment mostly in the hopes that you think about this decomposition yourself.)
I’d say it slightly differently: outer alignment = “the reward function incentivizes good behavior on the training distribution” and robustness / inner alignment = “the learned policy has good behavior on the deployment distribution”. Under these definitions, all you care about is inner alignment; outer alignment is instrumentally useful towards guaranteeing inner alignment, but if we got inner alignment some other way without getting outer alignment, that would be fine.
Another decomposition is: outer alignment = “good behavior on the training distribution”, robustness / inner alignment = “non-catastrophic behavior on the deployment distribution”. In this case, outer alignment is what tells you that your agent is actually useful, and inner alignment is what tells you it is safe.
I thought this was what the mesa optimizers paper was trying to point at, but I share your sense that the things people say and write are inconsistent with this decomposition. I mostly don’t engage with discussion of mesa optimization on LW / AIAF because of this disconnect.
I share your concern that the definitions of outer and inner alignment have gotten pretty fuzzy. I think the definitions we gave in “Risks from Learned Optimization” were actually pretty precise but suffered from being limited to the situation where the model is actually an optimizer. The problem is that now, people (including myself) have taken to using the terms outside of that narrow context, which leads to these sorts of disconnects.
outer alignment = the reward function incentivizes good behavior in the limit of infinite data
inner alignment = the model actually produced by the training process has good behavior according to the reward function.
I personally like this decomposition (and it is notably quite different than John’s) as it fits well with my intuitions for what things should be classified as outer vs. inner alignment failures, though perhaps you would disagree.
Yeah I think that decomposition mostly makes sense and is pretty similar to mine.
My main quibble is that your definition of outer alignment seems to have a “for all possible distributions” because of the “limit of infinite data” requirement. (If it isn’t all possible distributions and is just the training distribution, then in the IRD lava gridworld the reward that assigns +100 to lava and the policy that walks through lava when possible would be both outer and inner aligned, which seems bad.)
But then when arguing for the correctness of your outer alignment method, you need to talk about all possible situations that could come up, which seems not great. I’d rather have any “all possible situations” criteria be a part of inner alignment.
Another reason I prefer my decomposition is because it makes outer alignment a purely behavioral property, which is easier to check, much more conceptually grounded, and much more in line with what current outer alignment solutions guarantee.
Fwiw, I claim that this is the actually useful decomposition. (Though I’m not going to argue for it here—I’m writing this comment mostly in the hopes that you think about this decomposition yourself.)
I’d say it slightly differently: outer alignment = “the reward function incentivizes good behavior on the training distribution” and robustness / inner alignment = “the learned policy has good behavior on the deployment distribution”. Under these definitions, all you care about is inner alignment; outer alignment is instrumentally useful towards guaranteeing inner alignment, but if we got inner alignment some other way without getting outer alignment, that would be fine.
Another decomposition is: outer alignment = “good behavior on the training distribution”, robustness / inner alignment = “non-catastrophic behavior on the deployment distribution”. In this case, outer alignment is what tells you that your agent is actually useful, and inner alignment is what tells you it is safe.
I thought this was what the mesa optimizers paper was trying to point at, but I share your sense that the things people say and write are inconsistent with this decomposition. I mostly don’t engage with discussion of mesa optimization on LW / AIAF because of this disconnect.
I share your concern that the definitions of outer and inner alignment have gotten pretty fuzzy. I think the definitions we gave in “Risks from Learned Optimization” were actually pretty precise but suffered from being limited to the situation where the model is actually an optimizer. The problem is that now, people (including myself) have taken to using the terms outside of that narrow context, which leads to these sorts of disconnects.
I wrote “Outer alignment and imitative amplification” previously to try addressing this. Based at least partly on that post, the decomposition I would favor would be:
outer alignment = the reward function incentivizes good behavior in the limit of infinite data
inner alignment = the model actually produced by the training process has good behavior according to the reward function.
I personally like this decomposition (and it is notably quite different than John’s) as it fits well with my intuitions for what things should be classified as outer vs. inner alignment failures, though perhaps you would disagree.
Yeah I think that decomposition mostly makes sense and is pretty similar to mine.
My main quibble is that your definition of outer alignment seems to have a “for all possible distributions” because of the “limit of infinite data” requirement. (If it isn’t all possible distributions and is just the training distribution, then in the IRD lava gridworld the reward that assigns +100 to lava and the policy that walks through lava when possible would be both outer and inner aligned, which seems bad.)
But then when arguing for the correctness of your outer alignment method, you need to talk about all possible situations that could come up, which seems not great. I’d rather have any “all possible situations” criteria be a part of inner alignment.
Another reason I prefer my decomposition is because it makes outer alignment a purely behavioral property, which is easier to check, much more conceptually grounded, and much more in line with what current outer alignment solutions guarantee.