I share your concern that the definitions of outer and inner alignment have gotten pretty fuzzy. I think the definitions we gave in “Risks from Learned Optimization” were actually pretty precise but suffered from being limited to the situation where the model is actually an optimizer. The problem is that now, people (including myself) have taken to using the terms outside of that narrow context, which leads to these sorts of disconnects.
outer alignment = the reward function incentivizes good behavior in the limit of infinite data
inner alignment = the model actually produced by the training process has good behavior according to the reward function.
I personally like this decomposition (and it is notably quite different than John’s) as it fits well with my intuitions for what things should be classified as outer vs. inner alignment failures, though perhaps you would disagree.
Yeah I think that decomposition mostly makes sense and is pretty similar to mine.
My main quibble is that your definition of outer alignment seems to have a “for all possible distributions” because of the “limit of infinite data” requirement. (If it isn’t all possible distributions and is just the training distribution, then in the IRD lava gridworld the reward that assigns +100 to lava and the policy that walks through lava when possible would be both outer and inner aligned, which seems bad.)
But then when arguing for the correctness of your outer alignment method, you need to talk about all possible situations that could come up, which seems not great. I’d rather have any “all possible situations” criteria be a part of inner alignment.
Another reason I prefer my decomposition is because it makes outer alignment a purely behavioral property, which is easier to check, much more conceptually grounded, and much more in line with what current outer alignment solutions guarantee.
I share your concern that the definitions of outer and inner alignment have gotten pretty fuzzy. I think the definitions we gave in “Risks from Learned Optimization” were actually pretty precise but suffered from being limited to the situation where the model is actually an optimizer. The problem is that now, people (including myself) have taken to using the terms outside of that narrow context, which leads to these sorts of disconnects.
I wrote “Outer alignment and imitative amplification” previously to try addressing this. Based at least partly on that post, the decomposition I would favor would be:
outer alignment = the reward function incentivizes good behavior in the limit of infinite data
inner alignment = the model actually produced by the training process has good behavior according to the reward function.
I personally like this decomposition (and it is notably quite different than John’s) as it fits well with my intuitions for what things should be classified as outer vs. inner alignment failures, though perhaps you would disagree.
Yeah I think that decomposition mostly makes sense and is pretty similar to mine.
My main quibble is that your definition of outer alignment seems to have a “for all possible distributions” because of the “limit of infinite data” requirement. (If it isn’t all possible distributions and is just the training distribution, then in the IRD lava gridworld the reward that assigns +100 to lava and the policy that walks through lava when possible would be both outer and inner aligned, which seems bad.)
But then when arguing for the correctness of your outer alignment method, you need to talk about all possible situations that could come up, which seems not great. I’d rather have any “all possible situations” criteria be a part of inner alignment.
Another reason I prefer my decomposition is because it makes outer alignment a purely behavioral property, which is easier to check, much more conceptually grounded, and much more in line with what current outer alignment solutions guarantee.