Yeah I think that decomposition mostly makes sense and is pretty similar to mine.
My main quibble is that your definition of outer alignment seems to have a “for all possible distributions” because of the “limit of infinite data” requirement. (If it isn’t all possible distributions and is just the training distribution, then in the IRD lava gridworld the reward that assigns +100 to lava and the policy that walks through lava when possible would be both outer and inner aligned, which seems bad.)
But then when arguing for the correctness of your outer alignment method, you need to talk about all possible situations that could come up, which seems not great. I’d rather have any “all possible situations” criteria be a part of inner alignment.
Another reason I prefer my decomposition is because it makes outer alignment a purely behavioral property, which is easier to check, much more conceptually grounded, and much more in line with what current outer alignment solutions guarantee.
Yeah I think that decomposition mostly makes sense and is pretty similar to mine.
My main quibble is that your definition of outer alignment seems to have a “for all possible distributions” because of the “limit of infinite data” requirement. (If it isn’t all possible distributions and is just the training distribution, then in the IRD lava gridworld the reward that assigns +100 to lava and the policy that walks through lava when possible would be both outer and inner aligned, which seems bad.)
But then when arguing for the correctness of your outer alignment method, you need to talk about all possible situations that could come up, which seems not great. I’d rather have any “all possible situations” criteria be a part of inner alignment.
Another reason I prefer my decomposition is because it makes outer alignment a purely behavioral property, which is easier to check, much more conceptually grounded, and much more in line with what current outer alignment solutions guarantee.