Thanks for the example, but why this is a capabilities robustness problem and not an objective robustness problem, if we think of the objective as ‘classify pandas accurately’?
Insofar as it’s not a capability problem, I think it’s example of Goodharting and not inner misalignment/mesa optimization. The given objective (“minimize cross entropy loss”) is maximized on distribution by incorporating non-robust features (and also gives no incentive to be fully robust to adversarial examples, so even non-robust features that don’t really help with performance could still persist after training).
You might argue that there is no alignment without capabilities, since a sufficiently dumb model “can’t know what you want”. But I think you can come up with clean examples of capabilities failures if you look at, say, robots that use search to plan; they often do poorly according to the manually specified reward function on new domains because optimizing the reward is too hard for its search algorithm.
Of course, you can always argue that this is just an alignment failure one step up; if you perform Inverse Optimal Control on the behavior of the robot and derive a revealed reward function, you’ll find that its . In other words, you can invoke the fact that for biased agents, there doesn’t seem to be a super principled way of dividing up capabilities and preferences in the first place. I think this is probably going too far; there are still examples in practice (like the intractable search example) where thinking about it as a capabilities robustness issue is more natural than thinking about it as a objective problem.
But I think you can come up with clean examples of capabilities failures if you look at, say, robots that use search to plan; they often do poorly according to the manually specified reward function on new domains because optimizing the reward is too hard for its search algorithm.
I’d be interested to see actual examples of this, if there are any. But also, how would this not be an objective robustness failure if we frame the objective as “maximize reward”?
if you perform Inverse Optimal Control on the behavior of the robot and derive a revealed reward function, you’ll find that its
Do you mean to say that its reward function will be indistinguishable from its policy?
Interesting paper, thanks! If a policy cannot be decomposed into a planning algorithm and a reward function anyway, it’s unclear to me why 2D-robustness would be a better framing of robustness than just 1D-robustness.
I’d be interested to see actual examples of this, if there are any.
I have some toy examples from a paper I worked on: https://proceedings.mlr.press/v144/jain21a But I think this is a well known issue in robotics, because SOTA trajectory planning is often gradient-based (i.e. local). You definitely see this on any “hard” robotics task where initializing a halfway decent trajectory is hard. I’ve heard from Anca Dragan (my PhD advisor) that this happens with actual self driving car planners as well.
Do you mean to say that its reward function will be indistinguishable from its policy?
Oops, sorry, the answer got cutoff somehow. I meant to say that if you take a planner that’s suboptimal, and look at the policy it outputs, and then rationalize that policy assuming that the planner is optimal, you’ll get a reward function that is different from the reward function you put in. (Basically what the Armstrong + Mindermann paper says.)
Interesting paper, thanks! If a policy cannot be decomposed into a planning algorithm and a reward function anyway, it’s unclear to me why 2D-robustness would be a better framing of robustness than just 1D-robustness.
Well, the paper doesn’t show that you can’t decompose it, but merely that the naive way of decomposing observed behavior into capabilities and objectives doesn’t work without additional assumptions. But we have additional information all the time! People can tell when other people are failing due to incompetence via misalignment, for example. And in practice, we can often guess whether or not a failure is due to capabilities limitations or objective robustness failures, for example by doing experiments with fine-tuning or prompting.
The reason we care about 2D alignment is that capability failures seem much more benign than alignment failures. Besides the reasons given in the main post, we might also expect that capability failures will go away with scale, while alignment failures will become worse with scale. So knowing whether or not something is a capability robustness failure vs an alignment one can inform you as to the importance and neglectedness of research directions.
Thanks for the example, but why this is a capabilities robustness problem and not an objective robustness problem, if we think of the objective as ‘classify pandas accurately’?
Insofar as it’s not a capability problem, I think it’s example of Goodharting and not inner misalignment/mesa optimization. The given objective (“minimize cross entropy loss”) is maximized on distribution by incorporating non-robust features (and also gives no incentive to be fully robust to adversarial examples, so even non-robust features that don’t really help with performance could still persist after training).
You might argue that there is no alignment without capabilities, since a sufficiently dumb model “can’t know what you want”. But I think you can come up with clean examples of capabilities failures if you look at, say, robots that use search to plan; they often do poorly according to the manually specified reward function on new domains because optimizing the reward is too hard for its search algorithm.
Of course, you can always argue that this is just an alignment failure one step up; if you perform Inverse Optimal Control on the behavior of the robot and derive a revealed reward function, you’ll find that its . In other words, you can invoke the fact that for biased agents, there doesn’t seem to be a super principled way of dividing up capabilities and preferences in the first place. I think this is probably going too far; there are still examples in practice (like the intractable search example) where thinking about it as a capabilities robustness issue is more natural than thinking about it as a objective problem.
Thanks for the reply!
I’d be interested to see actual examples of this, if there are any. But also, how would this not be an objective robustness failure if we frame the objective as “maximize reward”?
Do you mean to say that its reward function will be indistinguishable from its policy?
Interesting paper, thanks! If a policy cannot be decomposed into a planning algorithm and a reward function anyway, it’s unclear to me why 2D-robustness would be a better framing of robustness than just 1D-robustness.
I have some toy examples from a paper I worked on: https://proceedings.mlr.press/v144/jain21a
But I think this is a well known issue in robotics, because SOTA trajectory planning is often gradient-based (i.e. local). You definitely see this on any “hard” robotics task where initializing a halfway decent trajectory is hard. I’ve heard from Anca Dragan (my PhD advisor) that this happens with actual self driving car planners as well.
Oops, sorry, the answer got cutoff somehow. I meant to say that if you take a planner that’s suboptimal, and look at the policy it outputs, and then rationalize that policy assuming that the planner is optimal, you’ll get a reward function that is different from the reward function you put in. (Basically what the Armstrong + Mindermann paper says.)
Well, the paper doesn’t show that you can’t decompose it, but merely that the naive way of decomposing observed behavior into capabilities and objectives doesn’t work without additional assumptions. But we have additional information all the time! People can tell when other people are failing due to incompetence via misalignment, for example. And in practice, we can often guess whether or not a failure is due to capabilities limitations or objective robustness failures, for example by doing experiments with fine-tuning or prompting.
The reason we care about 2D alignment is that capability failures seem much more benign than alignment failures. Besides the reasons given in the main post, we might also expect that capability failures will go away with scale, while alignment failures will become worse with scale. So knowing whether or not something is a capability robustness failure vs an alignment one can inform you as to the importance and neglectedness of research directions.
Thanks—this helps.