I’d be interested to see actual examples of this, if there are any.
I have some toy examples from a paper I worked on: https://proceedings.mlr.press/v144/jain21a But I think this is a well known issue in robotics, because SOTA trajectory planning is often gradient-based (i.e. local). You definitely see this on any “hard” robotics task where initializing a halfway decent trajectory is hard. I’ve heard from Anca Dragan (my PhD advisor) that this happens with actual self driving car planners as well.
Do you mean to say that its reward function will be indistinguishable from its policy?
Oops, sorry, the answer got cutoff somehow. I meant to say that if you take a planner that’s suboptimal, and look at the policy it outputs, and then rationalize that policy assuming that the planner is optimal, you’ll get a reward function that is different from the reward function you put in. (Basically what the Armstrong + Mindermann paper says.)
Interesting paper, thanks! If a policy cannot be decomposed into a planning algorithm and a reward function anyway, it’s unclear to me why 2D-robustness would be a better framing of robustness than just 1D-robustness.
Well, the paper doesn’t show that you can’t decompose it, but merely that the naive way of decomposing observed behavior into capabilities and objectives doesn’t work without additional assumptions. But we have additional information all the time! People can tell when other people are failing due to incompetence via misalignment, for example. And in practice, we can often guess whether or not a failure is due to capabilities limitations or objective robustness failures, for example by doing experiments with fine-tuning or prompting.
The reason we care about 2D alignment is that capability failures seem much more benign than alignment failures. Besides the reasons given in the main post, we might also expect that capability failures will go away with scale, while alignment failures will become worse with scale. So knowing whether or not something is a capability robustness failure vs an alignment one can inform you as to the importance and neglectedness of research directions.
I have some toy examples from a paper I worked on: https://proceedings.mlr.press/v144/jain21a
But I think this is a well known issue in robotics, because SOTA trajectory planning is often gradient-based (i.e. local). You definitely see this on any “hard” robotics task where initializing a halfway decent trajectory is hard. I’ve heard from Anca Dragan (my PhD advisor) that this happens with actual self driving car planners as well.
Oops, sorry, the answer got cutoff somehow. I meant to say that if you take a planner that’s suboptimal, and look at the policy it outputs, and then rationalize that policy assuming that the planner is optimal, you’ll get a reward function that is different from the reward function you put in. (Basically what the Armstrong + Mindermann paper says.)
Well, the paper doesn’t show that you can’t decompose it, but merely that the naive way of decomposing observed behavior into capabilities and objectives doesn’t work without additional assumptions. But we have additional information all the time! People can tell when other people are failing due to incompetence via misalignment, for example. And in practice, we can often guess whether or not a failure is due to capabilities limitations or objective robustness failures, for example by doing experiments with fine-tuning or prompting.
The reason we care about 2D alignment is that capability failures seem much more benign than alignment failures. Besides the reasons given in the main post, we might also expect that capability failures will go away with scale, while alignment failures will become worse with scale. So knowing whether or not something is a capability robustness failure vs an alignment one can inform you as to the importance and neglectedness of research directions.
Thanks—this helps.