my understanding is that every published impact regularisation method fails [supervisor manipulation] in a ‘default’ implementation.
Wouldn’t most measures with a stepwise inaction baseline pass? They would still have incentive to select over future plans so that the humans’ reactions to the agent are low impact (wrt current baseline), but if the stepwise inaction outcome is high impact by the time the agent realizes, that’s the new baseline.
Wouldn’t most measures with a stepwise inaction baseline pass?
I think not, because given stepwise inaction, the supervisor will issue a high-impact task, and the AI system will just ignore it due to being inactive. Therefore, the actual rollout of the supervisor issuing a high-impact task and the system completing it should be high impact relative to that baseline. Or at least that’s my current thinking, I’ve regularly found myself changing my mind about what systems actually do in these test cases.
OK, I now think the above comment is wrong, because proposals using stepwise inaction baselines often compare what would happen if you didn’t take the current action and were inactive to what would happen if you took the current action but were inactive from then on—at least that’s how it’s represented in this paper.
This post is extremely well done.
Wouldn’t most measures with a stepwise inaction baseline pass? They would still have incentive to select over future plans so that the humans’ reactions to the agent are low impact (wrt current baseline), but if the stepwise inaction outcome is high impact by the time the agent realizes, that’s the new baseline.
Thanks!
I think not, because given stepwise inaction, the supervisor will issue a high-impact task, and the AI system will just ignore it due to being inactive. Therefore, the actual rollout of the supervisor issuing a high-impact task and the system completing it should be high impact relative to that baseline. Or at least that’s my current thinking, I’ve regularly found myself changing my mind about what systems actually do in these test cases.
OK, I now think the above comment is wrong, because proposals using stepwise inaction baselines often compare what would happen if you didn’t take the current action and were inactive to what would happen if you took the current action but were inactive from then on—at least that’s how it’s represented in this paper.