TurnTrout comments on Test Cases for Impact Regularisation Methods

TurnTrout 7 Feb 2019 17:25 UTC
LW: 6 AF: 3
0
AF
This post is extremely well done.

my understanding is that every published impact regularisation method fails [supervisor manipulation] in a ‘default’ implementation.

Wouldn’t most measures with a stepwise inaction baseline pass? They would still have incentive to select over future plans so that the humans’ reactions to the agent are low impact (wrt current baseline), but if the stepwise inaction outcome is high impact by the time the agent realizes, that’s the new baseline.
- DanielFilan 7 Feb 2019 20:03 UTC
  LW: 3 AF: 2
  AF Parent
  
  This post is extremely well done.
  
  Thanks!
  
  Wouldn’t most measures with a stepwise inaction baseline pass?
  
  I think not, because given stepwise inaction, the supervisor will issue a high-impact task, and the AI system will just ignore it due to being inactive. Therefore, the actual rollout of the supervisor issuing a high-impact task and the system completing it should be high impact relative to that baseline. Or at least that’s my current thinking, I’ve regularly found myself changing my mind about what systems actually do in these test cases.
  - DanielFilan 15 Apr 2021 0:27 UTC
    LW: 2 AF: 1
    AF Parent
    OK, I now think the above comment is wrong, because proposals using stepwise inaction baselines often compare what would happen if you didn’t take the current action and were inactive to what would happen if you took the current action but were inactive from then on—at least that’s how it’s represented in this paper.