Gurkenglas comments on Attainable Utility Preservation: Empirical Results

Gurkenglas 22 Feb 2020 14:55 UTC
LW: 1 AF: 1
AF
It appears to me that a more natural adjustment to the stepwise impact measurement in Correction than appending waiting times would be to make Q also incorporate AUP. Then instead of comparing “Disable the Off-Switch, then achieve the random goal whatever the cost” to “Wait, then achieve the random goal whatever the cost”, you would compare “Disable the Off-Switch, then achieve the random goal with low impact” to “Wait, then achieve the random goal with low impact”.

The scaling term makes R_AUP vary under adding a constant to all utilities. That doesn’t seem right. Try a transposition-invariant normalization? (Or generate the auxiliary goals already normalized.)

Is there an environment where this agent would spuriously go in circles?
- TurnTrout 22 Feb 2020 16:05 UTC
  LW: 2 AF: 1
  AF Parent
  
  It appears to me that a more natural adjustment to the stepwise impact measurement in Correction than appending waiting times would be to make Q also incorporate AUP. Then instead of comparing “Disable the Off-Switch, then achieve the random goal whatever the cost” to “Wait, then achieve the random goal whatever the cost”, you would compare “Disable the Off-Switch, then achieve the random goal with low impact” to “Wait, then achieve the random goal with low impact”.
  
  This has been an idea I’ve been intrigued by ever since AUP came out. My main concern with it is the increase in compute required and loss of competitiveness. Still probably worth running the experiments.
  
  The scaling term makes R_AUP vary under adding a constant to all utilities. That doesn’t seem right. Try a transposition-invariant normalization? (Or generate benign auxiliary reward functions in the first place.)
  
  Correct. Proposition 4 in the AUP paper guarantees penalty invariance to affine transformation only if the denominator is also the penalty for taking some action (absolute difference in Q values). You could, for example, consider the penalty of some mild action: $| Q (s, a_{mild}) - Q (s, \emptyset) |$ . It’s really up to the designer in the near-term. We’ll talk about more streamlined designs for superhuman use cases in two posts.
  
  Is there an environment where this agent would spuriously go in circles?
  
  Don’t think so. Moving generates tiny penalties, and going in circles usually isn’t a great way to accrue primary reward.