Pattern comments on Attainable Utility Preservation: Empirical Results

Pattern 22 Feb 2020 18:15 UTC
LW: 2 AF: 1
AF
Bumping into the human makes them disappear, reducing the agent’s control over what the future looks like. This is penalized.
Decreases or increases?
AUPstarting state fails here,
but AUPstepwise does not.
Questions:
1. Is “Model-free AUP” the same as “AUP stepwise”?
2. Why does “Model-free AUP” wait for the pallet to reach the human before moving, while the “Vanilla” agent does not?
There is one weird thing that’s been pointed out, where stepwise inaction while driving a car leads to not-crashing being penalized at each time step. I think this is because you need to use an appropriate inaction rollout policy, not because stepwise itself is wrong. ↩︎
That might lead to interesting behavior in a game of chicken.
One interpretation is that AUP is approximately preserving access to states.
I wonder how this interacts with environments where access to states is always closing off. (StarCraft, Go, Chess, etc. - though it’s harder to think of how state/agent are ‘contained’ in these games.)
To be frank, this is crazy. I’m not aware of any existing theory explaining these results, which is why I proved a bajillion theorems last summer to start to get a formal understanding (some of which became the results on instrumental convergence and power-seeking).
Is the code for the SafeLife PPO-AUP stuff you did on github?
- TurnTrout 22 Feb 2020 20:07 UTC
  LW: 2 AF: 1
  AF Parent
  Decreases or increases?
  
  Decreases. Here, the “human” is just a block which paces back and forth. Removing the block removes access to all states containing that block.
  Is “Model-free AUP” the same as “AUP stepwise”?
  Yes. See the paper for more details.
  Why does “Model-free AUP” wait for the pallet to reach the human before moving, while the “Vanilla” agent does not?
  I’m pretty sure it’s just an artifact of the training process and the penalty term. I remember investigating it in 2018 and concluding it wasn’t anything important, but unfortunately I don’t recall the exact explanation.
  
  I wonder how this interacts with environments where access to states is always closing off. (StarCraft, Go, Chess, etc. - though it’s harder to think of how state/agent are ‘contained’ in these games.)
  
  It would still try to preserve access to future states as much as possible with respect to doing nothing that turn.
  
  Is the code for the SafeLife PPO-AUP stuff you did on github?
  
  Here. Note that we’re still ironing things out, but the preliminary results have been pretty solid.