Pattern comments on Environmental Structure Can Cause Instrumental Convergence

Pattern 11 Aug 2021 16:45 UTC
2 points
loop around the level in a particular repeating fashion
That’s what I meant by wait.
On second thought, it can’t be a ‘wait to choose between 2+ options unless there are 2+ options’, because the end of the level isn’t a choice between 2 things. (Although if we pay attention to the last 2 things Pac-Man has to eat, then there’s a choice between the order to eat them in, but that leads to the same state, so it probably doesn’t matter.)

Mostly I was trying to figure out how this generalizes, because it seemed like it was as much about winning as losing (because both end the game):
A portion of a Tic-Tac-Toe game-tree against a fixed opponent policy. Whenever we make a move that ends the game, we can’t go anywhere else – we have to stay put. Then most reward functions incentivize the green actions over the black actions: average-reward optimal policies are particularly likely to take moves which keep the game going. The logic is that any
lose-immediately-with-given-black-move
reward function can be permuted into a
stay-alive-with-green-move
reward function.