In the causal influence diagram approach, I think AlphaZero as formulated would be ‘TI-ignoring’ because it does all learning while ignoring the possibility of interruption and assumes it can execute the optimal action. But other algorithms would not be TI-ignoring—I wonder if MuZero would be TI-ignoring or not? (This corresponds to the Q-learning vs SARSA distinction—if you remember the slippery ice example in Sutton & Barto, the wind/slipping would be like the human overseer interrupting, I guess.)
Why wonder when you can think: What is the substantial difference in MuZero (as described in [1]) that makes the algorithm to consider interruptions?
Maybe I show some great ignorance of MDPs, but naively I don’t see how an interrupted game could come into play as a signal in the specified implementations of MuZero:
Explicit signals I can’t see, because the explicitly specified reward u seems contingent ultimately only on the game state / win condition.
One can hypothesize an implicit signal could be introduced if algorithm learns to “avoid game states that result in game being terminated for out-of-game reason / game not played until the end condition”, but how such learning would happen? Can MuZero interrupt the game during training? Sounds unlikely such move would be implemented in Go or Shogi environment. Are there any combination of moves in Atari game that could cause it?
The most obvious difference is that MuZero learns an environment, it doesn’t take a hardwired simulator handed down from on high. AlphaZero (probably) cannot have any concept of interruption that is not in the simulator and is forced to plan using only the simulator space of outcomes while assuming every action has the outcome the simulator says it has, while MuZero can learn from its on-policy games as well as logged offline games any of which can contain interruptions either explicitly or implicitly (by filtering them out), and it does planning using the model it learns incorporating the possibility of interruptions. (Hence the analogy to Q-learning vs SARSA.)
Even if the interrupted episodes are not set to −1 or 0 rewards (which obviously just directly incentivize a MuZero agent to avoid interruption as simply another aspect of playing against the adversary), and you drop any episode with interruption completely to try to render the agent as ignorant as possible about interruptions, that very filtering could backfire. For example, the resulting ignorance/model uncertainty could motivate avoidance of interruption as part of risk-sensitive play: “I don’t know why, but node X [which triggers interruption] never shows up in training even though earlier state node X-1 does, so I am highly uncertain about its value according to the ensemble of model rollouts, and so X might be extremely negative compared to my known-good alternatives Y and Z while the probability of it being the best possible outcome has an extremely small base rate; so, I will act to avoid X.” (This can also incentivize exploration & exploitation of human manipulation strategies simply because of the uncertainty around its value! Leading to dangerous divergences in different scenarios like testing vs deployment.)
In the causal influence diagram approach, I think AlphaZero as formulated would be ‘TI-ignoring’ because it does all learning while ignoring the possibility of interruption and assumes it can execute the optimal action. But other algorithms would not be TI-ignoring—I wonder if MuZero would be TI-ignoring or not? (This corresponds to the Q-learning vs SARSA distinction—if you remember the slippery ice example in Sutton & Barto, the wind/slipping would be like the human overseer interrupting, I guess.)
Why wonder when you can think: What is the substantial difference in MuZero (as described in [1]) that makes the algorithm to consider interruptions?
Maybe I show some great ignorance of MDPs, but naively I don’t see how an interrupted game could come into play as a signal in the specified implementations of MuZero:
Explicit signals I can’t see, because the explicitly specified reward u seems contingent ultimately only on the game state / win condition.
One can hypothesize an implicit signal could be introduced if algorithm learns to “avoid game states that result in game being terminated for out-of-game reason / game not played until the end condition”, but how such learning would happen? Can MuZero interrupt the game during training? Sounds unlikely such move would be implemented in Go or Shogi environment. Are there any combination of moves in Atari game that could cause it?
[1] https://arxiv.org/abs/1911.08265
The most obvious difference is that MuZero learns an environment, it doesn’t take a hardwired simulator handed down from on high. AlphaZero (probably) cannot have any concept of interruption that is not in the simulator and is forced to plan using only the simulator space of outcomes while assuming every action has the outcome the simulator says it has, while MuZero can learn from its on-policy games as well as logged offline games any of which can contain interruptions either explicitly or implicitly (by filtering them out), and it does planning using the model it learns incorporating the possibility of interruptions. (Hence the analogy to Q-learning vs SARSA.)
Even if the interrupted episodes are not set to −1 or 0 rewards (which obviously just directly incentivize a MuZero agent to avoid interruption as simply another aspect of playing against the adversary), and you drop any episode with interruption completely to try to render the agent as ignorant as possible about interruptions, that very filtering could backfire. For example, the resulting ignorance/model uncertainty could motivate avoidance of interruption as part of risk-sensitive play: “I don’t know why, but node X [which triggers interruption] never shows up in training even though earlier state node X-1 does, so I am highly uncertain about its value according to the ensemble of model rollouts, and so X might be extremely negative compared to my known-good alternatives Y and Z while the probability of it being the best possible outcome has an extremely small base rate; so, I will act to avoid X.” (This can also incentivize exploration & exploitation of human manipulation strategies simply because of the uncertainty around its value! Leading to dangerous divergences in different scenarios like testing vs deployment.)