gwern comments on [RETRACTED] It’s time for EA leadership to pull the short-timelines fire alarm.

gwern 11 Apr 2022 15:03 UTC
2 points
The most obvious difference is that MuZero learns an environment, it doesn’t take a hardwired simulator handed down from on high. AlphaZero (probably) cannot have any concept of interruption that is not in the simulator and is forced to plan using only the simulator space of outcomes while assuming every action has the outcome the simulator says it has, while MuZero can learn from its on-policy games as well as logged offline games any of which can contain interruptions either explicitly or implicitly (by filtering them out), and it does planning using the model it learns incorporating the possibility of interruptions. (Hence the analogy to Q-learning vs SARSA.)

Even if the interrupted episodes are not set to −1 or 0 rewards (which obviously just directly incentivize a MuZero agent to avoid interruption as simply another aspect of playing against the adversary), and you drop any episode with interruption completely to try to render the agent as ignorant as possible about interruptions, that very filtering could backfire. For example, the resulting ignorance/model uncertainty could motivate avoidance of interruption as part of risk-sensitive play: “I don’t know why, but node X [which triggers interruption] never shows up in training even though earlier state node X-1 does, so I am highly uncertain about its value according to the ensemble of model rollouts, and so X might be extremely negative compared to my known-good alternatives Y and Z while the probability of it being the best possible outcome has an extremely small base rate; so, I will act to avoid X.” (This can also incentivize exploration & exploitation of human manipulation strategies simply because of the uncertainty around its value! Leading to dangerous divergences in different scenarios like testing vs deployment.)