One question that we have been thinking about is whether the cyclic-vulnerability lies with CNNs or with AlphaZero style training. For example, some folks in multiagent systems think that “the failure of naive self play to produce unexploitable policies is textbook level material”. On the other hand, David Wu’s tree vs. cycle theory seems to suggest that certain inductive biases of CNNs are also at play.
“Why not both?” Twitter snideness aside*, I don’t see any contradiction: cycling in multi-agent scenarios due to forgetting responses is consistent with bad inductive biases. The biases make it unable to easily learn the best response, and so it learns various inferior responses which form a cycle.
Imagine that CNNs cannot ‘see’ the circles because the receptive window grows too slowly or some CNN artifact like that; no amount of training can let it see circles in full generality and recognize the trap. But it can still learn to win: eg. with enough adversarial training against an exploiter which has learned to create circles in the top left, it learns a policy of being scared of circles in the top left, and stops losing by learning to create circles in the other corners (where, as it happens, it is not currently being exploited); then the exploiter resumes training and learns to create circles in the top right, where the CNN falls right into the trap, and so it returns to winning; then the adversarial training resumes and it forgets the avoid-top-left strategy and learns the avoid-top-right strategy… And so on forever. The CNN cannot learn a policy of ‘never create circles in any corner’ because you can’t win a game of Go like that, and CNN/exploiter just circle around the 4 corners playing rock-paper-scissors-spock eternally.
* adversarial spheres looks irrelevant to me, and the other paper is relevant but attacks a fixed policy which is not the case with MCTS, especially with extremely large search budgets—which is supposed to be complete in the limit and is also changing the policy at runtime by policy-improvement
“Why not both?” Twitter snideness aside*, I don’t see any contradiction: cycling in multi-agent scenarios due to forgetting responses is consistent with bad inductive biases. The biases make it unable to easily learn the best response, and so it learns various inferior responses which form a cycle.
Imagine that CNNs cannot ‘see’ the circles because the receptive window grows too slowly or some CNN artifact like that; no amount of training can let it see circles in full generality and recognize the trap. But it can still learn to win: eg. with enough adversarial training against an exploiter which has learned to create circles in the top left, it learns a policy of being scared of circles in the top left, and stops losing by learning to create circles in the other corners (where, as it happens, it is not currently being exploited); then the exploiter resumes training and learns to create circles in the top right, where the CNN falls right into the trap, and so it returns to winning; then the adversarial training resumes and it forgets the avoid-top-left strategy and learns the avoid-top-right strategy… And so on forever. The CNN cannot learn a policy of ‘never create circles in any corner’ because you can’t win a game of Go like that, and CNN/exploiter just circle around the 4 corners playing rock-paper-scissors-spock eternally.
* adversarial spheres looks irrelevant to me, and the other paper is relevant but attacks a fixed policy which is not the case with MCTS, especially with extremely large search budgets—which is supposed to be complete in the limit and is also changing the policy at runtime by policy-improvement