There was an critical followup on Twitter, urelated to the instinctive Tromp-Taylor criticism[1]:
The failure of naive self play to produce unexploitable policies is textbook level material (Multiagent Systems, http://masfoundations.org/mas.pdf), and methods that produce less exploitable policies have been studied for decades.
and
Hopefully these pointers will help future researchers to address interesting new problems rather than empirically rediscovering known facts.
Reply by authors:
I can see why a MAS scholar would be unsurprised by this result. However, most ML experts we spoke to prior to this paper thought our attack would fail! We hope our results will motivate ML researchers to be more interested in the work on exploitability pioneered by MAS scholars.
...
Ultimately self-play continues to be a widely used method, with high-profile empirical successes such as AlphaZero and OpenAI Five. If even these success stories are so empirically vulnerable we think it’s important for their limitations to become established common knowledge.
My understanding is that the author’s position is reasonable for mainstream ML community standards; in particular there’s nothing wrong with the original tweet thread. “Self-play exploitable” is not new, but the practical demonstration of how easy it’s to do the exploit in Go engines is a new and interesting result.
I hope the “Related work” section gets fixed as soon as possible, though.
The question is at which level of scientific standards do we want alignment-adjacent work to be on. There are good arguments for aiming to be much better than mainstream ML research (which is very bad at not rediscovering prior work) in this respect, since the mere existence of a parallel alignment research universe by default biases towards rediscovery.
...which I feel is is not valid at all? If the policy was made aware of a weird rule in training, then it losing by this kind of rule is a valid adversarial example. For research purposes, it doesn’t matter what the “real” rules of Go are.
I don’t play Go, so don’t take this judgement for granted.
There was an critical followup on Twitter, urelated to the instinctive Tromp-Taylor criticism[1]:
and
Reply by authors:
...
My understanding is that the author’s position is reasonable for mainstream ML community standards; in particular there’s nothing wrong with the original tweet thread. “Self-play exploitable” is not new, but the practical demonstration of how easy it’s to do the exploit in Go engines is a new and interesting result.
I hope the “Related work” section gets fixed as soon as possible, though.
The question is at which level of scientific standards do we want alignment-adjacent work to be on. There are good arguments for aiming to be much better than mainstream ML research (which is very bad at not rediscovering prior work) in this respect, since the mere existence of a parallel alignment research universe by default biases towards rediscovery.
...which I feel is is not valid at all? If the policy was made aware of a weird rule in training, then it losing by this kind of rule is a valid adversarial example. For research purposes, it doesn’t matter what the “real” rules of Go are.
I don’t play Go, so don’t take this judgement for granted.