Thanks for the clarification, especially how a 6.1% winrate vs LeelaZero and 3.5% winrate vs ELF still imply significantly stronger Elo than is warranted.
The fact that Kellin could defeat LZ manually as well as the positions in bilibili video do seem to suggest that this is a common weakness of many AlphaZero-style Go AIs. I retract my comment about other engines.
To our knowledge, this attack is the first exploit that consistently wins against top programs using substantial search, without repeating specific sequences (e.g., finding a particular game that a bot lost and replaying the key parts of it).
Yeah! I’m not downplaying the value of this achievement at all! It’s very cool that this attack works and can be reproduced by a human. I think this work is great (as I’ve said, for example, in my comments on the ICML paper). I’m specifically quibbling about the “solved/unsolved” terminology that the post used to use.
Perhaps similar learning algorithms / neural-net architectures learn similar circuits / heuristics and thus also share the same vulnerabilities?
Your comment reminded me of ~all the adversarial attack transfer work in the image domain, which does suggest that non-adversarially trained neural networks will tend to have the same failure modes. Whoops. Should’ve thought about those results (and the convergent learning/universality results from interp) before I posted.
Keep in mind that the adversary was specifically trained against KataGo, whereas the performance against LeelaZero and ELF is basically zero-shot. It’s likely the case that an adversary trained against LeelaZero and ELF would also win consistently.
I’ve run LeelaZero and ELF and MiniGo (yet another independent AlphaZero replication in Go) by hand in particular test positions to see what their policy and value predictions are, and they all very massively misevaluate cyclic group situations just like KataGo. Perhaps by pure happenstance different bots could “accidentally” prefer different move patterns that make it harder or easier to form the attack patterns (indeed this is almost certainly something that should vary between bots as they do have different styles and preferences to some degree), but probably the bigger contributor to the difference is explicit optimization vs zero shot.
(As for whether, e.g. a transformer architecture would have less of an issue—I genuinely have no idea, I think it could go either way, nobody I know has tried it in Go. I think it’s at least easier to see why a convnet could be susceptible to this specific failure mode, but that doesn’t mean other architectures wouldn’t be too)
Thanks for the clarification, especially how a 6.1% winrate vs LeelaZero and 3.5% winrate vs ELF still imply significantly stronger Elo than is warranted.
The fact that Kellin could defeat LZ manually as well as the positions in bilibili video do seem to suggest that this is a common weakness of many AlphaZero-style Go AIs. I retract my comment about other engines.
Yeah! I’m not downplaying the value of this achievement at all! It’s very cool that this attack works and can be reproduced by a human. I think this work is great (as I’ve said, for example, in my comments on the ICML paper). I’m specifically quibbling about the “solved/unsolved” terminology that the post used to use.
Your comment reminded me of ~all the adversarial attack transfer work in the image domain, which does suggest that non-adversarially trained neural networks will tend to have the same failure modes. Whoops. Should’ve thought about those results (and the convergent learning/universality results from interp) before I posted.
Keep in mind that the adversary was specifically trained against KataGo, whereas the performance against LeelaZero and ELF is basically zero-shot. It’s likely the case that an adversary trained against LeelaZero and ELF would also win consistently.
I’ve run LeelaZero and ELF and MiniGo (yet another independent AlphaZero replication in Go) by hand in particular test positions to see what their policy and value predictions are, and they all very massively misevaluate cyclic group situations just like KataGo. Perhaps by pure happenstance different bots could “accidentally” prefer different move patterns that make it harder or easier to form the attack patterns (indeed this is almost certainly something that should vary between bots as they do have different styles and preferences to some degree), but probably the bigger contributor to the difference is explicit optimization vs zero shot.
So all signs point to this misgeneralization being general to AlphaZero with convnets, not one particular bot. In another post here https://www.lesswrong.com/posts/Es6cinTyuTq3YAcoK/there-are-probably-no-superhuman-go-ais-strong-human-players?commentId=gAEovdd5iGsfZ48H3 I explain why I think it’s intuitive how and why a convnet would learn the an incorrect algorithm first and then get stuck on it given the data.
(As for whether, e.g. a transformer architecture would have less of an issue—I genuinely have no idea, I think it could go either way, nobody I know has tried it in Go. I think it’s at least easier to see why a convnet could be susceptible to this specific failure mode, but that doesn’t mean other architectures wouldn’t be too)