Does this circle exploit have any connection to convolutions? That was my first thought when I saw the original writeups, but nothing here seems to help explain where the exploit is coming from. All of the listed agents vulnerable to it, AFAIK, make use of convolutions. The description you give of Wu’s anti-circle training sounds a lot like you would expect from an architectural problem like convolution blindness: training can solve the specific exploit but then goes around in cycles or circles (ahem), simply moving the vulnerability around, like squeezing a balloon.
We think it might. One weak point against this is that we tried training CNNs with larger kernels and the problem didn’t improve. However, it’s not obvious that larger kernels would fix it (it gives the model less need for spatial locality, but it might still have an inductive bias towards it), and the results are a bit confounded since we trained the CNN based on historical KataGo self-play training data rather. We’ve been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this. It’d be somewhat time consuming though, so curious to hear how interesting you and other commenters would find this result so we can prioritize.
We’re also planning on doing mechanistic interpretability to better understand the failure mode, which might shed light on this question.
Do you know they are distinct? The discussion of Go in that paper is extremely brief and does not describe what the exploitation is at all, AFAICT. Your E3 also doesn’t seem to describe what the Timbers agent does.
My main reason for believing they’re distinct is that an earlier version of their paper includes Figure 3 providing an example Go board that looks fairly different to ours. It’s a bit hard to compare since it’s a terminal board, there’s no move history, but it doesn’t look like what would result from capture of a large circular group. But I do wish the Timbers paper went into more detail on this, e.g. including full game traces from their latest attack. I encouraged the authors to do this but it seems like they’ve all moved on to other projects since then and have limited ability to revise the paper.
We’ve been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this.
I wouldn’t really expect larger convolutions to fix it, aside from perhaps making the necessary ‘circles’ larger and/or harder to find or create longer cycles in the finetuning as there’s more room to squish the attack around the balloon. It could be related to problems like the other parameters of the kernel like stride or padding. (For example, I recall the nasty ‘checkboard’ artifacts in generative upscaling were due to the convolution stride/padding, and don’t seem to ever come up in Transformer/MLP-based generative models but also simply making the CNN kernels larger didn’t fix it, IIRC—you had to fix the stride/padding settings.)
We’ve been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this. It’d be somewhat time consuming though, so curious to hear how interesting you and other commenters would find this result so we can prioritize.
I personally would find it interesting but I don’t know how important it is. It seems likely that you might find a completely different-looking adversarial attack, but would that be conclusive? There would be so many things that change between a CNN KataGo and a from-scratch ViT KataGo. Especially if you are right that Timbers et al find a completely different adversarial attack in their AlphaZero which AFAIK still uses CNNs. Maybe you could find many different attacks if you change up enough hyperparameters or initializations.
On the gripping hand, now that I look at this earlier version, their description of it as a weird glitch in AZ’s evaluation of pass moves at the end of the game sounds an awful lot like your first Tromp-Taylor pass exploit ie. it could probably be easily fixed with some finetuning. And in that case, perhaps Timbers et al would have found the ‘circle’ exploit in AZ after all if they had gotten past the first easy end-game pass-related exploit? (This also suggests a weakness in the search procedures: it really ought to produce more than one exploit, preferably a whole list of distinct exploits. Some sort of PBT or novelty search approach perhaps...)
Maybe a mechanistic interpretability approach would be better: if you could figure out where in KataGo it screws up the value estimate so badly, and what edits are necessary to make it yield the correct estimate,
We think it might. One weak point against this is that we tried training CNNs with larger kernels and the problem didn’t improve. However, it’s not obvious that larger kernels would fix it (it gives the model less need for spatial locality, but it might still have an inductive bias towards it), and the results are a bit confounded since we trained the CNN based on historical KataGo self-play training data rather. We’ve been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this. It’d be somewhat time consuming though, so curious to hear how interesting you and other commenters would find this result so we can prioritize.
We’re also planning on doing mechanistic interpretability to better understand the failure mode, which might shed light on this question.
My main reason for believing they’re distinct is that an earlier version of their paper includes Figure 3 providing an example Go board that looks fairly different to ours. It’s a bit hard to compare since it’s a terminal board, there’s no move history, but it doesn’t look like what would result from capture of a large circular group. But I do wish the Timbers paper went into more detail on this, e.g. including full game traces from their latest attack. I encouraged the authors to do this but it seems like they’ve all moved on to other projects since then and have limited ability to revise the paper.
I wouldn’t really expect larger convolutions to fix it, aside from perhaps making the necessary ‘circles’ larger and/or harder to find or create longer cycles in the finetuning as there’s more room to squish the attack around the balloon. It could be related to problems like the other parameters of the kernel like stride or padding. (For example, I recall the nasty ‘checkboard’ artifacts in generative upscaling were due to the convolution stride/padding, and don’t seem to ever come up in Transformer/MLP-based generative models but also simply making the CNN kernels larger didn’t fix it, IIRC—you had to fix the stride/padding settings.)
I personally would find it interesting but I don’t know how important it is. It seems likely that you might find a completely different-looking adversarial attack, but would that be conclusive? There would be so many things that change between a CNN KataGo and a from-scratch ViT KataGo. Especially if you are right that Timbers et al find a completely different adversarial attack in their AlphaZero which AFAIK still uses CNNs. Maybe you could find many different attacks if you change up enough hyperparameters or initializations.
On the gripping hand, now that I look at this earlier version, their description of it as a weird glitch in AZ’s evaluation of pass moves at the end of the game sounds an awful lot like your first Tromp-Taylor pass exploit ie. it could probably be easily fixed with some finetuning. And in that case, perhaps Timbers et al would have found the ‘circle’ exploit in AZ after all if they had gotten past the first easy end-game pass-related exploit? (This also suggests a weakness in the search procedures: it really ought to produce more than one exploit, preferably a whole list of distinct exploits. Some sort of PBT or novelty search approach perhaps...)
Maybe a mechanistic interpretability approach would be better: if you could figure out where in KataGo it screws up the value estimate so badly, and what edits are necessary to make it yield the correct estimate,