gwern comments on Even Superhuman Go AIs Have Surprising Failure Modes

gwern 21 Jul 2023 1:45 UTC
LW: 8 AF: 2
0
AF

We’ve been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this.

I wouldn’t really expect larger convolutions to fix it, aside from perhaps making the necessary ‘circles’ larger and/or harder to find or create longer cycles in the finetuning as there’s more room to squish the attack around the balloon. It could be related to problems like the other parameters of the kernel like stride or padding. (For example, I recall the nasty ‘checkboard’ artifacts in generative upscaling were due to the convolution stride/padding, and don’t seem to ever come up in Transformer/MLP-based generative models but also simply making the CNN kernels larger didn’t fix it, IIRC—you had to fix the stride/padding settings.)

We’ve been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this. It’d be somewhat time consuming though, so curious to hear how interesting you and other commenters would find this result so we can prioritize.

I personally would find it interesting but I don’t know how important it is. It seems likely that you might find a completely different-looking adversarial attack, but would that be conclusive? There would be so many things that change between a CNN KataGo and a from-scratch ViT KataGo. Especially if you are right that Timbers et al find a completely different adversarial attack in their AlphaZero which AFAIK still uses CNNs. Maybe you could find many different attacks if you change up enough hyperparameters or initializations.

On the gripping hand, now that I look at this earlier version, their description of it as a weird glitch in AZ’s evaluation of pass moves at the end of the game sounds an awful lot like your first Tromp-Taylor pass exploit ie. it could probably be easily fixed with some finetuning. And in that case, perhaps Timbers et al would have found the ‘circle’ exploit in AZ after all if they had gotten past the first easy end-game pass-related exploit? (This also suggests a weakness in the search procedures: it really ought to produce more than one exploit, preferably a whole list of distinct exploits. Some sort of PBT or novelty search approach perhaps...)

Maybe a mechanistic interpretability approach would be better: if you could figure out where in KataGo it screws up the value estimate so badly, and what edits are necessary to make it yield the correct estimate,