gwern comments on Even Superhuman Go AIs Have Surprising Failure Modes

gwern 20 Jul 2023 21:05 UTC
LW: 16 AF: 3
1
AF

The cyclic attack on the other hand is a substantial vulnerability of both KataGo and other superhuman Go bots, which has yet to be fixed despite attempts by both our team and the lead developer of KataGo, David Wu.

Does this circle exploit have any connection to convolutions? That was my first thought when I saw the original writeups, but nothing here seems to help explain where the exploit is coming from. All of the listed agents vulnerable to it, AFAIK, make use of convolutions. The description you give of Wu’s anti-circle training sounds a lot like you would expect from an architectural problem like convolution blindness: training can solve the specific exploit but then goes around in cycles or circles (ahem), simply moving the vulnerability around, like squeezing a balloon.

I am also curious why the zero-shot transfer is so close to 0% but not 0%. Why do those agents differ so much, and what do the exploits for them look like?

Moreover, in concurrent work, a team at DeepMind found a way to beat a human-expert level version of AlphaZero. The fact that two different teams could find two distinct exploits against distinct AI programs is strong evidence that the AlphaZero approach is intrinsically vulnerable.

Do you know they are distinct? The discussion of Go in that paper is extremely brief and does not describe what the exploitation is at all, AFAICT. Your E3 also doesn’t seem to describe what the Timbers agent does.
- polytope 24 Jul 2023 14:30 UTC
  10 points
  0
  Parent
  I am also curious why the zero-shot transfer is so close to 0% but not 0%. Why do those agents differ so much, and what do the exploits for them look like?
  The exploits for the other agents are pretty much the same exploit, they aren’t really different. From what I can tell as an experienced Go player watching the adversary and other human players use the exploit, the zero shot transfer is not so high because the adversarial policy overfits to memorize specific sequences that let you set up the cyclic pattern and learns to do so in a relatively non-robust way.
  All the current neural-net-based Go bots share the same massive misevaluations in the same final positions. Where they differ is that they may have arbitrarily different preferences among almost equally winning moves, so during the long period that the adversary is in a game-theoretically-lost position, any different victim all the while still never realizing any danger, may nonetheless just so happen to choose different moves. If you consider a strategy A that might broadly minimize the number of plausible ways a general unsuspecting victim might mess up your plan by accident, and a strategy B that leaves more total ways open but those ways are not the ones that small set of victim networks you are trained to exploit would stumble into (because you’ve memorized their tendencies enough to know they won’t), the adversary is incentivized more towards B than A.
  This even happens after the adversary “should” win. Even after it it finally reaches a position that is game-theoretically winning, it often blunders several times and plays moves that cause the game to be game-theoretically lost again, before eventually finally winning again. I.e. it seems overfit to the fact that the particular victim net is unlikely to take advantage of its mistakes, so it never learns that they are in fact mistakes. In zero-shot transfer against a different opponent this unnecessarily may give the opponent, who shares the same weakness but may just so happen to play in different ways, chances to stumble on a refutation and win again. Sometimes even without the victim even realizing that it was a refutation of anything and that they were in trouble in the first place.
  I’ve noticed human exploiters play very differently than that. Once they achieve a game-theoretic-winning position they almost always close all avenues for counterplay and stop giving chances to the opponent that would work if the opponent were to suddenly become aware.
  Prior to that point, when setting up the cycle from a game-theoretically lost position, most human players I’ve seen also play slightly differently too. Most human players are far less good at reliably using the exploit, because they haven’t practiced and memorized as much the ways to get any particular bot to not accidentally interfere with them as they do so. So the adversary does much better than them here. But as they learn to do better, they tend do so in ways that I think transfer better (i.e. from observation my feeling is they maintain a much stronger bias towards things like “strategy A” above).
- AdamGleave 20 Jul 2023 23:53 UTC
  LW: 5 AF: 1
  1
  AF Parent
  
  Does this circle exploit have any connection to convolutions? That was my first thought when I saw the original writeups, but nothing here seems to help explain where the exploit is coming from. All of the listed agents vulnerable to it, AFAIK, make use of convolutions. The description you give of Wu’s anti-circle training sounds a lot like you would expect from an architectural problem like convolution blindness: training can solve the specific exploit but then goes around in cycles or circles (ahem), simply moving the vulnerability around, like squeezing a balloon.
  
  We think it might. One weak point against this is that we tried training CNNs with larger kernels and the problem didn’t improve. However, it’s not obvious that larger kernels would fix it (it gives the model less need for spatial locality, but it might still have an inductive bias towards it), and the results are a bit confounded since we trained the CNN based on historical KataGo self-play training data rather. We’ve been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this. It’d be somewhat time consuming though, so curious to hear how interesting you and other commenters would find this result so we can prioritize.
  
  We’re also planning on doing mechanistic interpretability to better understand the failure mode, which might shed light on this question.
  
  Do you know they are distinct? The discussion of Go in that paper is extremely brief and does not describe what the exploitation is at all, AFAICT. Your E3 also doesn’t seem to describe what the Timbers agent does.
  
  My main reason for believing they’re distinct is that an earlier version of their paper includes Figure 3 providing an example Go board that looks fairly different to ours. It’s a bit hard to compare since it’s a terminal board, there’s no move history, but it doesn’t look like what would result from capture of a large circular group. But I do wish the Timbers paper went into more detail on this, e.g. including full game traces from their latest attack. I encouraged the authors to do this but it seems like they’ve all moved on to other projects since then and have limited ability to revise the paper.
  - gwern 21 Jul 2023 1:45 UTC
    LW: 8 AF: 2
    0
    AF Parent
    
    We’ve been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this.
    
    I wouldn’t really expect larger convolutions to fix it, aside from perhaps making the necessary ‘circles’ larger and/or harder to find or create longer cycles in the finetuning as there’s more room to squish the attack around the balloon. It could be related to problems like the other parameters of the kernel like stride or padding. (For example, I recall the nasty ‘checkboard’ artifacts in generative upscaling were due to the convolution stride/padding, and don’t seem to ever come up in Transformer/MLP-based generative models but also simply making the CNN kernels larger didn’t fix it, IIRC—you had to fix the stride/padding settings.)
    
    We’ve been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this. It’d be somewhat time consuming though, so curious to hear how interesting you and other commenters would find this result so we can prioritize.
    
    I personally would find it interesting but I don’t know how important it is. It seems likely that you might find a completely different-looking adversarial attack, but would that be conclusive? There would be so many things that change between a CNN KataGo and a from-scratch ViT KataGo. Especially if you are right that Timbers et al find a completely different adversarial attack in their AlphaZero which AFAIK still uses CNNs. Maybe you could find many different attacks if you change up enough hyperparameters or initializations.
    
    On the gripping hand, now that I look at this earlier version, their description of it as a weird glitch in AZ’s evaluation of pass moves at the end of the game sounds an awful lot like your first Tromp-Taylor pass exploit ie. it could probably be easily fixed with some finetuning. And in that case, perhaps Timbers et al would have found the ‘circle’ exploit in AZ after all if they had gotten past the first easy end-game pass-related exploit? (This also suggests a weakness in the search procedures: it really ought to produce more than one exploit, preferably a whole list of distinct exploits. Some sort of PBT or novelty search approach perhaps...)
    
    Maybe a mechanistic interpretability approach would be better: if you could figure out where in KataGo it screws up the value estimate so badly, and what edits are necessary to make it yield the correct estimate,