VojtaKovarik comments on Even Superhuman Go AIs Have Surprising Failure Modes

VojtaKovarik 23 Jul 2023 0:45 UTC
LW: 8 AF: 1
3
AF
My reaction to this is something like:
Academically, I find these results really impressive. But, uhm, I am not sure how much impact they will have? As in: it seems very unsurprising^[1] that something like this is possible for Go. And, also unsurprisingly, something like this might be possible for anything that involves neural networks—at least in some cases, and we don’t have a good theory for when yes/no. But also, people seem to not care. So perhaps we should be asking something else? Like, why is that people don’t care? Suppose you managed to demonstrate failures like this in settings X, Y, and Z—would this change anything? And also, when do these failures actually matter? [Not saying they don’t, just that we should think about it.]
To elaborate:
- If you understand neural networks (and how Go algorithms use them), it should be obvious that these algorithms might in principle have various vulnerabilities. You might become more confident about this once you learn about adversarial examples for image classifiers or hear arguments like “feed-forward networks can’t represent recursively-defined concepts”. But in a sense, the possibility of vulnerabilities should seem likely to you just based on the fact that neural networks (unlike some other methods) come with no relevant worst-case performance guarantees. (And to be clear, I believe all of this indeed was obvious to the authors since AlphaGo came out.)
- So if your application is safety-critical, security mindset dictates that you should not use an approach like this. (Though Go and many other domains aren’t safety-critical, hence my question “when does this matter”.)
- Viewed from this perspective, the value added by the paper is not “Superhuman Go AIs have vulnerabilities” but “Remember those obviously-possible vulnerabilities? Yep, it is as we said, it is not too hard to find them”.
- Also, I (sadly) expect that reactions to this paper (and similar results) will mostly fall into one of the following two camps: (1) Well, duh! This was obvious. (2) [For people without the security mindset:] Well, probably you just missed this one thing with circular groups; hotfix that, and then there will be no more vulnerabilities. I would be hoping for reaction such as (3) [Oh, ok! So failures like this are probably possible for all neural networks. And no safety-critical system should rely on neural networks not having vulnerabilities, got it.] However, I mostly expect that anybody who doesn’t already believe (1) and (3) will just react as (2).
- And this motivates my point about “asking something else”. EG, how do people who don’t already believe (3) think about these things, and which arguments would they find persuasive? Is it efficient to just demonstrate as many of these failures as possible? Or are some failures more useful than others, or does this perhaps not help at all? Would it help with “signpost moving” if we first made some people commit to specific predictions (eg, “I believe scale will solve the general problem of robustness, and in particular I think AlphaZero has no such vulnerabilities”).
1. ^
  At least I remember thinking this when AlphaZero came out. (We did a small project in 2018 where we found a way to exploit AlphaZero in the tiny connect-four game, so this isn’t just misremembering / hindsight bias.)
- AdamGleave 23 Jul 2023 22:47 UTC
  LW: 31 AF: 14
  6
  AF Parent
  When I started working on this project, a number of people came to me and told me (with varying degrees of tact) that I was wasting my time on a fool’s errand. Around half the people told me they thought it was extremely unlikely I’d find such a vulnerability. Around the other half told me such vulnerabilities obviously existed, and there was no point demonstrating it. Both sets of people were usually very confident in their views. In retrospect I wish I’d done a survey (even an informal one) before conducting this research to get a better sense of people’s views.
  
  Personally I’m in the camp that vulnerabilities like these existing was highly likely given the failures we’ve seen in other ML systems and the lack of any worst-case guarantees. But I was very unsure going in how easy they’d be to find. Go is a pretty limited domain, and it’s not enough to beat the neural network: you’ve got to beat Monte-Carlo Tree Search as well (and MCTS does have worst-case guarantees, albeit only in the limit of infinite search). Additionally, there are results showing that scale improves robustness (e.g. more pre-training data reduces vulnerability to adversarial examples in image classifiers).
  
  In fact, although the method we used is fairly simple, actually getting everything to work was non-trivial. There was one point after we’d patched the first (rather degenerate) pass-attack that the team was doubting whether our method would be able to beat the now stronger KataGo victim. We were considering cancelling the training run, but decided to leave it going given we had some idle GPUs in the cluster. A few days later there was a phase shift in the win rate of the adversary: it had stumbled across some strategy that worked and finally was learning.
  
  This is a long-winded way of saying that I did change my mind as a result of these experiments (towards robustness improving less than I’d previously thought with scale). I’m unsure how much effect it will have on the broader ML research community. The paper is getting a fair amount of attention, and is a nice pithy example of a failure mode. But as you suggest, the issue may be less a difference in concrete belief (surely any ML researcher would acknowledge adversarial examples are a major problem and one that is unlikely to be solved any time soon), than that of culture (to what degree is a security mindset appropriate?).
  
  This post was written as a summary of the results of the paper, intended for a fairly broad audience, so we didn’t delve much into the theory of change behind this agenda here. You might find this blog post describing the broader research agenda this paper fits into provides some helpful context, and I’d be interested to hear your feedback on that agenda.
- Noosphere89 23 Jul 2023 17:21 UTC
  LW: 3 AF: 2
  −2
  AF Parent
  
  (2) [For people without the security mindset:] Well, probably you just missed this one thing with circular groups; hotfix that, and then there will be no more vulnerabilities.
  
  i actually do expect this to happen, and importantly I think this result is basically of academic interest, primarily because it is probably known why this adversarial attack can have at all, and it’s the large scale cycles of a game board. This is almost certainly going to be solved, due to new training, so I find it a curiosity at best.
  - VojtaKovarik 23 Jul 2023 18:36 UTC
    LW: 2 AF: 1
    0
    AF Parent
    Yup, this is a very good illustration of the “talking past each other” that I think is happening with this line of research. (I mean with adversarial attacks on NNs in general, not just with Go in particular.) Let me try to hint at the two views that seem relevant here.
    1) Hinting at the “curiosity at best” view: I agree that if you hotfix this one vulnerability, then it is possible we will never encounter another vulnerability in current Go systems. But this is because there aren’t many incentives to go look for those vulnerabilities. (And it might even be that if Adam Gleave didn’t focus his PhD on this general class of failures, we would never have encountered even this vulnerability.)
    However, whether additional vulnerabilities exist seems like an entirely different question. Sure, there will only be finitely many vulnerabilities. But how confident are we that this cyclic-groups one is the last one? For example, I suspect that you might not be willing to give 1:1000 odds on whether we would encounter new vulnerabilities if we somehow spent 50 researcher-years on this.
    But I expect that you might say that this does not matter, because vulnerabilities in Go do not matter much, and we can just keep hotfixing them as they come up?
    2) And the other view seems to be something like: Yes, Go does not matter. But we were only using Go (and image classifiers, and virtual-environment football) to illustrate a general point, that these failures are an inherent part of deep learning systems. And for many applications, that is fine. But there will be applications where it is very much not fine (eg, aligning strong AIs, cyber-security, economy in the presence of malicious actors).
    And at this point, some people might disagree and claim something like “this will go away with enough training”. This seems fair, but I think that if you hold this view, you should make some testable predictions (and ideally ones that we can test prior to having superintelligent AI).
    And, finally, I think that if you had this argument with people in 2015, many of them would have made predictions such as “these exploits work for image classifiers, but they won’t work for multiagent RL”. Or “this won’t work for vastly superhuman Go”.
    Does this make sense? Assuming you still think this is just an academic curiosity, do you have some testable predictions for when/which systems will no longer have vulnerabilities like this? (Pref. something that takes fewer than 50 researcher years to test :D.)
    - Noosphere89 23 Jul 2023 18:46 UTC
      4 points
      0
      Parent
      I was mostly focusing on the one vulnerability presented, and thus didn’t want to make any large scale claims on whether this will entirely go away. The reason I labeled it a curiosity is because the adversarial attack exploits a weakness of a specific type of neural network, and importantly given the issue, it looks like it’s a solvable one.