Noosphere89 comments on Even Superhuman Go AIs Have Surprising Failure Modes

Noosphere89 23 Jul 2023 17:21 UTC
LW: 3 AF: 2
−2
AF

(2) [For people without the security mindset:] Well, probably you just missed this one thing with circular groups; hotfix that, and then there will be no more vulnerabilities.

i actually do expect this to happen, and importantly I think this result is basically of academic interest, primarily because it is probably known why this adversarial attack can have at all, and it’s the large scale cycles of a game board. This is almost certainly going to be solved, due to new training, so I find it a curiosity at best.
- VojtaKovarik 23 Jul 2023 18:36 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Yup, this is a very good illustration of the “talking past each other” that I think is happening with this line of research. (I mean with adversarial attacks on NNs in general, not just with Go in particular.) Let me try to hint at the two views that seem relevant here.
  1) Hinting at the “curiosity at best” view: I agree that if you hotfix this one vulnerability, then it is possible we will never encounter another vulnerability in current Go systems. But this is because there aren’t many incentives to go look for those vulnerabilities. (And it might even be that if Adam Gleave didn’t focus his PhD on this general class of failures, we would never have encountered even this vulnerability.)
  However, whether additional vulnerabilities exist seems like an entirely different question. Sure, there will only be finitely many vulnerabilities. But how confident are we that this cyclic-groups one is the last one? For example, I suspect that you might not be willing to give 1:1000 odds on whether we would encounter new vulnerabilities if we somehow spent 50 researcher-years on this.
  But I expect that you might say that this does not matter, because vulnerabilities in Go do not matter much, and we can just keep hotfixing them as they come up?
  2) And the other view seems to be something like: Yes, Go does not matter. But we were only using Go (and image classifiers, and virtual-environment football) to illustrate a general point, that these failures are an inherent part of deep learning systems. And for many applications, that is fine. But there will be applications where it is very much not fine (eg, aligning strong AIs, cyber-security, economy in the presence of malicious actors).
  And at this point, some people might disagree and claim something like “this will go away with enough training”. This seems fair, but I think that if you hold this view, you should make some testable predictions (and ideally ones that we can test prior to having superintelligent AI).
  And, finally, I think that if you had this argument with people in 2015, many of them would have made predictions such as “these exploits work for image classifiers, but they won’t work for multiagent RL”. Or “this won’t work for vastly superhuman Go”.
  Does this make sense? Assuming you still think this is just an academic curiosity, do you have some testable predictions for when/which systems will no longer have vulnerabilities like this? (Pref. something that takes fewer than 50 researcher years to test :D.)
  - Noosphere89 23 Jul 2023 18:46 UTC
    4 points
    0
    Parent
    I was mostly focusing on the one vulnerability presented, and thus didn’t want to make any large scale claims on whether this will entirely go away. The reason I labeled it a curiosity is because the adversarial attack exploits a weakness of a specific type of neural network, and importantly given the issue, it looks like it’s a solvable one.