Radford Neal comments on Even Superhuman Go AIs Have Surprising Failure Modes

Radford Neal 24 Jul 2023 3:08 UTC
2 points
1
I think the cyclic group exploit could have been found by humans. The idea behind it (maybe it gets confused about liberties when a group is circular) would probably be in the top 1000 ideas for exploits that a group of humans would brainstorm. Then these would need to be tested. Finding a working implementation would be a matter of trial and error, maybe taking a week. So if you got 100 good human go players to test these 1000 ideas, the exploit would be found within ten weeks.
The main challenge might be to maintain morale, with the human go players probably being prone to discouragement after spending some weeks trying to get exploit ideas to work with no success, and hence maybe not focusing hard on trying to get the next one to work, and hence missing that it does. Maybe it would work better to have 1000 human go players each test one idea...
- gwern 25 Jul 2023 1:36 UTC
  6 points
  1
  Parent
  We’ll never know now, of course, since now everyone knows about weird circular patterns as something to try, along with more famous older exploits of computer Go programs like ladders or capture races.
  
  First, I’d note that in terms of AI safety, either on offense or defense, it is unhelpful even if there is some non-zero probability of a large coordinated human effort finding a specific exploit. If you do that on a ‘safe’ model playing defense, it is not enough to find a single exploit (and presumably then patch it), because it can be hacked by another exploit; this is the same reason why test-and-patch is insufficient for secure software. Great, you found the ‘circle exploit’ - but you didn’t find the ‘square exploit’ or the ‘triangle exploit’, and so your model gets exploited anyway. And from the offense perspective of attacking a malign model to defeat it, you can’t run this hypothetical at all because by the time you get a copy of it, it’s too late.
  
  So, it’s mostly a moot point whether it could be done from the AI safety perspective. No matter how you spin it, hard-to-find-but-easy-to-exploit exploits in superhuman models is just bad news for AI safety.
  
  OK, but could humans? I suspect that humans could find it (unaided by algorithms) with only relatively low probability, for a few reasons.
  
  First, they didn’t find it already; open-source Go programs like Leela Zero are something like 6 years old now (it took a year or two after AG to clone it), and have been enthusiastically used by Go players, many of which would be interested in ‘anti-AI tactics’ (just as computer chess had its ‘anti-engine tactics’ period) or could stumble across it just wanking around doing weird things like making big circles. (And ‘Go players’ here is a large number: Go is still one of the most popular board games in the world, and while DeepMind may have largely abandoned the field, it’s not like East Asians in particular stopped being interested in it or researching it.) So we have a pretty good argument-from-silence there.
  
  Second, the history of security & cryptography research tends to show that humans overlap a lot with each other in the bugs/vulns they find, but that algorithmic approaches can find very different sets of bugs. (This should be a bit obvious a priori: if humans didn’t think very similarly and found uncorrelated sets of bugs, it’d be easy to remove effectively all bugs just by throwing a relatively small number of reviewers at code, to get an astronomically small probability of any bugs passing all the reviewers, and we’d live in software/security paradise now. Sadly, we tend to look at the wrong code and all think the same thing: ‘lgtm’.) More dramatically, we can see the results of innovations like fuzzing. What’s the usual result whenever someone throws a fuzzer at an unfuzzed piece of software, even ones which are decades old and supposedly battle-hardened? Finding a bazillion vulnerabilities all the human reviewers had somehow never noticed and weirdo edgecases. (See also reward hacking and especially “The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities”, Lehman et al 2018; most of these examples were unknown, even the arcade game ones, AFAICT.) Or here’s an example from literally today: Tavis (who is not even a CPU specialist!) applied a fuzzer using a standard technique but in a slightly novel way to AMD processors and discovered yet another simple speculative execution bug in the CPU instructions which allows exfiltration of arbitrary data in the system (ie. arbitrary local privilege escalation since you can read any keys or passwords or memory locations, including VM escapes, apparently). One of the most commonly-used desktop CPU archs in the world* going back 3-4 years, from a designer with 50 years of experience, doubtless using all sorts of formal methods & testing & human review already, yet, there you go: who would win, all that or one smol slightly tweaked algorithmic search? Oops.
  
  Or perhaps another more relevant example would be classic NN adversarial examples tweaking pixels: I’ve never seen a human construct one of those by hand, and if you had to hand-edit an image pixel by pixel, I greatly doubt a thousand determined humans (who were ignorant of them but trying to attack a classifier anyway) would stumble across them ever, much less put them in the top 1k hypotheses to try. These are just not human ways to do things.
  
  * In fact, I personally almost run a Zenbleed-affected Threadripper CPU, but it’s a year too old, looks like. Still, another reason to be careful what binaries you install...
  What links here?
  - Evaluating Superhuman Models with Consistency Checks by Daniel Paleka (1 Aug 2023 7:51 UTC; 21 points)