I’m confused. Does this show anything besides adversarial attacks working against AlphaZero-like AIs? Is it a surprising result? Is that kind of work important for reproducibility purposes regardless of surprisingness?
These AIs are (were) thought to be in the Universe of Go what AGI is expected to be in the actual world we live in: an overwhelming power no human being can prevail against, especially not an amateur just by following some weird tactic most humans could defeat. It seemed they had a superior understanding of the game’s universe, but as per the article this is still not the same kind of understanding we see in humans. We may have overestimated our own understanding of how these systems work, this is an unexpected (confusing) outcome. Especially that it is implied that this is not just a bug in a specific system but possibly other systems using similar architecture will have the same defects.
I think this is a data point towards A.I. winter to be expected in the coming years rather than FOOM.
I’d read it as the opposite in terms of safety, particularly in light of Adam’s comment on how the exploit was so hard to find and required so much search/training they very nearly gave up before it finally started to work. (Compare to your standard adversarial attacks on eg a CNN classifier where it’s licketysplit in seconds for most of them.)
I would put myself in the ‘believes it is trivial there are adversarial examples in KataGo’ camp, in part because DRL agents have always had adversarial examples and in part because I suspect that the isoperimetry line of work may be correct and in which case all existing Go/chess programs like KataGo (which tend to have ~0.1b parameters) may be 3 orders of magnitude too small to remove most adversarialness (if we imagine Go is about as hard as ImageNet and take the back-of-the-envelope speculation that isoperimetry on ImageNet CNNs would require ~100b parameters).
So for me the interesting aspects are (1) the exploit could be executed by a fairly ordinary human, (2) it had non-zero success on other agents (but much closer to 0% than 100%), (3) it’s unclear where the exploit comes from but is currently plausibly coming from a component which is effectively obsolete in the current scaling paradigm (convolutions), and (4) it was really hard to find even given one of the most powerful attack settings (unlimited tree search over a fixed victim model).
From the standpoint of AI safety & interpretability this is very concerning as it favors attack rather than defense. It means that fixed known instances of superhuman models can have difficult to find but trivially-exploited exploits, which are not universal but highly model-specific. Further, it seems plausible given the descriptions that while they may never go away, it may get harder to find an exploit with scale, leading to both a dangerous overhang in superhuman-seeming but still vulnerable ‘safe’ models and an incentive for rogue models to scale up as fast as possible. So you can have highly-safe aligned agents with known-good instances, which can be attacked as static victim models by dumb agents until a trivial exploit is found, which may need to only succeed once, such as in a sandbox escape onto the Internet. (The safe models may be directly available to attack, or may just be available as high-volume APIs—how would OA notice if you were running a tree search trying to exploit gpt-3-turbo when it’s running through so many billions or trillions of tokens a day across so many diverse users right now?) If you try to replace them or train a population of different agents to reduce the risk of an exploit working on any given try (eg. AlphaStar sampling agents from the AlphaStar League to better approximate Nash play), now you have to pay an alignment tax on all of that. But asymmetrically, autonomous self-improving agents will not be equally vulnerable simply because they will be regularly training new larger iterations which will apparently mostly immunize them from an old exploit—and how are you going to get them to hold still long enough to acquire a copy of them to run these slow expensive ‘offline’ attacks? (It would be very polite of them to provide you checkpoints or a stable API!)
an overwhelming power no human being can prevail against, especially not an amateur just by following some weird tactic most humans could defeat.
Where you gonna get said weird tactic...? The fact that there exists an adversarial attack does no good if it cannot be found.
I think the cyclic group exploit could have been found by humans. The idea behind it (maybe it gets confused about liberties when a group is circular) would probably be in the top 1000 ideas for exploits that a group of humans would brainstorm. Then these would need to be tested. Finding a working implementation would be a matter of trial and error, maybe taking a week. So if you got 100 good human go players to test these 1000 ideas, the exploit would be found within ten weeks.
The main challenge might be to maintain morale, with the human go players probably being prone to discouragement after spending some weeks trying to get exploit ideas to work with no success, and hence maybe not focusing hard on trying to get the next one to work, and hence missing that it does. Maybe it would work better to have 1000 human go players each test one idea...
We’ll never know now, of course, since now everyone knows about weird circular patterns as something to try, along with more famous older exploits of computer Go programs like ladders or capture races.
First, I’d note that in terms of AI safety, either on offense or defense, it is unhelpful even if there is some non-zero probability of a large coordinated human effort finding a specific exploit. If you do that on a ‘safe’ model playing defense, it is not enough to find a single exploit (and presumably then patch it), because it can be hacked by another exploit; this is the same reason why test-and-patch is insufficient for secure software. Great, you found the ‘circle exploit’ - but you didn’t find the ‘square exploit’ or the ‘triangle exploit’, and so your model gets exploited anyway. And from the offense perspective of attacking a malign model to defeat it, you can’t run this hypothetical at all because by the time you get a copy of it, it’s too late.
So, it’s mostly a moot point whether it could be done from the AI safety perspective. No matter how you spin it, hard-to-find-but-easy-to-exploit exploits in superhuman models is just bad news for AI safety.
OK, but could humans?
I suspect that humans could find it (unaided by algorithms) with only relatively low probability, for a few reasons.
First, they didn’t find it already; open-source Go programs like Leela Zero are something like 6 years old now (it took a year or two after AG to clone it), and have been enthusiastically used by Go players, many of which would be interested in ‘anti-AI tactics’ (just as computer chess had its ‘anti-engine tactics’ period) or could stumble across it just wanking around doing weird things like making big circles. (And ‘Go players’ here is a large number: Go is still one of the most popular board games in the world, and while DeepMind may have largely abandoned the field, it’s not like East Asians in particular stopped being interested in it or researching it.) So we have a pretty good argument-from-silence there.
Second, the history of security & cryptography research tends to show that humans overlap a lot with each other in the bugs/vulns they find, but that algorithmic approaches can find very different sets of bugs. (This should be a bit obvious a priori: if humans didn’t think very similarly and found uncorrelated sets of bugs, it’d be easy to remove effectively all bugs just by throwing a relatively small number of reviewers at code, to get an astronomically small probability of any bugs passing all the reviewers, and we’d live in software/security paradise now. Sadly, we tend to look at the wrong code and all think the same thing: ‘lgtm’.) More dramatically, we can see the results of innovations like fuzzing. What’s the usual result whenever someone throws a fuzzer at an unfuzzed piece of software, even ones which are decades old and supposedly battle-hardened? Finding a bazillion vulnerabilities all the human reviewers had somehow never noticed and weirdo edgecases. (See also reward hacking and especially “The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities”, Lehman et al 2018; most of these examples were unknown, even the arcade game ones, AFAICT.) Or here’s an example from literally today: Tavis (who is not even a CPU specialist!) applied a fuzzer using a standard technique but in a slightly novel way to AMD processors and discovered yet another simple speculative execution bug in the CPU instructions which allows exfiltration of arbitrary data in the system (ie. arbitrary local privilege escalation since you can read any keys or passwords or memory locations, including VM escapes, apparently). One of the most commonly-used desktop CPU archs in the world* going back 3-4 years, from a designer with 50 years of experience, doubtless using all sorts of formal methods & testing & human review already, yet, there you go: who would win, all that or one smol slightly tweaked algorithmic search? Oops.
Or perhaps another more relevant example would be classic NN adversarial examples tweaking pixels: I’ve never seen a human construct one of those by hand, and if you had to hand-edit an image pixel by pixel, I greatly doubt a thousand determined humans (who were ignorant of them but trying to attack a classifier anyway) would stumble across them ever, much less put them in the top 1k hypotheses to try. These are just not human ways to do things.
* In fact, I personally almost run a Zenbleed-affected Threadripper CPU, but it’s a year too old, looks like. Still, another reason to be careful what binaries you install...
To clarify: what I am confused about is the high AF score, which probably means that there is something exciting I’m not getting from this paper. Or maybe it’s not a missing insight, but I don’t understand why this kind of work is interesting/important?
Well, I wasn’t interested because AIs were better than humans at go, I was interested because it was evidence of a trend of AIs being better at humans at some tasks, for its future implications on AI capabilities. So from this perspective, I guess this article would be a reminder that adversarial training is an unsolved problem for safety, as Gwern said above. Still doesn’t feel like all there is to it though.
I think it may not be correct to shuffle this off into a box labelled “adversarial example” as if it doesn’t say anything central about the nature of current go AIs.
Go involves intuitive aspects (what moves “look right”), and tree search, and also something that might be seen as “theorem proving”. An example theorem is “a group with two eyes is alive”. Another is “a capture race between two groups, one with 23 liberties, the other with 22 liberties, will be won by the group with more liberties”. Human players don’t search the tree down to a depth of 23 to determine this—they apply the theorem. One might have thought that strong go AIs “know” these theorems, but it seems that they may not—they may just be good at faking it, most of the time.
I’m confused. Does this show anything besides adversarial attacks working against AlphaZero-like AIs?
Is it a surprising result? Is that kind of work important for reproducibility purposes regardless of surprisingness?
I find it exciting for the following:
These AIs are (were) thought to be in the Universe of Go what AGI is expected to be in the actual world we live in: an overwhelming power no human being can prevail against, especially not an amateur just by following some weird tactic most humans could defeat. It seemed they had a superior understanding of the game’s universe, but as per the article this is still not the same kind of understanding we see in humans. We may have overestimated our own understanding of how these systems work, this is an unexpected (confusing) outcome. Especially that it is implied that this is not just a bug in a specific system but possibly other systems using similar architecture will have the same defects. I think this is a data point towards A.I. winter to be expected in the coming years rather than FOOM.
I’d read it as the opposite in terms of safety, particularly in light of Adam’s comment on how the exploit was so hard to find and required so much search/training they very nearly gave up before it finally started to work. (Compare to your standard adversarial attacks on eg a CNN classifier where it’s licketysplit in seconds for most of them.)
I would put myself in the ‘believes it is trivial there are adversarial examples in KataGo’ camp, in part because DRL agents have always had adversarial examples and in part because I suspect that the isoperimetry line of work may be correct and in which case all existing Go/chess programs like KataGo (which tend to have ~0.1b parameters) may be 3 orders of magnitude too small to remove most adversarialness (if we imagine Go is about as hard as ImageNet and take the back-of-the-envelope speculation that isoperimetry on ImageNet CNNs would require ~100b parameters).
So for me the interesting aspects are (1) the exploit could be executed by a fairly ordinary human, (2) it had non-zero success on other agents (but much closer to 0% than 100%), (3) it’s unclear where the exploit comes from but is currently plausibly coming from a component which is effectively obsolete in the current scaling paradigm (convolutions), and (4) it was really hard to find even given one of the most powerful attack settings (unlimited tree search over a fixed victim model).
From the standpoint of AI safety & interpretability this is very concerning as it favors attack rather than defense. It means that fixed known instances of superhuman models can have difficult to find but trivially-exploited exploits, which are not universal but highly model-specific. Further, it seems plausible given the descriptions that while they may never go away, it may get harder to find an exploit with scale, leading to both a dangerous overhang in superhuman-seeming but still vulnerable ‘safe’ models and an incentive for rogue models to scale up as fast as possible. So you can have highly-safe aligned agents with known-good instances, which can be attacked as static victim models by dumb agents until a trivial exploit is found, which may need to only succeed once, such as in a sandbox escape onto the Internet. (The safe models may be directly available to attack, or may just be available as high-volume APIs—how would OA notice if you were running a tree search trying to exploit
gpt-3-turbo
when it’s running through so many billions or trillions of tokens a day across so many diverse users right now?) If you try to replace them or train a population of different agents to reduce the risk of an exploit working on any given try (eg. AlphaStar sampling agents from the AlphaStar League to better approximate Nash play), now you have to pay an alignment tax on all of that. But asymmetrically, autonomous self-improving agents will not be equally vulnerable simply because they will be regularly training new larger iterations which will apparently mostly immunize them from an old exploit—and how are you going to get them to hold still long enough to acquire a copy of them to run these slow expensive ‘offline’ attacks? (It would be very polite of them to provide you checkpoints or a stable API!)Where you gonna get said weird tactic...? The fact that there exists an adversarial attack does no good if it cannot be found.
I think the cyclic group exploit could have been found by humans. The idea behind it (maybe it gets confused about liberties when a group is circular) would probably be in the top 1000 ideas for exploits that a group of humans would brainstorm. Then these would need to be tested. Finding a working implementation would be a matter of trial and error, maybe taking a week. So if you got 100 good human go players to test these 1000 ideas, the exploit would be found within ten weeks.
The main challenge might be to maintain morale, with the human go players probably being prone to discouragement after spending some weeks trying to get exploit ideas to work with no success, and hence maybe not focusing hard on trying to get the next one to work, and hence missing that it does. Maybe it would work better to have 1000 human go players each test one idea...
We’ll never know now, of course, since now everyone knows about weird circular patterns as something to try, along with more famous older exploits of computer Go programs like ladders or capture races.
First, I’d note that in terms of AI safety, either on offense or defense, it is unhelpful even if there is some non-zero probability of a large coordinated human effort finding a specific exploit. If you do that on a ‘safe’ model playing defense, it is not enough to find a single exploit (and presumably then patch it), because it can be hacked by another exploit; this is the same reason why test-and-patch is insufficient for secure software. Great, you found the ‘circle exploit’ - but you didn’t find the ‘square exploit’ or the ‘triangle exploit’, and so your model gets exploited anyway. And from the offense perspective of attacking a malign model to defeat it, you can’t run this hypothetical at all because by the time you get a copy of it, it’s too late.
So, it’s mostly a moot point whether it could be done from the AI safety perspective. No matter how you spin it, hard-to-find-but-easy-to-exploit exploits in superhuman models is just bad news for AI safety.
OK, but could humans? I suspect that humans could find it (unaided by algorithms) with only relatively low probability, for a few reasons.
First, they didn’t find it already; open-source Go programs like Leela Zero are something like 6 years old now (it took a year or two after AG to clone it), and have been enthusiastically used by Go players, many of which would be interested in ‘anti-AI tactics’ (just as computer chess had its ‘anti-engine tactics’ period) or could stumble across it just wanking around doing weird things like making big circles. (And ‘Go players’ here is a large number: Go is still one of the most popular board games in the world, and while DeepMind may have largely abandoned the field, it’s not like East Asians in particular stopped being interested in it or researching it.) So we have a pretty good argument-from-silence there.
Second, the history of security & cryptography research tends to show that humans overlap a lot with each other in the bugs/vulns they find, but that algorithmic approaches can find very different sets of bugs. (This should be a bit obvious a priori: if humans didn’t think very similarly and found uncorrelated sets of bugs, it’d be easy to remove effectively all bugs just by throwing a relatively small number of reviewers at code, to get an astronomically small probability of any bugs passing all the reviewers, and we’d live in software/security paradise now. Sadly, we tend to look at the wrong code and all think the same thing: ‘lgtm’.) More dramatically, we can see the results of innovations like fuzzing. What’s the usual result whenever someone throws a fuzzer at an unfuzzed piece of software, even ones which are decades old and supposedly battle-hardened? Finding a bazillion vulnerabilities all the human reviewers had somehow never noticed and weirdo edgecases. (See also reward hacking and especially “The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities”, Lehman et al 2018; most of these examples were unknown, even the arcade game ones, AFAICT.) Or here’s an example from literally today: Tavis (who is not even a CPU specialist!) applied a fuzzer using a standard technique but in a slightly novel way to AMD processors and discovered yet another simple speculative execution bug in the CPU instructions which allows exfiltration of arbitrary data in the system (ie. arbitrary local privilege escalation since you can read any keys or passwords or memory locations, including VM escapes, apparently). One of the most commonly-used desktop CPU archs in the world* going back 3-4 years, from a designer with 50 years of experience, doubtless using all sorts of formal methods & testing & human review already, yet, there you go: who would win, all that or one smol slightly tweaked algorithmic search? Oops.
Or perhaps another more relevant example would be classic NN adversarial examples tweaking pixels: I’ve never seen a human construct one of those by hand, and if you had to hand-edit an image pixel by pixel, I greatly doubt a thousand determined humans (who were ignorant of them but trying to attack a classifier anyway) would stumble across them ever, much less put them in the top 1k hypotheses to try. These are just not human ways to do things.
* In fact, I personally almost run a Zenbleed-affected Threadripper CPU, but it’s a year too old, looks like. Still, another reason to be careful what binaries you install...
To clarify: what I am confused about is the high AF score, which probably means that there is something exciting I’m not getting from this paper.
Or maybe it’s not a missing insight, but I don’t understand why this kind of work is interesting/important?
Did you think it was interesting when AIs became better than all humans at go?
If so, shouldn’t you be interested to learn that this is no longer true?
Well, I wasn’t interested because AIs were better than humans at go, I was interested because it was evidence of a trend of AIs being better at humans at some tasks, for its future implications on AI capabilities.
So from this perspective, I guess this article would be a reminder that adversarial training is an unsolved problem for safety, as Gwern said above. Still doesn’t feel like all there is to it though.
I think it may not be correct to shuffle this off into a box labelled “adversarial example” as if it doesn’t say anything central about the nature of current go AIs.
Go involves intuitive aspects (what moves “look right”), and tree search, and also something that might be seen as “theorem proving”. An example theorem is “a group with two eyes is alive”. Another is “a capture race between two groups, one with 23 liberties, the other with 22 liberties, will be won by the group with more liberties”. Human players don’t search the tree down to a depth of 23 to determine this—they apply the theorem. One might have thought that strong go AIs “know” these theorems, but it seems that they may not—they may just be good at faking it, most of the time.