AdamGleave comments on Even Superhuman Go AIs Have Surprising Failure Modes

AdamGleave 23 Jul 2023 22:47 UTC
LW: 23 AF: 13
6
AF
When I started working on this project, a number of people came to me and told me (with varying degrees of tact) that I was wasting my time on a fool’s errand. Around half the people told me they thought it was extremely unlikely I’d find such a vulnerability. Around the other half told me such vulnerabilities obviously existed, and there was no point demonstrating it. Both sets of people were usually very confident in their views. In retrospect I wish I’d done a survey (even an informal one) before conducting this research to get a better sense of people’s views.

Personally I’m in the camp that vulnerabilities like these existing was highly likely given the failures we’ve seen in other ML systems and the lack of any worst-case guarantees. But I was very unsure going in how easy they’d be to find. Go is a pretty limited domain, and it’s not enough to beat the neural network: you’ve got to beat Monte-Carlo Tree Search as well (and MCTS does have worst-case guarantees, albeit only in the limit of infinite search). Additionally, there are results showing that scale improves robustness (e.g. more pre-training data reduces vulnerability to adversarial examples in image classifiers).

In fact, although the method we used is fairly simple, actually getting everything to work was non-trivial. There was one point after we’d patched the first (rather degenerate) pass-attack that the team was doubting whether our method would be able to beat the now stronger KataGo victim. We were considering cancelling the training run, but decided to leave it going given we had some idle GPUs in the cluster. A few days later there was a phase shift in the win rate of the adversary: it had stumbled across some strategy that worked and finally was learning.

This is a long-winded way of saying that I did change my mind as a result of these experiments (towards robustness improving less than I’d previously thought with scale). I’m unsure how much effect it will have on the broader ML research community. The paper is getting a fair amount of attention, and is a nice pithy example of a failure mode. But as you suggest, the issue may be less a difference in concrete belief (surely any ML researcher would acknowledge adversarial examples are a major problem and one that is unlikely to be solved any time soon), than that of culture (to what degree is a security mindset appropriate?).

This post was written as a summary of the results of the paper, intended for a fairly broad audience, so we didn’t delve much into the theory of change behind this agenda here. You might find this blog post describing the broader research agenda this paper fits into provides some helpful context, and I’d be interested to hear your feedback on that agenda.