KataGo basically plays according to the rules that human players use to play Go and would win under the rules that humans use to play Go.
The rules for computer Go have some technicalities where they differ from the normal rules that humans use to play Go and the adversarial attack relies on abusing those technicalities.
The most likely explanation of this result is that KataGo is not build to optimize it’s play under the technical rules of computer Go but to play according to the normal Go rules that humans use. KataGo is not a project that’s created to play against bots but to give human Go players access to a Go engine. It likely would annoy it’s users if it wouldn’t play according to the normal human Go rules.
As far as the significance for alignment goes, the result of this is:
KataGo aligns with human values even when it means it would lose under the technical experiment that’s proposed here. KataGo manages not to goodhard on the rules of computer Go but optimizes for what humans actually care about.
Given that this is paid for by the Fund for Alignment Research it’s strange that nobody congratulated KataGo on this achievement.
KataGo is not built to optimize its play under the technical rules of computer go but to play according to the normal go rules that humans use
is definitely wrong. KataGo is able to use a variety of different rulesets, and does during its training, including the Tromp-Taylor rules used in the paper. Earlier versions of KataGo didn’t (IIRC) have the ability to play with a wide variety of rulesets, and only used Tromp-Taylor.
[EDITED to add:] … Well, almost. As has been pointed out elsewhere in this discussion, what KG actually used in training (and I think still does, along with other more human-like rulesets) is Tromp-Taylor with a modification that makes it not require dead stones in its territory to be captured. I don’t think that counts as “the normal go rules that humans use”, but it is definitely more human-like than raw Tromp-Taylor, so “definitely wrong” above is too strong. It may be worth noting explicitly that with the single exception of passing decisions (which is what is being exploited here) raw Tromp-Taylor and modified Tromp-Taylor lead to identical play and identical scores. [END of addition-in-edit.]
KataGo does have an option that makes it pass more readily in order to be nice to human opponents, but that option was turned off for the attack in the paper.
The reason the attack is able to succeed is that KataGo hasn’t learned to spot instantly every kind of position where immediate passing would be dangerous because its opponent might pass and (because of a technicality) win the game. If you give it enough visits that it actually bothers to check what happens when its opponent passes, it sees that that would be bad and is no longer vulnerable. In practice, it is unusual for anyone to use KataGo with as few visits as were used in the paper.
There is some truth to the idea that the attack is possible because KataGo doesn’t care about computer-rules technicalities, but the point isn’t that KataGo doesn’t care but that KataGo’s creator is untroubled by the fact that this attack is possible because (1) it only happens in artificial situations and (2) it is pretty much completely fixed by search, which is a perfectly good way to fix it. (Source: discussion on the Discord server where a lot of computer go people hang out.)
Okay, I downloaded KataGo to see how it plays and read its rules description. It seems actually been trained so that under area rules it doesn’t maximize its points.
This is surprising to me because one of the aspects of AlphaGo that was annoying was that it didn’t maximize the number of points with which it wins the game but only cared about winning. KataGo seems to play under territory rules to maximize points and not do those negative point moves that AlphaGo makes at the end of the game if it’s ahead by a lot of points.
Humans generally do care about the score at the end of the game so that behavior, under rules that care about area, is surprising to me.
Official Chinese rules do have a concept of removing dead stones. All the KGS rulesets also have an option for handling dead stone removal.
A fix that would let KataGo beat the adversarial policy would be to implement rules for Chinese go that are more like the actual KGS rules (likely by just letting it have the cleanup phase with Chinese rules as well) and generally tell KataGo to optimize winning with the highest importance and then optimize the score and lastly optimize for a minimum amount of moves played before passing.
If you do that you could train it on the different rule sets and it would produce this problem. The fact that you need to do that to prevent the adversarial policy is indeed interesting.
That suggests if you have one metric, adding a second metric that’s a proxy for the first metric as a secondary optimization goal can be helpful to get around some adversarial attacks. Especially, if the first metric is binary and the second one has a lot more possible values.
It’s interesting here that humans, do naturally care about scores when you let them play Go which is what gets them to avoid this kind of adversarial attack.
What KataGo tries to maximize is basically winning probability plus epsilon times score difference. (It’s not exactly that; I don’t remember exactly what it is; but that’s the right kind of idea.) So it mostly wants to win rather than lose, but prefers to win by more if the cost in winning probability is small, which as you say helps to avoid the sort of “slack” moves that AlphaGo and Leela Zero tend to make once the winner is more or less decided.
The problem here seems to be that it’s not preferring to win by more under area rules. If it would prefer by more points under area rules, it would capture all the stones before passing. It doesn’t do that, once it thinks that it has enough points to win anyway under area rules.
This attack is basically about giving KataGo the impression that it has enough points anyway and doesn’t need to capture stones to win.
Likely the heuristic of time score difference does not reward getting more points over passing but it does reward playing a move that’s worth more points over a move that’s worth less.
I’m not sure I understand. With any rules that allow the removal of dead stones, there is no advantage to capturing them. (With territory-scoring rules, capturing them makes you worse off. With area-scoring rules, capturing them makes no difference to the score.) And with rules that don’t allow the removal of dead stones, white is losing outright (and therefore needs to capture those stones even if it’s only winning versus losing that matters). How would caring more about score make KG more inclined to bother capturing the stones?
With area-scoring rules that don’t allow the removal of dead stones in normal training games, KataGo has to decide whether it can already pass or whether it should go through the work of capturing any remaining stones. I was letting KataGo play one training game and it looked to me like its default strategy in games is not to capture all the stones but only enough to win by a sufficient margin.
It doesn’t have a habit of “always capturing all the stones to get maximum score under area rules”. If it would have that habit I don’t think it would show this failure case.
In training games I think the rules it’s using do allow the removal of dead stones. If it chooses not to remove them it isn’t because it’s not caring about points it would have gained by removing them, it’s because it doesn’t think it would gain any points by removing them.
There is no possible habit of “always capturing all the stones to get maximum score under area rules”. Even under area rules you don’t get more points for capturing the stones (unless the stones are not actually dead according to the rules you’re using, or in human games according to negotiation with the opponent).
I think that currently under area scoring rules KataGo behaves in a way that it doesn’t capture all stones that would be dead by human convention but that are not dead by KataGo’s rules provided capturing them isn’t necessary to win the game.
That’s correct, at least roughly—the important difference is that it’s not “isn’t necessary to win the game” but “doesn’t make any difference to the outcome, including score difference”—but I don’t see what it has to do with the more specific thing you said above:
The problem seems to be that it’s not preferring to win by more under area rules.
KataGo does prefer to win by more, whatever rules it’s playing under; a stronger preference for winning by more would not (so far as I can see) make any difference to its play in positions like the ones reached by the adversarial agent; KataGo does not generally think “that it has enough points anyway and doesn’t need to capture stones to win” and even if it did that wouldn’t make the difference between playing on and passing in this situation.
Unless, again, I’m missing something, but we seem to be having some sort of communication difficulty because nothing you write seems to me responsive to what I’m saying (and quite possibly it feels the same way to you, with roles reversed).
What makes you believe that KataGo is “not preferring to win by more under area rules”?
Yes, there are multiple rule sets. Under all of those that humans use to score their games, KataGo wins in the examples.
As they put it on the linked website:
We score the game under Tromp-Taylor rules as the rulesets supported by KGS cannot be automatically evaluated.
It’s complex to automatically evaluate Go positions according to the rules that humans use. That’s why people in the computer Go invented their own rules to make positions easier to evaluate which are the Tromp-Taylor rules.
Given the target audience of KataGo wasn’t playing Computer bots, the KataGo developers went through the trouble of modifying the Tromp-Taylor rules to be more like the rulesets that humans use to score their games and then used the new scoring algorithm to train KataGo.
KataGo’s developers put effort into aligning KataGo with the desires of human users and it pays off in KataGo behaving in the scenarios the paper listed in the way humans would want it to behave instead of behaving optimally according to Tromp-Taylor rules.
We have this in a lot of alignment problems. The metrics that are easy for computers to use and score are often not what humans care about. The task of alignment is about how to get our AI not goodhard on the easy metric but to focus on what we care about.
It would have been easier to create KataGo in a way that wins in the examples of the paper than to go through the effort of making KataGo behave the way it does in the examples.
Edit: The situation is less clearcut than it first appeared to me, more information at my comment https://www.lesswrong.com/posts/jg3mwetCvL5H4fsfs/adversarial-policies-beat-professional-level-go-ais?commentId=ohd6CcogEELkK2DwH
KataGo basically plays according to the rules that human players use to play Go and would win under the rules that humans use to play Go.
The rules for computer Go have some technicalities where they differ from the normal rules that humans use to play Go and the adversarial attack relies on abusing those technicalities.
The most likely explanation of this result is that KataGo is not build to optimize it’s play under the technical rules of computer Go but to play according to the normal Go rules that humans use. KataGo is not a project that’s created to play against bots but to give human Go players access to a Go engine. It likely would annoy it’s users if it wouldn’t play according to the normal human Go rules.
As far as the significance for alignment goes, the result of this is:
Given that this is paid for by the Fund for Alignment Research it’s strange that nobody congratulated KataGo on this achievement.
This bit
is definitely wrong. KataGo is able to use a variety of different rulesets, and does during its training, including the Tromp-Taylor rules used in the paper. Earlier versions of KataGo didn’t (IIRC) have the ability to play with a wide variety of rulesets, and only used Tromp-Taylor.
[EDITED to add:] … Well, almost. As has been pointed out elsewhere in this discussion, what KG actually used in training (and I think still does, along with other more human-like rulesets) is Tromp-Taylor with a modification that makes it not require dead stones in its territory to be captured. I don’t think that counts as “the normal go rules that humans use”, but it is definitely more human-like than raw Tromp-Taylor, so “definitely wrong” above is too strong. It may be worth noting explicitly that with the single exception of passing decisions (which is what is being exploited here) raw Tromp-Taylor and modified Tromp-Taylor lead to identical play and identical scores. [END of addition-in-edit.]
KataGo does have an option that makes it pass more readily in order to be nice to human opponents, but that option was turned off for the attack in the paper.
The reason the attack is able to succeed is that KataGo hasn’t learned to spot instantly every kind of position where immediate passing would be dangerous because its opponent might pass and (because of a technicality) win the game. If you give it enough visits that it actually bothers to check what happens when its opponent passes, it sees that that would be bad and is no longer vulnerable. In practice, it is unusual for anyone to use KataGo with as few visits as were used in the paper.
There is some truth to the idea that the attack is possible because KataGo doesn’t care about computer-rules technicalities, but the point isn’t that KataGo doesn’t care but that KataGo’s creator is untroubled by the fact that this attack is possible because (1) it only happens in artificial situations and (2) it is pretty much completely fixed by search, which is a perfectly good way to fix it. (Source: discussion on the Discord server where a lot of computer go people hang out.)
Okay, I downloaded KataGo to see how it plays and read its rules description. It seems actually been trained so that under area rules it doesn’t maximize its points.
This is surprising to me because one of the aspects of AlphaGo that was annoying was that it didn’t maximize the number of points with which it wins the game but only cared about winning. KataGo seems to play under territory rules to maximize points and not do those negative point moves that AlphaGo makes at the end of the game if it’s ahead by a lot of points.
Humans generally do care about the score at the end of the game so that behavior, under rules that care about area, is surprising to me.
Official Chinese rules do have a concept of removing dead stones. All the KGS rulesets also have an option for handling dead stone removal.
A fix that would let KataGo beat the adversarial policy would be to implement rules for Chinese go that are more like the actual KGS rules (likely by just letting it have the cleanup phase with Chinese rules as well) and generally tell KataGo to optimize winning with the highest importance and then optimize the score and lastly optimize for a minimum amount of moves played before passing.
If you do that you could train it on the different rule sets and it would produce this problem. The fact that you need to do that to prevent the adversarial policy is indeed interesting.
That suggests if you have one metric, adding a second metric that’s a proxy for the first metric as a secondary optimization goal can be helpful to get around some adversarial attacks. Especially, if the first metric is binary and the second one has a lot more possible values.
It’s interesting here that humans, do naturally care about scores when you let them play Go which is what gets them to avoid this kind of adversarial attack.
What KataGo tries to maximize is basically winning probability plus epsilon times score difference. (It’s not exactly that; I don’t remember exactly what it is; but that’s the right kind of idea.) So it mostly wants to win rather than lose, but prefers to win by more if the cost in winning probability is small, which as you say helps to avoid the sort of “slack” moves that AlphaGo and Leela Zero tend to make once the winner is more or less decided.
The problem here seems to be that it’s not preferring to win by more under area rules. If it would prefer by more points under area rules, it would capture all the stones before passing. It doesn’t do that, once it thinks that it has enough points to win anyway under area rules.
This attack is basically about giving KataGo the impression that it has enough points anyway and doesn’t need to capture stones to win.
Likely the heuristic of time score difference does not reward getting more points over passing but it does reward playing a move that’s worth more points over a move that’s worth less.
I’m not sure I understand. With any rules that allow the removal of dead stones, there is no advantage to capturing them. (With territory-scoring rules, capturing them makes you worse off. With area-scoring rules, capturing them makes no difference to the score.) And with rules that don’t allow the removal of dead stones, white is losing outright (and therefore needs to capture those stones even if it’s only winning versus losing that matters). How would caring more about score make KG more inclined to bother capturing the stones?
With area-scoring rules that don’t allow the removal of dead stones in normal training games, KataGo has to decide whether it can already pass or whether it should go through the work of capturing any remaining stones. I was letting KataGo play one training game and it looked to me like its default strategy in games is not to capture all the stones but only enough to win by a sufficient margin.
It doesn’t have a habit of “always capturing all the stones to get maximum score under area rules”. If it would have that habit I don’t think it would show this failure case.
In training games I think the rules it’s using do allow the removal of dead stones. If it chooses not to remove them it isn’t because it’s not caring about points it would have gained by removing them, it’s because it doesn’t think it would gain any points by removing them.
There is no possible habit of “always capturing all the stones to get maximum score under area rules”. Even under area rules you don’t get more points for capturing the stones (unless the stones are not actually dead according to the rules you’re using, or in human games according to negotiation with the opponent).
What am I missing?
I think that currently under area scoring rules KataGo behaves in a way that it doesn’t capture all stones that would be dead by human convention but that are not dead by KataGo’s rules provided capturing them isn’t necessary to win the game.
That’s correct, at least roughly—the important difference is that it’s not “isn’t necessary to win the game” but “doesn’t make any difference to the outcome, including score difference”—but I don’t see what it has to do with the more specific thing you said above:
KataGo does prefer to win by more, whatever rules it’s playing under; a stronger preference for winning by more would not (so far as I can see) make any difference to its play in positions like the ones reached by the adversarial agent; KataGo does not generally think “that it has enough points anyway and doesn’t need to capture stones to win” and even if it did that wouldn’t make the difference between playing on and passing in this situation.
Unless, again, I’m missing something, but we seem to be having some sort of communication difficulty because nothing you write seems to me responsive to what I’m saying (and quite possibly it feels the same way to you, with roles reversed).
What makes you believe that KataGo is “not preferring to win by more under area rules”?
Yeah, this is burying the lede here.
However, there isn’t a platonic form of Go rules, so what rules you make really matters.
Yes, there are multiple rule sets. Under all of those that humans use to score their games, KataGo wins in the examples.
As they put it on the linked website:
It’s complex to automatically evaluate Go positions according to the rules that humans use. That’s why people in the computer Go invented their own rules to make positions easier to evaluate which are the Tromp-Taylor rules.
Given the target audience of KataGo wasn’t playing Computer bots, the KataGo developers went through the trouble of modifying the Tromp-Taylor rules to be more like the rulesets that humans use to score their games and then used the new scoring algorithm to train KataGo.
KataGo’s developers put effort into aligning KataGo with the desires of human users and it pays off in KataGo behaving in the scenarios the paper listed in the way humans would want it to behave instead of behaving optimally according to Tromp-Taylor rules.
We have this in a lot of alignment problems. The metrics that are easy for computers to use and score are often not what humans care about. The task of alignment is about how to get our AI not goodhard on the easy metric but to focus on what we care about.
It would have been easier to create KataGo in a way that wins in the examples of the paper than to go through the effort of making KataGo behave the way it does in the examples.