“Firstly”: Yes, I oversimplified. (Deliberately, as it happens :-).) But every version of the rules that KataGo has used in its training games, IIUC, has had the feature that players are not required to capture enemy stones in territory surrounded by a pass-alive group.
I agree that in your example the white stones surrounding the big white territory are not pass-alive, so it would not be correct to say that in KG’s training this particular territory would have been assessed as winning for white.
But is it right to say that it was “trained to be aware” of this technicality? That’s not so clear to me. (I don’t mean that it isn’t clear what happened; I mean it isn’t clear how best to describe it.) It was trained in a way that could in principle teach it about this technicality. But it wasn’t trained in a way that deliberately tried to expose it to that technicality so it could learn, and it seems possible that positions of the type exploited by your adversary are rare enough in real training data that it never had much opportunity to learn about the technicality.
(To be clear, I am not claiming to know that that’s actually so. Perhaps it had plenty of opportunity, in some sense, but it failed to learn it somehow.)
If you define “what KataGo was trained to know” to include everything that was the case during its training, then I agree that what KataGo actually knows equals what it was trained to know. But even if you define things that way, it isn’t true that what KataGo actually knows equals what its “intuition” has learned: if there are things its intuition (i.e., its neural network) has failed to learn, it may still be true that KataGo knows them.
I think the (technical) lostness of the positions your adversary gets low-visits KataGo into is an example of this. KataGo’s neural network has not learned to see these positions as lost, which is either a bug or a feature depending on what you think KataGo is really trying to do; but if you run KataGo with a reasonable amount of searching, then as soon as it overcomes its intuition enough to explicitly ask itself “what happens if I pass here and the opponent passes too?”, it answers “yikes, I lose” and correctly decides not to do that.
Here’s an analogy that I think is reasonably precise. (It’s a rather gruesome analogy, for which I apologize. It may also be factually wrong, since it makes some assumptions about what people will instinctively do in a circumstance I have never actually seen a person in.) Consider a human being who drives a car, put into an environment where the road is strewn with moderately realistic models of human children. Most likely they will (explicitly or not) think “ok, these things all over the road that look like human children are actually something else, so I don’t need to be quite so careful about them” and if 0.1% of the models are in fact real human children and the driver is tired enough to be operating mostly on instinct, sooner or later they will hit one.
If the driver is sufficiently alert, they will (one might hope, anyway) notice the signs of life in 0.1% of the child-looking-things and start explicitly checking. Then (one might hope, anyway) they will drive in a way that enables them not to hit any of the real children.
Our hypothetical driver was trained in an environment where a road strewn with fake children and very occasional real children is going to lead to injured children if you treat the fakes as fakes: the laws of physics, the nature of human bodies, and for that matter the law, weren’t any different there. But they weren’t particularly trained for this weird situation where the environment is trying to confuse you in this way. (Likewise, KataGo was trained in an environment where a large territory containing scattered very dead enemy stones sometimes means that passing would lose you the game; but it wasn’t particularly trained for that weird situation.)
Our hypothetical driver, if sufficiently attentive, will notice that something is even weirder than it initially looks, and take sufficient care not to hit anyone. (Likewise, KataGo with a reasonable number of visits will explicitly ask itself the question “what happens if we both pass here”, see the answer, and avoid doing that.)
But our hypothetical driver’s immediate intuition may not notice exactly what is going on, which may lead to disaster. (Likewise, KataGo with very few visits is relying on intuition to tell it whether it needs to consider passing as an action its opponent might take in this position, and its intuition says no, so it doesn’t consider it, which may lead to disaster.)
Does the possibility (assuming it is in fact possible) of this gruesome hypothetical mean that there’s something wrong with how we’re training drivers? I don’t think so. We could certainly train drivers in a way that makes them less susceptible to this attack. If even a small fraction of driving lessons and tests were done in a situation with lots of model people and occasional real ones, everyone would learn a higher level of paranoid caution in these situations, and that would suffice. (The most likely way for my gruesome example to be unrealistic is that maybe everyone would already be sufficiently paranoid in this weird situation.) But this just isn’t an important enough “problem” to be worth devoting training to, and if we train drivers to drive in a way optimized for not hitting people going out of their way not to be visible to the driver then the effect is probably to make drivers less effective at noticing other, more common, things that could cause accidents, or else to make them drive really slowly all of the time (which would reduce accidents, but we have collectively decided not to prioritize that so much, because otherwise our speed limits would be lower everywhere).
Similarly, it doesn’t seem to me that your adversarial attack indicates anything wrong with how KataGo is trained; it could probably be trained in ways that would make it less vulnerable to your attack, but only at the cost of using more of its neural network for spotting this kind of nonsense and therefore being weaker overall (~ human drivers making more mistakes of other kinds because they’re focusing on places where quasi-suicidal people could be lurking) or just being extra-cautious about passing and therefore more annoying to human opponents for no actual gain in strength in realistic situations (~ human drivers driving more slowly all the time).
(From the AI-alignment perspective, I quite like ChristianKl’s take on this: KataGo, even though trained in an environment that in some sense “should” have taught it to avoid passing in some totally-decided positions, has learned to do that, thus being more friendly to actual human beings at the cost of being a little more exploitable. Of course, depending on just how you make the analogy with AI-alignment scenarios, it doesn’t have to look so positive! But I do think it’s interesting that the actual alignment effect in this case is a beneficial one: KG has ended up behaving in a way that suits humans.)
For the avoidance of doubt, I am not denying that you have successfully built an adversary that can exploit a limitation in KataGo’s intuition. I am not just convinced that that should be regarded as a problem for KataGo. Its intuition isn’t meant to solve all problems; if it could, it wouldn’t need to be able to search.
“Firstly”: Yes, I oversimplified. (Deliberately, as it happens :-).) But every version of the rules that KataGo has used in its training games, IIUC, has had the feature that players are not required to capture enemy stones in territory surrounded by a pass-alive group.
I agree that in your example the white stones surrounding the big white territory are not pass-alive, so it would not be correct to say that in KG’s training this particular territory would have been assessed as winning for white.
But is it right to say that it was “trained to be aware” of this technicality? That’s not so clear to me. (I don’t mean that it isn’t clear what happened; I mean it isn’t clear how best to describe it.) It was trained in a way that could in principle teach it about this technicality. But it wasn’t trained in a way that deliberately tried to expose it to that technicality so it could learn, and it seems possible that positions of the type exploited by your adversary are rare enough in real training data that it never had much opportunity to learn about the technicality.
(To be clear, I am not claiming to know that that’s actually so. Perhaps it had plenty of opportunity, in some sense, but it failed to learn it somehow.)
If you define “what KataGo was trained to know” to include everything that was the case during its training, then I agree that what KataGo actually knows equals what it was trained to know. But even if you define things that way, it isn’t true that what KataGo actually knows equals what its “intuition” has learned: if there are things its intuition (i.e., its neural network) has failed to learn, it may still be true that KataGo knows them.
I think the (technical) lostness of the positions your adversary gets low-visits KataGo into is an example of this. KataGo’s neural network has not learned to see these positions as lost, which is either a bug or a feature depending on what you think KataGo is really trying to do; but if you run KataGo with a reasonable amount of searching, then as soon as it overcomes its intuition enough to explicitly ask itself “what happens if I pass here and the opponent passes too?”, it answers “yikes, I lose” and correctly decides not to do that.
Here’s an analogy that I think is reasonably precise. (It’s a rather gruesome analogy, for which I apologize. It may also be factually wrong, since it makes some assumptions about what people will instinctively do in a circumstance I have never actually seen a person in.) Consider a human being who drives a car, put into an environment where the road is strewn with moderately realistic models of human children. Most likely they will (explicitly or not) think “ok, these things all over the road that look like human children are actually something else, so I don’t need to be quite so careful about them” and if 0.1% of the models are in fact real human children and the driver is tired enough to be operating mostly on instinct, sooner or later they will hit one.
If the driver is sufficiently alert, they will (one might hope, anyway) notice the signs of life in 0.1% of the child-looking-things and start explicitly checking. Then (one might hope, anyway) they will drive in a way that enables them not to hit any of the real children.
Our hypothetical driver was trained in an environment where a road strewn with fake children and very occasional real children is going to lead to injured children if you treat the fakes as fakes: the laws of physics, the nature of human bodies, and for that matter the law, weren’t any different there. But they weren’t particularly trained for this weird situation where the environment is trying to confuse you in this way. (Likewise, KataGo was trained in an environment where a large territory containing scattered very dead enemy stones sometimes means that passing would lose you the game; but it wasn’t particularly trained for that weird situation.)
Our hypothetical driver, if sufficiently attentive, will notice that something is even weirder than it initially looks, and take sufficient care not to hit anyone. (Likewise, KataGo with a reasonable number of visits will explicitly ask itself the question “what happens if we both pass here”, see the answer, and avoid doing that.)
But our hypothetical driver’s immediate intuition may not notice exactly what is going on, which may lead to disaster. (Likewise, KataGo with very few visits is relying on intuition to tell it whether it needs to consider passing as an action its opponent might take in this position, and its intuition says no, so it doesn’t consider it, which may lead to disaster.)
Does the possibility (assuming it is in fact possible) of this gruesome hypothetical mean that there’s something wrong with how we’re training drivers? I don’t think so. We could certainly train drivers in a way that makes them less susceptible to this attack. If even a small fraction of driving lessons and tests were done in a situation with lots of model people and occasional real ones, everyone would learn a higher level of paranoid caution in these situations, and that would suffice. (The most likely way for my gruesome example to be unrealistic is that maybe everyone would already be sufficiently paranoid in this weird situation.) But this just isn’t an important enough “problem” to be worth devoting training to, and if we train drivers to drive in a way optimized for not hitting people going out of their way not to be visible to the driver then the effect is probably to make drivers less effective at noticing other, more common, things that could cause accidents, or else to make them drive really slowly all of the time (which would reduce accidents, but we have collectively decided not to prioritize that so much, because otherwise our speed limits would be lower everywhere).
Similarly, it doesn’t seem to me that your adversarial attack indicates anything wrong with how KataGo is trained; it could probably be trained in ways that would make it less vulnerable to your attack, but only at the cost of using more of its neural network for spotting this kind of nonsense and therefore being weaker overall (~ human drivers making more mistakes of other kinds because they’re focusing on places where quasi-suicidal people could be lurking) or just being extra-cautious about passing and therefore more annoying to human opponents for no actual gain in strength in realistic situations (~ human drivers driving more slowly all the time).
(From the AI-alignment perspective, I quite like ChristianKl’s take on this: KataGo, even though trained in an environment that in some sense “should” have taught it to avoid passing in some totally-decided positions, has learned to do that, thus being more friendly to actual human beings at the cost of being a little more exploitable. Of course, depending on just how you make the analogy with AI-alignment scenarios, it doesn’t have to look so positive! But I do think it’s interesting that the actual alignment effect in this case is a beneficial one: KG has ended up behaving in a way that suits humans.)
For the avoidance of doubt, I am not denying that you have successfully built an adversary that can exploit a limitation in KataGo’s intuition. I am not just convinced that that should be regarded as a problem for KataGo. Its intuition isn’t meant to solve all problems; if it could, it wouldn’t need to be able to search.