It would be nice if some mechanistic interpretability researchers could find out what algorithm KataGo is using, but presumably the policy network is much too large for any existing methods to be useful.
Haoxing Du did some work on this when she was at Redwood—I’m not sure if any of her work is public, though. Also, some of her REMIXers looked into various aspects as well, but didn’t get very far.
Yes, I did some interpretability on the policy network of Leela Zero. Planning to post the results very soon! But I did not particularly look into the attack described here, and while there was one REMIX group that looked into a problem related to liberty counting, they didn’t get very far. I do agree this is an obvious problem to tackle with interpretability- I think it’s likely not that hard to get a rough idea why the cyclic attack works.
Haoxing Du did some work on this when she was at Redwood—I’m not sure if any of her work is public, though. Also, some of her REMIXers looked into various aspects as well, but didn’t get very far.
Yes, I did some interpretability on the policy network of Leela Zero. Planning to post the results very soon! But I did not particularly look into the attack described here, and while there was one REMIX group that looked into a problem related to liberty counting, they didn’t get very far. I do agree this is an obvious problem to tackle with interpretability- I think it’s likely not that hard to get a rough idea why the cyclic attack works.