I agree that the network trained on the large random-game dataset shows every sign of having learned the rules very well, and if I implied otherwise then that was an error. (I don’t think I ever intended to imply otherwise.)
The thing I was more interested in was the difference between that and the network trained on the much smaller championship-game dataset, whose incorrect-move rate is much much higher—about 5%. I’m pretty sure that either (1) having a lot more games of that type would help a lot or (2) having a bigger network would help a lot or (3) both; my original speculation was that 2 was more important but at that point I hadn’t noticed just how big the disparity in game count was. I now think it’s probably mostly 1, and I suspect that the difference between “random games” and “well played games” is not a major factor, and in particular I don’t think it’s likely that seeing only good moves is leading the network to learn a wrong ruleset. (It’s definitely not impossible! It just isn’t how I’d bet.)
Vaniver’s suggestion was that the championship-game-trained network had learned a wrong ruleset on account of some legal moves being very rare. It doesn’t seem likely to me that this (as opposed to 1. not having learned very well because the number of games was too small and/or 2. not having learned very well because the positions in the championship games are unrepresentative) is the explanation for having illegal moves as top prediction 5% of the time.
It looked as if you were disagreeing with that, but the arguments you’ve made in support all seem like cogent arguments against things other than what I was intending to say, which is why I think that at least one of us is misunderstanding the other.
In particular, at no point was I saying anything about the causes of the nonzero but very small error rate (~0.01%) of the network trained on the large random-game dataset, and at no point was I saying that that network had not done an excellent job of learning the rules.
I agree that the network trained on the large random-game dataset shows every sign of having learned the rules very well, and if I implied otherwise then that was an error. (I don’t think I ever intended to imply otherwise.)
The thing I was more interested in was the difference between that and the network trained on the much smaller championship-game dataset, whose incorrect-move rate is much much higher—about 5%. I’m pretty sure that either (1) having a lot more games of that type would help a lot or (2) having a bigger network would help a lot or (3) both; my original speculation was that 2 was more important but at that point I hadn’t noticed just how big the disparity in game count was. I now think it’s probably mostly 1, and I suspect that the difference between “random games” and “well played games” is not a major factor, and in particular I don’t think it’s likely that seeing only good moves is leading the network to learn a wrong ruleset. (It’s definitely not impossible! It just isn’t how I’d bet.)
Vaniver’s suggestion was that the championship-game-trained network had learned a wrong ruleset on account of some legal moves being very rare. It doesn’t seem likely to me that this (as opposed to 1. not having learned very well because the number of games was too small and/or 2. not having learned very well because the positions in the championship games are unrepresentative) is the explanation for having illegal moves as top prediction 5% of the time.
It looked as if you were disagreeing with that, but the arguments you’ve made in support all seem like cogent arguments against things other than what I was intending to say, which is why I think that at least one of us is misunderstanding the other.
In particular, at no point was I saying anything about the causes of the nonzero but very small error rate (~0.01%) of the network trained on the large random-game dataset, and at no point was I saying that that network had not done an excellent job of learning the rules.