What two NNs are you talking about? I thought the two NN system was different roles with different kinds of outputs, so that averaging them wouldn’t even make sense.
For example, in Nature’s article about AlphaGo: page 485, picture a: it says that a reinforcement learning rollout network is used to produce another bout of data which is then used to train the value network. Page 486, second column, the third formula: the valuation of the two networks (the fast rollout policy and the value network) for the current position is averaged to give a final score on every possible next move, then the most valuable move is choosed.
Interesting. To expand on that explanation, it seems that there are FOUR networks here—Three policy networks (P networks). One is the optimized learner (P-rho), which is trained by playing against itself, starting from a network made by supervised learning which is pretty good (P-sigma). There’s another network that just makes vaguely reasonable moves but evaluates quickly (P-pi) that as far as I can tell isn’t used for direct training of P-rho.
Then they train a new neural network to recognize position quality (Nu-theta) based on this optimized system playing against itself. The averaging you mention is mixing that with the results of just using P-pi to finish the game quick and see which side wins.
Well, there are actually many, many more. Sigma is just the initial seed from which a population of networks rho are created, each evolved by playing against a random previous iteration of themselves. In this way you can say that sigma-rho are just the initial and final point of an entire spectrum of networks, whose only purpose is to create the raw data which are used to train theta, and then be discarded. The stroke of genius of AlphaGo in my opinion was complementing the rollout network, already used in many other programs, with an intuitive network whose purpose is to imitate intuition, and furthermore to create this network by a pool of ‘cheap’ experts (compared to human experts) play. This technology could be adopted in other area where data are prone to be easily and automatically labelled (such as go ending positions).
But haven’t people been having AIs do that—self-play-training—for a long time? I think the most remarkable idea is to use the massive breed of policy nets only to create a judgement net, and use that in the end instead. That’s wild.
What two NNs are you talking about? I thought the two NN system was different roles with different kinds of outputs, so that averaging them wouldn’t even make sense.
For example, in Nature’s article about AlphaGo: page 485, picture a: it says that a reinforcement learning rollout network is used to produce another bout of data which is then used to train the value network.
Page 486, second column, the third formula: the valuation of the two networks (the fast rollout policy and the value network) for the current position is averaged to give a final score on every possible next move, then the most valuable move is choosed.
Interesting. To expand on that explanation, it seems that there are FOUR networks here—Three policy networks (P networks). One is the optimized learner (P-rho), which is trained by playing against itself, starting from a network made by supervised learning which is pretty good (P-sigma). There’s another network that just makes vaguely reasonable moves but evaluates quickly (P-pi) that as far as I can tell isn’t used for direct training of P-rho.
Then they train a new neural network to recognize position quality (Nu-theta) based on this optimized system playing against itself. The averaging you mention is mixing that with the results of just using P-pi to finish the game quick and see which side wins.
That’s rather convoluted.
Well, there are actually many, many more. Sigma is just the initial seed from which a population of networks rho are created, each evolved by playing against a random previous iteration of themselves. In this way you can say that sigma-rho are just the initial and final point of an entire spectrum of networks, whose only purpose is to create the raw data which are used to train theta, and then be discarded.
The stroke of genius of AlphaGo in my opinion was complementing the rollout network, already used in many other programs, with an intuitive network whose purpose is to imitate intuition, and furthermore to create this network by a pool of ‘cheap’ experts (compared to human experts) play.
This technology could be adopted in other area where data are prone to be easily and automatically labelled (such as go ending positions).
But haven’t people been having AIs do that—self-play-training—for a long time? I think the most remarkable idea is to use the massive breed of policy nets only to create a judgement net, and use that in the end instead. That’s wild.