I worked on some neural networks this summer (mostly as a learning experience) and this matches my experience. I spent almost all of my time trying to get good training data, and if I had good training data, our previous method would work just as well (and faster).
Step one: new trick is discovered solving some problem X which couldn’t be handled before.
Step two: people try to apply it to everything that the old styles didn’t work on like problem Y which is sort of in the same problem class. At this stage overly enthusiastic people may over-promise. “I’m sure it will work amazingly on Y”
Step three: “Bah! These CS types never deliver, Y will always be better done by humans.”
Step four: Interest and funding flees as the news stops paying attention, a few people keep chipping away at the problem and eventually slightly outperform humans on Y and try to get it to work on Z.
Step five: Someone proves mathematically that it can never solve major set of problems in Z.
It’s interesting to note that AlphaGo’s improvement was to use data to train a neural network, and then use this network to produce even more data, which in the case of Go is easily labelled, and then use this second wave of data to train a second neural network. The output was then the average of the two NN’s output.
What two NNs are you talking about? I thought the two NN system was different roles with different kinds of outputs, so that averaging them wouldn’t even make sense.
For example, in Nature’s article about AlphaGo: page 485, picture a: it says that a reinforcement learning rollout network is used to produce another bout of data which is then used to train the value network. Page 486, second column, the third formula: the valuation of the two networks (the fast rollout policy and the value network) for the current position is averaged to give a final score on every possible next move, then the most valuable move is choosed.
Interesting. To expand on that explanation, it seems that there are FOUR networks here—Three policy networks (P networks). One is the optimized learner (P-rho), which is trained by playing against itself, starting from a network made by supervised learning which is pretty good (P-sigma). There’s another network that just makes vaguely reasonable moves but evaluates quickly (P-pi) that as far as I can tell isn’t used for direct training of P-rho.
Then they train a new neural network to recognize position quality (Nu-theta) based on this optimized system playing against itself. The averaging you mention is mixing that with the results of just using P-pi to finish the game quick and see which side wins.
Well, there are actually many, many more. Sigma is just the initial seed from which a population of networks rho are created, each evolved by playing against a random previous iteration of themselves. In this way you can say that sigma-rho are just the initial and final point of an entire spectrum of networks, whose only purpose is to create the raw data which are used to train theta, and then be discarded. The stroke of genius of AlphaGo in my opinion was complementing the rollout network, already used in many other programs, with an intuitive network whose purpose is to imitate intuition, and furthermore to create this network by a pool of ‘cheap’ experts (compared to human experts) play. This technology could be adopted in other area where data are prone to be easily and automatically labelled (such as go ending positions).
But haven’t people been having AIs do that—self-play-training—for a long time? I think the most remarkable idea is to use the massive breed of policy nets only to create a judgement net, and use that in the end instead. That’s wild.
I worked on some neural networks this summer (mostly as a learning experience) and this matches my experience. I spent almost all of my time trying to get good training data, and if I had good training data, our previous method would work just as well (and faster).
Will it “bust”, or will it be superseded by the next wave of more general AI architectures designed to require less supervision, less data?
If the typical pattern holds:
Step one: new trick is discovered solving some problem X which couldn’t be handled before.
Step two: people try to apply it to everything that the old styles didn’t work on like problem Y which is sort of in the same problem class. At this stage overly enthusiastic people may over-promise. “I’m sure it will work amazingly on Y”
Step three: “Bah! These CS types never deliver, Y will always be better done by humans.”
Step four: Interest and funding flees as the news stops paying attention, a few people keep chipping away at the problem and eventually slightly outperform humans on Y and try to get it to work on Z.
Step five: Someone proves mathematically that it can never solve major set of problems in Z.
Step six: Someone comes up a new trick… GOTO 1
It’s interesting to note that AlphaGo’s improvement was to use data to train a neural network, and then use this network to produce even more data, which in the case of Go is easily labelled, and then use this second wave of data to train a second neural network. The output was then the average of the two NN’s output.
What two NNs are you talking about? I thought the two NN system was different roles with different kinds of outputs, so that averaging them wouldn’t even make sense.
For example, in Nature’s article about AlphaGo: page 485, picture a: it says that a reinforcement learning rollout network is used to produce another bout of data which is then used to train the value network.
Page 486, second column, the third formula: the valuation of the two networks (the fast rollout policy and the value network) for the current position is averaged to give a final score on every possible next move, then the most valuable move is choosed.
Interesting. To expand on that explanation, it seems that there are FOUR networks here—Three policy networks (P networks). One is the optimized learner (P-rho), which is trained by playing against itself, starting from a network made by supervised learning which is pretty good (P-sigma). There’s another network that just makes vaguely reasonable moves but evaluates quickly (P-pi) that as far as I can tell isn’t used for direct training of P-rho.
Then they train a new neural network to recognize position quality (Nu-theta) based on this optimized system playing against itself. The averaging you mention is mixing that with the results of just using P-pi to finish the game quick and see which side wins.
That’s rather convoluted.
Well, there are actually many, many more. Sigma is just the initial seed from which a population of networks rho are created, each evolved by playing against a random previous iteration of themselves. In this way you can say that sigma-rho are just the initial and final point of an entire spectrum of networks, whose only purpose is to create the raw data which are used to train theta, and then be discarded.
The stroke of genius of AlphaGo in my opinion was complementing the rollout network, already used in many other programs, with an intuitive network whose purpose is to imitate intuition, and furthermore to create this network by a pool of ‘cheap’ experts (compared to human experts) play.
This technology could be adopted in other area where data are prone to be easily and automatically labelled (such as go ending positions).
But haven’t people been having AIs do that—self-play-training—for a long time? I think the most remarkable idea is to use the massive breed of policy nets only to create a judgement net, and use that in the end instead. That’s wild.