There are other big deals. The MS ImageNet win also contained frightening progress on the training meta level.
The other issue is that constructing this kind of mega-neural net is tremendously difficult. Landing on a particular set of algorithms—determining how each layer should operate and how it should talk to the next layer—is an almost epic task. But Microsoft has a trick here, too. It has designed a computing system that can help build these networks.
As Jian Sun explains it, researchers can identify a promising arrangement for massive neural networks, and then the system can cycle through a range of similar possibilities until it settles on this best one. “In most cases, after a number of tries, the researchers learn [something], reflect, and make a new decision on the next try,” he says. “You can view this as ‘human-assisted search.’”
Going by that description, it is much much less important than residual learning, because hyperparameter optimization is not new. There are a lot of approaches: grid search, random search, Gaussian processes. Some hyperparameter optimizations baked into MSR’s deep learning framework would save some researcher time and effort, certainly, but I don’t know that it would’ve made any big difference unless they have something quite unusual going one.
(I liked one paper which took a Bayesian multi-armed bandit approach and treated error curves as partial information about final performance, and it would switch between different networks being trained based on performance, regularly ‘freezing’ and ‘thawing’ networks as the probability each network would become the best performer changed with information from additional mini-batches/epoches.) Probably the single coolest one is last year some researchers showed that it is possible to somewhat efficiently backpropagate on hyperparameters! So hyperparameters just become more parameters to learn, and you can load up on all sorts of stuff without worrying about it making your hyperparameter optimization futile or having to train a billion times, and would both save people a lot of time (for using vanilla networks) and allow exploring extremely complicated and heavily parameterized families of architectures, and would be a big deal. Unfortunately, it’s still not efficient enough for the giant networks we want to train. :(
A step which was taken a long time ago and does not seem to have played much of a role in recent developments; for the most part, people don’t bother with extensive hyperparameter tuning. Better initialization, better algorithms like dropout or residual learning, better architectures, but not hyperparameters.
There are other big deals. The MS ImageNet win also contained frightening progress on the training meta level.
-- extracted from very readable summary at wired: http://www.wired.com/2016/01/microsoft-neural-net-shows-deep-learning-can-get-way-deeper/
Going by that description, it is much much less important than residual learning, because hyperparameter optimization is not new. There are a lot of approaches: grid search, random search, Gaussian processes. Some hyperparameter optimizations baked into MSR’s deep learning framework would save some researcher time and effort, certainly, but I don’t know that it would’ve made any big difference unless they have something quite unusual going one.
(I liked one paper which took a Bayesian multi-armed bandit approach and treated error curves as partial information about final performance, and it would switch between different networks being trained based on performance, regularly ‘freezing’ and ‘thawing’ networks as the probability each network would become the best performer changed with information from additional mini-batches/epoches.) Probably the single coolest one is last year some researchers showed that it is possible to somewhat efficiently backpropagate on hyperparameters! So hyperparameters just become more parameters to learn, and you can load up on all sorts of stuff without worrying about it making your hyperparameter optimization futile or having to train a billion times, and would both save people a lot of time (for using vanilla networks) and allow exploring extremely complicated and heavily parameterized families of architectures, and would be a big deal. Unfortunately, it’s still not efficient enough for the giant networks we want to train. :(
The key point is that machine learning starts to happen at the hyper-parameter level. Which is one more step toward systems that optimize themselves.
A step which was taken a long time ago and does not seem to have played much of a role in recent developments; for the most part, people don’t bother with extensive hyperparameter tuning. Better initialization, better algorithms like dropout or residual learning, better architectures, but not hyperparameters.