So, this is another learning-rate tuner. What are the prospects for the many other kinds of hyperparameters? Even stuff like arch size still require the equivalent of hyperparameter tuning to decide on the compute-optimal scaling.
AGD can train any architecture, dataset and batch size combination (as far as we have tested), out-of-the-box. I would argue that this is a qualitative change to the current methods, where you have to find the right learning rate for every batch size, architecture and dataset combination, in order to converge in an optimal or near-optimal time. I think this is a reasonable interpretation of “train ImageNet without hyperparameters”. That said, there is a stronger sense of “hyperparameter-free” where the optimum batch size and architecture size would decide on the compute-optimal scaling. And, an even stronger sense where the architecture type is selected.
In other words, we have the following hierarchy of lack of hyperparameterness,
learning rate must be selected, sometimes with schedulers etc. or via heuristics, to guarantee convergence for any architecture, dataset, batch size …
pick and architecture, dataset and batch size and it will converge (hopefully) in a near-optimal time
compute-optimal batch size and architecture size is automatically found for a dataset
given a dataset, we are given the best architecture type (e.g. resnet, CNN etc.)
I would argue that we currently are in stage 1. If AGD (or similar optimisers) do actually work like we think, we’re now in stage 2. In my mind, this is a qualitative change.
So, I think calling it “another learning-rate tuner” is a little disingenuous—incorporating information about the architecture seems to move in a direction of eliminating a hyperparameter by removing a degree of freedom, rather than a “learning rate tuner” whichI think of as a heuristic method usually involving trial-and-error, without any explanation for why that learning rate is best. However, if there are similar papers out there already that you think do something similar, or you think I’m wrong in any way, please send them over, or let me know!
So, this is another learning-rate tuner. What are the prospects for the many other kinds of hyperparameters? Even stuff like arch size still require the equivalent of hyperparameter tuning to decide on the compute-optimal scaling.
AGD can train any architecture, dataset and batch size combination (as far as we have tested), out-of-the-box. I would argue that this is a qualitative change to the current methods, where you have to find the right learning rate for every batch size, architecture and dataset combination, in order to converge in an optimal or near-optimal time. I think this is a reasonable interpretation of “train ImageNet without hyperparameters”. That said, there is a stronger sense of “hyperparameter-free” where the optimum batch size and architecture size would decide on the compute-optimal scaling. And, an even stronger sense where the architecture type is selected.
In other words, we have the following hierarchy of lack of hyperparameterness,
learning rate must be selected, sometimes with schedulers etc. or via heuristics, to guarantee convergence for any architecture, dataset, batch size …
pick and architecture, dataset and batch size and it will converge (hopefully) in a near-optimal time
compute-optimal batch size and architecture size is automatically found for a dataset
given a dataset, we are given the best architecture type (e.g. resnet, CNN etc.)
I would argue that we currently are in stage 1. If AGD (or similar optimisers) do actually work like we think, we’re now in stage 2. In my mind, this is a qualitative change.
So, I think calling it “another learning-rate tuner” is a little disingenuous—incorporating information about the architecture seems to move in a direction of eliminating a hyperparameter by removing a degree of freedom, rather than a “learning rate tuner” whichI think of as a heuristic method usually involving trial-and-error, without any explanation for why that learning rate is best. However, if there are similar papers out there already that you think do something similar, or you think I’m wrong in any way, please send them over, or let me know!