These kind of ‘twist on known optimizers’ papers are pretty common, and they mostly don’t amount to too much. E.g., the only difference between Adam and “SafeRate[Adam direction]” is that they used their second-order method to automatically tune the learning rate of the Adam optimizer. Such automatic hyperparameter tuning has been a thing for a long time. E.g., here’s a paper from ~30 years ago.
Also note that Adam pretty much keeps up with SafeRate in the above plot until the loss drops to ~10−8, which is extremely low, and very far beyond what any plausible AGI training run will reach. SafeRate’s advantage isn’t supposed to be ‘make loss go down harder’, it’s supposed to be ‘more stable optimization process’, which is exactly what you see in the plot above.
That’s not to say SafeRate is worthless. The fact that they can do second order hyperparameter tuning with only a second forward pass, and not another pair of forward and backward passes, is somewhat interesting. It may also make large language model training more stable, which I understand to be an issue with tuning such training processes. However, it’s extremely unlikely IMO to be some “multiple OOM jump” in training efficiency.
These kind of ‘twist on known optimizers’ papers are pretty common, and they mostly don’t amount to too much. E.g., the only difference between Adam and “SafeRate[Adam direction]” is that they used their second-order method to automatically tune the learning rate of the Adam optimizer. Such automatic hyperparameter tuning has been a thing for a long time. E.g., here’s a paper from ~30 years ago.
Also note that Adam pretty much keeps up with SafeRate in the above plot until the loss drops to ~10−8, which is extremely low, and very far beyond what any plausible AGI training run will reach. SafeRate’s advantage isn’t supposed to be ‘make loss go down harder’, it’s supposed to be ‘more stable optimization process’, which is exactly what you see in the plot above.
That’s not to say SafeRate is worthless. The fact that they can do second order hyperparameter tuning with only a second forward pass, and not another pair of forward and backward passes, is somewhat interesting. It may also make large language model training more stable, which I understand to be an issue with tuning such training processes. However, it’s extremely unlikely IMO to be some “multiple OOM jump” in training efficiency.