If gradient hacking is thought to be possible because gradient descent is a highly local optimization process, maybe it would help to use higher-order approaches. E.g., Newton’s method uses second order derivative information, and the Householder methods use even higher order derivatives.
These higher order methods aren’t commonly used in deep learning because of their additional computational expense. However, if such methods can detect and remove mechanisms of gradient hacking that are invisible to gradient descent, it maybe be worthwhile to occasionally use higher order methods in training.
If gradient hacking is thought to be possible because gradient descent is a highly local optimization process, maybe it would help to use higher-order approaches. E.g., Newton’s method uses second order derivative information, and the Householder methods use even higher order derivatives.
These higher order methods aren’t commonly used in deep learning because of their additional computational expense. However, if such methods can detect and remove mechanisms of gradient hacking that are invisible to gradient descent, it maybe be worthwhile to occasionally use higher order methods in training.