Thus reminds me of the machine learning point that when you do gradient descent in really high dimensions, local minima are less common than you’d think, because to be trapped in a local minimum, every dimension has to be bad.
Instead of gradient descent getting trapped at local minima, it’s more likely to get pseudo-trapped at “saddle points” where it’s at a local minimum along some dimensions but a local maximum along others, and due to the small slope of the curve it has trouble learning which is which.
Thus reminds me of the machine learning point that when you do gradient descent in really high dimensions, local minima are less common than you’d think, because to be trapped in a local minimum, every dimension has to be bad.
Instead of gradient descent getting trapped at local minima, it’s more likely to get pseudo-trapped at “saddle points” where it’s at a local minimum along some dimensions but a local maximum along others, and due to the small slope of the curve it has trouble learning which is which.