Your hypothesis seems reasonable, and I think the following proves it.
1. This is for 5e-3, giving no spikes and faster convergences:
2. Gradient descent failed to converge for multiple LRs, from 1e-2 to 1e-5. However, decreasing the LR by 1.0001 when the training error increases gave this:
It’s messy, and the decrease seems to turn the jumps of the slingshot effect into causes for getting stuck in sub-optimal basins, but the trajectory was always downwards. Increasing the rate of reduction decreased spikes but convergence no longer appeared.
An increase to 2. removed the spikes entirely.
Your hypothesis seems reasonable, and I think the following proves it.
1. This is for 5e-3, giving no spikes and faster convergences:
2. Gradient descent failed to converge for multiple LRs, from 1e-2 to 1e-5. However, decreasing the LR by 1.0001 when the training error increases gave this:
It’s messy, and the decrease seems to turn the jumps of the slingshot effect into causes for getting stuck in sub-optimal basins, but the trajectory was always downwards. Increasing the rate of reduction decreased spikes but convergence no longer appeared.
An increase to 2. removed the spikes entirely.