Gurkenglas comments on Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Gurkenglas Mar 8, 2025, 11:11 AM
2 points
0
Oh, you’re using AdamW everywhere? That might explain the continuous training loss increase after each spike, with AdamW needing time to adjust to the new loss landscape...
Lower learning rate leads to more spikes? Curious! I hypothesize that… it needs a small learning rate to get stuck in a narrow local optimum, and then when it reaches the very bottom of the basin, you get a ~zero gradient, and then the “normalize gradient vector to step size” step is discontinuous around zero.
Experiments springing to mind are:
1. Do you get even fewer spikes if you increase the step size instead?
2. Is there any optimizer setup at all that makes the training loss only ever go down?
2.1. Reduce the step size whenever an update would increase the training loss?
2.2. Use gradient descent instead of AdamW?
- Rareș Baron Mar 8, 2025, 5:12 PM
  1 point
  0
  Parent
  Your hypothesis seems reasonable, and I think the following proves it.
  1. This is for 5e-3, giving no spikes and faster convergences:
  2. Gradient descent failed to converge for multiple LRs, from 1e-2 to 1e-5. However, decreasing the LR by 1.0001 when the training error increases gave this:
  It’s messy, and the decrease seems to turn the jumps of the slingshot effect into causes for getting stuck in sub-optimal basins, but the trajectory was always downwards. Increasing the rate of reduction decreased spikes but convergence no longer appeared.
  An increase to 2. removed the spikes entirely.
- Rareș Baron Mar 8, 2025, 5:10 PM
  1 point
  0
  Parent
  Your hypothesis seems reasonable, and I think the following proves it.
  1. This is for 5e-3, giving no spikes and faster convergences:
  2. Gradient descent failed to converge for multiple LRs, from 1e-2 to 1e-5. However, decreasing the LR by 1.0001 when the training error increases gave this:
  It’s messy, and the decrease seems to turn the jumps of the slingshot effect into causes for getting stuck in sub-optimal basins, but the trajectory was always downwards. Increasing the rate of reduction decreased spikes but convergence no longer appeared.
  An increase to 2. removed the spikes entirely.