gwern comments on Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream

gwern 7 Sep 2024 17:03 UTC
16 points
4
What do you think would happen if you further trained the Adam model with SGD (and vice-versa)? Has it found too qualitatively different a local optima to ‘fix’ the privileged basis issue or would it just gradually change to a more SGD-like internal organization?
- Diego Caples 8 Sep 2024 23:20 UTC
  5 points
  0
  Parent
  If we were to start training with Adam and later switch to SGD, I would guess that the privileged basis would persist.
  There is no mechanism in SGD which opposes solutions with basis aligned features, it’s just that SGD is agnostic to all choices of directions for features in the residual stream. Because there are -many possible directions for features to point, the reason an SGD trained model does not have privileged basis is simply because it is exceedingly unlikely to be randomly initialized into one.
  On the other hand, Adam collects statistics with respect to each basis dimension, making basis dimensions different other directions. Somehow, this causes model features to align with basis dimensions.