What do you think would happen if you further trained the Adam model with SGD (and vice-versa)? Has it found too qualitatively different a local optima to ‘fix’ the privileged basis issue or would it just gradually change to a more SGD-like internal organization?
If we were to start training with Adam and later switch to SGD, I would guess that the privileged basis would persist.
There is no mechanism in SGD which opposes solutions with basis aligned features, it’s just that SGD is agnostic to all choices of directions for features in the residual stream. Because there are -many possible directions for features to point, the reason an SGD trained model does not have privileged basis is simply because it is exceedingly unlikely to be randomly initialized into one.
On the other hand, Adam collects statistics with respect to each basis dimension, making basis dimensions different other directions. Somehow, this causes model features to align with basis dimensions.
What do you think would happen if you further trained the Adam model with SGD (and vice-versa)? Has it found too qualitatively different a local optima to ‘fix’ the privileged basis issue or would it just gradually change to a more SGD-like internal organization?
If we were to start training with Adam and later switch to SGD, I would guess that the privileged basis would persist.
There is no mechanism in SGD which opposes solutions with basis aligned features, it’s just that SGD is agnostic to all choices of directions for features in the residual stream. Because there are -many possible directions for features to point, the reason an SGD trained model does not have privileged basis is simply because it is exceedingly unlikely to be randomly initialized into one.
On the other hand, Adam collects statistics with respect to each basis dimension, making basis dimensions different other directions. Somehow, this causes model features to align with basis dimensions.