ϵ-random is a bad baseline because activation space is not isotropic (or some other reason I do not understand) and this is not actually that unexpected or interesting.
Isn’t this just the answer? To rephrase:
The SAE is only able to represent a subset of the possible directions from the initial space when you force it to compress the space down.
If you take a magnitude from a direction where change matters, and then apply the magnitude to random dimensions most of which the model throws away, it will result in a smaller change.
Isn’t this just the answer? To rephrase:
The SAE is only able to represent a subset of the possible directions from the initial space when you force it to compress the space down.
If you take a magnitude from a direction where change matters, and then apply the magnitude to random dimensions most of which the model throws away, it will result in a smaller change.