Remember back in 2013 when the talk of the town was how vector representations of words learned by neural networks represent rich semantic information? So you could do cool things like take the [male] vector, subtract the [female] vector, add the [king] vector, and get out something close to the [queen] vector?
Incidentally, there’s a recent paper that investigates how this works in SAEs on transformers:
we search for what we term crystal structure in the point cloud of SAE features … initial search for SAE crystals found mostly noise … consistent with multiple papers pointing out that (man,woman,king,queen) is not an accurate parallelogram
We found the reason to be the presence of what we term distractor features. … To eliminate such semantically irrelevant distractor vectors, we wish to project the data onto a lower-dimensional subspace orthogonal to them. … Figure 1 illustrates that this dramatically improves the cluster and trapezoid/parallelogram quality, highlighting that distractor features can hide existing crystals.
Incidentally, there’s a recent paper that investigates how this works in SAEs on transformers: