However, we find that in our trained models the learned encoder weights are not the transpose of the decoder weights and are cleverly offset to increase representational capacity. Specifically, we find that similar features which have closely related dictionary vectors have encoder weights that are offset so that they prevent crosstalk between the noisy feature inputs and confusion between the distinct features.
That post also includes a summary of Neel Nanda’s replication of the experiments, and they provided an additional interpretation of this that I think is interesting.
One question from this work is whether the encoder and decoder should be tied. I find that, empirically, the decoder and encoder weights for each feature are moderately different, with median cosine similiarty of only 0.5, which is empirical evidence they’re doing different things and should not be tied. Conceptually, the encoder and decoder are doing different things: the encoder is detecting, finding the optimal direction to project onto to detect the feature, minimising interference with other similar features, while the decoder is trying to represent the feature, and tries to approximate the “true” feature direction regardless of any interference.
Originally they were tied (because it makes intuitive sense), but I believe Anthropic was the first to suggest untying them, and found that this helped it differentiate similar features:
That post also includes a summary of Neel Nanda’s replication of the experiments, and they provided an additional interpretation of this that I think is interesting.