danwil comments on Tokenized SAEs: Infusing per-token biases.

danwil 9 Aug 2024 7:14 UTC
3 points
0
Indeed, similar tokens (e.g. “The” vs. ” The”) have similar token-biases (median max-cos-sim 0.89). This is likely because we initialize the biases with the unigram residuals, which mostly retain the same “nearby tokens” as the original embeddings.

Due to this, most token-biases have high cosine similarity to their respective unigram residuals (median 0.91). This indicates that if we use the token-biases as additional features, we can interpret the activations relative to the unigram residuals (as a somewhat un-normalized similarity, due to the dot product).

That said, the token-biases have uniquely interesting properties—for one, they seem to be great “last token detectors”. Suppose we take a last-row residual vector from an arbitrary 128-token prompt. Then, a max-cos-sim over the 50257 token-biases yields the prompt’s exact last token with 88-98% accuracy (layers 10 down to 5), compared to only 35-77% accuracy using unigram residuals (see footnote 2).