Did you vary expansion size? The tokenized SAE will have 50k more features in its dictionary (compared to the 16x expansion of ~12k features from the paper version).
Did you ever train a baseline SAE w/ similar number of features as the tokenized?
Yes to both! We varied expansion size for tokenized (8x-32x) and baseline (4x-64x), available in the Google Drive folder expansion-sweep. Just to be clear, our focus was on learning so-called “complex” features that do not solely activate based on the last token. So, we did not use the lookup biases as additional features (only for decoder reconstruction).
That said, ~25% of the suggested 64x baseline features are similar to the token-biases (cosine similarity 0.4-0.9). In fact, evolving the token-biases via training substantially increases their similarity (see figure). Smaller expansion sizes have up to 66% similar features and fewer dead features. (related sections above: ‘Dead features’ and ‘Measuring “simple” features’)
Do you have a dictionary-size to CE-added plot? (fixing L0)
So, we did not use the lookup biases as additional features (only for decoder reconstruction)
I agree it’s not like the other features in that the encoder isn’t used, but it is used for reconstruction which affects CE. It’d be good to show the pareto improvement of CE/L0 is not caused by just having an additional vocab_size number of features (although that might mean having to use auxk to have a similar number of alive features).
One caveat that I want to highlight is that there was a bug training the tokenized SAEs for the expansions sweep, the lookup table isn’t learned but remained at the hard-coded values...
They are therefore quite suboptimal. Due to some compute constraints, I haven’t re-run that experiment (the x64 SAEs take quite a while to train).
Anyway, I think the main question you want answered is if the 8x tokenized SAE beats the 64x normal SAE, which it does. The 64x SAE is improving slightly quicker near the end of training, I only used 130M tokens.
Below is an NMSE plot for k=30 across expansion factors (the CE is about the same albeit slightly less impacted by size increase). the “tokenized” label indicates the non-learned lookup and the “Learned” is the working tokenized setup.
My suggested experiment to really get at this question (which if I were in your shoes, I wouldn’t want to run cause you’ve already done quite a bit of work on this project!, lol):
Compare 1. Baseline 80x expansion (56k features) at k=30 2. Tokenized-learned 8x expansion (50k vocab + 6k features) at k=29 (since the token adds 1 extra feature)
for 300M tokens (I usually don’t see improvements past this amount) showing NMSE and CE.
If tokenized-SAEs are still better in this experiment, then that’s a pretty solid argument to use these!
If they’re equivalent, then tokenized-SAEs are still way faster to train in this lower expansion range, while having 50k “features” already interpreted.
If tokenized-SAEs are worse, then these tokenized features aren’t a good prior to use. Although both sets of features are learned, the difference would be the tokenized always has the same feature per token (duh), and baseline SAEs allow whatever combination of features (e.g. features shared across different tokens).
Although, tokenized features are dissimilar to normal features in that they don’t vary in activation strength. Tokenized features are either 0 or 1 (or norm of the vector). So it’s not exactly an apples-to-apples comparison w/ a similar sized dictionary of normal SAE features, although that plot would be nice!
Did you vary expansion size? The tokenized SAE will have 50k more features in its dictionary (compared to the 16x expansion of ~12k features from the paper version).
Did you ever train a baseline SAE w/ similar number of features as the tokenized?
Yes to both! We varied expansion size for tokenized (8x-32x) and baseline (4x-64x), available in the Google Drive folder
expansion-sweep
. Just to be clear, our focus was on learning so-called “complex” features that do not solely activate based on the last token. So, we did not use the lookup biases as additional features (only for decoder reconstruction).That said, ~25% of the suggested 64x baseline features are similar to the token-biases (cosine similarity 0.4-0.9). In fact, evolving the token-biases via training substantially increases their similarity (see figure). Smaller expansion sizes have up to 66% similar features and fewer dead features. (related sections above: ‘Dead features’ and ‘Measuring “simple” features’)
Do you have a dictionary-size to CE-added plot? (fixing L0)
I agree it’s not like the other features in that the encoder isn’t used, but it is used for reconstruction which affects CE. It’d be good to show the pareto improvement of CE/L0 is not caused by just having an additional vocab_size number of features (although that might mean having to use auxk to have a similar number of alive features).
One caveat that I want to highlight is that there was a bug training the tokenized SAEs for the expansions sweep, the lookup table isn’t learned but remained at the hard-coded values...
They are therefore quite suboptimal. Due to some compute constraints, I haven’t re-run that experiment (the x64 SAEs take quite a while to train).
Anyway, I think the main question you want answered is if the 8x tokenized SAE beats the 64x normal SAE, which it does. The 64x SAE is improving slightly quicker near the end of training, I only used 130M tokens.
Below is an NMSE plot for k=30 across expansion factors (the CE is about the same albeit slightly less impacted by size increase). the “tokenized” label indicates the non-learned lookup and the “Learned” is the working tokenized setup.
That’s great thanks!
My suggested experiment to really get at this question (which if I were in your shoes, I wouldn’t want to run cause you’ve already done quite a bit of work on this project!, lol):
Compare
1. Baseline 80x expansion (56k features) at k=30
2. Tokenized-learned 8x expansion (50k vocab + 6k features) at k=29 (since the token adds 1 extra feature)
for 300M tokens (I usually don’t see improvements past this amount) showing NMSE and CE.
If tokenized-SAEs are still better in this experiment, then that’s a pretty solid argument to use these!
If they’re equivalent, then tokenized-SAEs are still way faster to train in this lower expansion range, while having 50k “features” already interpreted.
If tokenized-SAEs are worse, then these tokenized features aren’t a good prior to use. Although both sets of features are learned, the difference would be the tokenized always has the same feature per token (duh), and baseline SAEs allow whatever combination of features (e.g. features shared across different tokens).
This is a completely fair suggestion. I’ll look into training a fully-fledged SAE with the same number of features for the full training duration.
Although, tokenized features are dissimilar to normal features in that they don’t vary in activation strength. Tokenized features are either 0 or 1 (or norm of the vector). So it’s not exactly an apples-to-apples comparison w/ a similar sized dictionary of normal SAE features, although that plot would be nice!