One caveat that I want to highlight is that there was a bug training the tokenized SAEs for the expansions sweep, the lookup table isn’t learned but remained at the hard-coded values...
They are therefore quite suboptimal. Due to some compute constraints, I haven’t re-run that experiment (the x64 SAEs take quite a while to train).
Anyway, I think the main question you want answered is if the 8x tokenized SAE beats the 64x normal SAE, which it does. The 64x SAE is improving slightly quicker near the end of training, I only used 130M tokens.
Below is an NMSE plot for k=30 across expansion factors (the CE is about the same albeit slightly less impacted by size increase). the “tokenized” label indicates the non-learned lookup and the “Learned” is the working tokenized setup.
My suggested experiment to really get at this question (which if I were in your shoes, I wouldn’t want to run cause you’ve already done quite a bit of work on this project!, lol):
Compare 1. Baseline 80x expansion (56k features) at k=30 2. Tokenized-learned 8x expansion (50k vocab + 6k features) at k=29 (since the token adds 1 extra feature)
for 300M tokens (I usually don’t see improvements past this amount) showing NMSE and CE.
If tokenized-SAEs are still better in this experiment, then that’s a pretty solid argument to use these!
If they’re equivalent, then tokenized-SAEs are still way faster to train in this lower expansion range, while having 50k “features” already interpreted.
If tokenized-SAEs are worse, then these tokenized features aren’t a good prior to use. Although both sets of features are learned, the difference would be the tokenized always has the same feature per token (duh), and baseline SAEs allow whatever combination of features (e.g. features shared across different tokens).
One caveat that I want to highlight is that there was a bug training the tokenized SAEs for the expansions sweep, the lookup table isn’t learned but remained at the hard-coded values...
They are therefore quite suboptimal. Due to some compute constraints, I haven’t re-run that experiment (the x64 SAEs take quite a while to train).
Anyway, I think the main question you want answered is if the 8x tokenized SAE beats the 64x normal SAE, which it does. The 64x SAE is improving slightly quicker near the end of training, I only used 130M tokens.
Below is an NMSE plot for k=30 across expansion factors (the CE is about the same albeit slightly less impacted by size increase). the “tokenized” label indicates the non-learned lookup and the “Learned” is the working tokenized setup.
That’s great thanks!
My suggested experiment to really get at this question (which if I were in your shoes, I wouldn’t want to run cause you’ve already done quite a bit of work on this project!, lol):
Compare
1. Baseline 80x expansion (56k features) at k=30
2. Tokenized-learned 8x expansion (50k vocab + 6k features) at k=29 (since the token adds 1 extra feature)
for 300M tokens (I usually don’t see improvements past this amount) showing NMSE and CE.
If tokenized-SAEs are still better in this experiment, then that’s a pretty solid argument to use these!
If they’re equivalent, then tokenized-SAEs are still way faster to train in this lower expansion range, while having 50k “features” already interpreted.
If tokenized-SAEs are worse, then these tokenized features aren’t a good prior to use. Although both sets of features are learned, the difference would be the tokenized always has the same feature per token (duh), and baseline SAEs allow whatever combination of features (e.g. features shared across different tokens).
This is a completely fair suggestion. I’ll look into training a fully-fledged SAE with the same number of features for the full training duration.