Great work!
Did you ever run just the L0-approx & sparsity-frequency penalty separately? It’s unclear if you’re getting better results because the L0 function is better or because there are less dead features.
Also, a feature frequency of 0.2 is very large! 1⁄5 tokens activating is large even for positional (because your context length is 128). It’d be bad if the improved results are because polysemanticity is sneaking back in through these activations. Sampling datapoints across a range of activations should show where the meaning becomes polysemantic. Is it the bottom 10% (or 10% of max-activating example is my preferred method)
What a cool paper! Congrats!:)
What’s cool:
1. e2e saes learn very different features every seed. I’m glad y’all checked! This seems bad.
2. e2e SAEs have worse intermediate reconstruction loss than local. I would’ve predicted the opposite actually.
3. e2e+downstream seems to get all the benefits of the e2e one (same perf at lower L0) at the same compute cost, w/o the “intermediate activations aren’t similar” problem.
It looks like you’ve left for future work postraining SAE_local on KL or downstream loss as future work, but that’s a very interesting part! Specifically the approximation of SAE_e2e+downstream as you train on number of tokens.
Did y’all try ablations on SAE_e2e+downstream? For example, only training on the next layers Reconstruction loss or next N-layers rec loss?