Remove high-frequency features from 0_8192 layer 3 until it has L0 < 40 (the same L0 as the 1_32768 layer 3 dictionary)
Recompute statistics for this modified dictionary.
I predict the resulting dictionary will be “like 1_32768 but a bit worse.” Concretely, I’m guessing that means % loss recovered around 72%.
Results:
I killed all features of frequency larger than 0.038. This was 2041 features, and resulted in a L0 just below 40. The stats:
MSE Loss: 0.27 (worse than 1_32768)
Percent loss recovered: 77.9% (a little bit better than 1_32768)
I was a bit surprised by this—it suggests the high-frequency features are disproportionately likely to be useful for reconstructing activations in ways that don’t actually mater to the model’s computation. (Though then again, maybe this is what we expect for uninterpretable features.)
It also suggests that we might be better off training dictionaries with a too-low L1 penalty and then just pruning away high-frequency features (sort of the dual operation of “train with a high L1 penalty and resample low-frequency features”). I’d be interested for someone to explore if there’s a version of this that helps.
Here’s an experiment I’m about to do:
Remove high-frequency features from 0_8192 layer 3 until it has L0 < 40 (the same L0 as the 1_32768 layer 3 dictionary)
Recompute statistics for this modified dictionary.
I predict the resulting dictionary will be “like 1_32768 but a bit worse.” Concretely, I’m guessing that means % loss recovered around 72%.
Results:
I killed all features of frequency larger than 0.038. This was 2041 features, and resulted in a L0 just below 40. The stats:
MSE Loss: 0.27 (worse than 1_32768)
Percent loss recovered: 77.9% (a little bit better than 1_32768)
I was a bit surprised by this—it suggests the high-frequency features are disproportionately likely to be useful for reconstructing activations in ways that don’t actually mater to the model’s computation. (Though then again, maybe this is what we expect for uninterpretable features.)
It also suggests that we might be better off training dictionaries with a too-low L1 penalty and then just pruning away high-frequency features (sort of the dual operation of “train with a high L1 penalty and resample low-frequency features”). I’d be interested for someone to explore if there’s a version of this that helps.