Logan Riggs comments on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Logan Riggs 22 Aug 2024 17:22 UTC
LW: 4 AF: 2
0
AF
I finally checked!
Here is the Jaccard similarity (ie similarity of input-token activations) across seeds
The e2e ones do indeed have a much lower jaccard sim (there normally is a spike at 1.0, but this is removed when you remove features that only activate <10 times).
I also (mostly) replicated the decoder similarity chart:
And calculated the encoder sim:
[I, again, needed to remove dead features (< 10 activations) to get the graphs here.]
So yes, I believe the original paper’s claim that e2e features learn quite different features across seeds is substantiated.