Joseph Bloom comments on Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Joseph Bloom 28 Feb 2024 16:58 UTC
1 point
0
My mental model is the encoder is working hard to find particular features and distinguish them from others (so it’s doing a compressed sensing task) and that out of context it’s off distribution and therefore doesn’t distinguish noise properly. Positional features are likely a part of that but I’d be surprised if it was most of it.