and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space)
However, interpretable concepts do seem to tend to be fairly well localized in VAE-space, and shards are likely to be concentrated where the concepts they are relevant to are found.
There are probably a dozen or more articles on this bu now. Search for VAE or Variational Auto-Encoder in the context of mechanical interpretability. The seminal paper on this was from Anthropic.
As I mentioned in my other comment, SAEs finds features that correspond to abstract features of words and text. That’s not the same as finding features that correspond to reality.
However, interpretable concepts do seem to tend to be fairly well localized in VAE-space, and shards are likely to be concentrated where the concepts they are relevant to are found.
What do you mean? Do you have a link?
There are probably a dozen or more articles on this bu now. Search for VAE or Variational Auto-Encoder in the context of mechanical interpretability. The seminal paper on this was from Anthropic.
I don’t immediately find it, do you have a link?
I think @RogerDearnaley means Sparse Autoencoders (SAEs), see for example these papers and the SAE tag on LessWrong.
As I mentioned in my other comment, SAEs finds features that correspond to abstract features of words and text. That’s not the same as finding features that correspond to reality.