Miko Planas comments on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Miko Planas 7 Oct 2023 6:02 UTC
4 points
0
I’m quite interested in further understanding how the naming of these features will scale once we increase the layers of the transformer model and increase the size of the sparse autoencoder. This sounds like the search space for the large model doing the autointerpreting will become massive. Intuitively, might this affect the reliability of the short descriptions generated by the autointerpretability model? Additionally, if the model being analyzed—as well as the sparse autoencoder—is sufficiently larger than the autointerpreting model, how will this affect the reliability of the generation?
I am interested in answering those questions because I am trying to understand the utility of learning a massive dictionary, as well as the feasibility of autogenerating descriptions for the features (engineering problem).