RGRGRG comments on Transcoders enable fine-grained interpretable circuit analysis for language models

RGRGRG 4 May 2024 18:23 UTC
1 point
0
Question about the “rules of the game” you present. Are you allowed to simply look at layer 0 transcoder features for the final 10 tokens—you could probably roughly estimate the input string from these features’ top activators. From you case study, it seems that you effectively look at layer 0 transcoder features for a few of the final tokens through a backwards search, but wonder if you can skip the search and simply look at transcoder features. Thank you.