tdooms comments on Tokenized SAEs: Infusing per-token biases.

tdooms 11 Aug 2024 8:23 UTC
2 points
0
That’s awesome to hear, while we are not especially familiar with circuit analysis, anecdotally, we’ve heard that some circuit features are very disappointing (such as the “Mary” feature for IOI, I believe this is also the case in Othello SAEs where many features just describe the last move). This was a partial motivation for this work.

About similar tokenized features, maybe I’m misunderstanding, but this seems like a problem for any decoder-like structure. In the lookup table though, I think this behaviour is somewhat attenuated due to the strict manual trigger, which encourages the lookup table to learn exact features instead of means.
- Logan Riggs 12 Aug 2024 14:38 UTC
  3 points
  0
  Parent
  About similar tokenized features, maybe I’m misunderstanding, but this seems like a problem for any decoder-like structure.
  I didn’t mean to imply it’s a problem, but the intepretation should be different. For example, if at layer N, all the number tokens have cos-sim=1 in the tokenized-feature set, then if we find a downstream feature reading from ” 9″ token on a specific task, then we should conclude it’s reading from a more general number direction than a specific number direction.
  I agree this argument also applies to the normal SAE decoder (if the cos-sim=1)