That’s awesome to hear, while we are not especially familiar with circuit analysis, anecdotally, we’ve heard that some circuit features are very disappointing (such as the “Mary” feature for IOI, I believe this is also the case in Othello SAEs where many features just describe the last move). This was a partial motivation for this work.
About similar tokenized features, maybe I’m misunderstanding, but this seems like a problem for any decoder-like structure. In the lookup table though, I think this behaviour is somewhat attenuated due to the strict manual trigger, which encourages the lookup table to learn exact features instead of means.
About similar tokenized features, maybe I’m misunderstanding, but this seems like a problem for any decoder-like structure.
I didn’t mean to imply it’s a problem, but the intepretation should be different. For example, if at layer N, all the number tokens have cos-sim=1 in the tokenized-feature set, then if we find a downstream feature reading from ” 9″ token on a specific task, then we should conclude it’s reading from a more general number direction than a specific number direction.
I agree this argument also applies to the normal SAE decoder (if the cos-sim=1)
That’s awesome to hear, while we are not especially familiar with circuit analysis, anecdotally, we’ve heard that some circuit features are very disappointing (such as the “Mary” feature for IOI, I believe this is also the case in Othello SAEs where many features just describe the last move). This was a partial motivation for this work.
About similar tokenized features, maybe I’m misunderstanding, but this seems like a problem for any decoder-like structure. In the lookup table though, I think this behaviour is somewhat attenuated due to the strict manual trigger, which encourages the lookup table to learn exact features instead of means.
I didn’t mean to imply it’s a problem, but the intepretation should be different. For example, if at layer N, all the number tokens have cos-sim=1 in the tokenized-feature set, then if we find a downstream feature reading from ” 9″ token on a specific task, then we should conclude it’s reading from a more general number direction than a specific number direction.
I agree this argument also applies to the normal SAE decoder (if the cos-sim=1)