About similar tokenized features, maybe I’m misunderstanding, but this seems like a problem for any decoder-like structure.
I didn’t mean to imply it’s a problem, but the intepretation should be different. For example, if at layer N, all the number tokens have cos-sim=1 in the tokenized-feature set, then if we find a downstream feature reading from ” 9″ token on a specific task, then we should conclude it’s reading from a more general number direction than a specific number direction.
I agree this argument also applies to the normal SAE decoder (if the cos-sim=1)
I didn’t mean to imply it’s a problem, but the intepretation should be different. For example, if at layer N, all the number tokens have cos-sim=1 in the tokenized-feature set, then if we find a downstream feature reading from ” 9″ token on a specific task, then we should conclude it’s reading from a more general number direction than a specific number direction.
I agree this argument also applies to the normal SAE decoder (if the cos-sim=1)