Double thanks for the extended discussion and ideas! Also interested to see what happens.
We earlier created some SAEs that completely remove the unigram directions from the encoder (e.g. old/gpt2_resid_pre_8_t8.pt).
However, a ” Golden Gate Bridge” feature individually activates on ” Golden” (plus prior context), ” Gate” (plus prior), and ” Bridge” (plus prior). Without the last-token/unigram directions these tended not to activate directly, complicating interpretability.
Double thanks for the extended discussion and ideas! Also interested to see what happens.
We earlier created some SAEs that completely remove the unigram directions from the encoder (e.g.
old/gpt2_resid_pre_8_t8.pt
).However, a ” Golden Gate Bridge” feature individually activates on ” Golden” (plus prior context), ” Gate” (plus prior), and ” Bridge” (plus prior). Without the last-token/unigram directions these tended not to activate directly, complicating interpretability.