Maybe this should be like Anthropic’s shared decoder bias? Essentially subtract off the per-token bias at the beginning, let the SAE reconstruct this “residual”, then add the per-token bias back to the reconstructed x.
The motivation is that the SAE has a weird job in this case. It sees x, but needs to reconstruct x—per-token-bias, which means it needs to somehow learn what that per-token-bias is during training.
However, if you just subtract it first, then the SAE sees x’, and just needs to reconstruct x’.
We haven’t considered this since our idea was that the encoder could maybe use the full information to better predict features. However, this seems worthwhile to at least try. I’ll look into this soon, thanks for the inspiration.
Double thanks for the extended discussion and ideas! Also interested to see what happens.
We earlier created some SAEs that completely remove the unigram directions from the encoder (e.g. old/gpt2_resid_pre_8_t8.pt).
However, a ” Golden Gate Bridge” feature individually activates on ” Golden” (plus prior context), ” Gate” (plus prior), and ” Bridge” (plus prior). Without the last-token/unigram directions these tended not to activate directly, complicating interpretability.
Maybe this should be like Anthropic’s shared decoder bias? Essentially subtract off the per-token bias at the beginning, let the SAE reconstruct this “residual”, then add the per-token bias back to the reconstructed x.
The motivation is that the SAE has a weird job in this case. It sees x, but needs to reconstruct x—per-token-bias, which means it needs to somehow learn what that per-token-bias is during training.
However, if you just subtract it first, then the SAE sees x’, and just needs to reconstruct x’.
So I’m just suggesting changing f(x) here:
f(x)=σ(Wenc(x−bdec))−−>σ(Wenc(x−bdec−Wlookup(t)))
w/ ^x remaining the same:
^x=Wdec(f(x))+bdec+Wlookup(t)
We haven’t considered this since our idea was that the encoder could maybe use the full information to better predict features. However, this seems worthwhile to at least try. I’ll look into this soon, thanks for the inspiration.
Double thanks for the extended discussion and ideas! Also interested to see what happens.
We earlier created some SAEs that completely remove the unigram directions from the encoder (e.g.
old/gpt2_resid_pre_8_t8.pt
).However, a ” Golden Gate Bridge” feature individually activates on ” Golden” (plus prior context), ” Gate” (plus prior), and ” Bridge” (plus prior). Without the last-token/unigram directions these tended not to activate directly, complicating interpretability.