Logan Riggs comments on Tokenized SAEs: Infusing per-token biases.

Logan Riggs 12 Aug 2024 15:15 UTC
3 points
0
Maybe this should be like Anthropic’s shared decoder bias? Essentially subtract off the per-token bias at the beginning, let the SAE reconstruct this “residual”, then add the per-token bias back to the reconstructed x.
The motivation is that the SAE has a weird job in this case. It sees x, but needs to reconstruct x—per-token-bias, which means it needs to somehow learn what that per-token-bias is during training.
However, if you just subtract it first, then the SAE sees x’, and just needs to reconstruct x’.
So I’m just suggesting changing $f (x)$ here:
$f (x) = σ (W_{e n c} (x - b_{d e c})) - - > σ (W_{e n c} (x - b_{d e c} - W_{l o o k u p} (t)))$
w/ $^x$ remaining the same:
$^x = W_{d e c} (f (x)) + b_{d e c} + W_{l o o k u p} (t)$
- tdooms 12 Aug 2024 22:41 UTC
  4 points
  0
  Parent
  We haven’t considered this since our idea was that the encoder could maybe use the full information to better predict features. However, this seems worthwhile to at least try. I’ll look into this soon, thanks for the inspiration.
- danwil 13 Aug 2024 8:47 UTC
  3 points
  2
  Parent
  Double thanks for the extended discussion and ideas! Also interested to see what happens.
  We earlier created some SAEs that completely remove the unigram directions from the encoder (e.g. old/gpt2_resid_pre_8_t8.pt).
  However, a ” Golden Gate Bridge” feature individually activates on ” Golden” (plus prior context), ” Gate” (plus prior), and ” Bridge” (plus prior). Without the last-token/unigram directions these tended not to activate directly, complicating interpretability.