samshap comments on Stitching SAEs of different sizes

samshap 14 Jul 2024 23:53 UTC
2 points
0
:Here’s my longer reply.
I’m extremely excited by the work in SAEs and their potential for interpretability, however I think there is a subtle misalignment in the SAE architecture and loss function, and the actual desired objective function.
The SAE loss function is:
$L (x; W_{d e c}, b_{d e c}) = E_{x} [| | x -^x | |^{2} + λ | | f (x) | |_{1}]$ , where $| | f (x) | |_{1} = \sum_{i} f_{i} (x)$ is the $ℓ 1$ -Norm.
or
$L (x) = E_{x} [| | x - W_{d e c} f (x) - b_{d e c} | |^{2} + λ | | f (x) | |_{1}]$
I would argue that, however, what you are actually trying to solve is the sparse coding problem:
$L (x; W_{d e c}, b_{d e c}) = E_{x} [{min}_{f} | | x - W_{d e c} f - b_{d e c} | |^{2} + λ | | f | |_{1}]$
where, importantly, the inner optimization is solved separately (including at runtime).
Since $f$ is an overcomplete basis, finding $f^{*}$ that minimizes the inner loop (also known as basis pursuit denoising^[1] ) is a notoriously challenging problem, one which a single-layer encoder is underpowered to compute. The SAE’s encoder thus introduces a significant error ${~ f}_{e n c}$ , which means that you are actual loss function is:
$L (x; Θ) = E_{x} [| | x - W_{d e c} (f^{*} + {~ f}_{e n c}) - b_{d e c} | |^{2} + λ | | f^{*} + {~ f}_{e n c} | |_{1}]$
The magnitude of the errors would have to be determined empirically, but I suspect that it is enough to be a significant source of error..
There are a few things you could do reduce the error:
1. Ensuring that $W_{d e c}$ obeys the restricted isometry property^[2] (i.e. a cap on the cosine similarity of decoder weights), or barring that, adding a term to your loss function that at least minimizes the cosine similarities.
2. Adding extra layers to your encoder, so it’s better at solving for $f^{*}$ .
3. Empirical studies to see how large the feature error is / how much reconstruction error it is adding.
1. ^
  https://epubs.siam.org/doi/abs/10.1137/S003614450037906X?casa_token=E-R-1D55k-wAAAAA:DB1SABlJH5NgtxkRlxpDc_4IOuJ4SjBm5-dLTeZd7J-pnTAA4VQQ2FJ6TfkRpZ3c93MNrpHddcI
2. ^
  http://www.numdam.org/item/10.1016/j.crma.2008.03.014.pdf
- Neel Nanda 15 Jul 2024 20:44 UTC
  2 points
  0
  Parent
  Interesting! You might be interested in a post from my team on inference-time optimization
  
  It’s not clear to me what the right call here is though, because you want f to be something the model could extract. The encoder being so simple is in some ways a feature, not a bug—I wouldn’t want it to be eg a deep model, because the LLM can’t easily extract that!
  - samshap 16 Jul 2024 21:49 UTC
    1 point
    0
    Parent
    Thanks for sharing that study. It looks like your team is already well-versed in this subject!
    You wouldn’t want something that’s too hard to extract, but I think restricting yourself to a single encoder layer is too conservative—LLMs don’t have to be able to fully extract the information from a layer in a single step.
    I’d be curious to see how much closer a two-layer encoder would get to the ITO results.