Noa Nabeshima comments on Matryoshka Sparse Autoencoders

Noa Nabeshima 16 Dec 2024 9:39 UTC
LW: 5 AF: 3
0
AF
Even with all possible prefixes included in every batch the toy model learns the same small mixing between parent and children (this was best out of 2, for the first run the matryoshka didn’t represent one of the features): https://sparselatents.com/matryoshka_toy_all_prefixes.png

Here’s a hypothesis that could explain most of this mixing. If the hypothesis is true, then even if every possible prefix is included in every batch, there will still be mixing.

Hypothesis:
Regardless of the number of prefixes, there will be some prefix loss terms where
1. a parent and child feature are active
2. the parent latent is included in the prefix
3. the child latent isn’t included in the prefix.

The MSE loss in these prefix loss terms is pretty large because the child feature isn’t represented at all. This nudges the parent to slightly represent all of its children a bit.

To compensate for this, if a child feature is active and the child latent is included the prefix, it undoes the parent decoder vector’s contribution to the features of the parent’s other children.

This could explain these weird properties of the heatmap:
- Parent decoder vector has small positive cosine similarity with child features
- Child decoder vectors have small negative cosine similarity with other child features

Still unexplained by this hypothesis:
- Child decoder vectors have very small negative cosine similarity with the parent feature.
- chanind 26 Dec 2024 0:16 UTC
  LW: 15 AF: 7
  0
  AF Parent
  I tried digging into this some more and think I have an idea what’s going on. As I understand it, the base assumption for why Matryoshka SAE should solve absorption is that a narrow SAE should perfectly reconstruct parent features in a hierarchy, so then absorption patterns can’t arise between child and parent features. However, it seems like this assumption is not correct: narrow SAEs sill learn messed up latents when there’s co-occurrence between parent and child features in a hierarchy, and this messes up what the Matryoshka SAE learns.
  
  I did this investigation in the following colab: https://colab.research.google.com/drive/1sG64FMQQcRBCNGNzRMcyDyP4M-Sv-nQA?usp=sharing
  
  Apologies for the long comment, this might make more sense as its own post potentially. I’m curious to get others thoughts on this—it’s also possible I’m doing something dumb.
  The problem: Matryoshka latents don’t perfectly match true features
  In the post, the Matryoshka latents seem to have the following problematic properties:
  - The latent tracking a parent feature contains components of child features
  - The latents tracking child features have negative components of each other child feature
  The setup: simplified hierarchical features
  I tried to investigate this using a simpler version of the setup in this post, focusing on a single parent/child relationship between latents. This is like a zoomed-in version on a single set of parent/child features. Our setup has 3 true features in a hierarchy as below:
```
feat 0 - parent_feature (p=0.3, mutually exclusive children)
feat 1 - ├── child_feature_1 (p=0.4)
feat 2 - └── child_feature_2 (p=0.4)
```
  These features have higher firing probabilities compared to the setup in the original post to make the trends highlighted more obvious. All features fire with magnitude 1.0 and have a 20d representation with no superposition (all features are mutually orthogonal).
  Simplified Matryoshka SAE
  I used a simpler Matryoshka SAE that doesn’t use feature sampling or reshuffling of latents and doesn’t take the log of losses. Since we already know the hierarchy of the underlying features in this setup, I just used a Matryoshka SAE with a single inner SAE width of 1 latent to track the 1 parent feature, and the outer SAE width of 3 to match the number of true features. So the Matryoshka SAE sizes are as below:
```
size 0: latents [0]
size 1: latents [0, 1, 2]
```
  The cosine similarities between the encoder and decoder of the Matryoshka SAE and the true features is shown below:
  The Matryoshka decoder matches what we saw in the original post: the latent tracking the parent feature has positive cosine sim with the child features, and the latents tracking the child features have negative cosine sim with the other child feature. Our matryoshka inner SAE consisting of just latent 0 does track the parent feature as we expected though! What’s going on here? How is it possible for the inner Matryoshka SAE to represent a merged version of the parent and child features?
  Narrow SAEs do not correctly reconstruct parent features
  The core idea behind Matryoshka SAEs is that in a narrower SAE, the SAE should learn a clean representation of parent features despite co-ooccurrence with child features. Once we have a clean representation of a parent feature in a hierarchy, adding child latents to the SAE should not allow any absorption.
  Surprisingly, this assumption is incorrect: narrow SAEs merge child representations into the parent latent.
  I tried training a standard SAE with a single latent on our toy example, expecting that the 1-latent SAE would learn only the parent feature without any signs of absorption. Below is the plot of the cosine similarities between the SAE encoder and decoder with the true features.
  This single-latent SAE learns a representation that merges the representation of the child features into the parent latent, exactly how we saw in our Matryoshka SAE and in the original post’s results! Our narrow SAE is not learning the correct representation of feature 0 as we would hope. Instead, it’s learning feature 0 + weaker representations of child features 1 and 2.
  Why does this happen?
  Likely this reduces MSE loss compared with learning the actual correct representation of feature 0 on its own. When there’s fewer latents than features, the SAE always has to accept some MSE error, and this behavior of merging in some of the child features into the parent latent likely reduces MSE loss compared with learning the actual parent feature 0 on its own.
  What does this mean for Matryoshka SAEs?
  This issue should affect any Matryoshka SAE, since the base assumption underlying Matryoshka SAEs is that a narrow SAE will correctly represent general parent features without any issues due to co-occurrence from specific child features. Since that assumption is not correct, we should not expect a Matryoshka SAE to completely fix absorption issues. I would expect that the topk SAEs from https://www.lesswrong.com/posts/rKM9b6B2LqwSB5ToN/learning-multi-level-features-with-matryoshka-saes would also suffer from this problem, although I didn’t test that in this toy setting since topk SAEs are tricker to evaluate in toy settings (it’s not obvious what K to pick).
  It’s possible the issues shown in this toy setting are more extreme than in a real LLM since the firing probabilities of the features may be higher than many features in a real LLM. That said, it’s hard to say anything concrete about the firing probabilities of features in real LLMs since we have no ground truth data on true LLM features.
  What links here?
  - Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models by chanind (30 Dec 2024 22:50 UTC; 24 points)
  - Noa Nabeshima 28 Dec 2024 7:43 UTC
    LW: 3 AF: 3
    0
    AF Parent
    This is cool! I wonder if it can be fixed. I imagine it could be improved some amount by nudging the prefix distribution, but it doesn’t seem like that will solve it properly. Curious if this is a large issue in real LMs. It’s frustrating that there aren’t ground-truth features we have access to in language models.
    
    I think how large of a problem this is can probably be inferred from a description of the feature distribution. It’d be nice to have a better sense of what that distribution is (assuming the paradigm is correct enough).
    - chanind 28 Dec 2024 21:00 UTC
      1 point
      0
      Parent
      It might also be an artifact of using MSE loss. Maybe a different loss term for reconstruction loss might not have this problem?
  - Noa Nabeshima 30 Dec 2024 23:48 UTC
    LW: 2 AF: 2
    0
    AF Parent
    You know I was thinking ab this—say that there are two children and they’re orthogonal to the parent and each have probability 0.4 given the parent. If you imagine the space it looks like three clusters, two with probability 0.4, norm 1.4 and one with probability 0.2 and norm 1. They all have high cosine similarity with each other. From this frame, having the parent ‘include’ the children directions a bit doesn’t seem that inappropriate. One SAE latent setup that seems pretty reasonable is to actually have one parent latent that’s like “one of these three clusters is active” and three child latents pointing to each of the three clusters. The parent latent decoder in that setup would also include a bit of the child feature directions.
    
    This is all sketchy though. It doesn’t feel like we have a good answer to the question “How exactly do we want the SAEs to behave in various scenarios?”
    - chanind 31 Dec 2024 4:23 UTC
      1 point
      0
      Parent
      Yeah I think that’s right, the problem is that the SAE sees 3 very non-orthogonal inputs, and settles on something sort of between them (but skewed towards the parent). I don’t know how to get the SAE to exactly learn the parent only in these scenarios—I think if we can solve that then we should be in pretty good shape.
      This is all sketchy though. It doesn’t feel like we have a good answer to the question “How exactly do we want the SAEs to behave in various scenarios?”
      I do think the goal should be to get the SAE to learn the true underlying features, at least in these toy settings where we know what the true features are. If the SAEs we’re training can’t handle simple toy examples without superposition I don’t have a lot of faith that when we’re training SAEs on real LLM activations that the results are trustworthy.

Noa Nabeshima comments on Matryoshka Sparse Autoencoders

The problem: Matryoshka latents don’t perfectly match true features

The setup: simplified hierarchical features

Simplified Matryoshka SAE

Narrow SAEs do not correctly reconstruct parent features

Why does this happen?

What does this mean for Matryoshka SAEs?