Eric J. Michaud comments on SAE feature geometry is outside the superposition hypothesis

Eric J. Michaud 26 Jun 2024 17:46 UTC
31 points
1
Nice post! I’m one of the authors of the Engels et al. paper on circular features in LLMs so just thought I’d share some additional details about our experiments that are relevant to this discussion.
To review, our paper finds circular representations of days of the week and months of the year within LLMs. These appear to reflect the cyclical structure of our calendar! These plots are for gpt-2-small:
We found these by (1) performing some graph-based clustering on SAE features with the cosine similarity between decoder vectors as the similarity measure. Then (2) given a cluster of SAE features, identify tokens which activate any feature in the cluster and reconstruct the LLM’s activation vector on tokens with the SAE while only allowing the SAE features in the cluster to participate in the reconstruction. Then (3) we visualize the reconstructed points along the top PCA components. Our idea here was that if there were some true higher-dimensional features in the LLM, that multiple SAE features would together need to participate in reconstructing that feature, and we want to visualize this feature while removing all others from the activation vector. To find such groups of SAE features, the decoder vector cosine similarity was what worked in practice. We also tried Jaccard similarity (capturing how frequently SAE features fire together) but it didn’t yield interesting clusters like cosine similarity did in the experiments we ran.
In practice, this required looking at altogether thousands of panels of interactive PCA plots like this:
Here’s a Dropbox link with all 500 of the gpt-2-small interactive plots like these that we looked at: https://www.dropbox.com/scl/fo/usyuem3x4o4l89zbtooqx/ALw2-ZWkRx_I9thXjduZdxE?rlkey=21xkkd6n8ez1n51sf0d773w9t&st=qpz5395r&dl=0 (note that I used n_clusters=1000 with spectral clustering but only made plots for the top 500, ranked by mean pairwise cosine similarity of SAE features within the cluster).
Here are the clusters that I thought might have interesting structure:
- cluster67: numbers, PCA dim 2 is their value
- cluster109: money amounts, pca dim 1 might be related to cents and pca dim 2 might be related to the dollar amount
- cluster134: different number-related tokens like “000”, “million” vs. “billion”, etc.
- cluster138: days of the week circle!!!
- cluster157: years
- cluster71: possible “left” vs. “right” direction
- cluster180: “long” vs. “short”
- cluster212: years, possible circular representation of year within century in pca dims 2-3
- cluster213: the “-” in between a range of numbers, ordered by the first number
- cluster223: “up” vs. “down” direction
- cluster251: months of the year!!
- cluster285: pca dim 1 is republican vs democrat
You can hover over a point on each scatter plot to see some context and the token (in bold) that the activation vector (residual stream in layer 7) fires above.
Most clusters however don’t seem obviously interesting. We also looked at ~2000 Mistral-7B clusters and only the days of the week and months of the year clusters seemed clearly interesting. So at least for the LLMs we looked at, for the SAEs we had, and with our method for discovering interesting geometry, interesting geometry didn’t seem ubiquitous. That said, it could just be that our methods are limited, or the LLMs and/or SAEs we used weren’t large enough, or that there is interesting geometry but it’s not obvious to us from PCA plots like the above.
That said, I think you’re right that the basic picture of features as independent near-orthogonal directions from Toy Models of Superposition is wrong, as discussed by Anthropic in their Towards Monosemanticity post, and efforts to understand this could be super important. As mentioned by Joseph Bloom in his comment, understanding this better could inspire different SAE architectures which get around scaling issues we may already be running into.
- kromem 10 Jul 2024 1:09 UTC
  2 points
  0
  Parent
  In practice, this required looking at altogether thousands of panels of interactive PCA plots like this [..]
  Most clusters however don’t seem obviously interesting.
  What do you think of @jake_mendel’s point about the streetlight effect?
  If the methodology was looking at 2D slices of up to a 5 dimensional spaces, was detection of multi-dimensional shapes necessarily biased towards human identification and signaling of shape detection in 2D slices?
  
  I really like your update to the superposition hypothesis from linear to multi-dimensional in your section 3, but I’ve been having a growing suspicion that—especially if node multi-functionality and superposition is the case—that the dimensionality of the data compression may be severely underestimated. If Llama on paper is 4,096 dimensions, but in actuality those nodes are superimposed, there could be OOM higher dimensional spaces (and structures in those spaces) than the on paper dimensionality max.
  
  So even if your revised version of the hypothesis is correct, it might be that the search space for meaningful structures was bounded much lower than where the relatively ‘low’ composable mulit-dimensional shapes are actually primarily forming.
  
  I know that for myself, even when considering basic 4D geometry like a tesseract, if data clusters were around corners of the shape I’d only spot a small number of the possible 2D slices, and in at least one of those cases might think what I was looking at was a circle instead of a tesseract: https://mathworld.wolfram.com/images/eps-gif/TesseractGraph_800.gif
  
  Do you think future work may be able to rely on automated multi-dimensional shape and cluster detection exploring shapes and dimensional spaces well beyond even just 4D, or that the difficulty in mutli-dimensional pattern recognition will remain a foundational obstacle for the foreseeable future?