LawrenceC comments on Comments on Anthropic’s Scaling Monosemanticity

LawrenceC 3 Jun 2024 17:38 UTC
11 points
0
But I was quietly surprised by how many features they were using in their sparse autoencoders (respectively 1M, 4M, or 34M). Assuming Claude Sonnet has the same architecture of GPT-3, its residual stream has dimension 12K so the feature ratios are 83x, 333x, and 2833x, respectively^[1]. In contrast, my team largely used a feature ratio of 2x, and Anthropic’s previous work “primarily focus[ed] on a more modest 8× expansion”. It does make sense to look for a lot of features, but this seemed to be worth mentioning.
There’s both theoretical work (i.e. this theory work) and empirical experiments (e.g. in memorization) demonstrating that models seem to be able to “know” O(quadratically) many things, in the size of their residual stream.^[1] My guess is Sonnet is closer to Llama-70b in size (~8.2k features), so this suggests ~67M features naively, and also that 34M is reasonable.
Also worth noting that a lot of their 34M features were dead, so the number of actual features is quite a bit lower.
1. ^
  You might also expect to need O(Param) params to recover the features, so for a 70B model with residual stream width 8.2k you want 8.5M (~=70B/8192) features.
- ryan_greenblatt 3 Jun 2024 18:01 UTC
  4 points
  0
  Parent
  It’s seems plausible to me that a 70b model stores ~6 billion bits of memorized information. Naively, you might think this requires around 500M features. (supposing that each “feature” represents 12 bits which is probably a bit optimistic)
  
  I don’t think SAEs will actually work at this level of sparsity though, so this is mostly besides the point.
  
  I’m pretty skeptical of a view like “scale up SAEs and get all the features”.
  
  (If you wanted “feature” to mean something.)
  - LawrenceC 3 Jun 2024 19:13 UTC
    2 points
    0
    Parent
    70b storing 6b bits of pure memorized info seems quite reasonable to me, maybe a bit high. My guess is there’s a lot more structure to the world that the models exploit to “know” more things with fewer memorized bits, but this is a pretty low confidence take (and perhaps we disagree on what “memorized info” means here). That being said, SAEs as currently conceived/evaluated won’t be able to find/respect a lot of the structure, so maybe 500M features is also reasonable.
    I don’t think SAEs will actually work at this level of sparsity though, so this is mostly besides the point.
    I agree that SAEs don’t work at this level of sparsity and I’m skeptical of the view myself. But from a “scale up SAEs to get all features” perspective, it sure seems pretty plausible to me that you need a lot more features than people used to look at.
    I also don’t think the Anthropic paper OP is talking about has come close for Pareto frontier for size <> sparsity <> trainability.