LawrenceC comments on Comments on Anthropic’s Scaling Monosemanticity

LawrenceC 3 Jun 2024 19:13 UTC
2 points
0
70b storing 6b bits of pure memorized info seems quite reasonable to me, maybe a bit high. My guess is there’s a lot more structure to the world that the models exploit to “know” more things with fewer memorized bits, but this is a pretty low confidence take (and perhaps we disagree on what “memorized info” means here). That being said, SAEs as currently conceived/evaluated won’t be able to find/respect a lot of the structure, so maybe 500M features is also reasonable.
I don’t think SAEs will actually work at this level of sparsity though, so this is mostly besides the point.
I agree that SAEs don’t work at this level of sparsity and I’m skeptical of the view myself. But from a “scale up SAEs to get all features” perspective, it sure seems pretty plausible to me that you need a lot more features than people used to look at.
I also don’t think the Anthropic paper OP is talking about has come close for Pareto frontier for size <> sparsity <> trainability.