It’s seems plausible to me that a 70b model stores ~6 billion bits of memorized information.
Naively, you might think this requires around 500M features. (supposing that each “feature” represents 12 bits which is probably a bit optimistic)
I don’t think SAEs will actually work at this level of sparsity though, so this is mostly besides the point.
I’m pretty skeptical of a view like “scale up SAEs and get all the features”.
70b storing 6b bits of pure memorized info seems quite reasonable to me, maybe a bit high. My guess is there’s a lot more structure to the world that the models exploit to “know” more things with fewer memorized bits, but this is a pretty low confidence take (and perhaps we disagree on what “memorized info” means here). That being said, SAEs as currently conceived/evaluated won’t be able to find/respect a lot of the structure, so maybe 500M features is also reasonable.
I don’t think SAEs will actually work at this level of sparsity though, so this is mostly besides the point.
I agree that SAEs don’t work at this level of sparsity and I’m skeptical of the view myself. But from a “scale up SAEs to get all features” perspective, it sure seems pretty plausible to me that you need a lot more features than people used to look at.
I also don’t think the Anthropic paper OP is talking about has come close for Pareto frontier for size <> sparsity <> trainability.
It’s seems plausible to me that a 70b model stores ~6 billion bits of memorized information. Naively, you might think this requires around 500M features. (supposing that each “feature” represents 12 bits which is probably a bit optimistic)
I don’t think SAEs will actually work at this level of sparsity though, so this is mostly besides the point.
I’m pretty skeptical of a view like “scale up SAEs and get all the features”.
(If you wanted “feature” to mean something.)
70b storing 6b bits of pure memorized info seems quite reasonable to me, maybe a bit high. My guess is there’s a lot more structure to the world that the models exploit to “know” more things with fewer memorized bits, but this is a pretty low confidence take (and perhaps we disagree on what “memorized info” means here). That being said, SAEs as currently conceived/evaluated won’t be able to find/respect a lot of the structure, so maybe 500M features is also reasonable.
I agree that SAEs don’t work at this level of sparsity and I’m skeptical of the view myself. But from a “scale up SAEs to get all features” perspective, it sure seems pretty plausible to me that you need a lot more features than people used to look at.
I also don’t think the Anthropic paper OP is talking about has come close for Pareto frontier for size <> sparsity <> trainability.