Thanks Joel. I appreciated this. Wish I had time to write my own version of this. Alas.
Previously I’ve seen the rule of thumb “20-100 for most models”. Anthropic says:
We were saying this and I think this might be an area of debate in the community for a few reasons. It could be that the “true L0” is actually very high. It could be that low activating features aren’t contributing much to your reconstruction and so aren’t actually an issue in practice. It’s possible the right L1 or L0 is affected by model size, context length or other details which aren’t being accounted for in these debates. A thorough study examining post-hoc removal of low activating or low norm features could help. FWIW, it’s not obvious to me that L0 should be lower / higher and I think we should be careful not to cargo-cult the stat. Probably we’re not at too much risk here since we’re discussing this out in the open already.
Having multiple different-sized SAEs for the same model seems useful. The dashboard shows feature splitting clearly. I hadn’t ever thought of comparing features from different SAEs using cosine similarity and plotting them together with UMAP.
Different SAEs, same activations. Makes sense since it’s notionally the same vector space. Apollo did this recently when comparing e2e vs vanilla SAEs. I’d love someone to come up with better measures of U-MAP quality as the primary issue with them is the risk of arbitrariness.
Neither of these plots seems great. They both suggest to me that these SAEs are “leaky” in some sense at lower activation levels, but in opposite ways:
This could be bad. Could also be that the underlying information is messy and there’s interference or other weird things going on. Not obvious that it’s bad measurement as opposed to messy phenomena imo. Trying to distinguish the two seems valuable.
4. On Scaling
Yup. Training simultaneously could be good. It’s an engineering challenge. I would reimplement good proofs of concept that suggest this is feasible and how to do it. I’d also like to point out that this isn’t the first time a science has had this issue.
On some level I think this challenge directly parallels bioinformatics / gene sequencing. They needed a human genome project because it was expensive and ambitious and individual actors couldn’t do it on their own. But collaborating is hard. Maybe EA in particular can get the ball rolling here faster than it might otherwise. The NDIF / Bau Lab might also be a good banner to line up behind.
I didn’t notice many innovations here—it was mostly scaling pre-existing techniques to a larger model than I had seen previously. The good news is that this worked well. The bad news is that none of the old challenges have gone away.
Agreed. I think the point was basically scale. Criticisms along the lines of “this is tackling the hard part of the problem or proving interp is actually useful” are unproductive if that wasn’t the intention. Anthropic has 3 teams now and counting doing this stuff. They’re definitely working on a bunch of harder / other stuff that maybe focuses on the real bottlenecks.
Thanks Joel. I appreciated this. Wish I had time to write my own version of this. Alas.
We were saying this and I think this might be an area of debate in the community for a few reasons. It could be that the “true L0” is actually very high. It could be that low activating features aren’t contributing much to your reconstruction and so aren’t actually an issue in practice. It’s possible the right L1 or L0 is affected by model size, context length or other details which aren’t being accounted for in these debates. A thorough study examining post-hoc removal of low activating or low norm features could help. FWIW, it’s not obvious to me that L0 should be lower / higher and I think we should be careful not to cargo-cult the stat. Probably we’re not at too much risk here since we’re discussing this out in the open already.
Different SAEs, same activations. Makes sense since it’s notionally the same vector space. Apollo did this recently when comparing e2e vs vanilla SAEs. I’d love someone to come up with better measures of U-MAP quality as the primary issue with them is the risk of arbitrariness.
This could be bad. Could also be that the underlying information is messy and there’s interference or other weird things going on. Not obvious that it’s bad measurement as opposed to messy phenomena imo. Trying to distinguish the two seems valuable.
Yup. Training simultaneously could be good. It’s an engineering challenge. I would reimplement good proofs of concept that suggest this is feasible and how to do it. I’d also like to point out that this isn’t the first time a science has had this issue.
On some level I think this challenge directly parallels bioinformatics / gene sequencing. They needed a human genome project because it was expensive and ambitious and individual actors couldn’t do it on their own. But collaborating is hard. Maybe EA in particular can get the ball rolling here faster than it might otherwise. The NDIF / Bau Lab might also be a good banner to line up behind.
Agreed. I think the point was basically scale. Criticisms along the lines of “this is tackling the hard part of the problem or proving interp is actually useful” are unproductive if that wasn’t the intention. Anthropic has 3 teams now and counting doing this stuff. They’re definitely working on a bunch of harder / other stuff that maybe focuses on the real bottlenecks.