Thanks! I’m not sure. My guess is that if you go super narrow, it may be more likely to result in an inconvenient level of “feature splitting”. Since there are only a few total concepts to learn, an SAE of equivalent width might exploit its greater relative capacity to learn niche combinations of features (to reduce sparsity loss).
Makes sense! Thanks! In that case, we can potentially reduce the width, which might (along with a smaller dataset) help scale saes to understanding mechanisms in big models?
Thanks! I’m not sure. My guess is that if you go super narrow, it may be more likely to result in an inconvenient level of “feature splitting”. Since there are only a few total concepts to learn, an SAE of equivalent width might exploit its greater relative capacity to learn niche combinations of features (to reduce sparsity loss).
Makes sense! Thanks! In that case, we can potentially reduce the width, which might (along with a smaller dataset) help scale saes to understanding mechanisms in big models?