Great work! Is there something like too narrow of a dataset? For refusal, what do you think happens if we specifically train on a bunch of examples that show signs refusal?
Thanks! I’m not sure. My guess is that if you go super narrow, it may be more likely to result in an inconvenient level of “feature splitting”. Since there are only a few total concepts to learn, an SAE of equivalent width might exploit its greater relative capacity to learn niche combinations of features (to reduce sparsity loss).
Makes sense! Thanks! In that case, we can potentially reduce the width, which might (along with a smaller dataset) help scale saes to understanding mechanisms in big models?
Great work! Is there something like too narrow of a dataset? For refusal, what do you think happens if we specifically train on a bunch of examples that show signs refusal?
Thanks! I’m not sure. My guess is that if you go super narrow, it may be more likely to result in an inconvenient level of “feature splitting”. Since there are only a few total concepts to learn, an SAE of equivalent width might exploit its greater relative capacity to learn niche combinations of features (to reduce sparsity loss).
Makes sense! Thanks! In that case, we can potentially reduce the width, which might (along with a smaller dataset) help scale saes to understanding mechanisms in big models?