Thanks, that makes a lot of sense, I had skimmed the Anthropic paper and saw how it was used, but not where it comes from.
If it’s the importance to the loss, then theoretically you could derive one using backprop I guess? E.g. the accumulated gradient to your activations, over a few batches.
Yep, definitely! If you’re using MSE loss then it’s got a pretty straightforward to use backprop to see how importance relates to the loss function. Also if you’re interested, I think Redwood’s paper on capacity (which is the same as what Anthropic calls dimensionality) look at derivative of loss wrt the capacity assigned to a given feature
Huh, I actually tried this. Training IA3, which multiplies activations by a float. Then using that float as the importance of that activation. It seems like a natural way to use backprop to learn an importance matrix, but it gave small (1-2%) increases in accuracy. Strange.
I also tried using a VAE, and introducing sparsity by tokenizing the latent space. And this seems to work. At least probes can overfit to complex concept using the learned tokens.
Thanks, that makes a lot of sense, I had skimmed the Anthropic paper and saw how it was used, but not where it comes from.
If it’s the importance to the loss, then theoretically you could derive one using backprop I guess? E.g. the accumulated gradient to your activations, over a few batches.
Yep, definitely! If you’re using MSE loss then it’s got a pretty straightforward to use backprop to see how importance relates to the loss function. Also if you’re interested, I think Redwood’s paper on capacity (which is the same as what Anthropic calls dimensionality) look at derivative of loss wrt the capacity assigned to a given feature
Huh, I actually tried this. Training IA3, which multiplies activations by a float. Then using that float as the importance of that activation. It seems like a natural way to use backprop to learn an importance matrix, but it gave small (1-2%) increases in accuracy. Strange.
I also tried using a VAE, and introducing sparsity by tokenizing the latent space. And this seems to work. At least probes can overfit to complex concept using the learned tokens.
Oh that’s very interesting, Thank you.