Testing it with Pythia-70M and few enough features to permit the naive calculation sounds like a great approach to start with.
Closest neighbour rather than average over all sounds sensible. I’m not certain what you mean by unique vs non-unique. If you’re referring to situations where there may be several equally close closest neighbours then I think we can just take the mean cos-sim of those neighbours, so they all impact on the loss but the magnitude of the loss stays within the normal range.
Only on features that activate also sounds sensible, but the decoder weights of neurons that didn’t activate would need to be allowed to update if they were the closest neighbours for neurons that did activate. Otherwise we could get situations where e.g. one neuron (neuron A) has encoder and decoder weights both pointing in sensible directions to capture a feature, but another neuron has decoder weights aligned with neuron A but has encoder weights occupying a remote region of activation space and thus rarely activates, causing its decoder weights to remain in that direction blocking neuron A if we don’t allow it to update.
Yes I think we want to penalise high cos-sim more. The modified sigmoid flattens out as x->1 but the I think the purple function below does what we want.
Training with a negative orthogonality regulariser could be an option. I think vanilla SAEs already have plenty of geometrically aligned features (e.g. see @jacobcd52 ’s comment below). Depending on the purpose, another option to intentionally generate feature combinatorics could be to simply add together some of the features learnt by a vanilla SAE. If the individual features weren’t combinations then their sums certainly would be.
I’ll be very interested to see results and am happy to help with interpreting them etc. Also more than happy to have a look at any code.
Testing it with Pythia-70M and few enough features to permit the naive calculation sounds like a great approach to start with.
Closest neighbour rather than average over all sounds sensible. I’m not certain what you mean by unique vs non-unique. If you’re referring to situations where there may be several equally close closest neighbours then I think we can just take the mean cos-sim of those neighbours, so they all impact on the loss but the magnitude of the loss stays within the normal range.
Only on features that activate also sounds sensible, but the decoder weights of neurons that didn’t activate would need to be allowed to update if they were the closest neighbours for neurons that did activate. Otherwise we could get situations where e.g. one neuron (neuron A) has encoder and decoder weights both pointing in sensible directions to capture a feature, but another neuron has decoder weights aligned with neuron A but has encoder weights occupying a remote region of activation space and thus rarely activates, causing its decoder weights to remain in that direction blocking neuron A if we don’t allow it to update.
Yes I think we want to penalise high cos-sim more. The modified sigmoid flattens out as x->1 but the I think the purple function below does what we want.
Training with a negative orthogonality regulariser could be an option. I think vanilla SAEs already have plenty of geometrically aligned features (e.g. see @jacobcd52 ’s comment below). Depending on the purpose, another option to intentionally generate feature combinatorics could be to simply add together some of the features learnt by a vanilla SAE. If the individual features weren’t combinations then their sums certainly would be.
I’ll be very interested to see results and am happy to help with interpreting them etc. Also more than happy to have a look at any code.