The author’s results from unlearning MMLU seems slightly rushed but moderately promising (I previously wrote a paper trying similar things, making good comparisons here is difficult), but the results from unlearning different coding languages seem very strong (compared to my previous attempt), the model seems to be substantially more monosemantic.
I agree with your suspicions that the gemma SAE performance was poor from using reconstructed activations, matches the drop in performance I got when I tried doing this.
Would be interesting to see if, e.g. steering performance from MONET expert directions is also comparable to that of SAEs. Using SAEs in practice is quite costly so I would prefer an approach more similar to MONET.
Hi Nicky! I agree that it would be interesting to see the steering performance of MONET compared to that of SAEs. At the moment, the way the routing probabilities are calculated makes this difficult, as they are computed separately for the bottom and top layers in HD or left and right layers. Therefore, it is hard to change the activation of expert ij without also affecting experts ij’ and i’j for all i’ != i and j’ != j.
One of the authors told me the following: “For pruning the experts, we manually expand the decomposed activations using $g_{hij}=g^1_{hi}^1∗g^2_{hj}$. After masking the relevant expert (i, j), we compute for all experts rather than performing efficient expert decomposition. This approach requires more memory and computational resources compared to the standard Monet mode, which is one of our current limitations. We are actively working on porting our Monet training code and developing a decomposed expert routing kernel in CUDA to enable more efficient expert manipulation without the need for full expert expansion.”
I think this problem would be easily solved for top-1 activations, as to steer you could just replace the expert the model wants to choose with the one you want to steer with. Since k = 1, you don’t need to worry about affecting other routing probabilities.
It would be really interesting if someone tried training a top-1 MONET model (with multiple heads, so that even though each head only selects one expert, it still has the ability to express itself through multiple semantic concepts) and tested its steering performance.
The unlearning results seem promising!
The author’s results from unlearning MMLU seems slightly rushed but moderately promising (I previously wrote a paper trying similar things, making good comparisons here is difficult), but the results from unlearning different coding languages seem very strong (compared to my previous attempt), the model seems to be substantially more monosemantic.
I agree with your suspicions that the gemma SAE performance was poor from using reconstructed activations, matches the drop in performance I got when I tried doing this.
Would be interesting to see if, e.g. steering performance from MONET expert directions is also comparable to that of SAEs. Using SAEs in practice is quite costly so I would prefer an approach more similar to MONET.
Hi Nicky! I agree that it would be interesting to see the steering performance of MONET compared to that of SAEs. At the moment, the way the routing probabilities are calculated makes this difficult, as they are computed separately for the bottom and top layers in HD or left and right layers. Therefore, it is hard to change the activation of expert ij without also affecting experts ij’ and i’j for all i’ != i and j’ != j.
One of the authors told me the following: “For pruning the experts, we manually expand the decomposed activations using $g_{hij}=g^1_{hi}^1∗g^2_{hj}$. After masking the relevant expert (i, j), we compute for all experts rather than performing efficient expert decomposition. This approach requires more memory and computational resources compared to the standard Monet mode, which is one of our current limitations. We are actively working on porting our Monet training code and developing a decomposed expert routing kernel in CUDA to enable more efficient expert manipulation without the need for full expert expansion.”
I think this problem would be easily solved for top-1 activations, as to steer you could just replace the expert the model wants to choose with the one you want to steer with. Since k = 1, you don’t need to worry about affecting other routing probabilities.
It would be really interesting if someone tried training a top-1 MONET model (with multiple heads, so that even though each head only selects one expert, it still has the ability to express itself through multiple semantic concepts) and tested its steering performance.