We are simply tuning the model to have similar activations for these very short, context free snippets. The characterization of the training you made with pair (A) or (B) is not what we do and we would agree if that was what we were doing this whole thing would be much less meaningful.
This is great. 2 suggestions:
Call it ablation, erasure, concept censoring or similar, not fine-tuning. That way you don’t bury the lead. It also took me a long time to realise that this is what you were doing.
Maybe consider other way to erase the seperation of self-other. There are other erasure techniques, they are sharper scalpels, so you can wield them with more force. For example LEACE, training a linear classifier to predict A or B, then erase the activations that had predictive power
I also found it interesting that you censored the self_attn using gradient. This implicitly implies that:
concepts are best represented in the self attention
they are non-linear (meaning you need to use gradient rather than linear methods).
Am I right about your assumptions, and if so, why do you think this?
I’ve been doing some experiments to try and work this out https://github.com/wassname/eliciting_suppressed_knowledge