This is great. My hunch is that modularity could be greatly improved with little loss of capabilities, if we used some sort of loss function which weakly prioritized modularity of skills during training.
I tried to do some experiments on this idea of separability of skills in transformers last year, but didn’t get very far. In part, because I was less thorough than you, in part because I was using smaller models, and trying for more entangled skills (toxic internet comments vs wikipedia entries).
This is great. My hunch is that modularity could be greatly improved with little loss of capabilities, if we used some sort of loss function which weakly prioritized modularity of skills during training.
I tried to do some experiments on this idea of separability of skills in transformers last year, but didn’t get very far. In part, because I was less thorough than you, in part because I was using smaller models, and trying for more entangled skills (toxic internet comments vs wikipedia entries).