Really cool work!
Would it be accurate to say that MoE models are an extremely coarse form of parameter decomposition? They check the box for faithfulness, and they’re an extreme example of optimizing minimality (each input x only uses one component of the model if you define each expert as a component) while completely disregarding simplicity.
Training on CoT traces seems like a particular instance of a general class of “self-defeating strategies.” Other examples include antibiotics/bacterial resistance (treating bacterial infections creates selective pressure that promotes resistant bacterial populations, gradually rendering the antibiotics ineffective for future use) and the dilemma in The Imitation Game after Turing and his team have cracked Enigma (acting upon the deciphered messages would tip off the Nazis and remove the Allies’ informational advantage).