What I was proposing there isn’t distillation/compression of the primary model. Rather, it’s training the primary model to have internal representations that are easily learned by other systems. Knowledge distillation is the process the other systems use to learn the primary model’s representations. As far as I know, the brain doesn’t do anything like this.
Imagine the primary model has a collection of 10 neurons that collectively represent 10 different concepts, but all the neurons are highly polysemantic. There’s no single neuron that corresponds to a single one of the concepts. When the primary model needs a pure representation, it uses some complex function of the 10 neurons’ activations to recover a pure representation.
This is pretty bad from an interpretability perspective, and my guess is that it also makes it more difficult to use the primary model as a teacher for knowledge distillation. Student models have to learn the disentangling function before they can get a pure representation. In contrast, knowledge distillation would be easier if those 10 neurons each uniquely represented a single concept. That’s what the primary model is being trained for.
By training the primary model to be easily interpretable to students, I hope to get a primary model whose representations are generally interpretable to both student models and humans.
Distillation is simply the process of one network learning to model another, usually by predicting its outputs on the same inputs, but there are many variations. The brain certainly uses distillation: deepmind’s founding research was based on hippocampal replay wherein the hippocampus trains the cortex, a form of distillation.
Leaving aside whether “training the primary model to have internal representations that are easily learned by other systems” is an effective explainability technique at all vs alternatives, both the training of the explainer distillations and any associated implied explainability side objective impose a cost.
The brain evidence is relevant because it suggests that distillation for primary capability purposes of compression/efficiency/etc does not increase interpretability for free, and thus it has some capability tradeoff cost.
All that being said, it does seem that sparsity (or other forms of compression bottlenecks) can aid interpretability by reducing complexity, filtering noise, etc thus speeding up downstream learning of those internal representations. But it would be surprising if the ideal sparsity for efficiency/capability happened to be the same as the ideal for interpretability/explainability.
What I was proposing there isn’t distillation/compression of the primary model. Rather, it’s training the primary model to have internal representations that are easily learned by other systems. Knowledge distillation is the process the other systems use to learn the primary model’s representations. As far as I know, the brain doesn’t do anything like this.
Imagine the primary model has a collection of 10 neurons that collectively represent 10 different concepts, but all the neurons are highly polysemantic. There’s no single neuron that corresponds to a single one of the concepts. When the primary model needs a pure representation, it uses some complex function of the 10 neurons’ activations to recover a pure representation.
This is pretty bad from an interpretability perspective, and my guess is that it also makes it more difficult to use the primary model as a teacher for knowledge distillation. Student models have to learn the disentangling function before they can get a pure representation. In contrast, knowledge distillation would be easier if those 10 neurons each uniquely represented a single concept. That’s what the primary model is being trained for.
By training the primary model to be easily interpretable to students, I hope to get a primary model whose representations are generally interpretable to both student models and humans.
Distillation is simply the process of one network learning to model another, usually by predicting its outputs on the same inputs, but there are many variations. The brain certainly uses distillation: deepmind’s founding research was based on hippocampal replay wherein the hippocampus trains the cortex, a form of distillation.
Leaving aside whether “training the primary model to have internal representations that are easily learned by other systems” is an effective explainability technique at all vs alternatives, both the training of the explainer distillations and any associated implied explainability side objective impose a cost.
The brain evidence is relevant because it suggests that distillation for primary capability purposes of compression/efficiency/etc does not increase interpretability for free, and thus it has some capability tradeoff cost.
All that being said, it does seem that sparsity (or other forms of compression bottlenecks) can aid interpretability by reducing complexity, filtering noise, etc thus speeding up downstream learning of those internal representations. But it would be surprising if the ideal sparsity for efficiency/capability happened to be the same as the ideal for interpretability/explainability.