Distillation is simply the process of one network learning to model another, usually by predicting its outputs on the same inputs, but there are many variations. The brain certainly uses distillation: deepmind’s founding research was based on hippocampal replay wherein the hippocampus trains the cortex, a form of distillation.
Leaving aside whether “training the primary model to have internal representations that are easily learned by other systems” is an effective explainability technique at all vs alternatives, both the training of the explainer distillations and any associated implied explainability side objective impose a cost.
The brain evidence is relevant because it suggests that distillation for primary capability purposes of compression/efficiency/etc does not increase interpretability for free, and thus it has some capability tradeoff cost.
All that being said, it does seem that sparsity (or other forms of compression bottlenecks) can aid interpretability by reducing complexity, filtering noise, etc thus speeding up downstream learning of those internal representations. But it would be surprising if the ideal sparsity for efficiency/capability happened to be the same as the ideal for interpretability/explainability.
Distillation is simply the process of one network learning to model another, usually by predicting its outputs on the same inputs, but there are many variations. The brain certainly uses distillation: deepmind’s founding research was based on hippocampal replay wherein the hippocampus trains the cortex, a form of distillation.
Leaving aside whether “training the primary model to have internal representations that are easily learned by other systems” is an effective explainability technique at all vs alternatives, both the training of the explainer distillations and any associated implied explainability side objective impose a cost.
The brain evidence is relevant because it suggests that distillation for primary capability purposes of compression/efficiency/etc does not increase interpretability for free, and thus it has some capability tradeoff cost.
All that being said, it does seem that sparsity (or other forms of compression bottlenecks) can aid interpretability by reducing complexity, filtering noise, etc thus speeding up downstream learning of those internal representations. But it would be surprising if the ideal sparsity for efficiency/capability happened to be the same as the ideal for interpretability/explainability.