I forgot to reply to this important part in my other comment:
Another option: given a large, “primary” model, you could train multiple smaller “secondary” models (with different architectures) to imitate the primary model (knowledge distillation), then train the primary model to improve the secondary models’ imitation performance. This should cause the primary model to learn internal representations that are more easily learned by other models of various architectures, and are hopefully more interpretable to humans. If you then assume the primary model is a self-optimizing system, then this approach becomes even more promising because now the self-optimizing systems is actively looking for architectures that are easy for weaker models to understand.
I was already assuming that distillation/compression was part of recursive self-optimization—it’s certainly something the brain does. I’m more doubtful that this improves interpretability for free, absent explicit regularization criteria and its associated costs. The regions of the brain that seem most involved in distillation are those that took the longest for us to understand, and or are still mysterious—such as the cerebellum (motor control was a red herring, it’s probably involved in training or distilling the cortex).
What I was proposing there isn’t distillation/compression of the primary model. Rather, it’s training the primary model to have internal representations that are easily learned by other systems. Knowledge distillation is the process the other systems use to learn the primary model’s representations. As far as I know, the brain doesn’t do anything like this.
Imagine the primary model has a collection of 10 neurons that collectively represent 10 different concepts, but all the neurons are highly polysemantic. There’s no single neuron that corresponds to a single one of the concepts. When the primary model needs a pure representation, it uses some complex function of the 10 neurons’ activations to recover a pure representation.
This is pretty bad from an interpretability perspective, and my guess is that it also makes it more difficult to use the primary model as a teacher for knowledge distillation. Student models have to learn the disentangling function before they can get a pure representation. In contrast, knowledge distillation would be easier if those 10 neurons each uniquely represented a single concept. That’s what the primary model is being trained for.
By training the primary model to be easily interpretable to students, I hope to get a primary model whose representations are generally interpretable to both student models and humans.
Distillation is simply the process of one network learning to model another, usually by predicting its outputs on the same inputs, but there are many variations. The brain certainly uses distillation: deepmind’s founding research was based on hippocampal replay wherein the hippocampus trains the cortex, a form of distillation.
Leaving aside whether “training the primary model to have internal representations that are easily learned by other systems” is an effective explainability technique at all vs alternatives, both the training of the explainer distillations and any associated implied explainability side objective impose a cost.
The brain evidence is relevant because it suggests that distillation for primary capability purposes of compression/efficiency/etc does not increase interpretability for free, and thus it has some capability tradeoff cost.
All that being said, it does seem that sparsity (or other forms of compression bottlenecks) can aid interpretability by reducing complexity, filtering noise, etc thus speeding up downstream learning of those internal representations. But it would be surprising if the ideal sparsity for efficiency/capability happened to be the same as the ideal for interpretability/explainability.
I forgot to reply to this important part in my other comment:
I was already assuming that distillation/compression was part of recursive self-optimization—it’s certainly something the brain does. I’m more doubtful that this improves interpretability for free, absent explicit regularization criteria and its associated costs. The regions of the brain that seem most involved in distillation are those that took the longest for us to understand, and or are still mysterious—such as the cerebellum (motor control was a red herring, it’s probably involved in training or distilling the cortex).
What I was proposing there isn’t distillation/compression of the primary model. Rather, it’s training the primary model to have internal representations that are easily learned by other systems. Knowledge distillation is the process the other systems use to learn the primary model’s representations. As far as I know, the brain doesn’t do anything like this.
Imagine the primary model has a collection of 10 neurons that collectively represent 10 different concepts, but all the neurons are highly polysemantic. There’s no single neuron that corresponds to a single one of the concepts. When the primary model needs a pure representation, it uses some complex function of the 10 neurons’ activations to recover a pure representation.
This is pretty bad from an interpretability perspective, and my guess is that it also makes it more difficult to use the primary model as a teacher for knowledge distillation. Student models have to learn the disentangling function before they can get a pure representation. In contrast, knowledge distillation would be easier if those 10 neurons each uniquely represented a single concept. That’s what the primary model is being trained for.
By training the primary model to be easily interpretable to students, I hope to get a primary model whose representations are generally interpretable to both student models and humans.
Distillation is simply the process of one network learning to model another, usually by predicting its outputs on the same inputs, but there are many variations. The brain certainly uses distillation: deepmind’s founding research was based on hippocampal replay wherein the hippocampus trains the cortex, a form of distillation.
Leaving aside whether “training the primary model to have internal representations that are easily learned by other systems” is an effective explainability technique at all vs alternatives, both the training of the explainer distillations and any associated implied explainability side objective impose a cost.
The brain evidence is relevant because it suggests that distillation for primary capability purposes of compression/efficiency/etc does not increase interpretability for free, and thus it has some capability tradeoff cost.
All that being said, it does seem that sparsity (or other forms of compression bottlenecks) can aid interpretability by reducing complexity, filtering noise, etc thus speeding up downstream learning of those internal representations. But it would be surprising if the ideal sparsity for efficiency/capability happened to be the same as the ideal for interpretability/explainability.