I’m not exactly sure what you mean by “single channel”
I mean the thing where BERT has a single stack of sequential layers which each process the entire latent representation of the previous layer. In contrast, imagine a system that dynamically routes different parts of the input to different components of the models, then has those model components communicate with each other to establish the final output.
At first glance, the dynamically routing model seems much less interpretable. However, I think we’ll find that different parts of the dynamic model will specialize to process different types of input or perform different types of computation, even without an explicit regularizer that encourages sparse connections or small circuits. I think this will aid interpretability quite a lot.
I don’t think a self-optimizing architecture will change its internals as quickly or often as you seem to be implying. A major hint here is that brain architecture doesn’t vary that much across species (a human brain neuroscientist can easily adapt their expertise to squirrel brains). Additionally, most deep learning advances seem more along the lines of “get out of SGD’s way” or “make things more convenient for SGD” than “add a bunch of complex additional mechanisms”. I think we plausibly end up with a handful clusters in architecture space that have good performance on certain domains and that further architecture search doesn’t stray too far from those clusters.
I also think that highly varied, distributed systems have much better interpretability prospects than you might assume. Consider that neural nets face their own internal interpretability issues. Different parts of the network need to be able to communicate effectively with each other on at least two levels:
Different parts need to share the results of their computation in a way that’s mutually legible.
Different parts need to communicate with each other about how their internal representations should change so they can more effectively coordinate their computation.
I.e., if part A is looking for trees and part B is looking for leaves, part A should be able to signal to part B about how it’s leaf detectors should change so that part A’s tree detectors function more effectively.
We currently use SGD for this coordination (and not-coincidentally, gradients are very important interpretability tools), but even some hypothetical learned alternative to SGD would need to do this too.
Importantly, as systems become more distributed, varied and adaptable, the premium on effective cross-system communication increases. It becomes more and more important that systems with varied architectures be able to understand each other.
The need for cross-regional compatibility implies that networks tend to learn representations that are maximally interpretable to other parts of the network. You could object that there’s no reason such representations have to be human interpretable. This objection is partially right. There’s no reason that the raw internal representations have to be human interpretable. However, the network has to accept/generate human interpretable input/output, so it needs components that translate its internal representations to human interpretable forms.
Because the “inner interpretability” problem described above forces models to use consistent internal representations, we should be able to apply the model’s own input/output translation components to the internal representations anywhere in the model to get human interpretable output.
“Knowledge Neurons in Pretrained Transformers” was also able to use input embeddings to modify knowledge stored in neurons from the intermediate layers. E.g., given a neuron storing “Paris is the capital of France”, the authors subtract the embedding of “Paris” and add the embedding of “London”. This causes the model to output London instead of France for French capital-related queries (though the technique is not fully reliable).
These successes are very difficult to explain in a paradigm where interpretability is unrelated to performance, but are what you’d expect from thinking about what models would need to do to address the inner interpretability problem I describe above.
Before reading your article, my initial take was that interpretability techniques for ANNs and BNNs are actually not all that different—but ANNs are naturally much easier to monitor and probe.
My impression is that they’re actually pretty different. For example, ML interpretability has access to gradients, whereas brain interpretability has much greater expectations of regional specialization. ML interpretability can do things like feature visualization, saliency mapping, etc. Brain interpretability can learn more from observing how damage to different region impacts human behavior. Also, you can ask humans to introspect and do particular mental tasks while you analyze them, which is much harder for models.
The pessimistic scenario is thus simply that this status quo doesn’t change, because willingness to spend on interpretability doesn’t change much, and moving towards recursive self-optimization increases the cost.
I think a large part of why there’s so little interpretability work is that people don’t think it’s feasible. The tools for doing it aren’t very good, and there’s no overarching paradigm that tells people where to start or how to proceed (or more accurately, the current paradigm says don’t even bother). This is a big part of why I think it’s so important to make a positive case for interpretability.
TLDR: I think our crux reduces to some mix of these 3 key questions
how system complexity evolves in the future with recursive self optimization
how human interpretability cost scales with emergent system complexity
willingness to spend on interpretability in the future
So my core position is then:
system complexity is going to explode hyperexponentially (which is just an obvious prediction of the general hyperexponential trend)
interpretability cost thus scales suboptimally with humans in the loop (probably growing exponentially) and
willingness to spend on interpretability won’t change enormously
In other words, future ML systems will reach a point where they evolve faster than we can understand. This may be a decade away or more, but it’s not a century away.
For interpretability to scale in this scenario, you need to outsource it to already trusted systems (ie something like iterated amplification).
I mean the thing where BERT has a single stack of sequential layers which each process the entire latent representation of the previous layer. In contrast, imagine a system that dynamically routes different parts of the input to different components of the models, then has those model components communicate with each other to establish the final output.
The system you are imaging sounds just equivalent to transformers. Content based dynamic routing, soft attention, and content-addressable memory are all actually just variations/descriptions of the same thing.
A matrix multiply of A*M where A consists of 1-hot row vectors is mathematically equivalent to an array of memory lookup ops, where each 1-hot row of A is a memory address referencing some row of memory matrix M. Relaxing the 1-hot constraint naturally can’t make it less flexible than a memory lookup, it becomes a more general soft memory blend operation.
Then if you compute A with a nonlinear input layer of the form A = f(Q*K), where f is some competitive non-linearity, and Q and K are query and key matrices, that implements a more general soft version of content-based addressing, chain em together and you get soft content-addressable memory (which obviously is universal and equivalent/implements routing).
Standard relu deepnets don’t use matrix transpose in the forward pass, only the back pass, and thus have fixed K and M matrices that only change slowly with SGD. They completely lack the ability to do attention/routing/memory operations over dynamic activations. Transformers add the transpose op as a fwd pass building block allowing the output activations to feed into K and/or M, which simultaneously enables universal attention/routing/memory operations.
I don’t think a self-optimizing architecture will change its internals as quickly or often as you seem to be implying.
I disagree—you may simply be failing to imagine the future as I do. Fully justifying why I disagree is not something I should do on a public forum, but I will say that the brain is already changing it’s internals more quickly than you seem to be implying, and ANNs on von neumann hardware are potentially vastly more flexible in their ability to expand/morph existing layers, add modules, distill others, explore new pathways, learn to predict sub-module training trajectories, meta-learn to predict .. etc. The brain is limited by the topological constraints of both the far less flexible neuromorphic substrate and a constrained volume; constraints that largely do not apply to ANNs on von neumman hardware.
Additionally, most deep learning advances seem more along the lines of “get out of SGD’s way” or “make things more convenient for SGD” than “add a bunch of complex additional mechanisms”.
Getting out of the optimizer’s way allows it to explore complexity beyond human capability. The bitter lesson is not one of simplicity beating complexity. It is about design complexity emergence shifting from the human optimization substrate to the machine optimization substrate. The main point I was making is that meta subsumes—moving from architecture to meta-learned meta-architecture (e.g. recursions of learning to learn the architecture of compressed hyper networks that generate/train the lower level architectures).
However, the network has to accept/generate human interpretable input/output, so it needs components that translate its internal representations to human interpretable forms.
Both a 1950′s computer and a 2021 GPU-based computer, each running some complex software of their era, accept/generate human interpretable inputs/outputs, but one is enormously more difficult to understand at any deep level.
This is roughly what we see in the papers “Transformer Feed-Forward Layers Are Key-Value Memories”
Side note, but that’s such a dumb name for a paper—it’s the equivalent of “Feed-Forward Network Layers are Soft Threshold Memories”, and only marginally better than “Linear Neural Network Layers Are Matrix Multiplies”.
I mean the thing where BERT has a single stack of sequential layers which each process the entire latent representation of the previous layer. In contrast, imagine a system that dynamically routes different parts of the input to different components of the models, then has those model components communicate with each other to establish the final output.
At first glance, the dynamically routing model seems much less interpretable. However, I think we’ll find that different parts of the dynamic model will specialize to process different types of input or perform different types of computation, even without an explicit regularizer that encourages sparse connections or small circuits. I think this will aid interpretability quite a lot.
I don’t think a self-optimizing architecture will change its internals as quickly or often as you seem to be implying. A major hint here is that brain architecture doesn’t vary that much across species (a human brain neuroscientist can easily adapt their expertise to squirrel brains). Additionally, most deep learning advances seem more along the lines of “get out of SGD’s way” or “make things more convenient for SGD” than “add a bunch of complex additional mechanisms”. I think we plausibly end up with a handful clusters in architecture space that have good performance on certain domains and that further architecture search doesn’t stray too far from those clusters.
I also think that highly varied, distributed systems have much better interpretability prospects than you might assume. Consider that neural nets face their own internal interpretability issues. Different parts of the network need to be able to communicate effectively with each other on at least two levels:
Different parts need to share the results of their computation in a way that’s mutually legible.
Different parts need to communicate with each other about how their internal representations should change so they can more effectively coordinate their computation.
I.e., if part A is looking for trees and part B is looking for leaves, part A should be able to signal to part B about how it’s leaf detectors should change so that part A’s tree detectors function more effectively.
We currently use SGD for this coordination (and not-coincidentally, gradients are very important interpretability tools), but even some hypothetical learned alternative to SGD would need to do this too.
Importantly, as systems become more distributed, varied and adaptable, the premium on effective cross-system communication increases. It becomes more and more important that systems with varied architectures be able to understand each other.
The need for cross-regional compatibility implies that networks tend to learn representations that are maximally interpretable to other parts of the network. You could object that there’s no reason such representations have to be human interpretable. This objection is partially right. There’s no reason that the raw internal representations have to be human interpretable. However, the network has to accept/generate human interpretable input/output, so it needs components that translate its internal representations to human interpretable forms.
Because the “inner interpretability” problem described above forces models to use consistent internal representations, we should be able to apply the model’s own input/output translation components to the internal representations anywhere in the model to get human interpretable output.
This is roughly what we see in the papers “Transformer Feed-Forward Layers Are Key-Value Memories” and “Knowledge Neurons in Pretrained Transformers” and LW post “interpreting GPT: the logit lens”. They’re able to apply the vocabulary projection matrix (which generates the output at the final layer) to intermediate representations and get human interpretable translations of those representations.
“Knowledge Neurons in Pretrained Transformers” was also able to use input embeddings to modify knowledge stored in neurons from the intermediate layers. E.g., given a neuron storing “Paris is the capital of France”, the authors subtract the embedding of “Paris” and add the embedding of “London”. This causes the model to output London instead of France for French capital-related queries (though the technique is not fully reliable).
These successes are very difficult to explain in a paradigm where interpretability is unrelated to performance, but are what you’d expect from thinking about what models would need to do to address the inner interpretability problem I describe above.
My impression is that they’re actually pretty different. For example, ML interpretability has access to gradients, whereas brain interpretability has much greater expectations of regional specialization. ML interpretability can do things like feature visualization, saliency mapping, etc. Brain interpretability can learn more from observing how damage to different region impacts human behavior. Also, you can ask humans to introspect and do particular mental tasks while you analyze them, which is much harder for models.
I think a large part of why there’s so little interpretability work is that people don’t think it’s feasible. The tools for doing it aren’t very good, and there’s no overarching paradigm that tells people where to start or how to proceed (or more accurately, the current paradigm says don’t even bother). This is a big part of why I think it’s so important to make a positive case for interpretability.
TLDR: I think our crux reduces to some mix of these 3 key questions
how system complexity evolves in the future with recursive self optimization
how human interpretability cost scales with emergent system complexity
willingness to spend on interpretability in the future
So my core position is then:
system complexity is going to explode hyperexponentially (which is just an obvious prediction of the general hyperexponential trend)
interpretability cost thus scales suboptimally with humans in the loop (probably growing exponentially) and
willingness to spend on interpretability won’t change enormously
In other words, future ML systems will reach a point where they evolve faster than we can understand. This may be a decade away or more, but it’s not a century away.
For interpretability to scale in this scenario, you need to outsource it to already trusted systems (ie something like iterated amplification).
The system you are imaging sounds just equivalent to transformers. Content based dynamic routing, soft attention, and content-addressable memory are all actually just variations/descriptions of the same thing.
A matrix multiply of A*M where A consists of 1-hot row vectors is mathematically equivalent to an array of memory lookup ops, where each 1-hot row of A is a memory address referencing some row of memory matrix M. Relaxing the 1-hot constraint naturally can’t make it less flexible than a memory lookup, it becomes a more general soft memory blend operation.
Then if you compute A with a nonlinear input layer of the form A = f(Q*K), where f is some competitive non-linearity, and Q and K are query and key matrices, that implements a more general soft version of content-based addressing, chain em together and you get soft content-addressable memory (which obviously is universal and equivalent/implements routing).
Standard relu deepnets don’t use matrix transpose in the forward pass, only the back pass, and thus have fixed K and M matrices that only change slowly with SGD. They completely lack the ability to do attention/routing/memory operations over dynamic activations. Transformers add the transpose op as a fwd pass building block allowing the output activations to feed into K and/or M, which simultaneously enables universal attention/routing/memory operations.
I disagree—you may simply be failing to imagine the future as I do. Fully justifying why I disagree is not something I should do on a public forum, but I will say that the brain is already changing it’s internals more quickly than you seem to be implying, and ANNs on von neumann hardware are potentially vastly more flexible in their ability to expand/morph existing layers, add modules, distill others, explore new pathways, learn to predict sub-module training trajectories, meta-learn to predict .. etc. The brain is limited by the topological constraints of both the far less flexible neuromorphic substrate and a constrained volume; constraints that largely do not apply to ANNs on von neumman hardware.
Getting out of the optimizer’s way allows it to explore complexity beyond human capability. The bitter lesson is not one of simplicity beating complexity. It is about design complexity emergence shifting from the human optimization substrate to the machine optimization substrate. The main point I was making is that meta subsumes—moving from architecture to meta-learned meta-architecture (e.g. recursions of learning to learn the architecture of compressed hyper networks that generate/train the lower level architectures).
Both a 1950′s computer and a 2021 GPU-based computer, each running some complex software of their era, accept/generate human interpretable inputs/outputs, but one is enormously more difficult to understand at any deep level.
Side note, but that’s such a dumb name for a paper—it’s the equivalent of “Feed-Forward Network Layers are Soft Threshold Memories”, and only marginally better than “Linear Neural Network Layers Are Matrix Multiplies”.