TLDR: I think our crux reduces to some mix of these 3 key questions
how system complexity evolves in the future with recursive self optimization
how human interpretability cost scales with emergent system complexity
willingness to spend on interpretability in the future
So my core position is then:
system complexity is going to explode hyperexponentially (which is just an obvious prediction of the general hyperexponential trend)
interpretability cost thus scales suboptimally with humans in the loop (probably growing exponentially) and
willingness to spend on interpretability won’t change enormously
In other words, future ML systems will reach a point where they evolve faster than we can understand. This may be a decade away or more, but it’s not a century away.
For interpretability to scale in this scenario, you need to outsource it to already trusted systems (ie something like iterated amplification).
I mean the thing where BERT has a single stack of sequential layers which each process the entire latent representation of the previous layer. In contrast, imagine a system that dynamically routes different parts of the input to different components of the models, then has those model components communicate with each other to establish the final output.
The system you are imaging sounds just equivalent to transformers. Content based dynamic routing, soft attention, and content-addressable memory are all actually just variations/descriptions of the same thing.
A matrix multiply of A*M where A consists of 1-hot row vectors is mathematically equivalent to an array of memory lookup ops, where each 1-hot row of A is a memory address referencing some row of memory matrix M. Relaxing the 1-hot constraint naturally can’t make it less flexible than a memory lookup, it becomes a more general soft memory blend operation.
Then if you compute A with a nonlinear input layer of the form A = f(Q*K), where f is some competitive non-linearity, and Q and K are query and key matrices, that implements a more general soft version of content-based addressing, chain em together and you get soft content-addressable memory (which obviously is universal and equivalent/implements routing).
Standard relu deepnets don’t use matrix transpose in the forward pass, only the back pass, and thus have fixed K and M matrices that only change slowly with SGD. They completely lack the ability to do attention/routing/memory operations over dynamic activations. Transformers add the transpose op as a fwd pass building block allowing the output activations to feed into K and/or M, which simultaneously enables universal attention/routing/memory operations.
I don’t think a self-optimizing architecture will change its internals as quickly or often as you seem to be implying.
I disagree—you may simply be failing to imagine the future as I do. Fully justifying why I disagree is not something I should do on a public forum, but I will say that the brain is already changing it’s internals more quickly than you seem to be implying, and ANNs on von neumann hardware are potentially vastly more flexible in their ability to expand/morph existing layers, add modules, distill others, explore new pathways, learn to predict sub-module training trajectories, meta-learn to predict .. etc. The brain is limited by the topological constraints of both the far less flexible neuromorphic substrate and a constrained volume; constraints that largely do not apply to ANNs on von neumman hardware.
Additionally, most deep learning advances seem more along the lines of “get out of SGD’s way” or “make things more convenient for SGD” than “add a bunch of complex additional mechanisms”.
Getting out of the optimizer’s way allows it to explore complexity beyond human capability. The bitter lesson is not one of simplicity beating complexity. It is about design complexity emergence shifting from the human optimization substrate to the machine optimization substrate. The main point I was making is that meta subsumes—moving from architecture to meta-learned meta-architecture (e.g. recursions of learning to learn the architecture of compressed hyper networks that generate/train the lower level architectures).
However, the network has to accept/generate human interpretable input/output, so it needs components that translate its internal representations to human interpretable forms.
Both a 1950′s computer and a 2021 GPU-based computer, each running some complex software of their era, accept/generate human interpretable inputs/outputs, but one is enormously more difficult to understand at any deep level.
This is roughly what we see in the papers “Transformer Feed-Forward Layers Are Key-Value Memories”
Side note, but that’s such a dumb name for a paper—it’s the equivalent of “Feed-Forward Network Layers are Soft Threshold Memories”, and only marginally better than “Linear Neural Network Layers Are Matrix Multiplies”.
TLDR: I think our crux reduces to some mix of these 3 key questions
how system complexity evolves in the future with recursive self optimization
how human interpretability cost scales with emergent system complexity
willingness to spend on interpretability in the future
So my core position is then:
system complexity is going to explode hyperexponentially (which is just an obvious prediction of the general hyperexponential trend)
interpretability cost thus scales suboptimally with humans in the loop (probably growing exponentially) and
willingness to spend on interpretability won’t change enormously
In other words, future ML systems will reach a point where they evolve faster than we can understand. This may be a decade away or more, but it’s not a century away.
For interpretability to scale in this scenario, you need to outsource it to already trusted systems (ie something like iterated amplification).
The system you are imaging sounds just equivalent to transformers. Content based dynamic routing, soft attention, and content-addressable memory are all actually just variations/descriptions of the same thing.
A matrix multiply of A*M where A consists of 1-hot row vectors is mathematically equivalent to an array of memory lookup ops, where each 1-hot row of A is a memory address referencing some row of memory matrix M. Relaxing the 1-hot constraint naturally can’t make it less flexible than a memory lookup, it becomes a more general soft memory blend operation.
Then if you compute A with a nonlinear input layer of the form A = f(Q*K), where f is some competitive non-linearity, and Q and K are query and key matrices, that implements a more general soft version of content-based addressing, chain em together and you get soft content-addressable memory (which obviously is universal and equivalent/implements routing).
Standard relu deepnets don’t use matrix transpose in the forward pass, only the back pass, and thus have fixed K and M matrices that only change slowly with SGD. They completely lack the ability to do attention/routing/memory operations over dynamic activations. Transformers add the transpose op as a fwd pass building block allowing the output activations to feed into K and/or M, which simultaneously enables universal attention/routing/memory operations.
I disagree—you may simply be failing to imagine the future as I do. Fully justifying why I disagree is not something I should do on a public forum, but I will say that the brain is already changing it’s internals more quickly than you seem to be implying, and ANNs on von neumann hardware are potentially vastly more flexible in their ability to expand/morph existing layers, add modules, distill others, explore new pathways, learn to predict sub-module training trajectories, meta-learn to predict .. etc. The brain is limited by the topological constraints of both the far less flexible neuromorphic substrate and a constrained volume; constraints that largely do not apply to ANNs on von neumman hardware.
Getting out of the optimizer’s way allows it to explore complexity beyond human capability. The bitter lesson is not one of simplicity beating complexity. It is about design complexity emergence shifting from the human optimization substrate to the machine optimization substrate. The main point I was making is that meta subsumes—moving from architecture to meta-learned meta-architecture (e.g. recursions of learning to learn the architecture of compressed hyper networks that generate/train the lower level architectures).
Both a 1950′s computer and a 2021 GPU-based computer, each running some complex software of their era, accept/generate human interpretable inputs/outputs, but one is enormously more difficult to understand at any deep level.
Side note, but that’s such a dumb name for a paper—it’s the equivalent of “Feed-Forward Network Layers are Soft Threshold Memories”, and only marginally better than “Linear Neural Network Layers Are Matrix Multiplies”.