johnswentworth comments on On how various plans miss the hard bits of the alignment challenge

johnswentworth 12 Jul 2022 18:17 UTC
LW: 25 AF: 12
10
AF
What might an ambitious interpretability agenda focused on the sharp left turn and the generalization problem look like besides just trying harder at interpretability?
Some key pieces...
Desiderata 1: we need to aim for some kind of interpretability which will carry over across architectural/training paradigm changes, internal ontology shifts at runtime, etc. The tools need to work without needing a lot of new investment everytime there’s a big change.
In my own approach, that’s what Selection Theorems would give us: theorems which characterize certain interpretable internal structures as instrumentally convergent across a wide range of architecture/internal ontology.
Desiderata 2: we need to be able to robustly tie the internal structures identified to some kind of high-level human-interpretable “things”. The “things” could be mathematical, like e.g. we might aim to robustly recognize embedded search processes or embedded world models. Or, the “things” could be real-world things, like e.g. we might aim to robustly recognize embedded representations of natural abstractions from the environment (and the natural abstractions in the environment to which the representations correspond). Either way, this would have to involve more than just a bunch of proxies which are vaguely correlated with the human-intuitive concept(s); the correspondence both between learned representation and mathematical/real-world structure, and between human concept and mathematical/real-world structure, would have to be highly robust.
In my own approach, that’s what the formalization of natural abstractions would give us: theorems which let us robustly talk about the things-which-embedded-representations-represent, in a way which also ties those things to human concepts.
Desiderata 3: we need to somehow guarantee that there’s no important/dangerous cognitive work routing around the interpretable structures. E.g. if we’re aiming to recognize embedded search processes, we need to somehow guarantee that there’s optimization performed in a way which would circumvent things-recognized-by-our-search-process-interpretability-tool. Or if we’re aiming to recognize representations of natural abstractions in general, then we need to somehow guarantee that no important/dangerous cognitive work is routing through channels other than those concepts.
The natural abstraction framework fits this desiderata particularly well, since it directly talks about abstractions which summarize all the information relevant at a distance. There’s no capabilities to be gained by using non-natural abstractions.
Finally, one thing which is not a desiderata but is an important barrier which most current interpretability work fails to tackle: interpretability is not compositional/reductive. If I understand each of 100 parts in isolation, that does not mean that I understand a system consisting of those 100 parts together. (If interpretability were compositional/reductive, then we’d already understand neural nets just fine, because individual neurons and weights are very simple!)