We’ve previously seen some discussion about <@abstraction@>(@Public Static: What is Abstraction?@), and some [claims](https://www.lesswrong.com/posts/wopE4nT28ausKGKwt/classification-of-ai-alignment-research-deconfusion-good?commentId=cKNrWxfxRgENS2EKX) that there are “natural” abstractions, or that AI systems will <@tend@>(@Chris Olah’s views on AGI safety@) to <@learn@>(@Conversation with Rohin Shah@) increasingly human-like abstractions (at least up to a point). To make this more crisp, given a system, let’s consider the information (abstraction) of the system that is relevant for predicting parts of the world that are “far away”. Then, the **natural abstraction hypothesis** states that:
1. This information is much lower-dimensional than the system itself.
2. These low-dimensional summaries are exactly the high-level abstract objects/concepts typically used by humans.
3. These abstractions are “natural”, that is, a wide variety of cognitive architectures will learn to use approximately the same concepts to reason about the world.
For example, to predict the effect of a gas in a larger system, you typically just need to know its temperature, pressure, and volume, rather than the exact positions and velocities of each molecule of the gas. The natural abstraction hypothesis predicts that many cognitive architectures would all converge to using these concepts to reason about gases.
If the natural abstraction hypothesis were true, it could make AI alignment dramatically simpler, as our AI systems would learn to use approximately the same concepts as us, which can help us both to “aim” our AI systems at the right goal, and to peer into our AI systems to figure out what exactly they are doing. So, this new project aims to test whether the natural abstraction hypothesis is true.
The first two claims will likely be tested empirically. We can build low-level simulations of interesting systems, and then compute what summary is useful for predicting its effects on “far away” things. We can then ask how low-dimensional that summary is (to test (1)), and whether it corresponds to human concepts (to test (2)).
A [followup post](https://www.alignmentforum.org/posts/f6oWbqxEwktfPrKJw/computing-natural-abstractions-linear-approximation) illustrates this in the case of a linear-Gaussian Bayesian network with randomly chosen graph structure. In this case, we take two regions of 110 nodes that are far apart each, and operationalize the relevant information between the two as the covariance matrix between the two regions. It turns out that this covariance matrix has about 3-10 “dimensions” (depending on exactly how you count), supporting claim (1). (And in fact, if you now compare to another neighborhood, two of the three “dimensions” remain the same!) Unfortunately, this doesn’t give much evidence about (2) since humans don’t have good concepts for parts of linear-Gaussian Bayesian networks with randomly chosen graph structure.
While (3) can also be tested empirically through simulation, we would hope that we can also prove theorems that state that nearly all cognitive architectures from some class of models would learn the same concepts in some appropriate types of environments.
To quote the author, “the holy grail of the project would be a system which provably learns all learnable abstractions in a fairly general class of environments, and represents those abstractions in a legible way. In other words: it would be a standardized tool for measuring abstractions. Stick it in some environment, and it finds the abstractions in that environment and presents a standard representation of them.”
Planned opinion:
The notion of “natural abstractions” seems quite important to me. There are at least some weak versions of the hypothesis that seem obviously true: for example, if you ask GPT-3 some new type of question it has never seen before, you can predict pretty confidently that it is still going to respond with real words rather than a string of random characters. This is effectively because you expect that GPT-3 has learned the “natural abstraction” of the words used in English and that it uses this natural abstraction to drive its output (leaving aside the cases where it must produce output in some other language).
The version of the natural abstraction hypothesis investigated here seems a lot stronger and I’m excited to see how the project turns out. I expect the author will post several short updates over time; I probably won’t cover each of these individually and so if you want to follow it in real time I recommend following it on the Alignment Forum.
Planned summary for the Alignment Newsletter:
Planned opinion:
Nice summary.