Contrapositive Natural Abstraction - Project Intro

Epistemic status: Early-stage model, and I’m relatively new to AI safety. This is also my first LessWrong post, but please hold my ideas and writing to a higher bar. Prioritize candidness over politeness.

Thanks to John Wentworth for pointers on an early draft.

TL;DR: I’m starting work on the Natural Abstraction Hypothesis from an over-general formalization, and narrowing until it’s true. This will start off purely information-theoretic, but I expect to add other maths eventually.

Motivation

My first idea upon learning about interpretability was to retarget the search. After reading a lot about deception auditing and Gabor filters, and starved of teloi, that dream began to die.

That was, until I found John Wentworth’s works. We seem to have similar intuitions about a whole lot of things. I think this is an opportunity for stacking.

Counter-proofs narrow our search

If we start off with an accurate and over-general model, we can prune with counter-proofs until we’re certain where NAH works. I think that this approach seems better fit for me than starting ground-up.

Pruning also yields clear insights about where our model breaks; maybe NAH doesn’t work if the system has property X, in which case we should ensure frontier AI doesn’t have X.

Examples

As Nate Soares mentioned, we don’t currently have a finite-example theory of abstraction. If we could lower-bound the convergence of abstractions in terms of, say, capabilities, that would give us finite-domain convergence and capabilities scaling laws. I think that information theory has tools for measuring these convergences, and that those tools “play nicer” when most of the problem is specified in such terms.

KL-Divergence and causal network entropy

Let $U$ be some probability distribution (our environment) and $M$ be our model of it, where $M$ is isomorphic to a causal network $B$ with $k$ nodes and $n$ edges. Can we upper-bound the divergence of abstractions given model accuracy and compute?

The Kullback-Leibler divergence $D_{KL} (Y | | X)$ of two probability distributions $X$ and $Y$ tells us how accurately $X$ predicts $Y$ , on average. This gives us a measure of world-model accuracy; what about abstraction similarity?

We could take the cross-entropy $H (X, Y)$ of each causal network $B_{i}$ , and compare it to the training limit $B_{\infty}$ .

Then our question looks like: Can we parameterize $a r g m a x_{H} (H (B_{i}, B_{\infty}))$ in terms of $D_{KL} (U | | M)$ , $n$ and $k$ ? Does $B_{\infty}$ (where $D_{KL} (U | | M_{\infty})$ is minimal) converge?

Probably not the way I want to; this is my first order of business.

Also, I don’t know if it would be better to call the entire causal graph “abstractions”, or just compare nodes or edges, or subsets thereof. I also need a better model for compute than number of causal network nodes and edges, since each edge can be a computationally intractable function.

And I need to find what ensures computationally limited models are isomorphic to causal networks; this is probably the second area I narrow my search.

Modularity

I expect that capability measures like KL-divergence won’t imply helpful convergence because of a lack of evolved modularity. I think that stochastic updates of some sort are needed to push environment latents into the model, and that they might e.g. need to be approximately Bayes-optimal.

Evolved modularity is a big delta for my credence in NAH. A True Name for modularity would plausibly be a sufficiently tight foundation for the abstraction I want.

Larger models

Using the same conditions, do sufficiently larger models contain abstractions of smaller models? I.e. do much larger causal graphs always contain a subset of abstractions which converge to the smaller graphs’ abstractions? Can we parameterize that convergence?

This is the telos of the project; the True Name of natural abstractions in superintelligences.