Contrapositive Natural Abstraction—Project Intro
Epistemic status: Early-stage model, and I’m relatively new to AI safety. This is also my first LessWrong post, but please hold my ideas and writing to a higher bar. Prioritize candidness over politeness.
Thanks to John Wentworth for pointers on an early draft.
TL;DR: I’m starting work on the Natural Abstraction Hypothesis from an over-general formalization, and narrowing until it’s true. This will start off purely information-theoretic, but I expect to add other maths eventually.
Motivation
My first idea upon learning about interpretability was to retarget the search. After reading a lot about deception auditing and Gabor filters, and starved of teloi, that dream began to die.
That was, until I found John Wentworth’s works. We seem to have similar intuitions about a whole lot of things. I think this is an opportunity for stacking.
Counter-proofs narrow our search
If we start off with an accurate and over-general model, we can prune with counter-proofs until we’re certain where NAH works. I think that this approach seems better fit for me than starting ground-up.
Pruning also yields clear insights about where our model breaks; maybe NAH doesn’t work if the system has property X, in which case we should ensure frontier AI doesn’t have X.
Examples
As Nate Soares mentioned, we don’t currently have a finite-example theory of abstraction. If we could lower-bound the convergence of abstractions in terms of, say, capabilities, that would give us finite-domain convergence and capabilities scaling laws. I think that information theory has tools for measuring these convergences, and that those tools “play nicer” when most of the problem is specified in such terms.
KL-Divergence and causal network entropy
Let be some probability distribution (our environment) and be our model of it, where is isomorphic to a causal network with nodes and edges. Can we upper-bound the divergence of abstractions given model accuracy and compute?
The Kullback-Leibler divergence of two probability distributions and tells us how accurately predicts , on average. This gives us a measure of world-model accuracy; what about abstraction similarity?
We could take the cross-entropy of each causal network , and compare it to the training limit .
Then our question looks like: Can we parameterize in terms of , and ? Does (where is minimal) converge?
Probably not the way I want to; this is my first order of business.
Also, I don’t know if it would be better to call the entire causal graph “abstractions”, or just compare nodes or edges, or subsets thereof. I also need a better model for compute than number of causal network nodes and edges, since each edge can be a computationally intractable function.
And I need to find what ensures computationally limited models are isomorphic to causal networks; this is probably the second area I narrow my search.
Modularity
I expect that capability measures like KL-divergence won’t imply helpful convergence because of a lack of evolved modularity. I think that stochastic updates of some sort are needed to push environment latents into the model, and that they might e.g. need to be approximately Bayes-optimal.
Evolved modularity is a big delta for my credence in NAH. A True Name for modularity would plausibly be a sufficiently tight foundation for the abstraction I want.
Larger models
Using the same conditions, do sufficiently larger models contain abstractions of smaller models? I.e. do much larger causal graphs always contain a subset of abstractions which converge to the smaller graphs’ abstractions? Can we parameterize that convergence?
This is the telos of the project; the True Name of natural abstractions in superintelligences.
Also, neural nets only compute to some precision, and often work prety well if that precision is reduced to 4–8 bits, which is a pretty significant limit on their computational capacity compared to assuming arbitrary precision.
Yes, I agree. I expect abstractions, typically, to involve much more than 4-8 bits of information. On my model, any neural network, be it MLP, KAN or something new, will approximate abstractions with multiple nodes in parallel when the network is wide enough. I.e. the causal graph I mentioned is very distinct from the NN which might be running it.
Though now that you mentioned it, I wonder if low-precision NN weights are acceptable because of some network property (maybe SGD is so stochastic that higher precision doesn’t help) or the environment (maybe natural latents tend to be lower-entropy)?
Anyways, thanks for engaging. It’s encouraging to see someone comment.
You’re assuming a lot of familiarity with your chosen notation, i.e. you probably just lost a significant fraction of your readers — I’d suggest spending a few sentences defining terminology.
Shoot, thanks. Hopefully it’s clearer now.
Yes: thanks!