Epistemic status: Early-stage model, and I’m relatively new to AI safety. This is also my first LessWrong post, but please hold my ideas and writing to a higher bar. Prioritize candidness over politeness.
Thanks to John Wentworth for pointers on an early draft.
TL;DR: I’m starting work on the Natural Abstraction Hypothesis from an over-general formalization, and narrowing until it’s true. This will start off purely information-theoretic, but I expect to add other maths eventually.
Motivation
My first idea upon learning about interpretability was to retarget the search. After reading a lot about deception auditing and Gabor filters, and starved of teloi, that dream began to die.
That was, until I found John Wentworth’s works. We seem to have similar intuitions about a whole lot of things. I think this is an opportunity for stacking.
Counter-proofs narrow our search
If we start off with an accurate and over-general model, we can prune with counter-proofs until we’re certain where NAH works. I think that this approach seems better fit for me than starting ground-up.
Pruning also yields clear insights about where our model breaks; maybe NAH doesn’t work if the system has property X, in which case we should ensure frontier AI doesn’t have X.
Examples
As Nate Soares mentioned, we don’t currently have a finite-example theory of abstraction. If we could lower-bound the convergence of abstractions in terms of, say, capabilities, that would give us finite-domain convergence and capabilities scaling laws. I think that information theory has tools for measuring these convergences, and that those tools “play nicer” when most of the problem is specified in such terms.
KL-Divergence and causal network entropy
Let U be some probability distribution (our environment) and M be our model of it, where M is isomorphic to a causal network B with k nodes and n edges. Can we upper-bound the divergence of abstractions given model accuracy and compute?
The Kullback-Leibler divergenceDKL(Y||X) of two probability distributions X and Y tells us how accurately X predicts Y, on average. This gives us a measure of world-model accuracy; what about abstraction similarity?
We could take the cross-entropyH(X,Y) of each causal network Bi, and compare it to the training limit B∞.
Then our question looks like: Can we parameterize argmaxH(H(Bi,B∞)) in terms of DKL(U||M), n and k? Does B∞ (where DKL(U||M∞) is minimal) converge?
Probably not the way I want to; this is my first order of business.
Also, I don’t know if it would be better to call the entire causal graph “abstractions”, or just compare nodes or edges, or subsets thereof. I also need a better model for compute than number of causal network nodes and edges, since each edge can be a computationally intractable function.
And I need to find what ensures computationally limited models are isomorphic to causal networks; this is probably the second area I narrow my search.
Modularity
I expect that capability measures like KL-divergence won’t imply helpful convergence because of a lack of evolved modularity. I think that stochastic updates of some sort are needed to push environment latents into the model, and that they might e.g. need to be approximately Bayes-optimal.
Evolved modularity is a big delta for my credence in NAH. A True Name for modularity would plausibly be a sufficiently tight foundation for the abstraction I want.
Larger models
Using the same conditions, do sufficiently larger models contain abstractions of smaller models? I.e. do much larger causal graphs always contain a subset of abstractions which converge to the smaller graphs’ abstractions? Can we parameterize that convergence?
This is the telos of the project; the True Name of natural abstractions in superintelligences.
Contrapositive Natural Abstraction—Project Intro
Epistemic status: Early-stage model, and I’m relatively new to AI safety. This is also my first LessWrong post, but please hold my ideas and writing to a higher bar. Prioritize candidness over politeness.
Thanks to John Wentworth for pointers on an early draft.
TL;DR: I’m starting work on the Natural Abstraction Hypothesis from an over-general formalization, and narrowing until it’s true. This will start off purely information-theoretic, but I expect to add other maths eventually.
Motivation
My first idea upon learning about interpretability was to retarget the search. After reading a lot about deception auditing and Gabor filters, and starved of teloi, that dream began to die.
That was, until I found John Wentworth’s works. We seem to have similar intuitions about a whole lot of things. I think this is an opportunity for stacking.
Counter-proofs narrow our search
If we start off with an accurate and over-general model, we can prune with counter-proofs until we’re certain where NAH works. I think that this approach seems better fit for me than starting ground-up.
Pruning also yields clear insights about where our model breaks; maybe NAH doesn’t work if the system has property X, in which case we should ensure frontier AI doesn’t have X.
Examples
As Nate Soares mentioned, we don’t currently have a finite-example theory of abstraction. If we could lower-bound the convergence of abstractions in terms of, say, capabilities, that would give us finite-domain convergence and capabilities scaling laws. I think that information theory has tools for measuring these convergences, and that those tools “play nicer” when most of the problem is specified in such terms.
KL-Divergence and causal network entropy
Let U be some probability distribution (our environment) and M be our model of it, where M is isomorphic to a causal network B with k nodes and n edges. Can we upper-bound the divergence of abstractions given model accuracy and compute?
The Kullback-Leibler divergence DKL(Y||X) of two probability distributions X and Y tells us how accurately X predicts Y, on average. This gives us a measure of world-model accuracy; what about abstraction similarity?
We could take the cross-entropy H(X,Y) of each causal network Bi, and compare it to the training limit B∞.
Then our question looks like: Can we parameterize argmaxH(H(Bi,B∞)) in terms of DKL(U||M), n and k? Does B∞ (where DKL(U||M∞) is minimal) converge?
Probably not the way I want to; this is my first order of business.
Also, I don’t know if it would be better to call the entire causal graph “abstractions”, or just compare nodes or edges, or subsets thereof. I also need a better model for compute than number of causal network nodes and edges, since each edge can be a computationally intractable function.
And I need to find what ensures computationally limited models are isomorphic to causal networks; this is probably the second area I narrow my search.
Modularity
I expect that capability measures like KL-divergence won’t imply helpful convergence because of a lack of evolved modularity. I think that stochastic updates of some sort are needed to push environment latents into the model, and that they might e.g. need to be approximately Bayes-optimal.
Evolved modularity is a big delta for my credence in NAH. A True Name for modularity would plausibly be a sufficiently tight foundation for the abstraction I want.
Larger models
Using the same conditions, do sufficiently larger models contain abstractions of smaller models? I.e. do much larger causal graphs always contain a subset of abstractions which converge to the smaller graphs’ abstractions? Can we parameterize that convergence?
This is the telos of the project; the True Name of natural abstractions in superintelligences.