The Pragmascope Idea
Pragma (Greek): thing, object.
A “pragmascope”, then, would be some kind of measurement or visualization device which shows the “things” or “objects” present.
I currently see the pragmascope as the major practical objective of work on natural abstractions. As I see it, the core theory of natural abstractions is now 80% nailed down, I’m now working to get it across the theory-practice gap, and the pragmascope is the big milestone on the other side of that gap.
This post introduces the idea of the pragmascope and what it would look like.
Background: A Measurement Device Requires An Empirical Invariant
First, an aside on developing new measurement devices.
Why The Thermometer?
What makes a thermometer a good measurement device? Why is “temperature”, as measured by a thermometer, such a useful quantity?
Well, at the most fundamental level… we stick a thermometer in two different things. Then, we put those two things in contact. Whichever one showed a higher “temperature” reading on the thermometer gets colder, whichever one showed a lower “temperature” reading on the thermometer gets hotter, all else equal (i.e. controlling for heat exchanged with other things in the environment). And this is robustly true across a huge range of different things we can stick a thermometer into.
It didn’t have to be that way! We could imagine a world (with very different physics) where, for instance, heat always flows from red objects to blue objects, from blue objects to green objects, and from green objects to red objects. But we don’t see that in practice. Instead, we see that each system can be assigned a single number (“temperature”), and then when we put two things in contact, the higher-number thing gets cooler and the lower-number thing gets hotter, regardless of which two things we picked.
Underlying the usefulness of the thermometer is an empirical fact, an invariant: the fact that which-thing-gets-hotter and which-thing-gets-colder when putting two things into contact can be predicted from a single one-dimensional real number associated with each system (i.e. “temperature”), for an extremely wide range of real-world things.
Generalizing: a useful measurement device starts with identifying some empirical invariant. There needs to be a wide variety of systems which interact in a predictable way across many contexts, if we know some particular information about each system. In the case of the thermometer, a wide variety of systems get hotter/colder when in contact, in a predictable way across many contexts, if we know the temperature of each system.
So what would be an analogous empirical invariant for a pragmascope?
The Role Of The Natural Abstraction Hypothesis
The natural abstraction hypothesis has three components:
Chunks of the world generally interact with far-away chunks of the world via relatively-low-dimensional summaries
A broad class of cognitive architectures converge to use subsets of these summaries (i.e. they’re instrumentally convergent)
These summaries match human-recognizable “things” or “concepts”
For purposes of the pragmascope, we’re particularly interested in claim 2: a broad class of cognitive architectures converge to use subsets of the summaries. If true, that sure sounds like an empirical invariant!
So what would a corresponding measurement device look like?
What would a pragmascope look like, concretely?
The “measurement device” (probably a python function, in practice) should take in some cognitive system (e.g. a trained neural network) and maybe its environment (e.g. simulator/data), and spit out some data structure representing the natural “summaries” in the system/environment. Then, we should easily be able to take some other cognitive system trained on the same environment, extract the natural “summaries” from that, and compare. Based on the natural abstraction hypothesis, we expect to observe things like:
A broad class of cognitive architectures trained on the same data/environment end up with subsets of the same summaries.
Two systems with the same summaries are able to accurately predict the same things on new data from the same environment/distribution.
On inspection, the summaries correspond to human-recognizable “things” or “concepts”.
A system is able to accurately predict things involving the same human-recognizable concepts the pragmascope says it has learned, and cannot accurately predict things involving human-recognizable concepts the pragmascope says it has not learned.
It’s these empirical observations which, if true, will underpin the usefulness of the pragmascope. The more precisely and robustly these sorts of properties hold, the more useful the pragmascope. Ideally we’d even be able to prove some of them.
What’s The Output Data Structure?
One obvious currently-underspecified piece of the picture: what data structures will the pragmascope output, to represent the “summaries”? I have some current-best-guesses based on the math, but the main answer at this point is “I don’t know yet”. I expect looking at the internals of trained neural networks will give lots of feedback about what the natural data structures are.
Probably the earliest empirical work will just punt on standard data structures, and instead focus on translating internal-concept-representations in one net into corresponding internal-concept-representations in another. For instance, here’s one experiment I recently proposed:
Train two nets, with different architectures (both capable of achieving zero training loss and good performance on the test set), on the same data.
Compute the small change in data dx which would induce a small change in trained parameter values d\theta along each of the narrowest directions of the ridge in the loss landscape (i.e. eigenvectors of the Hessian with largest eigenvalue).
Then, compute the small change in parameter values d\theta in the second net which would result from the same small change in data dx.
Prediction: the d\theta directions computed will approximately match the narrowest directions of the ridge in the loss landscape of the second net.
Conceptually, this sort of experiment is intended to take all the stuff one network learned, and compare it to all the stuff the other network learned. It wouldn’t yield a full pragmascope, because it wouldn’t say anything about how to factor all the stuff a network learns into individual concepts, but it would give a very well-grounded starting point for translating stuff-in-one-net into stuff-in-another-net (to first/second-order approximation).
- Natural Abstractions: Key claims, Theorems, and Critiques by 16 Mar 2023 16:37 UTC; 237 points) (
- World-Model Interpretability Is All We Need by 14 Jan 2023 19:37 UTC; 35 points) (
- How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It) by 10 Aug 2022 18:14 UTC; 28 points) (
- My AI Alignment Research Agenda and Threat Model, right now (May 2023) by 28 May 2023 3:23 UTC; 25 points) (
- AI Risk Intro 2: Solving The Problem by 22 Sep 2022 13:55 UTC; 22 points) (
- Greed Is the Root of This Evil by 13 Oct 2022 20:40 UTC; 18 points) (
- AI Risk Intro 2: Solving The Problem by 24 Sep 2022 9:33 UTC; 11 points) (EA Forum;
- My AI Alignment Research Agenda and Threat Model, right now (May 2023) by 28 May 2023 3:23 UTC; 6 points) (EA Forum;
Question 1: What’s the minimal set of articles one should read to understand this 80%?
Question/Remark 2: AFAICT, your theory has a major missing piece, which is, proving that “abstraction” (formalized according to your way of formalizing it) of is actually a crucial ingredient of learning/cognition. The way I see it, such a proof should be by demonstrating that hypothesis classes defined in terms of probabilistic graph models / abstraction hierarchies can be learned with good sample complexity (and better yet if you can tell something about the computational complexity), in a manner that cannot be achieved if you discard any of the important-according-to-you pieces. You might have some different approach to this, but I’m not sure what it is.
Telephone Theorem, Redundancy/Resampling, and Maxent for the math, Chaos for the concepts.
If we want to show that abstraction is a crucial ingredient of learning/cognition, then “Can we efficiently learn hypothesis classes defined in terms of abstraction hierarchies, as captured by John’s formalism?” is entirely the wrong question. Just because something can be learned efficiently doesn’t mean it’s convergent for a wide variety of cognitive systems. And even if such hypothesis classes couldn’t be learned efficiently in full generality, it would still be possible for a subset of that hypothesis class to be convergent for a wide variety of cognitive systems, in which case general properties of the hypothesis class would still apply to those systems’ cognition.
The question we actually want here is “Is abstraction, as captured by John’s formalism, instrumentally convergent for a wide variety of cognitive systems?”. And that question is indeed not yet definitively answered. The pragmascope itself would largely allow us to answer that question empirically, and I expect the ability to answer it empirically will quickly lead to proofs as well.
Thank you!
I believe that the relevant cognitive systems all look like learning algorithms for a prior of certain fairly specific type. I don’t know how this prior looks like, but it’s something very rich on the one hand and efficiently learnable on the other hand. So, if you showed that your formalism naturally produces priors that seem closer to that “holy grail prior”, in terms of richness/efficiency, compared to priors that we already know (e.g. MDPs with small number of states which are not rich enough, or the Solomonoff prior which is both statistically and computationally intractable), that would at least be evidence that you’re going in the right direction.
Hmm, I’m not sure what would it mean for a subset of a hypothesis class to be “convergent”.
That’s interesting, but I’m still not sure what it means exactly. Let’s say we take a reinforcement learner which a specific hypothesis class, such all MDPs of certain size, or some family of MDPs with low eluder dimension, or the actual AIXI. How would you determine whether your formalism is “instrumentally convergent” for each of those? Is there a rigorous way to state the question?
Doesn’t the necessity of abstraction follow from size concerns? The alternative to abstraction would be to measure and simulate everything in full detail, which can only be done if you are “exponentially bigger than the universe” (and have exponentially many universes to learn from).
One could argue that some kind of abstraction is necessary due to size concerns, but that alone does not necessarily nail down my whole formalism.
Okay, really rough idea on how to identify where a ML model’s goals are stored + measure how much of an optimizer it is. If successful, it might provide a decent starting point for disentangling concepts from each other.
The Ground of Optimization mentions “retargetability” as one of the variables of optimizing systems. How much of the system do you need to change in order to make it optimize towards a different target configuration? Can you easily split the system into the optimizer and the optimized? For example: In a house-plus-construction-company system, we just need to vary the house’s schematics to make the system optimize towards wildly different houses. Conversely, to make a ball placed at the edge of a giant inverted cone come to rest in a different location, we’d need to change the shape of the entire cone.
Intuitively, it seems like it should be possible to identify goals in neural networks the same way. A “goal” is the minimal set of parameters that you need to perturb in order to make the network optimize a meaningfully different metric without any loss of capability.
Various shallow pattern-matchers/look-up tables are not easily retargetable — you’d need to rewrite most of their parameters. They’re more like inverted cones.
Idealized mesa-optimizers with a centralized crystallized mesa-objective are very retargetable — their utility function is precisely mathematically defined, disentangled from capabilities, and straightforwardly rewritten.
Intermediate systems — e. g., shard economies/heuristics over world-models are somewhat retargetable. There may be limited dimensions along which their mesa-objectives may be changed without capability loss, limited “angles” in concept-space by which their targeting may be adjusted. Alternatively/additionally, you’d need to rewrite the entire suite of shards/heuristics at once and in a cross-dependent manner.
As a bonus, the fraction of parameters you need to change to retarget the system roughly tells you how much of an optimizer it is.
The question is how to implement this. It’s easy to imagine the algorithms that may work if we had infinite compute, but practically?
Neuron Shapleys may be a good starting point? The linked paper seems to “use the Shapley value framework to measure the importance of different neurons in determining an arbitrary metric of the neural net output”, and the authors use it to tank accuracy/remove social bias/increase robustness to adversarial attacks just by rewriting a few neurons. It might be possible to do something similar to detect goal-encoding neurons? Haven’t looked into it in-depth yet, though.
Neat idea. One thing I’d watch out for is that “subset of the neurons” might not be the right ontology for a conceptually-”small” change. E.g. in the Rome paper, they made low-rank updates rather than work with individual neurons. So bear in mind that figuring out the ontology through which to view the network’s internals may itself be part of the problem.
My understanding of how the natural abstractions hypothesis relates to trained machine learning models and what the main challenge in actually applying it is:
How it applies:
A high-dimensional dataset from the real world such as MNIST contains “natural abstractions”, i.e. patterns that show up in a lot of places. If you train a machine learning model on the dataset, it will pick up all the sufficiently redundant information in the dataset (and often also a lot of non-redundant information), making it function as a sort of “summary statistic” of the data. Different systems trained on the same data would tend to pick up the same information, because they all have the same dataset and the information is part of the dataset, due to the redundancy.
How it is a challenge:
Think about, what format is it that the machine learning model gets the abstraction in? That is, if you want to get the information back out again, what way can you do it? Well, one guaranteed way is, for an image classifier, you can take the original images and feed them to the network, and it will output the classes according to the patterns it learned. Similarly, for a generative model, you can have it generate a dataset that resembles the one it was trained on.
These operations would obviously extract all of the information that the models have learned, but the operations are not very useful, as the extracted information is not in a format that is nice to get an overview of. So ideally we’d have a way to extract better structured information.
But it’s not obvious from the natural abstraction hypothesis that such a method exists. The natural abstraction hypothesis guarantees that the information is embedded in the models somehow, but it doesn’t seem like it guarantees that it’s embedded in some nice format, or that there is a nice way to extract it, beyond the way that it was put in.
I’m not sure whether you agree with this. My lesson if the above is true is that we need to think of ways to structure the models so as to put the information in a nicer format. But maybe I’m missing something. Maybe I need to re-digest the gKPD argument or something.
This basically correct, other than the part about not having any guarantee that the information is in a nice format. The Maxent and Abstractions arguments do point toward a relatively nice format, though it’s not yet clear what the right way is to bind the variables of those arguments to stuff in a neural net. (Though I expect the data structures actually used will have additional structure to them on top of the maxent form.)
I’ve been thinking about what results this experiment would yield (have been too lazy to actually perform the experiment myself 😅). You’ve probably already performed the experiment, so my theorizing here probably isn’t useful to you, but I thought I should bring it up anyway, so you can correct my theorizing if wrong/so other people can learn from it.
I believe this dx would immediately bring you “off the data manifold”, perhaps unless the network has been trained to be very robust.
For instance the first eigenvector of the Hessian probably represents the average output of the model, but if e.g. your model is an image classifier and all the images in the dataset have a white background, then rather than just using the network’s built-in bias parameters to control the average output, it could totally decide to just pick a random combination of those white pixels and use them for the intercept. But there’s no reason two different networks are going to use the same combination, since it’s a massively underspecified problem, so this dx won’t generalize to other networks.
I did try it on a simple MNIST classifier. The main result was that all effects were dominated by a handful of missclassified or barely-correctly-classified data points, and the phenomenon I originally hypothesized just wasn’t super relevant.
Since then, I’ve also tried a different kind of experiment to translate interpretable features across nets, this time on a simple generative model. Basically, the experiment just directly applied the natural abstraction hypothesis to the image-distributions produced by nets trained on the same data (using a first-order approximation). That one worked a lot better, but didn’t really connect to peak breadth or even say much about network internals in general.
Ah, I had been thinking that this method would weight these sorts of data points highly, but I wasn’t sure how critical it would be. I’ve assumed it would be possible to reweight things to focus on a better distribution of data points, because it seems like there would be some very mathematically natural ways of doing this reweighting. Is this something you’ve experimented with?
… I suppose it may make more sense to do this reweighting for my purposes than for yours.
When you say “directly applied”, what do you mean?
Saying much about network internals seems difficult as ever. I get the impression that these methods can’t really do it, due to being too local; they can say something about how the network behaves on the data manifold, but networks that are internally very different can behave the same on the data manifold, and so these methods can’t really distinguish those networks.
Meta: I’m going through a backlog of comments I never got around to answering. Sorry it took three months.
Something along those lines might work; I didn’t spend much time on it before moving to a generative model.
The actual main thing I did was to compute the SVD of the jacobian of a generative network output (i.e. the image) with respect to input (i.e. the latent vector). Results of interest:
Conceptually, near-0 singular values indicate a direction-in-image-space in which no latent parameter change will move the image—i.e. locally-inaccessible directions. Conversely, large singular values indicate “degrees of freedom” in the image. Relevant result: if I take two different trained generative nets, and find latents for each such that they both output approximately the same image, then they both roughly agree on what directions-in-image-space are local degrees of freedom.
By taking the SVD of the jacobian of a chunk of the image with respect to the latent, we can figure out which directions-in-latent-space that chunk of image is locally sensitive to. And then, a rough local version of the natural abstraction hypothesis would say that nonadjacent chunks of image should strongly depend on the same small number of directions-in-latent-space, and be “locally independent” (i.e. not highly sensitive to the same directions-in-latent-space) given those few. And that was basically correct.
To be clear, this was all “rough heuristic testing”, not really testing predictions carefully derived from the natural abstraction framework.
From the existing theory, I still have a hard time seeing what you would be able to get out of this in practice, without adding further structures. That’s not to say I think the idea is doomed, I have some ideas for what I’d attempt to do as further structures (like existing human geometric knowledge), but it doesn’t seem like you plan on using those.
I’m not sure to which extent my confusion is because you still plan to learn new things when experimenting with networks as you apply this pragmascope, vs you being more optimistic about what the existing theory allows you to do. Vs other things I might not realize.
I definitely expect to learn a lot from networks when experimenting.
Can you unroll that?
“Small change in data” = one additional training sample is slightly modified? “Induce” = via an SGD update step on that additional training sample? Why is there a ridge in the loss landscape? What are “the narrowest directions”?
The easiest operationalization starts from the assumption that we train to zero loss. From there, we can calculate the small change in optimal parameter values dθ due to a small change in all the data dx:
(−∑ndfndθd2Ln(dfn)2dfndθ)dθ=∑ndfndθd2Ln(dfn)2dfndxndxn
… where:
fn(θ,xn) is the network output on datapoint n
Ln(fn) is the loss on datapoint n
(More generally, when calculating maxθu(θ,x), the change in optimal θ-value from a small change in x is given by d2udθ2dθ=−d2udθdxdx.)
The “narrowest directions” are the eigenvectors of the loss Hessian with largest eigenvalue (where the loss Hessian is ∑ndfndθd2Ln(dfn)2dfndθ, i.e. the matrix on the LHS in the formula above). And there’s a ridge in the loss landscape because, if we’re training to zero loss, then presumably we’re in the overparameterized regime.
Note: I think what you’re doing there is asking what incremental change in the training data uniquely strengthens the influence of one feature in the network without touching the others.
The “pointiest directions” in parameter space correspond to the biggest features in the orthogonalised feature set of the network.
So I’d agree with the prediction that if you calculate what dtheta the dx corresponds to in the second network, you’d indeed often find that it’s close to being an eigenvector/most prominent orthogonalised feature of the second network too. Because we know that neural networks tend to learn similar features when trained on similar tasks.
I think it might be interesting to see whether actually modifying the training data in the dx direction would tend to give you a network where the corresponding feature is more prominent, and how large dx can get before that ceases to hold.
I don’t see why this experiment is good. This hessian similarity loss is only a product of the input/output behavior, and because both networks get 0 loss, their input/output behavior must be very similar, combined with general continuous optimization smoothness would lead to similar hessians. I think doing this in a case where the nets get nonzero loss (like ~all real world scenarios), would be more meaningful, because it would be similarity despite input-output behavior being non-identical and some amount of lossy compression happening.
I like this idea, and I particularly like that it is amenable to empirical studies. If I were going to tackle this I would use some small synthetic datasets which can be:
a) fully known / described. You can decide which portion of the full dataset to be train vs test (vs validation if third split needed)
b) fully learned by toy sized (by today’s standard) models
so, probably like some simple logic puzzles, some limited set of addition or multiplication (e.g. of all three digit numbers), maybe translation between two simple toy languages?
On these I’d want to try some quite different architectures. Some subset of the following which could be made to work well with the dataset chosen (and then try again with a different dataset and different set of architectures):
multilayer perceptrons of different sizes
convolutional neural nets
transformers
LSTMs
multilayer perceptrons with variational auto-encoders acting as information bottlenecks between layers
Numenta’s super sparse network idea https://github.com/numenta/htmpapers
maybe a Spiking Neural Net implementation like Nengo https://www.nengo.ai/
xgboost
https://github.com/RichardEvans/apperception an explicit logical model learner
etc...
I think if you could trace the same abstraction across three or four of these types, you’d making get some valuable insights into the generalizable nature of knowledge.
Along the same lines of thought, there are a lot of interpretability techniques which were developed for image models (e.g. CNNs) which I think would be really interesting if generalized to language models / transformers, and seem logically like they would translate pretty easily.