What application do you have in mind? If you’re trying to reason about formal models without trying to completely rigorously prove things about them, then I think thinking of neural networks as stochastic systems is the way to go. Namely, you view the weights as a random variable solving a stochastic optimization problem to produce a weight-valued random variable, then conditioning it on whatever knowledge about the weights/activations you assume is available. This can be done both in the Bayesian “thermostatic” sense as a model of idealized networks, and in the sense of modeling the NN as SGD-like systems. Both methods are explored explicitly (and give different results) in suitable high width limits by the PDLT and tensor networks paradigms (the latter also looks at “true SGD” with nonnegligible step size).
Here you should be careful about what you condition on, as conditioning on exact knowledge of too much input-output behavior of course blows stuff up, and you should think of a way of coarse-graining, i.e. “choose a precision scale” :). Here my first goto would be to assume the tempered Boltzmann distribution on the loss at an appropriate choice of temperature for what you’re studying.
If you’re trying to do experiments, then I would suspect that a lot of the time you can just blindly throw whatever ML-ish tools you’d use in an underdetermined, “true inference” context and they’ll just work (with suitable choices of hyperparameters)
Statistical localization in disordered systems, and dreaming of more realistic interpretability endpoints
[epistemic status: half fever dream, half something I think is an important point to get across. Note that the physics I discuss is not my field though close to my interests. I have not carefully engaged with it or read the relevant papers—I am likely to be wrong about the statements made and the language used.]
A frequent discussion I get into in the context of AI is “what is an endpoint for interpretability”. I get into this argument from two sides:
arguing with interpretability purists, who say that the only way to get robust safety from interpretability is to mathematically prove that behaviors are safe and/or no deception is going on.
arguing with interpretability skeptics, who say that the only way to get robust safety from interpretability is to prove that behaviors are safe and/or no deception is going on.
My typical response to this is that no, you’re being silly: imagine discussing any other phenomenon in this way: “the only way to show that the sun will rise tomorrow is to completely model the sun on the level of subatomic particles and prove that they will not spontaneously explode”. Or asking a bridge safety expert to model every single particle and provably lower-bound the probability of them losing structural coherence in a way not observed by bulk models.
But there’s a more fundamental intuition here, that I started developing when I started trying to learn statistical physics. There are a few lossy ways of expressing it. One is to talk about renormalization, how assumptions about renormalizability of systems is a “theorem” in statistical mechanics, but is not (and probably never will be) proven mathematically, (in some sense, it feels much more like a “truly new flavor of axiom” than even complexity-theoretic things like P vs. NP). But that’s still not it. There is a more general intuition, that’s hard to get across (in particular for someone who, like me, is only a dabbler in the subject) -- that some genuinely incredibly complex and information-laden systems have some “strong locality” properties, which are (insofar as the physical meaning of the word holds meaning) both provable and very robust to changing and expanding the context.
For a while, I thought that this is just a vibe—a way to guide thinking, but not something that can be operationalized in a way that may significantly convince people without a similar intuition.
However, recently I’ve become more hopeful that an “explicitly formalizable” notion of robust interpretability may fall out of this language in a somewhat natural way.
This is closely related to recent discussions and writeups we’ve been doing with Lauren Greenspan on scale and renormalization in (statistical) QFT and connections to ML.
One direction to operationalize this is through the notion of “localization” in statistical physics, and in particular “Anderson localization”. The idea (if I understand it correctly) is that in certain disordered systems (think of a semiconductor, which is an “ordered” metal with a disordered system of “impurity atoms” sprinkled inside), you can prove a kind of screening property: that from the point of view of the localized dynamics near a particular spin, you can provably ignore spins far away from the point you’re studying (or rather, replace them by an “ordered” field that modifies the local dynamics in a fully controllable way). This idea of of local interactions being “screened” from far-away details is ubiquitous. In a very large and very robust class of systems, interactions are purely local, except for mediation by a small number of hierarchical “smooth” couplings that see only high-level summary statistics of the “non-local” spins and treat them as a background—and moreover, these “locality” properties are provable (insofar as we assume the extra “axioms” of thermodynamics), assuming some (once again, hierarchical and robustly adjustable) assumptions of independence. There are a number of related principles here that (if I understand correctly) get used in similar contexts, sometimes interchangeably: one I liked is “local perturbations perturb locally” (“LPPL”) from this paper.
Note that in the above paragraph I did something I generally disapprove of: I am trying to extract and verbalize “vibes” from science that I don’t understand on a concrete level, and I am almost certainly getting a bunch of things wrong. But I don’t know of another way of gesturing in a “look, there’s something here and it’s worth looking into” way without doing this to some extent.
Now AI systems, just like semiconductors, are statistical systems with a lot of disorder. In particular in a standard operationalization (as e.g. in PDLT), we can conceptualize of neural nets as a field theory. There is a “vacuum theory” that depends only on the architecture, and then adding new datapoints corresponds to adding particles. PDLT only studies a certain perturbative picture here, but it seems plausible that an extension of these techniques may extend to non-perturbative scales (and hope for this is a big part of the reason that Lauren and I have been thinking and writing about renormalization). In a “dream” version of such an extension, the datapoints would form a kind of disordered system, with both ordered components, hierarchical relationships, and some assumption of inherent randomness outside of the relationships. A great aspect of “numerical” QFT, such as gets applied in condensed matter models, is that you don’t need a really great model of the hierarchical relationships: sometimes you can just play around and turn on a handful of extra parameters until you find something that works. (Again, at the moment this is an imprecise interpretation of things I have not deeply engaged with.)
Of course doing this makes some assumptions—but the assumptions are on the level of the data (i.e. particles), not the weights/ model internals (i.e., fields—the place where we are worried about misalignment, etc.). And if you assume these assumptions and write down a “localization theorem” result, then plausibly the kind of statement you will get is something along the lines of the following:
More generally, the kinds of information this kind of picture would give is a kind of “local provably robust interpretability”—where the text completion behavior of a model is provably (under suitable “disordered system” assumptions) reducible to a collection of several local circuits that depend on understandable phenomena at a few different scales. A guiding “complexity intuition” for me here is provided by the notrivial but tractable grammar task diagrams in the paper Marks et al. (See pages 25-27, and note the shape of these diagrams is more or less straightup typical of the shape of a nonrenormalized interaction diagram you see before you start applying renormalization to simplify a statistical system).
An important caveat here is that in physical models of this type (and in pictures that include renormalization more generally), one does not make—or assume—any “fundamentality” assumptions. In many cases a number of alternative (but equivalent, once the “screening” is factored in) pictures exist, with various levels of granularity, elegance, etc. (this already can be seen in the 2D Ising model—a simple magnet model—where the same behaviors can either be understood in a combinatorial “spin-to-spin interaction” way, which mirrors the “fundamental interpretability” desires of mechinterp, and through this “recursive screening out” model that is more renormalization-flavored; the results are the same (to a very high level of precision), even when looking at very localized effects involving collections of a few spins. So the question of whether an interpretation is “fundamental” or uses the “right latents” is to a large extent obviated here; the world of thermodynamics is much more anarchical and democratic than the world of mathematical formalism and “elegant proof”, at least in this context.
Having handwavily described a putative model, I want to quickly say that I don’t actually believe in this model. There are a bunch of things I probably got wrong, there are a bunch of other, better tools to use, and so on. But the point is not the model: it’s that this kind of stuff exists. There exist languages that show that arbitrarily complex, arbitrarily expressive behaviors are provably reducible to local interactions, where behaviors can be understood as clusters of hierarchical interactions that treat all but a few parts of the system at every point as “screened out noise”.
I think that if models like this are possible, then a solution to “the interpretability component to safety” is possible in this framework. If you have provably localized behaviors then for example you have a good idea where to look for deception: e.g., deception cannot occur on the level “very low-level” local interactions, as they are too simple to express the necessary reasoning, and perhaps it can be carefully operationalized and tracked in the higher-level interactions.
As you’ve no doubt noticed, this whole picture is splotchy and vague. It may be completely wrong. But there also may be something in this direction that works. I’m hoping to think more about this, and very interested in hearing people’s criticisms and thoughts.