ARC’s current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understanding of what was going on inside a neural network, we would hope to be able to use that understanding to verify that the network was not going to behave dangerously in unforeseen situations. ARC is attempting to perform this kind of verification, but using a mathematical kind of “explanation” instead of one written in natural language.
To help elucidate this connection, ARC has been supporting work on Compact Proofs of Model Performance via Mechanistic Interpretability by Jason Gross, Rajashree Agrawal, Lawrence Chan and others, which we were excited to see released along with this post. While we ultimately think that provable guarantees for large neural networks are unworkable as a long-term goal, we think that this work serves as a useful springboard towards alternatives.
In this post, we will:
Summarize ARC’s takeaways from this work and the problems we see with provable guarantees
Explain ARC’s notion of a heuristic explanation and how it is intended to overcome these problems
Describe with the help of a worked example how the quality of a heuristic explanation can be quantified, using a process we have been calling surprise accounting
We are also sharing a draft by Gabriel Wu (currently visiting ARC) describing a heuristic explanation for the same model that appears in the above paper:
Thanks to Stephanie He for help with the diagrams in this post. Thanks to Eric Neyman, Erik Jenner, Gabriel Wu, Holly Mandel, Jason Gross, Mark Xu, and Mike Winer for comments.
Formal verification for neural networks
In Compact Proofs of Model Performance via Mechanistic Interpretability, the authors train a small transformer on an algorithmic task to high accuracy, and then construct several different formal proofs of lower bounds on the network’s accuracy. Without foraying into the details, the most interesting takeaway from ARC’s perspective is the following picture:
In the top right of the plot is the brute-force proof, which simply checks every possible input to the network. This gives the tightest possible bound, but is very long. Meanwhile, in the bottom left is the trivial proof, which simply states that the network is at least 0% accurate. This is very short, but gives the loosest possible bound. In between these two extremes, along the orange Pareto frontier, there are proofs that exploit more structure in the network, leading to tighter bounds for a given proof length, or put another way, shorter proofs for a given bound tightness.
It is exciting to see a clear demonstration that shorter proofs better explain why the neural network has high accuracy, paralleling a common mathematical intuition that shorter proofs offer more insight. One might therefore hope that if we understood the internals of a neural network well enough, then we would be able to provide provable guarantees for very complex behaviors, even when brute-force approaches are infeasible. However, we think that such a hope is not realistic for large neural networks, because the notion of proof is too strict.
The basic problem with provable guarantees is that they must account for every possible way in which different parts of the network interact with one another, even when those interactions are incidental to the network’s behavior. These interactions manifest as error terms, which the proof must provide a worst-case bound for, leading to a looser bound overall. The above picture provides a good demonstration of this: moving towards the left of the plot, the best bound gets looser and looser.
More generally, it is hard to prove a lack of structure – another common mathematical intuition. There are many examples of formal phenomena that appear to arise from a lack of structure, but for which proving this is considered out-of-reach: in mathematics, the normality of π, or the Collatz conjecture, which has natural generalizations that are known to be undecidable; in computer science, the behavior of certain cryptographic hash functions; or in physics, the diffusion of a gas across a room in some formal model.
Heuristic explanations
Since proving a lack of structure is a key obstacle to formal verification of neural networks, ARC has been pursuing an alternative approach. Instead of attempting to prove a lack of structure, we instead assume by default a lack of structure, and produce a best guess given any structure that has been pointed out. We call such an estimate a heuristic explanation (or sometimes a heuristic argument).
Returning to the above examples of unproven statements, in each case there is an informal, probabilistic argument that explains the phenomenon. For example, to estimate the density of 0s in the decimal expansion of π, we treat the digits as uniformly random, giving a best guess of 1⁄10. This is the sort of reasoning we wish to permit in a heuristic explanation.
Unlike a proof, a heuristic explanation is defeasible, meaning that its conclusion is open to revision once further structure has been pointed out. Our hope is to reach a robust conclusion by having searched thoroughly enough for structure, rather than worst-casing over all possible structure as in a proof. Some ideas about how to potentially formalize this hope are described in the paper Formalizing the presumption of independence.
Informal heuristic explanations are already commonplace in mechanistic interpretability, such as when analyzing circuits. For example, consider the following circuit between an “early curve detector” neuron and a “late curve detector” neuron, consisting of the weights in the 5x5 convolution between the two neurons:
Imagine attempting to use this circuit to provide some sort of formal guarantee about the behavior of the late curve detector, given the behavior of the early curve detector. Analyzing this circuit alone would be insufficient, because there could be confounding interactions with other neurons. For example, we would need to prove that there is no “anti-early curve detector” that cancels out the “early curve detector”. Nevertheless, it is considered reasonable to assume by default that there is no “anti-early curve detector” if no such thing has been pointed out.
To help validate these ideas, ARC has produced a heuristic explanation of the max-of-k model studied in the Compact Proofs paper. The explanation very roughly mirrors the analysis in the “cubic proof” from that paper, but focuses on approximating error terms rather than bounding them. Consequently, the explanation ends up with estimates that are much closer to the true accuracy of the model than the lower bounds given by the proofs:
Surprise accounting
We are interested in proofs and heuristic explanations not only to provide assurances about properties of networks, but also to help explain why those properties hold. We have seen that for proofs, more insight is offered by shorter proofs with tighter bounds. Correspondingly, we can quantify the amount of understanding encapsulated by a heuristic explanation using a method we have been calling surprise accounting.
The intuition behind surprise accounting is to ask: how surprising is the phenomenon (in an information-theoretic sense), now that we have access to the heuristic explanation? The total surprise decomposes into two pieces:
The surprise of the explanation itself: how many times did the explanation just “work out” for no apparent reason? Or put another way, how many free parameters did the explanation have?
The surprise of the phenomenon given the explanation: how likely was the phenomenon to happen by chance, given the information provided by the explanation?
These two pieces are analogous to proof length and bound tightness respectively, but they have the advantage of being measured in the same units, namely bits. The total of the two pieces is also very similar to the Bayesian information criterion (BIC) for probabilistic models, which has a similar decomposition.
From the vantage point of mechanistic interpretability, surprise accounting offers an answer to the question, “What counts as an explanation?” For example, when a neural network memorizes facts, is a lookup table a valid explanation for its behavior? From the perspective of surprise accounting, an explanation for the behavior that makes use of how the network generalizes may have lower total surprise. But if there is no explanation with lower total surprise, then the lookup table explanation is good enough.
To make this idea more concrete, we will go through a worked example, making use of the following basic formula:
Total surprise = surprise of explanation + surprise given explanation
Worked example: Boolean circuit
For our worked example of surprise accounting, we will use the following Boolean circuit consisting of AND, OR and NOT gates:
The circuit has a tree structure, and reuses the same 8 inputs in both the top and bottom halves. But other than this basic structure, it looks pretty random. However, if we start trying a few different inputs, we notice something surprising: the network seems to always output TRUE. Why is this?
Brute-force
Without any explanation at all, there are 28 = 256 possible inputs, and the network could have output TRUE or FALSE for each of these. So the surprise of the (empty) explanation is 0 bits, and the surprise given the explanation is 256 bits. This is exactly the number of cases that the brute-force proof would need to check.
No explanation: 0 + 256 = 256 bits
Final OR gate
A very simple explanation we could give is to notice that the final gate is an OR rather than an AND gate. Since an OR gate outputs TRUE ¾ of time on random inputs, it is significantly less surprising that the network always outputs TRUE.
Quantitatively, the surprise of the explanation is 1 bit (for noticing that the final gate was an OR rather than an AND), and the surprise given the explanation is 256 log2(1/0.75) bits.
Final OR gate explanation: 1 + 256 log2(1/0.75) ≈ 107.25 bits
This is a substantial improvement over having no explanation at all. Note, however, that this explanation is not a proof.
Pattern of NOT gates
We can improve upon this by noticing patterns in the structure of the network. One such pattern is how the NOT gates are arranged. For each input to a gate in the top half of a network, there is a corresponding input to a gate in the bottom half of the network. For the leftmost layer, the presence of a NOT gate is always different, and for every subsequent layer, the presence of a NOT gate is always the same:
This means each gate input is anticorrelated with the corresponding input in the other half of the network: for the leftmost layer, the correlation is −1; for the next layer, the correlation is −0.5; for the layer after that, −0.25; and the two inputs to the final OR gate have a correlation of −0.125.[1] Implicitly, these correlations assume that the AND and OR gates were chosen independently at random, since we have not yet noticed any particular structure to them.
The surprise of this explanation is 15 bits for the 15 pairs of inputs (7 same, 8 different), plus the 1 bit carried over from the previous explanation for the final OR gate, for a total of 16 bits. Given the explanation, the probability that the final gate outputs TRUE on a random input has increased to 0.75 + 0.25 × 0.125 = 0.78125 because of the −0.125 correlation, so the surprise given the explanation is now only 256 log2(1/0.78125) bits.
Final OR plus pattern of NOT gates explanation: 16 + 256 log2(1/0.78125) ≈ 107.17 bits
This is only a fraction of a bit better than the previous explanation, but sets us up for a more substantial improvement.
Pattern of AND and OR gates
If we look at the network again, we see even more structure: every time there is an AND gate in the top half of the network, there is OR gate in the bottom half of the network, and vice versa:
Hence each gate input is actually perfectly anticorrelated with the corresponding input in the other half of the network. Put another way, the bottom half of the network is exactly the negation of the top half of the network, by De Morgan’s laws. The OR of anything and its negation is always TRUE, so this completely explains why the network always outputs TRUE.
The surprise of this explanation is 16 bits carried over from the previous explanation, plus 7 bits for the 7 pairs of AND and OR gates, for a total of 23 bits. The surprise given the explanation is now 0 bits.
Pattern of all gates explanation: 23 + 0 = 23 bits
A natural question is whether there is further hidden structure that permits an explanation with even lower total surprise. From what I have said so far, perhaps there could be. But it turns out that I constructed the network as follows: first I chose the top half of the network by randomly choosing AND or OR for the 7 gates and randomly choosing either a NOT gate or no NOT gate for the 14 inputs; then I copied and negated this random network using De Morgan’s laws; and finally I connected the top and bottom halves by an OR gate. So there really is no additional structure that we could exploit, unless my construction was accidentally redundant in some way or a pattern arose by fluke.
More generally, we can compare the total surprise of an explanation with the amount of optimization applied to produce the phenomenon. If the two match, then there is no reason to expect there to be a better explanation.
Note that in this case, the best explanation corresponded to a proof, but this need not be the case in general. For example, if the circuit implemented a cryptographic hash function, we would not expect to be able to do better than to treat the output of the hash function as random.
Discussion
ARC is currently working on designing algorithms for finding and making use of heuristic explanations for neural networks. At a high level, one way this could work is as follows:
Set up a “generic” heuristic explanation with free parameters. In the worked example above, the “same” or “different” arrows are free parameters of the explanation. For a realistic neural network, free parameters might appear in an activation model such as a variational autoencoder (perhaps in a similar fashion to how sparse autoencoders are currently used in mechanistic interpretability).
Optimize the parameters of this explanation using gradient descent to minimize total surprise, as measured using surprise accounting. We expect to have to do this in parallel with model training in order to account for how the model itself was optimized (including any possible backdoors).
Use the explanation to heuristically verify desired properties of the network, or for other downstream applications such as mechanistic anomaly detection or eliciting latent knowledge.
We have omitted many details about how and why such an approach might work, and plan to come back to some of these in future posts.
Conclusion
Heuristic explanations are a natural extension of provable guarantees that we believe have much better potential to scale to large, sophisticated neural networks. Surprise accounting offers a way to quantify the quality of a heuristic explanation, potentially enabling useful explanations to be found automatically.
- ↩︎
Roughly speaking, the reason that the anti-correlation halves with each layer is that if the subsequent AND or OR gates are different, then the anti-correlation is maintained, whereas if they are the same, then the anti-correlation is eliminated. Making this argument more precise is left as an exercise to the reader.
I’m not sure I fully understand how you would use surprise accounting in a practical scenario. In the circuit example, you’re starting out with a statement that is known to be true (by checking all inputs) -- that the circuit always outputs 1. But in most practical scenarios, you have a hypothesis (e.g., “the model will never do $bad_thing under any input”) that you’re not sure is true. So where do you get the statement that you’d apply the surprise accounting to? I guess you could start out with “the model didn’t do $bad_thing on 1 million randomly sampled inputs”, try to find an explanation of this finding with low total surprise, and then extend the explanation into a proof that applies to all inputs. Is that right?
Yes, I think the most natural way to estimate total surprise in practice would be to use sampling like you suggest. You could try to find the best explanation for “the model does $bad_thing with probability less than 1 in a million” (which you believe based on sampling) and then see how unlikely $bad_thing is according to the resulting explanation. In the Boolean circuit worked example, the final 23-bit explanation is likely still the best explanation for why the model outputs TRUE on at least 99% of inputs, and we can use this explanation to see that the model actually outputs TRUE on all inputs.
Another possible approach is analogous to fine-tuning. You could start by using surprise accounting to find the best explanation for “the loss of the model is L” (where L is estimated during training), which should incentivize rich explanations of the model’s behavior in general. Then to estimate the probability that model does some rare $bad_thing, you could “fine-tune” your explanation using an objective that encourages it to focus on the relevant tails of the distribution. We have more ideas about estimating the probability of events that are too rare to estimate via sampling, and have been considering objectives other than surprise accounting for this. We plan to share these ideas soon.
Cool, find-tuning sounds a bit like conditional Kolmogorov complexity—the cost of your explanation would be K(explanation of rare thing | explanation of the loss value and general functionality)
From the examples later in your post, it seems like it might be clearer to say something more like “how many things need to hold about the circuit for the explanation to describe the circuit”? More precisely, I’m objecting to your “how many times” because it could plausibly mean “on how many inputs” which I don’t think is what you mean, and I’m objecting to your “for no apparent reason” because I don’t see what it would mean for an explanation to hold for a reason in this case.
Yes, that’s a clearer way of putting it in the case of the circuit in the worked example. The reason I said “for no apparent reason” is that there could be some redundancy in the explanation. For example, if you already had an explanation for the output of some subcircuit, you shouldn’t pay additional surprise if you then check the output of that subcircuit in some particular case. But perhaps this was a distracting technicality.
Why does the OR-gate cost only 1 bit?
One argument I can see is that for any binary gate you only consider OR and AND gates. If we make the (semi-reasonable) assumption that these are equally likely then an OR gate cost 1 -bit?
However, you also have to describe which gate is to be described which takes more bits. You;d need ~log|gates| worth of bits to describe an arbitrary gate.
Yes, the cost of 1 bit for the OR gate was based on the somewhat arbitrary choice to consider only OR and AND gates. A bit more formally, the heuristic explanations in the post implicitly use a “reference class” of circuits where each binary gate was randomly chosen to be either an OR or an AND, and each input wire to a binary gate was randomly chosen to have a NOT or not. The arbitrariness of this choice of reference class is one obstruction to formalizing heuristic explanations and surprise accounting. We are currently preparing a paper that explores this and related topics, but unfortunately the core issue remains unresolved.
Thanks. I’m looking forward to your paper!
The ‘surprise accounting’ framework sounds quite a lot like the Minimum Description Length principle (e.g. here). Do you have any takes on how surprise accounting compares and differs vis a vis MDL?
Do I understand correctly that the main issue is finding ~ a canonical prior on the set of circuits?
I would say that they are motivated by the same basic idea, but are applied to different problems. The MDL (or the closely-related BIC) is a method for model selection given a dataset, whereas surprise accounting is a method for evaluating heuristic explanations, which don’t necessarily involve model selection.
Take the Boolean circuit worked example: what is the relevant dataset? Perhaps it is the 256 (input, TRUE) pairs. But the MDL would select a much simpler model, namely the circuit that ignores the input and outputs TRUE (or “x_1 OR (NOT x_1)” if it has to consist of AND, OR and NOT gates). On the other hand, a heuristic explanation is not interested choosing a simpler model, but is instead interested in explaining why the model we have been given behaves in the way it does.
The heuristic explanations in the post do use a single prior or over the set of circuits, which we also call a “reference class”. But we wish to allow explanations that use other reference classes, as well as explanations that combine multiple reference classes, and perhaps even explanations that use “subjective” reference classes that do not seem to correspond to any precise prior. These are the sorts of issues explored in the upcoming paper. Ultimately, though, a lot of our heuristic arguments and the surprise accounting for them remain somewhat ambiguous or informal.
Possibilities I see:
Maybe the cost can be amortized over the whole circuit? Use one bit per circuit to say “this is just and/or” vs “use all gates”.
This is an illustrative simplified example, in a more general scheme, you need to specify a coding scheme, which is equivalent to specifying a prior over possible things you might see.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?