Thanks for writing this up! Your “amplified weak” version of the conjecture (with complexity bounds increasing exponentially in 1/ε) seems plausible to me. So if you could amplify the original (weak) conjecture to this unconditionally, it wouldn’t significantly decrease my credence in the principle. But it would be nice to have this bound on what the dependence on ε would need to be.
Jacob_Hilton
The statements are equivalent if only a tiny fraction (tending to 0) of random reversible circuits satisfy . We think this is very likely to be true, since it is a very weak consequence of the conjecture that random (depth-) reversible circuits are pseudorandom permutations. If it turned out to not be true, it would no longer make sense to think of as an “outrageous coincidence” and so I think we would have to abandon the conjecture. So in short we are happy to consider either version (though I agree that “for which is false” is a bit more natural).
The hope is to use the complexity of the statement rather than mathematical taste.
If it takes me 10 bits to specify a computational possibility that ought to happen 1% of the time, then we shouldn’t be surprised to find around 10 (~1% of ) occurrences. We don’t intend the no-coincidence principle to claim that these should all happen for a reason.
Instead, we intend the no-coincidence principle to claim that such if such coincidences happen much more often than we would have expected them to by chance, then there is a reason for that. Or put another way: if we applied bits of selection to the statement of a -level coincidence, then there is a reason for it. (Hopefully the “outrageous” qualifier helps to indicate this, although we don’t know whether Gowers meant quite same thing as us.)
The formalization reflects this distinction: the property is chosen to be so unlikely that we wouldn’t expect it to happen for any circuit at all by chance (), not merely that we wouldn’t expect it to happen for a single random circuit. Hence by the informal principle, there ought to be a reason for any occurrence of property .
For the informal no-coincidence principle, it’s important to us (and to Gowers IIUC) that a “reason” is not necessarily a proof, but could instead be a heuristic argument (in the sense of this post). We agree there are certainly apparently outrageous coincidences that may not be provable, such as Chebyshev’s bias (discussed in the introduction to the post). See also John Conway’s paper On Unsettleable Arithmetical Problems for a nice exposition of the distinction between proofs and heuristic arguments (he uses the word “probvious” for a statement with a convincing heuristic argument).
Correspondingly, our formalization doesn’t bake in any sort of proof system. The verification algorithm only has to correctly distinguish circuits that might satisfy property from random circuits using the advice string – it doesn’t necessarily have to interpret as a proof and verify its correctness.
It’s not, but I can understand your confusion, and I think the two are related. To see the difference, suppose hypothetically that 11% of the first million digits in the decimal expansion of were 3s. Inductive reasoning would say that we should expect this pattern to continue. The no-coincidence principle, on the other hand, would say that there is a reason (such as a proof or a heuristic argument) for our observation, which may or may not predict that the pattern will continue. But if there were no such reason and yet the pattern continued, then the no-coincidence principle would be false, whereas inductive reasoning would have been successful.
So I think one can view the no-coincidence principle as a way to argue in favor of induction (in the context of formally-defined phenomena): when there is a surprising pattern, the no-coincidence principle says that there is a reason for it, and this reason may predict that the pattern will continue (although we can’t be sure of this until we find the reason). Interestingly, one could also use induction to argue in favor of the no-coincidence principle: we can usually find reasons for apparently outrageous coincidences in mathematics, so perhaps they always exist. But I don’t think they are the same thing.
Good question! We also think that NP ≠ co-NP. The difference between 99% (our conjecture) and 100% (NP = co-NP) is quite important, essentially because 99% of random objects “look random”, but not 100%. For example, consider a uniformly random string for some large . We can quite confidently say things like: the number of 0s in is between and ; there is no streak of alternating 0s and 1s; etc. But these only hold with 99% confidence (more precisely, with probability tending to 1), not 100%.
Going back to the conjecture statement, the job of the verification algorithm is much harder for 100% than 99%. For 100%, has to definitively tell (with the help of a certificate) whether a circuit has property . Whereas for 99%, it simply has to spot (again with the help of a “certificate” of sorts) any structure at all that reveals the circuit to be non-random in a way that could cause it to have property . For example, could start by checking the proportions of different types of gates, and if these differed too much from a random circuit, immediately reject the circuit out-of-hand for being “possibly structured”. Footnote 6 has another example of structure that could cause a circuit to have property , which seems much harder for a 100%- to deal with.
Before reversible circuits, we first considered a simpler setting: triangle counting. The no-coincidence principle in that setting turned out to be true, but for a relatively uninteresting reason, because the domain was not rich enough. Nevertheless, I think this result serves as a helpful exercise for people trying to get to grips with our definitions, as well as providing more of the story about how we ended up with our reversible circuits statement.
In the triangle counting setting, we consider the distribution over undirected 3-partite graphs on 3 groups of vertices obtained by taking each of the possible edges to be present independently with probability . We take , so there are many more edges than vertices, but much fewer than the total number of possible edges.
For a randomly sampled , one can check that the number of triangles in has mean and variance . So if has triangles, then we consider this to be an “outrageous coincidence”, since this exceeds the mean by standard deviations. (This is “outrageous” because if the number of triangles were normally distributed with this mean and variance, then the probability of exceeding this for any of the possible graphs would tend to zero.) This motivates the following statement, which turns out to be true:
Proposition (No-coincidence principle for triangle counting, ). For any , there is a linear-time verification algorithm that receives as input:
A 3-partite graph on 3 groups of vertices, represented as a list of edges
An advice string
such that:
For all graphs with triangles, if is sufficiently large then there exists with length linear in the number of edges of such that .
For , the probability that there is any with tends to zero.
(Note that the result would be trivial if we allowed to run in polynomial time, since it could then directly count the number of triangles, so we need to be more fine-grained than that.)
Dmitry Vaintrob and Aryan Bhatt were able to generalize this result to polygon counting (note that we consider graphs on groups of vertices arranged in a cycle, not -partite graphs in general, so there are possible edges, not ) and to permanents. So we concluded that neither of these settings seemed to be rich enough to produce an interesting no-coincidence principle (even though permanents are #P-hard to compute), and moved on to considering circuits instead.
Hint for polygon counting:
Write the number of polygons as a cyclic trace .
It sounds like we are not that far apart here. We’ve been doing some empirical work on toy systems to try to make the leap from mechanistic interpretability “stories” to semi-formal heuristic explanations. The max-of-k draft is an early example of this, and we have more ambitious work in progress along similar lines. I think of this work in a similar way to you: we are not trying to test empirical assumptions (in the way that some empirical work on frontier LLMs is, for example), but rather to learn from the process of putting our ideas into practice.
For those who are interested in the mathematical details, but would like something more accessible than the paper itself, see this talk I gave about the paper:
Thank you – this is probably the best critique of ARC’s research agenda that I have read since we started working on heuristic explanations. This level of thoughtfulness in external feedback is very rare and I’m grateful for the detail and clarity you put into it. I don’t think my response fully rebuts your central concern, but hopefully it gives a sense of my current thinking about it.
It sounds like we are in agreement that something very loosely heuristic explanation-flavored (interpreted so broadly as to include mechanistic interpretability, for example) can reasonably be placed at the root of the diagram, by which I mean that it’s productive to try to explain neural network behaviors in this very loose sense, attempt to apply such explanations to downstream applications such as MAD/LPE/ELK etc. We begin to diverge, I think, about the extent to which ARC should focus on a more narrow conception of heuristic explanations. From least to most specific:
Any version that is primarily mathematical rather than “story-centric”
Some (mathematical) version that is consistent with our information-theoretic intuitions about what constitutes a valid explanation (i.e., in the sense of something like surprise accounting)
Some such version that is loosely based on independence assumptions
Some version that satisfies more specific desiderata for heuristic estimators (such as the ones discussed in the paper linked in (3), or in this more recent paper)
Opinions at ARC will differ, but (1) I feel pretty comfortable defending, (2) I think is quite a promising option to be considering, (3) seems like a reasonable best guess but I don’t think we should be that wedded to it, and (4) I think is probably too specific (and with the benefit of hindsight I think we have focused too much on this in the past). ARC’s research has actually been trending in the “less specific” direction over time, as should hopefully be evident from our most recent write-ups (with the exception of our recent paper on specific desiderata, which mostly covers work done in 2023), and I am quite unsure exactly where we should settle on this axis.
By contrast, my impression is that you would not really defend even (1) (although I am curious exactly where you come down this axis, if you want to clarify). So I’ll give what I see as the basic case for searching for a mathematical rather than a “story-centric” approach:
Mechanistic interpretability has so far yielded very little in the way of beating baselines at downstream tasks (this has been discussed at length elsewhere, see for example here, here and here), so I think it should still be considered a largely unproven approach (to be clear, this is roughly my view of all alignment approaches that aren’t already in active use at labs, including ARC’s, and I remain excited to see people’s continued valiant attempts; my point is that the bar is low and a portfolio approach is appropriate).
Relying purely on stories clearly doesn’t work at sufficient scale under worst-case assumptions (because the AI will have concepts you don’t have words for), and there isn’t a lot of evidence that this isn’t indeed already a bottleneck in practice (i.e., current AIs may well already have concepts you don’t have words for).
I think that ARC’s worst-case, theoretical approach (described at zoom level 1) is an especially promising alternative to iterative, empirically-driven work. I think empirical approaches are more promising overall, but have correlated failure modes (namely, they could end up relying on correlated empirical contingencies that later turn out to be false), and have far more total effort going into them (arguably disproportionately so). Conditional on taking such an approach, story-centric methods don’t seem super viable (how should one analyze stories theoretically?).
I don’t really buy the argument that because a system has a lot of complexity, it can only be analyzed in ad-hoc ways. It seems to me that an analogous argument would have failed to make good predictions about the bitter lesson (i.e., by arguing that a simple algorithm like SGD should not be capable of producing great complexity in a targeted way). Instead, because neural nets are trained in an incremental, automated way based on mathematical principles, it seems quite possible to me that we can find explanations for them in a similar way (which is not an argument that can be applied to biological brains).
This doesn’t of course defend (2)–(4) (which I would only want to do more weakly in any case). We’ve tried to get our intuitions for those across in our write-ups (as linked in (2)–(4) above), but I’m not sure there’s anything succinct I can add here if those were unconvincing. I agree that puts us in the rather unfortunate position of sharing a reference class with Stephen Wolfram to many external observers (although hopefully our claims are not quite so overstated).
I think it’s important for ARC to recognize this tension, and to strike the right balance between making our work persuasive to external skeptics on the one hand, and having courage in our convictions on the other hand (I think both have been important virtues in scientific development historically). Concretely, my current best guess is that ARC should:
(a) Avoid being too wedded to intuitive desiderata for heuristic explanations that we can’t directly tie back to specific applications
(b) Search for concrete cases that put our intuitions to the test, so that we can quickly reach a point where either we no longer believe in them, or they are more convincing to others
(c) Also pursue research that is more agnostic to the specific form of explanation, such as work on low probability estimation or other applications
(d) Stay on the lookout for ideas from alternative theoretical approaches (including singular learning theory, sparsity-based approaches, computational mechanics, causal abstractions, and neural net-oriented varieties of agent foundations), although my sense is that object-level intuitions here just differ enough that it’s difficult to collaborate productively. (Separately, I’d argue that proponents of all these alternatives are in a similar predicament, and could generally be doing a better job on analogous versions of (a)–(c).)
I think we have been doing all of (a)–(d) to some extent already, although I imagine you would argue that we have not been going far enough. I’d be interested in more thoughts on how to strike the right balance here.
A bird’s eye view of ARC’s research
The LLM output looks correct to me.
Backdoors as an analogy for deceptive alignment
Yes, I think the most natural way to estimate total surprise in practice would be to use sampling like you suggest. You could try to find the best explanation for “the model does $bad_thing with probability less than 1 in a million” (which you believe based on sampling) and then see how unlikely $bad_thing is according to the resulting explanation. In the Boolean circuit worked example, the final 23-bit explanation is likely still the best explanation for why the model outputs TRUE on at least 99% of inputs, and we can use this explanation to see that the model actually outputs TRUE on all inputs.
Another possible approach is analogous to fine-tuning. You could start by using surprise accounting to find the best explanation for “the loss of the model is L” (where L is estimated during training), which should incentivize rich explanations of the model’s behavior in general. Then to estimate the probability that model does some rare $bad_thing, you could “fine-tune” your explanation using an objective that encourages it to focus on the relevant tails of the distribution. We have more ideas about estimating the probability of events that are too rare to estimate via sampling, and have been considering objectives other than surprise accounting for this. We plan to share these ideas soon.
Yes, that’s a clearer way of putting it in the case of the circuit in the worked example. The reason I said “for no apparent reason” is that there could be some redundancy in the explanation. For example, if you already had an explanation for the output of some subcircuit, you shouldn’t pay additional surprise if you then check the output of that subcircuit in some particular case. But perhaps this was a distracting technicality.
I would say that they are motivated by the same basic idea, but are applied to different problems. The MDL (or the closely-related BIC) is a method for model selection given a dataset, whereas surprise accounting is a method for evaluating heuristic explanations, which don’t necessarily involve model selection.
Take the Boolean circuit worked example: what is the relevant dataset? Perhaps it is the 256 (input, TRUE) pairs. But the MDL would select a much simpler model, namely the circuit that ignores the input and outputs TRUE (or “x_1 OR (NOT x_1)” if it has to consist of AND, OR and NOT gates). On the other hand, a heuristic explanation is not interested choosing a simpler model, but is instead interested in explaining why the model we have been given behaves in the way it does.
The heuristic explanations in the post do use a single prior or over the set of circuits, which we also call a “reference class”. But we wish to allow explanations that use other reference classes, as well as explanations that combine multiple reference classes, and perhaps even explanations that use “subjective” reference classes that do not seem to correspond to any precise prior. These are the sorts of issues explored in the upcoming paper. Ultimately, though, a lot of our heuristic arguments and the surprise accounting for them remain somewhat ambiguous or informal.
Yes, the cost of 1 bit for the OR gate was based on the somewhat arbitrary choice to consider only OR and AND gates. A bit more formally, the heuristic explanations in the post implicitly use a “reference class” of circuits where each binary gate was randomly chosen to be either an OR or an AND, and each input wire to a binary gate was randomly chosen to have a NOT or not. The arbitrariness of this choice of reference class is one obstruction to formalizing heuristic explanations and surprise accounting. We are currently preparing a paper that explores this and related topics, but unfortunately the core issue remains unresolved.
Formal verification, heuristic explanations and surprise accounting
See the statement from OpenAI in this article:
We’re removing nondisparagement clauses from our standard departure paperwork, and we’re releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual. We’ll communicate this message to former employees.
They have communicated this to me and I believe I was in the same category as most former employees.
I think the main reasons so few people have mentioned this are:
As I mentioned, there is still some legal ambiguity and additional avenues for retaliation
Some people are taking their time over what they want to say
Most people don’t want to publicly associate themselves with a controversial situation
Most people aren’t inclined to disparage their former employer anyway, and so they may not think of their own situation as that big of a deal
Yes, by “unconditionally” I meant “without an additional assumption”. I don’t currently see why the Reduction-Regularity assumption ought to be true (I may need to think about it more).