I think this is a really good and well-thought-out explanation of the agenda.
I do still think that it’s missing a big piece: namely in your diagram, the lowest-tier dot (heuristic explanations) is carrying a lot of weight, and needs more support and better messaging. Specifically, my understanding having read this and interacted with ARC’s agenda is that “heuristic arguments” as a direction is highly useful. But while it seems to me that the placement of heuristic arguments at the root of this ambitious diagram is core to the agenda, I haven’t been convinced that this placement is supported by any results beyond somewhat vague associative arguments.
As an extreme example of this, Stephen Wolfram believes he has a collection of ideas building on some thinking about cellular automata that will describe all of physics. He can write down all kinds of causal diagrams with this node in the root, leading to great strides in our understanding of science and the cosmos and so on. But ultimately, such a diagram would be making the statement that “there exists a productive way to build a theory of everything which is based on cellular automata in a particular way similar to how he thinks about this theory”. Note that this is different from saying that cellular automata are interesting, or even that a better theory of cellular automata would be useful for physics, and requires a lot more motivation and scientific falsification to motivate.
The idea of heuristic arguments is, at its core, a way of generalizing the notion of independence in statistical systems and models of statistical systems. It’s discussing a way to point at a part of the system and say “we are treating this as noise” or “we are treating these two parts as statistically independent”, or “we are treating these components of the system as independently as we can, given the following set of observations about our system” (with a lot of the theory of HA asking how to make the last of these statements explicit/computable). I think this is a productive class of questions to think about, both theoretically and empirically. It’s related to a lot of other research in the field (on causality, independence and so on). I conceptually vibe with ARC’s approach from what I’ve seen of the org. (Modulo the corrigible fact that I think there should be a lot more empirical work on what kinds of heuristic arguments work in practice. For example what’s the right independence assumption on components of an image classifier/ generator NN that notices/generates the kind of textural randomness seen in a cat’s fur? So far there is no HA guess about this question, and I think there should be at least some ideas on this level for the field to have a healthy amount of empiricism.)
I think that what ARC is doing is useful and productive. However, I don’t see strong evidence that this particular kind of analysis is a principled thing to put at the root of a diagram of this shape. The statement that we should think about and understand independence is a priori not the same as the idea that we should have a more principled way of deciding when one interpretation of a neural net is more correct than another, which is also separate from (though plausibly related to) the (I think also good) idea in MAD/ELK that it might be useful to flag NN’s that are behaving “unusually” without having a complete story of the unusual behavior.
I think there’s an issue with building such a big structure on top of an undefended assumption, which is that it is creates some immissibility (i.e., difficulty of mixing) with other ideas in interpretability, which are “story-centric”. The phenomena that happen in neural nets (same as phenomena in brains, same as phenomena in realistic physical systems) are probably special: they depend on some particular aspects of the world/ of reasoning/ of learning that has some sophisticated moving parts that aren’t yet understood (some standard guesses are shallow and hierarchical dependence graphs, abundance of rough symmetries, separation of scale-specific behaviors, and so on). Our understanding will grow by capturing these ideas in terms of suitably natural language and sophistication for each phenomenon.
[added in edit] In particular (to point at a particular formalization of the general critique), I don’t think that there currently exists a defendable link between Heuristic Arguments and the proof verification as in Jason Gross’s excellent paper. The specific weakening of the notion of proof verification is more general interpretability. Your post on surprise accounting, is also excellent, but it doesn’t explain how heuristic arguments would lead to understanding systems better—rather, it shows that if we had ways of making better independence assumptions about systems with an existing interpretation, we would get a useful way of measuring surprise and explanatory robustness (with proof a maximally robust limit). But I think that drawing the line from seeking explanations with some nice properties/ measurements to the statement that a formal theory of such properties would lead to an immediate generalization of proof/interpretability which is strictly better than the existing “story-centric” methods is currently undefended (similar to the story that some early work on causality in interp had that a good attempt to formalize and validate causal interpretations would lead to better foundations of interp. -- the techniques are currently used productively e.g. here, but as an ingredient of an interpretation analysis rather than the core of the story). I think similar critiques hold for other sufficiently strong interpretations of the other arrows in this post. Note that while I would support a weaker meaning of arrows here (as you suggest in a footnote), there is nevertheless a core implicit assumption that the diagram exists as a part of a coherent agenda that deduces ambitious conclusions from a quite specific approach to interpretability. I could see any of the nodes here as being a part of a reasonable agenda that integrates with mechanistic interpretability more generally, but this is not the approach that ARC has followed.
I think that the issue of the approach sketched here is that it overindexes on a particular shape of explanation—namely, that the most natural way to describe the relevant details inherent in principled interpretability work will most naturally factorize through a language that grows out of better-understanding independence assumptions in statistical modeling. I don’t see much evidence for this being the case, any more than I see evidence that the best theory of physics should grow out of a particular way of seeing cellular automata (and I’d in fact bet with some confidence that this is not true in both of these cases). At the same time I think that ARC ideas are good, and that trying to relate them to other work in interp is productive (I’m excited about the VAE draft in particular). I just would like to see a less ambitious, more collaboratively motivated version of this, which is working on improving and better validating the assumptions one could make as part of mechanistic/statistical analysis of a model (with new interpretability/MAD ideas as a plausible side-effect) rather than orienting towards a world where this particular direction is in some sense foundational for a “universal theory of interpretability”.
I think this is a really good and well-thought-out explanation of the agenda.
I do still think that it’s missing a big piece: namely in your diagram, the lowest-tier dot (heuristic explanations) is carrying a lot of weight, and needs more support and better messaging. Specifically, my understanding having read this and interacted with ARC’s agenda is that “heuristic arguments” as a direction is highly useful. But while it seems to me that the placement of heuristic arguments at the root of this ambitious diagram is core to the agenda, I haven’t been convinced that this placement is supported by any results beyond somewhat vague associative arguments.
As an extreme example of this, Stephen Wolfram believes he has a collection of ideas building on some thinking about cellular automata that will describe all of physics. He can write down all kinds of causal diagrams with this node in the root, leading to great strides in our understanding of science and the cosmos and so on. But ultimately, such a diagram would be making the statement that “there exists a productive way to build a theory of everything which is based on cellular automata in a particular way similar to how he thinks about this theory”. Note that this is different from saying that cellular automata are interesting, or even that a better theory of cellular automata would be useful for physics, and requires a lot more motivation and scientific falsification to motivate.
The idea of heuristic arguments is, at its core, a way of generalizing the notion of independence in statistical systems and models of statistical systems. It’s discussing a way to point at a part of the system and say “we are treating this as noise” or “we are treating these two parts as statistically independent”, or “we are treating these components of the system as independently as we can, given the following set of observations about our system” (with a lot of the theory of HA asking how to make the last of these statements explicit/computable). I think this is a productive class of questions to think about, both theoretically and empirically. It’s related to a lot of other research in the field (on causality, independence and so on). I conceptually vibe with ARC’s approach from what I’ve seen of the org. (Modulo the corrigible fact that I think there should be a lot more empirical work on what kinds of heuristic arguments work in practice. For example what’s the right independence assumption on components of an image classifier/ generator NN that notices/generates the kind of textural randomness seen in a cat’s fur? So far there is no HA guess about this question, and I think there should be at least some ideas on this level for the field to have a healthy amount of empiricism.)
I think that what ARC is doing is useful and productive. However, I don’t see strong evidence that this particular kind of analysis is a principled thing to put at the root of a diagram of this shape. The statement that we should think about and understand independence is a priori not the same as the idea that we should have a more principled way of deciding when one interpretation of a neural net is more correct than another, which is also separate from (though plausibly related to) the (I think also good) idea in MAD/ELK that it might be useful to flag NN’s that are behaving “unusually” without having a complete story of the unusual behavior.
I think there’s an issue with building such a big structure on top of an undefended assumption, which is that it is creates some immissibility (i.e., difficulty of mixing) with other ideas in interpretability, which are “story-centric”. The phenomena that happen in neural nets (same as phenomena in brains, same as phenomena in realistic physical systems) are probably special: they depend on some particular aspects of the world/ of reasoning/ of learning that has some sophisticated moving parts that aren’t yet understood (some standard guesses are shallow and hierarchical dependence graphs, abundance of rough symmetries, separation of scale-specific behaviors, and so on). Our understanding will grow by capturing these ideas in terms of suitably natural language and sophistication for each phenomenon.
[added in edit] In particular (to point at a particular formalization of the general critique), I don’t think that there currently exists a defendable link between Heuristic Arguments and the proof verification as in Jason Gross’s excellent paper. The specific weakening of the notion of proof verification is more general interpretability. Your post on surprise accounting, is also excellent, but it doesn’t explain how heuristic arguments would lead to understanding systems better—rather, it shows that if we had ways of making better independence assumptions about systems with an existing interpretation, we would get a useful way of measuring surprise and explanatory robustness (with proof a maximally robust limit). But I think that drawing the line from seeking explanations with some nice properties/ measurements to the statement that a formal theory of such properties would lead to an immediate generalization of proof/interpretability which is strictly better than the existing “story-centric” methods is currently undefended (similar to the story that some early work on causality in interp had that a good attempt to formalize and validate causal interpretations would lead to better foundations of interp. -- the techniques are currently used productively e.g. here, but as an ingredient of an interpretation analysis rather than the core of the story). I think similar critiques hold for other sufficiently strong interpretations of the other arrows in this post. Note that while I would support a weaker meaning of arrows here (as you suggest in a footnote), there is nevertheless a core implicit assumption that the diagram exists as a part of a coherent agenda that deduces ambitious conclusions from a quite specific approach to interpretability. I could see any of the nodes here as being a part of a reasonable agenda that integrates with mechanistic interpretability more generally, but this is not the approach that ARC has followed.
I think that the issue of the approach sketched here is that it overindexes on a particular shape of explanation—namely, that the most natural way to describe the relevant details inherent in principled interpretability work will most naturally factorize through a language that grows out of better-understanding independence assumptions in statistical modeling. I don’t see much evidence for this being the case, any more than I see evidence that the best theory of physics should grow out of a particular way of seeing cellular automata (and I’d in fact bet with some confidence that this is not true in both of these cases). At the same time I think that ARC ideas are good, and that trying to relate them to other work in interp is productive (I’m excited about the VAE draft in particular). I just would like to see a less ambitious, more collaboratively motivated version of this, which is working on improving and better validating the assumptions one could make as part of mechanistic/statistical analysis of a model (with new interpretability/MAD ideas as a plausible side-effect) rather than orienting towards a world where this particular direction is in some sense foundational for a “universal theory of interpretability”.