johnswentworth comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

johnswentworth 2 Nov 2022 16:55 UTC
LW: 4 AF: 3
2
AF
That counterargument does at least typecheck, so we’re not talking past each other. Yay!
In the context of neurosymbolic methods, I’d phrase my argument like this: in order for the symbols in the symbolic-reasoning parts to robustly mean what we intended them to mean (e.g. standard semantics in the case of natural language), we need to pick the right neural structures to “hook them up to”. We can’t just train a net to spit out certain symbols given certain inputs and then use those symbols as though they actually correspond to the intended meaning, because <all the usual reasons why maximizing a training objective does not do the thing we intended>.
Now, I’m totally on board with the general idea of using neural nets for symbol grounding and then building interpretable logic-style stuff on top of that. (Retargeting the Search is an instance of that general strategy, especially if we use a human-coded search algorithm.) But interpretability is a necessary step to do that, if we want the symbols to be robustly correctly grounded.
On to the specifics:
A lot of interpretability is about discovering how concepts are used in a higher-level algorithm, and the argument doesn’t apply there.
I partially buy that. It does seem to me that a lot of people doing interpretability don’t really seem have a particular goal in mind, and are just generally trying to understand what’s going on. Which is not necessarily bad; understanding basically anything in neural nets (including higher-level algorithms) will probably help us narrow in on the answers to the key questions. But it means that a lot of work is not narrowly focused on the key hard parts (i.e. how to assign external meaning to internal structures).
One point of using such methods is to enforce or encourage certain high-level algorithmic properties, e.g. modularity.
Insofar as the things passing between modules are symbols whose meaning we don’t robustly know, the same problem comes up. The usefulness of structural/algorithmic properties is pretty limited, if we don’t have a way to robustly assign meaning to the things passing between the parts.
- David Scott Krueger (formerly: capybaralet) 3 Nov 2022 22:52 UTC
  LW: 8 AF: 5
  1
  AF Parent
  Hmm I feel a bit damned by faint praise here… it seems like more than type-checking, you are agreeing substantively with my points (or at least, I fail to find any substantive disagreement with/in your response).
  
  Perhaps the main disagreement is about the definition of interpretability, where it seems like the goalposts are moving… you say (paraphrasing) “interpretability is a necessary step to robustly/correctly grounding symbols”. I can interpret that in a few ways:
  1. “interpretability := mechanistic interpretability (as it is currently practiced)”: seems false.
  2. “interpretability := understanding symbol grounding well enough to have justified confidence that it is working as expected”: also seems false; we could get good grounding without justified confidence, although it certainly much better to have the justified confidence.
  3. “interpretability := having good symbol grounding”: a mere tautology.
  A potential substantive disagreement: I think we could get high levels of justified confidence via means that look very different from (what I’d consider any sensible notion of) “interpretability”, e.g. via:
  - A principled understanding of how to train or otherwise develop systems that ground symbols in the way we want/expect/etc.
  - Empirical work
  - A combination of either/both of the above with mechanistic interpretability
  It’s not clear that any of these or their combination will give us as high of levels of justified confidence as we would like, but that’s just the nature of the beast (and a good argument for pursuing governance solutions).
  
  A few more points regarding symbol grounding:
  - I think it’s not a great framing… I’m struggling to articulate why, but it’s maybe something like “There is no clear boundary between symbols and non-symbols”
  - I think the argument I’m making in the original post applies equally well to grounding… There is some difficult work to be done and it is not clear that reverse engineering is a better approach than engineering.

johnswentworth comments on “Cars and Elephants”: a handwavy argument/​analogy against mechanistic interpretability

johnswentworth comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability