RogerDearnaley comments on Reverse-engineering using interpretability

RogerDearnaley 22 May 2023 8:36 UTC
LW: 1 AF: 1
AF
Now we’re going to build a new model that is constructed based on the description of this model. Each component in the new model is going to be a small model trained to imitate a human computing the function that the description of the component specifies.
Some of the recent advances in symbolic regression and equation learning might be useful during this step to help generate functions describing component behavior, if what the component in the model is doing is moderately complicated. (E.g. A Mechanistic Interpretability Analysis of Grokking found that a model trained to do modular arithmetic ended up implementing it using discrete Fourier transforms and trig identities, which sounds like the sort of thing that might be a lot easier to figure our from a learned equation describing the component’s behavior). Being able to reduce a neural circuit to an equation or a Bayesnet or whatever would help a lot with interpretablity, and at that point you might even not need to train an implementation model — we could maybe just use the symbolic form directly, as a more compact and more efficiently computable representation.
At the end of this process, you might even end up with something symbolic that looked a lot like a “Good Old Fashioned AI” (GOFAI) model — but a “Bitter Lesson compatible” one first learnt by a neural net and then reverse engineered using interpretability. Obviously doing this would put high demands on our interpretation tools.
If I had such a function describing a neural net component, one of my first questions would be: what portions of the domain of this function are well covered by the training set that the initial neural net model was trained on, or at least sufficiently near items in that training set interpolating the function to them seems likely to be safe (given its local first, second, third… partial derivatives) vs. what portions are untested extrapolations? Did the symbolic regression/function learning process give us multiple candidate functions, and if so how much do they differ outside that well-tested region of the function domain?
This seems like it would give us some useful intuition for when the model might be unsafely extrapolating outside the training distribution and we need to be particularly cautious.
Some portions of the neural net may turn out to be irreducibly complex — I suspect it would be good to be able to identify when something genuinely is complex, and when we’re just looking at a big tangled-up blob of memorized instances from the training set (e.g. by somehow localizing sources of loss on the test set).