nostalgebraist comments on Tracing the Thoughts of a Large Language Model

nostalgebraist 28 Mar 2025 18:23 UTC
LW: 35 AF: 20
6
AF
Very impressive! At least on a first read, to me this felt closer than any past work to realizing the SAE dream of actually understanding the computations occurring in the model, as opposed to just describing various “cool-looking patterns” that can be extracted from activations.
I’m curious about what would happen if you studied cases similar to the examples you present, except that the recruitment of a particular capability (such as arithmetic or knowledge about a named entity) occurs through in-context learning.
For example, you discuss an “obscured arithmetic” task involving publication dates. In that case, the model seems to have learned in training that the correct prediction can be done by doing arithmetic. But we could imagine obscured arithmetic tasks that are novel to the model, in which the mapping between the text and a “latent arithmetic problem” has to be learned in-context^[1].
We might then ask ourselves: how does the model’s approach to these problems relate to its approach to problems which it “can immediately tell” are arithmetic problems?
A naively obvious “algorithm” would look like
1. Try out various mappings between the observed text and (among other things) arithmetic problems
2. Notice that one particular mapping to arithmetic always yields the right answer on previous example cases
3. Based on the observation in (2), map the current example to arithmetic, solve the arithmetic problem, and map back to predict the answer
However, due to the feedforward and causal structure of transformer LMs, they can’t re-use the same mechanism twice to “verify that arithmetic works” in 1+2 and then “do arithmetic” in 3.^[2]
It’s possible that LLMs actually solve cases like this in some qualitatively different way than the “algorithm” above, in which case it would be interesting to learn what that is^[3].
Alternatively, if the model is doing something like this “algorithm,” it must be recruiting multiple “copies” of the same capability, and we could study how many “copies” exist and to what extent they use identical albeit duplicated circuitry. (See fn2 of this comment for more)
It would be particularly interesting if feature circuit analysis could be used to make quantitative predictions about things like “the model can perform computations of depth D or lower when not obscured in a novel way, but it this depth lowers to some D’ < D when it must identify the required computation through few-shot learning.”
(A related line of investigation would be looking into how the model solves problems that are obscured by transformations like base64, where the model has learned the mapping in training, yet the mapping is sufficiently complicated that its capabilities typically degrade significantly relative to those it displays on “plaintext” problems.)
1. ^
  One could quantify the extent to which this is true by looking at how much the model benefits from examples. In an “ideal” case of this kind, the model would do very poorly when given no examples (equivalently, when predicting the first answer in a few-shot sequence), yet it would do perfectly when given many examples.
2. ^
  For instance, suppose that the current example maps to an addition problem where one operand has 9 in the ones place. So we might imagine that an “add _9” add function feature is involved in successfully computing the answer, here.
  But for this feature to be active at all, the model needs to know (by this point in the list of layers) that it should do addition with such an operand in the first place. If it’s figuring that out by trying mappings to arithmetic and noticing that they work, the implementations of arithmetic used to “try and verify” must appear in layers before the one in which the “add _9″ feature under discussion occurs, since the final outputs of the entire “try and verify” process are responsible for activating that feature. And then we could ask: how does this earlier implementation of arithmetic work? And how many times does the model “re-implement” a capability across the layer list?
3. ^
  Perhaps it is something like “try-and-check many different possible approaches at every answer-to-example position, then use induction heads to move info about try-and-check outputs that matched the associated answer position to later positions, and finally use this info to amplify the output of the ‘right’ computation and suppress everything else.”