Grammars, subgrammars, and combinatorics of generalization in transformers

Introduction

This is the first installment of my January writing project. We will look at generative neural networks from the framework of (probabilistic) “formal grammars”, specifically focusing on building a complex grammar out of simple “rule grammars”. This turns out to lead to a nice, and relatively non-technical way of discussing how complex systems like language models can be built out of “heuristics”. Thinking more about how these blocks are combined (and focussing on the difference between combining rules via “AND” vs. “OR”) leads to some new insights on generalization, which are sometimes lost in the fuzzy language of heuristics and circuits. In a follow-up to this post we’ll apply the tools introduced here in a more concrete context, of the recent paper on an in-context variant of modular addition, which provides a particularly nice example of the distinctions inherent in the “subgrammar” picture. I’ll also sketch some preliminary results from ongoing research about how neural networks might learn through analogy, a mechanism that dovetails in an interesting way with the formal grammar discussions here, but implies a new and interesting class of mechanisms that goes beyond these ideas.

Grammars, subgrammars, and circuits

A grammar (also known as “language” or “syntax”) is an “accept/​reject” machine on strings. It accepts a finite string of tokens (these can be bits, words, letters, numbers, LLM tokens, etc.) and outputs a “grammar check” which is “True” if the string is grammatical and “False” if it is not. Technically, any boolean classifier on strings is a grammar. But in practice, the term “grammar” denotes a particular way of generating and analyzing such classifiers. In particular, this term tends to be used for classifiers which decompose into formal and (in a suitable sense) local rules. We’ll revisit all of this later (focusing in particular on locality). But to keep things concrete, let’s introduce a toy grammar that we’ll be carrying around as a simple example. We define G_0 to be the grammar on all lowercase English letters, a-z, which accepts a string if and only if it satisfies the following two rules:

  • R1: “q” must be followed by “u”

  • R2: “i” before “e” except after “c”. In other words, this grammar rejects the substring “ei” unless it is part of the substring “cei”. Note that this grammar is particularly simple (in particular it is what’s called “regular”). But it will do as a first example. In particular, note that we have implicitly defined not one but three grammars: namely, we have the grammar G_0, and also we have the ‘rules’ R1 and R2, which themselves are grammars. For example let’s look at the five strings: “abc”, “quie”, “ei”, “qie”, “qei”. Then both R1 and R2 accept “abc” and “quie”. Both R1 and R2 reject “qei”. But the strings “ei” and “qie” are both accepted by exactly one of R1 and R2.

Subgrammars.

Since R1 and R2 are rules, they must be applied jointly. Thus the truth value G(string) is equivalent to the AND expression: R1(string) AND R2(string). In particular, G will accept only the strings “abc” and “quie” of the above five, since at least one of R1 or R2 fails on the others. Put in another way, we can define Strings_G to be the subset of all strings on which G evaluates to True. Then we have an intersection expression Strings_G = Strings_{R1} \cap Strings_{R2}. Note in particular that the grammar defined by G is a subgrammar of both R1 and R2. So the set of all strings that G accepts is a subset of strings that R1 accepts, and similarly with R2. So in this case, the property of being more complex than R1 and R2 makes G more restrictive. However, there is another, less common, way to make grammars more complex that goes in the opposite direction. Namely, let E be the grammar of all grammatical English sentences and let F be the grammar of all French sentences. Let FE be the grammar of all sentences which are either in French or in English. Then FE is evidently more complex (harder to specify) than either F or E, but it is less restrictive than either, i.e., contains both F and E as subgrammars.

Probabilistic grammars and a tiny bit of linguistics

This is a post on interpretability, not linguistics. So I’m not going to discuss the more complicated ways that linguists (or even computer scientists) think of grammars, such as defining what it means for a grammar to be context-free. However, it will be helpful to have a tiny bit of extra context (no pun intended) before moving on. First, a grammar can be probabilistic. We can think of this as a function that assigns each possible string a probability p(string) of being accepted, similar to how a language model assigns probabilities to different possible completions. We then obtain a stochastic string generator by taking a fully random string s_{prior} (with a tiny bit of worry to avoid infinite strings – for instance, we can start by drawing an integer length parameter n, say with probability ½ of length n = 0, ¼ of length n = 1, ⅛ of length n = 2, etc., then take a random length-n string – we will generally ignore this, by focussing on strings of a particular length), and we accept and print s_{prior} with probability , and, if rejected, we repeat the process until we finally accept a string. Of course, this process is equivalent to defining a “posterior” probability distribution P(s) on the set of all strings, and drawing from this probability distribution. This type of object (a probability distribution on all strings) is the definition of a probabilistic grammar. Any grammar is also a probabilistic grammar (with, once again, the small print of having to choose a preliminary probability distribution on length): namely, for each fixed length n, the probabilistic grammar associated to a “deterministic” grammar G is the grammar that randomly samples length-n strings s that are accepted by G (and assigns 0 probability to any string that is rejected by G). We can more generally “soften” any grammatical rule to a probabilistic rule by quantitatively “disincentivising” mistakes in a string. For instance we soften our “R1” grammar above (`“q” must be followed by “u”’) to by assigning each string s an acceptance probability of , where the function returns the number of instances in s of a letter “q” which is not followed by “u”. Note that it is natural to penalize mistakes multiplicatively, as above – this is standard in probability theory and statistical physics. Similarly, given two probabilistic grammars R1 and R2, a typical way to “combine” the two is to write Note that if we have previously normalized to genuine probability distributions – i.e., made them sum to one, the process of combining rules will generally require “re-normalizing” (i.e., dividing through by a factor to make the sum of probabilities add up to 1). This can be viewed as a technicality, at least for now. Similarly, we can take the “union” of two probabilistic grammars by averaging the two corresponding probability distributions. E.g. given two probabilistic grammars A and B, we can define a probabilistic grammar Avg(A,B) that first chooses the grammar A or B, each with probability ½, and then generates a phrase according to the rule it chose. Note that we can understand the function implemented by a transformer as a probabilistic grammar: given an empty input, the transformer has some probability P(s) of generating any given string s (usually parsed in terms of tokens, though by expanding tokens, this also induces a grammar on characters). Conversely, any (suitably finite) probabilistic grammar is implementable by a transformer. We will often blur the line between probabilistic and discrete grammars. I’ve already explained how to go from a discrete grammar to a probabilistic one. There is no “canonical” way to go in the other direction, but roughly given a probabilistic grammar G_{prob} we can (for a given length of string) simply choose some suitable cutoff probability and define the discrete grammar which accepts s if and only if . Finally, a note on locality: any checkable binary function on strings is a grammar. For instance we can define a grammar which accepts a string s if and only if it gives a formally correct proof of the Riemann hypothesis. However in practice, most grammars that are studied are checkable locally. This will for us be an informal notion. Informally, a local grammar is a “sequential parser”. It is a program that recursively tracks a hidden state with (on the empty string ) and for a string with letters, , where is an “update” function that takes in a hidden state and a character and outputs a new hidden state. Generally, the more complicated the update function is (and the more global variables it tracks), the more complicated we think of the grammar as being.

Heuristics

By now a common point of view (that I share) is that algorithms learned by LLMs (and other AI) mostly work in parallel, by combining a bunch of “heuristics”. One way to conceptualize a heuristic is as a “rule”: a sentence is “accepted” by an LLM (i.e., “plausible” as an output) if it jointly passes a large collection of heuristic checks: things like ‘“i” before “e” except after “c”’, but also more qualitative/​subtle things like ‘if the text seems happy so far, words like “bright” and “smile” are more likely to occur’. In a sense somewhat complicated by its probabilistic nature, LLM’s will combine heuristics using “AND”: so sentences should pass all checks to be plausible for generation (or more cleanly, accept/​reject probabilities from different rules should multiply). On the other hand, LLMs certainly have something that implements subgrammars in an OR fashion. For instance chatgpt (and relatives) almost certainly have a feature tracking language, and a sentence is plausible to be generated either if it passes a French parser or an English parser, despite very few sentences passing both at the same time. This can be conceptualized as a hidden variable “switch” circuit: if at a given moment of generation the switch is set to “French”, the LLM will produce French text, and if it is set to “English” it will produce English text. However, because of their stochastic and highly parallel nature, if we actually look at internals, the way LLMs work is more complicated. We see that despite the context being clearly English, LLMs will often execute many “obviously” French behaviors in their internals; at some point, these should get filtered out, but this shouldn’t be thought of as a “hard” parameter of the computation. While viewing French/​English as a relatively Boolean switch is mostly reasonable, there are many contexts where two alternative contextual computations will start happening in parallel (this is seen here, for example), and through some downstream process, one of the generation processes “gets the message” and stops. This is especially prevalent if the contexts are less clean-cut: say, if the “switch” is between “happy” and “sad” text. Thus we have identified two different types of “gadgets” which both often go under the name of heuristics. On the one hand, “AND” or “rule heuristics”, which typically get combined with an “AND” mechanism and individually (approximately) define “larger” grammars (e.g. the ‘“i” before “e” except …’ grammar is one heuristic component of chatgpt); on the other hand, “OR” or “context heuristics”, which typically get combined with an “OR” mechanism and individually (approximately) define subgrammars (such as “French text”). However, I want to flag (and this is very important) that in the parallel and probabilistic context of machine learning, subgrammars will sometimes (at least partially) be internally implemented not as a true programmatic switch, but as a collection of parallel circuits with some mechanism for distilling the “correct” context from “incorrect” contexts.

Rules, subgrammars and logits

It will be useful for us to digest a bit further the complicated relationships between “rules” (combined with AND) and “contextual subgrammars” (combined with OR) that can exist in a transformer. As I’m aiming for most of this post to be readable by a non-technical (or at least a non-ML) audience, I’ll avoid discussing the details of the transformer architecture. Instead I’ll point out that, by and large, it can be conceptualized as a suitably “local” grammar. Namely, the internals of a transformer track a hidden state h (where, for experts, I’m dumping “all hidden activation data in the context”) which evolves in some way from token to token. Presented with an n-token string of tokens, it outputs a next token s_n according to a probabilistic rule which is equivalent to implementing a “parser” which iteratively

  • picks a random “next token candidate” s_{candidate}

  • accepts it, i.e., sets , with probability (depending on the hidden state). (Iterating through rejections until an acceptance is reached.)

However, before computing any probabilities, what the transformer outputs is a logit value which is a real number for each potential completion token , and it assigns an acceptance probability of to the “next token” being .[1] The “logit” framework works excellently with the practice of combining “rule” grammars into a more complex grammar. Indeed, if we have two “rule” grammars R1 and R2, with corresponding logits and (which are both “parser” functions of the hidden state h and the next token t to be accepted/​rejected), then the “combined” grammar that imposes both rules is quite simply the sum In this case, the probability of accepting a token t combines multiplicatively: So in particular, if either one of the rules R1 or R2 rejects a completion t (corresponding to the acceptance probability being small, i.e., the corresponding logit being very negative) then the combined program is also likely to reject it. Indeed, in experiments, it does seem to often be the case than when a text generation process is decomposable into two or more simpler rules (applied with an AND), the algorithm being implemented will often be well approximated by a sum (or more generally, linear combination) on the level of logits. Here, since we are in probability land, there is a lot more freedom and flexibility, and (perhaps surprisingly), “OR”-style rules (so, splitting a grammar into subgrammars) can also sometimes be implemented by summing logits (⇔ multiplying probabilities). For instance let’s move from the French/​English example (which is complicated in this context by the fact that some short sentences parse in both languages) to the Russian/​English example, where because of different alphabets, it is (to a first approximation) impossible to encounter both Russian and English text in a sentence. In this case, we can form two subgrammars R and E, which separately parse Russian, resp., English. Suppose furthermore that R (resp., E) assigns some very low fixed probability – say 1E(-20) \approx exp(-46) – to any completion that has non-cyrillic (resp., non-Latin) characters. Then we can once again formally write down a grammar whose logits are the sum The combined rule has the funny property that, from an “accept/​reject” viewpoint, any letter starts out with a very low probability of acceptance. Indeed, any Cyrillic letter has probability 1E(-20) due to rule E since it’s not Latin and any Latin letter has probability 1E(-20) due to rule R since it’s not Cyrillic. However, because of the nature of “acceptance probabilities”, when modeling the behavior of such a combination, we can subtract away the constant term ln(1E(-20)) (= −46) from both logits, and we obtain what is effectively the “OR” circuit which accepts either English or Russian text (note that this combination “rejects” text that has both types of characters, since the constant “rejection” offset of −46 appears twice). In practice, LLMs probably don’t directly “add” a Russian text completion algorithm and an English one except in very early layers, but rather use the nonlinearity inherent in AI calculations to “switch” to either a Russian or English context at some early point in the computation, in order to be able to share some circuits between the languages (e.g., it wouldn’t be efficient to separately build a “math” circuit in English and in Russian, since the two text generation problems have a lot in common). However, for certain simple or shallow transformer algorithms (of which we will see an example soon), the “hack” of simply adding logits, thereby converting an AND between “rule” circuits to an OR between subgrammar circuits, is indeed the most efficient, or at least most easily learned process.

Subgrammars in mechinterp

This process of “parallel contextual subgrammars” is perhaps more prevalent than one might expect, at least when appropriately understood. One unexpected place where it shows up is modular addition.

Modular addition: definition

The problem of modular addition can be defined as a very simple formal language, i.e., grammar. The “complexity” of the modular addition problem is controlled by an integer p (usually assumed prime, for some number-theoretic simplifications). For instance in Neel Nanda et al.’s initial work on the subject, p=113. The resulting language, has a p-token dictionary: {0, 1, 2, .., p-1}. Given a string s,

  • ModAdd will reject any string s unless it has 3 letters (i.e., it’s strictly a grammar on 3-letter strings)

  • ModAdd will accept a string of three numbers “abc” if and only if we have an equality mod p, . (In other words, iff a+b-c is a multiple of p). Note that since it only accepts finite strings, this grammar is finite. The total number of strings it accepts is p^2 (since for any choice of there is a unique continuation abc which is accepted). Thus when training it, one passes some number of “valid” 3-token training sentences that satisfy the modular addition requirement with the number of examples n_train In practice, it is possible to train an algorithm with 100% accuracy with a large range of training data size between and (note that there is a simple reason why just O(p) examples aren’t enough to learn anything: indeed, if our set of examples never repeats a token, or if each token is only repeated one or two times, there is simply not enough information to generalize to more complicated relationships between tokens. However, with a suitably well-designed architecture, it seems that a logarithmic multiple of p examples is enough to guarantee generalization).

Modular addition: circuits as subgrammars

The way simple transformers (as well as simple MLPs) perform modular addition is relatively well-understood (and this is one of the only learned algorithms which is, at least in a suitably satisfying way, “fully interpreted”). I won’t get into the details since they are irrelevant, but instead I will gather together some relevant points. First, depending on architecture, modular addition neural nets tend to learn one of two types of algorithms, called “pizza” and “clock” (which you don’t have to know to understand the following).[2] While a fully analogous story is possible for the (more elegant and efficient) “clock” algorithm, it is conceptually easier to describe for a “pizza” circuit. In a “pizza-preferring” architecture, the key points are as follows for a fully trained generalizing algorithm:

  • The algorithm learned operates in parallel, meaning that multiple independent “circuits” simultaneously evaluate the input. In other words, the (learned) logits logit(ab, c) determining the accept/​reject probabilities of c as a completion to the string “ab” are in fact a sum of “circuit” logits, $\sum_{i=1}^k circuit_i(“ab”, c). $(Here the number of circuits, k, is in practice between 3 and 10 or so, depending on architecture, training, and various other choices.)

  • For each circuit circuit_i viewed as a standalone algorithm, it incorrectly classifies about ⅓ of inputs.

  • The misclassified inputs can be successfully conceptualized as random (i.e., given by a hash). In fact, the last bullet can be further improved (not only can the “wrong” inputs be understood as random but the corresponding “errors” are significantly smaller in expectation than the “correct” logits). At the end of the day (and injecting some extra context that one can get from analyzing the algorithm), we see that when we add several circuits together, the resulting output logit (“ab”, c) is

  • a sum of (pseudo) random values of arbitrary sign, if

  • a sum of positive values, if In particular, if there are enough circuits then we see that (assuming the pseudorandom choices aren’t extremely correlated – which they are discouraged to be by the training process), the correct logit outweighs each of the “wrong” logits, by some minimal offset value K. Since the value of c can be made arbitrarily large (by modifying the weights to rescale the output logit by some scale factor), the “acceptance probability” of an incorrect completion, , becomes effectively zero (specifically, at least less likely than the correct completion). While the exact math is not worth going into, a “morally correct” simplification of the picture is that the different circuits (contributing the linearly combined logit-functions logit_i) can be viewed as consultants or witnesses who “vote” on how likely any output is. With probability about ⅓, each of them will vote incorrectly and with probability ⅔ they vote correctly; and if there are enough “consultants” and they are suitably uncorrelated (something you can mathematically prove in this context), the law of large numbers implies that their “verdict” is correct. Now since the difference between correct and incorrect verdicts gets weighted by some free parameter, then exponentiated, this means that the acceptance probability of the “rejected” answer can become arbitrarily close to zero: something we expect from a good probabilistic approximation to a discrete grammar.

Upshots so far

Let’s review what we’ve covered so far.

  • “Probabilistic grammar” is a lens for looking at any textual generation or prediction task.

    • In particular, even toy tasks like modular addition can be productively viewed as a grammar.

  • It points at a particular way of conceptualizing text, via simpler rules or “heuristics” which are:

    • local and

    • combinatorial (in the sense of “combined in some appropriate formal way”)

  • Rules that are combined with AND (like “i before e” AND “u after q”) are easy to combine in standard architectures (like transformers) by adding logits, corresponding to multiplying probabilities.

  • Rules that are “contextual”, i.e., combined with OR (like “writing in Russian OR writing in English”) are treated in a more complicated way and have more diversity in implementation choices, but can also sometimes just correspond to adding logits. (And will tend to be learned in this way in sufficiently shallow toy examples.)

  • “Noisy rules”, like predicting modular addition from circuits with uncorrelated errors, can be treated as contextual rules, and can often be obtained by adding logits: this is often the same as circuits “voting” on the correct answer (“do what circuit #1 says, UNLESS the majority of other circuits disagree”). A higher-level upshot is that thinking of circuits as grammatical rules is productive, and points at the importance of keeping track of how simple, noisy and heuristic rules can combine to form more sophisticated processes.

Memorization and beyond

There is a particularly stupid, but popular (among learning algorithms) way to decompose a grammar into subgrammars, which is memorization. In this context, each allowable phrase is a subgrammar (i.e., a grammar that accepts only the phrase itself), and these are combined with an OR. This can either be done globally (by memorizing each training phrase, and then learning the “combined” grammar which is “accept a phrase if it is exactly one of phrase1 OR phrase2 OR …), or it can be done locally, as an n-gram (accept a phrase if each consecutive pair of tokens is bigram1 OR bigram2 OR …). While each of the constituent subgrammars is quite simple, this is an expensive way to learn a grammar: each of the allowed sentences must be kept somewhere in the algorithm’s global memory (i.e., its weights). However, when LLMs are trained in an overparameterized regime (i.e., there are more parameters than training examples), it is certainly possible to do this, and often (though far from always!) when you interpret how an overparameterized learning algorithm works, this is exactly what happens. For example, certain real-life architectures for learning modular addition will prefer memorization (memorizing a + b \equiv c for every pair of residues (a,b)), or learn a memorizing solution first and only “grok” a generalizing solution much later (this in particular happens in the experiments in Neel Nanda’s seminal paper on the topic. Note that efficiently designed overparameterized algorithms will generalize immediately before learning a memorizing solution). Thus the dichotomy between “AND” and “OR” methods to combine simpler grammars into more complicated ones is related to the dichotomy between generalization and memorization. In some sense, the less context-dependent “OR” phenomena an algorithm exhibits, the more you expect it to generalize. One way to easily see this correspondence is to remember that “AND” rules are always on (each example must obey all AND rules, thus they are maximally “general”) whereas “OR” rules and subgrammars are by their very nature contextual, i.e., specific to some examples but not others. Note that algorithms that “totally memorize”, i.e., learn each example as an independent subgrammar, are easy to spot: these are algorithms that exhibit no generalization (i.e., will not behave better than random chance on held-out examples outside the training set). In the next post in this series, we’ll see that there is a lot of room on the spectrum between generalization and memorization, as conceptualized by AND vs. OR behavior for how components combine (and this is sometimes, but not always, visible by looking at the behavior on held-out examples). A beautiful intermediate example, where “full generalization” (in a suitable sense) can occur either via “partially-memorizing” subgrammars or via “fully general” methods, is provided by the very elegant recent paper) by He et al. on “in-context modular linear algebra” that I encountered in the latest NeuRIPS, which we’ll look at more carefully next time. We will treat this paper as a jumping-off board to explore more sophisticated ways to operationalize the spectrum between memorization and generalization: in particular, we will see that sometimes we can conceptualize AND vs. OR as an adjustable “slider”. Even when it’s possible to fully generalize, there can exist algorithms that have excellent generalization behavior, but work by memorizing some large number of “context-dependent” subgrammars, each of which behaves correctly only on a small batch of data (but is generalizing on this batch). This “AND vs. OR” slider is related to some other recent work on rethinking generalization. A reasonable question one can now ask is whether this “AND vs. OR” slider is sufficient to explain the different generalizing behaviors of algorithms learned by real-life neural nets (whether on toy data settings, or real-life LLM data). One obvious issue is that, like everything to do with algorithms, the occurrences of AND vs. OR can be stacked to exhibit a fractal nature: the subgrammars you combine into a grammar via OR can themselves be composed of simpler grammars via AND, which can be composed of simpler grammars via OR, which can … While seemingly complicated, this might not be too bad a complication: similarly stacked structures occur all the time in CS and math (see e.g. the polynomial hierarchy), and might be relatively easy to incorporate as a tractable source of complexity in an appropriately generalized version of the AND vs. OR “slider” idea. Indeed, I think that carefully formalizing and measuring this kind of nested structure and thus bridging the gap between “total memorization” vs. “total generalization” is a very valuable approach to interpretability, and in the next post, I will use the He et al. paper as a jumping off point to think about it further in empiricist-friendly ways. However, as we’ll revisit next time, even the “fractal” AND vs. OR picture can be further complicated (this is going to be a motif of my series of posts, since I’m a theory person – luckily, not the only motif). There are two directions to go further frmo here, one of which I’m particularly excited about.

Future directions

This “AND vs. OR” framework for understanding how neural networks combine simple rules into complex behaviors is powerful, but not complete. There are at least two important directions that pull us beyond this basic picture. The first direction leads us toward causality. While stacks of AND/​OR operations can capture many algorithmic behaviors, we’re finding that more fine-grained, graph-based processes may be needed to fully describe how neural networks generalize. While I don’t understand this theory well, this connects to ideas from causality theory, including natural latents and finite factored sets. While our framework of analyzing algorithms as shallow stacks of ANDs and ORs provides a useful experimental approach, it may ultimately need to be integrated into these broader causal frameworks. The second direction is more surprising. In our recent research with Louis Jaburi, we’re finding evidence that neural networks can learn in ways that aren’t captured by any combination of rule-based circuits—what we might call learning by analogy. While we currently have only preliminary examples of this behavior, they suggest something fascinating: neural networks can develop “meta-rules” that operate without ever formalizing their underlying principles. This isn’t just another way of combining circuits—it appears to be a fundamentally different kind of learning that breaks out of the rule-based paradigm entirely (and can hopefully lead to significantly simplifying causal analysis in certain contexts).


  1. ↩︎

    Technically, the logit value can be positive, corresponding to an impossible value . However, it’s easy to see that the distribution of text doesn’t change if the same value c is subtracted from all logits, so it’s possible to make all of them negative by subtracting a sufficiently large fixed value. In practice, transformers don’t execute the inefficient “acceptance” algorithm at all, but rather immediately normalize the “acceptance” probabilities to add up to one, and draw in a single step from the resulting probability distribution on all logits – if you’re used to seeing a “softmax” function in your output, this is the function that outputs the normalized probabilities.

  2. ↩︎

    Note: adding this for context. It’s mathy and not needed for the rest. Both the clock and pizza circuit are best abstractd out via complex numbers, and both embed and unembed the residue via for The value “k” here corresponds to the Fourier mode. The difference is in the nonlinearity: the clock circuit sends the pair to and the pizza circuit (preferred in certain “tied” contexts) first adds the residues, and sends to One can check that for a fixed value of the nonlinear postactivation complex number associated to follows the shape of a cardioid as b varies. In particular, for the “concave” part of the cardioid, which corresponds to about 13 of values of b, the maximal output logit is incorrect due to the concavity.