Thanks to @Linda Linsefors and @Kola Ayonrinde for reviewing the draft.

tl;dr:

I built a toy model of the Universal-AND Problem described in Toward A Mathematical Framework for Computation in Superposition (CiS).

It successfully learnt a solution to the problem, providing evidence that such computational superposition^[1] can occur in the wild. In this post, I’ll describe the circuits I’ve found that model learns to use, and explain why they work.

The learned circuit is different from the construction given in CiS. The paper gives a sparse construction, while the learned model is dense—every neuron is useful for computing every possible $v_{i} \land v_{j}$ output.

I sketch out how this works and why this is plausible and superior to the hypothesized construction.

This result has implications for other computational superposition problems. In particular, it provides a new answer for why XOR so commonly appears in linear representations.

The Universal-AND Problem

The Universal-AND problem considers a set of boolean inputs $v_{1}, \dots, v_{m}$ taking values in ${0, 1}$ . The inputs are $s$ -sparse, i.e. at most $s$ are active at once, $Σ v_{i} \leq s$ . The problem is to create a one layer model that computes a vector of size $d$ , called the neuron activations, such that it’s possible to read out every $v_{i} \land v_{j}$ using an appropriate linear transform.

The problem has elaborations where $v_{1}, \dots, v_{m}$ are stored in superposition in an activation vector of size $d_{0}$ , or noise is introduced, but I have not directly experimented on this.

The idea of this problem is that for values of $d ≪ m^{2}$ there is not enough compute or output bandwidth to compute every possible AND pair separately. But if we assume $s ≪ d$ , then the model can take advantage of the sparsity of the inputs to re-use the same model weights for unrelated calculations. This phenomenon is called Computational Superposition^[1]. This is thought to occur in real-life models, in analogy to Activations in Superposition where sparsity is leveraged to store many more pieces of data in a vector than could be otherwise.

I model this as:

y = ReLU (W v + b)

z = R y + c

Where $W \in R^{m \times d}, b \in R^{d}, R \in R^{d \times m^{2}}, c \in R^{m^{2}}$ . In other words, $W$ and $b$ describe the “compute layer”, while $R$ and $c$ describe the “readout layer”.

In this formulation, the output has dimension $m^{2}$ corresponding to ordered pairs selecting pairs of inputs. So the target output is

^zim+j={vi∧vji≠j0i=j

The diagonals are therefore unused and only exist to make indexing convenient.

Related Work

Toward A Mathematical Framework for Computation in Superposition supplies the original statement of the U-AND problem, and a sparse construction that solves it with provable bounds on the error term. But I find in practice the learned model differs from the sparse construction in some key ways.

Circuits in Superposition: Compressing many small neural networks into one also looks at a problem statement that forces computational superposition and also uses a sparse construction to calculate theoretical bounds.

Superposition is not “just” neuron polysemanticity discusses the distinction between superposition and other possible reasons neurons might be polysemantic. The circuit found in this article is a good example of Example 1 - non-neuron aligned features.

What’s up with LLMs representing XORs of arbitrary features? and comments go into some depth on linear readout of Boolean circuits as observed in the wild. I suggest an alternative explanation for these observations.

Training

I trained the above two layer model on synthetic data where randomly exactly $s$ values of $v_{i}$ are active (i.e. take value $1$ ). I used RMS loss.

I sample $v_{i}$ uniformly, so test cases of the form $0 \land 0$ where much more common than $1 \land 0$ and $1 \land 1$ . I up-weighted the loss so that the expected contribution from each of those three cases is equal. This encourages the model to focus on the few active results rather than the vast sea of inactive results. We can justify this because the second layer is intended as a “readout” layer. For training purposes, we read out every possible pair every time, but if the first layer was actually embedded in part of a larger network, presumably not all pairs are actually desired.

Some weight decay is used to regularize the network. This matches real-world training runs, and encourages the model to focus on the optimal circuits. I trained each model with 6000 epochs of 10k batches to encourage this.

I typically ran with $m = 100, d = 1000, s = 3$ , but experimented with other values too.

The full details can be found in the corresponding notebook.

Results

This section contains empirical results, and can be skipped if you just want the crisp formula I extract from the behaviour.

We find that at low $s$ values, the model does find solutions that are capable of solving the U-AND problem, even extending to extremely low values of $d$ . The model weights take on a simple pattern of binary weights described below.

At higher values of $s$ (starting at 10 for $m = 100$ ), the model starts to prefer more degenerate solutions, particularly for $d \leq m$ . It either learns pure-additive circuits that roughly correspond to $z_{i m + j} = 0.4 (v_{i} + v_{j})$ or it fails to update the layer one weights at all, and relies entirely on readout weights.

The Binary Weighted Circuit

The model generally tends towards neuron weights that are binary, i.e. takes on only one of two different values. As far as I can tell, the choice is randomized, with a roughly even balance between them.

Expressed mathematically

W_{i j} = {\begin{matrix} u_{i} & with probability p_{i} l_{i} & with probability 1 - p_{i} \end{matrix}

I discuss why this is an effective choice in Circuit Analysis.

Binary Weight Charts

We established that with sufficient overtraining, the model trends towards binary weights with just three parameters $u_{i}$ , $l_{i}$ , $p_{i}$ . That means we can a scatter plot with one point per neuron. Error bars show 90th-percentile deviation from the weight being near to either the upper/lower bound.

Interactive Chart

Readout Charts

Another way of viewing the neurons is in terms of how they are read out by matrix $R$ .

I pick two arbitrary inputs (say $v_{1}$ and $v_{2}$ ), and plot each neuron in a scatter chart based on their weights (i.e. $x = W_{i 1}$ , $y = W_{i 2}$ ). I then color the neurons based on their readout weight for $v_{1} \land v_{2}$ (i.e. $R_{(1 m + 2), i}$ ).

The four corners of this chart correspond to classes A (top right), C (bottom left), B1 and B2 as described in Circuit Analysis. Class C has higher weights per-neuron as there are fewer neurons in that class than class A.

Even in cases where the weights don’t form a Binary Weighted Circuit, these charts give a clear indication of how readout works.

You can tell this is a different regime / circuit to as there is no neurons with negative input weights but positive readout weights (yellow dots in the bottom left).

Variant Experiments

I also evaluated the same models with randomized weights, then optimizing only the readout matrix. This variant exhibits similar loss, just worse by a constant factor. This matches observations from comments on the CiS post that random weights already have a number of desirable properties.

I changed the synthetic input to a mix of sparsity values. In this variant, you no longer see such a neat binary in the weights, but this is not that surprising at different weight values will work best at different sparsity levels. I did not spot any surprising differences here.

I tried using a model that forces the weights matrix to match the binary weights pattern, with some smoothing to allow for gradients. This model has very few free parameters, so I thought it would train very well. But it converged slower than simply training the whole matrix, so I found nothing valuable here.

Analysis

Circuit Analysis

I found for a wide range of parameters, the model tends towards binary weights where for any given neuron, the weights it uses only take on two different values, with no pattern. Expressed mathematically

W_{i j} = {\begin{matrix} u_{i} & with probability p_{i} l_{i} & with probability 1 - p_{i} \end{matrix}

This is very close to the CiS Construction, which followed this exact pattern, with values

u_{i} = 1 l_{i} = 0 p_{i} = {log}^{2} m / \sqrt{d}

In the CiS Construction $p_{i}$ is small, and $l_{i}$ is zero, meaning the neurons were only sparsely connected. This property was key in proving an upper bound on the loss of the model.

But in the learned Binary Weighted Circuit we see quite different values. I found these values representative^[2]:

u_{i} = 0.1 l_{i} = - 0.25 p_{i} = 0.75

Notation: When the values are a constant, I’ll drop the $i$ subscript.

Unlike the original construction, this is dense. Every neuron reads a significant value for every possible input.

Let’s explain exactly how neurons in this structure can be used to read out a good approximation of AND. Without loss of generality, let’s consider the first two inputs, i.e. we want to read out $v_{1} \land v_{2}$ from the activations of $d$ neurons that all share the same values of $u$ // $l$ $p$ , and differ only by the random choice in the weights.

We can subdivide the neurons into 4 classes based on the random choice of weight for the first two inputs.

\begin{matrix} Class A & y_{i} = ReLU ( & u v_{1} + u v_{2} + X_{i} + b) & occurs with proportion p^{2} Class B1 & y_{i} = ReLU ( & u v_{1} + l v_{2} + X_{i} + b) & occurs with proportion p (1 - p) Class B2 & y_{i} = ReLU ( & l v_{1} + u v_{2} + X_{i} + b) & occurs with proportion p (1 - p) Class C & y_{i} = ReLU ( & l v_{1} + l v_{2} + X_{i} + b) & occurs with proportion (1 - p)^{2} \end{matrix}

For simplicity of analysis, let’s ignore the other inputs shown as $X_{i}$ (called the interference terms in the CiS paper), and set $b = 0.05$ , though you’ll see these details won’t be too critical.

We can then draw up a truth table for the results of each class of neuron for the 4 possible values of $v_{0}, v_{1}$ .

$v_{1}$	$v_{2}$	A	B1	B2	C	$4 (A + C - B 1 - B 2)$
0	0	0.05	0.05	0.05	0.05	0
0	1	0.15	0	0.15	0	0
1	0	0.15	0.15	0	0	0
1	1	0.25	0	0	0	1

Taking the right linear combination of the 4 truth tables, we can recreate the AND truth table. This linear combination is a close match for the ones seen in Readout Charts.

Taking a linear combination will always be possible if the 4 classes are linearly independent, which is true for quite a wide range of choices for the key parameters^[3].

Of course, the interference term means that these truth tables are not accurate. But in this toy model, the different inputs are almost completely independent. So the interference term can be modeled as noise. This will pull the 4 classes towards colinearity, but for appropriate bias there will always be a difference that can be exploited.

As there are many neurons in each class, the readout matrix can average over them, reducing the noise to reasonable levels. This is similar to the proofs on noise bounds in CiS, and I discuss a bit further in Circuit Efficiency.

XOR Circuits

Note that linear combinations of these classes can recreate any desired truth table^[4] - we only got AND because of our choice of readout matrix. This may go to some extent to explain this observation that is often possible to readout XOR.

I explain the XOR readout phenomenon in two parts. Firstly, it seems $a \oplus b$ will be readoutable whenever there is a demand for boolean logic involving $a$ and $b$ , regardless of whether XOR is specifically useful or used. This is because the circuit for learning any truth table is the same, only differing by readout.

But secondly, $a$ and $b$ might not even be need to be related logically. The circuit described above involves all neurons as that allows errors to average out. In the toy example, we actually are testing all possible pairs. But if we were only interested in pairs drawn from a subset, they would still want to use all the neurons to maximize error averaging. So if there were two disjoint AND problems that shared the same model, the model might find that it is efficient to learn one big binary weighted circuit that covers inputs from both problems. It would only readout what is actually needed, but it would be possible to find readouts that connect both sets of inputs.

Generalizability of Linear Probes

Sam Marks makes the argument that if XOR feature directions are readily available, then a linear probe that learns $v_{1}$ only on data where $v_{2} = 0$ , then the probe will fail to generalize to $v_{2} = 1$ because it’s just as likely to identify $v_{1} \oplus v_{2}$ as it is $v_{1}$ , but they give opposite answers when out of distribution.

For the parameters I used above, we find the following readout weights (for the total of each class):

v_{1} \sim 4 A - 4 B 1 + \frac{8}{3} B 2 - \frac{8}{3} C

v_{1} \oplus v_{2} \sim \frac{20}{3} B 1 + \frac{20}{3} B 2 - \frac{40}{3} C

So although both formulas are learnable, XOR requires larger weights, so will be disfavoured by regularization^[5]. So I predict probes would still work when generalized, despite XOR being readoutable.

Circuit Efficiency

So why is this dense Binary Weighted Circuit preferred over the CiS Construction described in CiS? Possibly it’s just a natural effect of weight decay, which tends to encourage circuits to spread out to keep individual weights low in magnitude.

I think that this circuit produces lower loss values (absolutely and asymptotically). I’ll give an approximating argument.

To recap, the CiS Construction fills the weight matrix with zeros and ones, with the ones being at probability $p ≪ 1$ . While the Binary Weighted Circuit fills the weight matrix with $u$ and $l$ , with $l < 0 < u$ and $p \approx 0.5$ .

Let’s return to the the view of a neuron from the previous section where we have two distinguished inputs, w.l.og.

y_{i} = ReLU (l v_{1} + u v_{2} + X_{i} + b)

Here I’ve explicitly labelled the interference term $X_{i}$

X_{i} = m \sum j = 3 W_{i j} v_{j}

Most of the entries of $X$ don’t contribute, as $v_{j}$ will be zero. There will be at most $s$ entries, each randomly contributing $u$ or $l$ depending on $W_{i j}$ . So we can approximate $X$ as binomially distributed^[6].

X_{i} \sim (u - l) Binom (s, p) + s l

Var (X_{i}) = (u - l)^{2} s p (1 - p)

We can treat the collection of $X_{i}$ as independent as the randomness comes from the independent choices of $W_{i j}$ .

It’s hard to characterize exactly how the ReLU function interacts with $X$ , but I’m going to assume it affects variance by a constant factor, $β$ , giving:

Var (y_{i}) = β (u - l)^{2} s p (1 - p)

Now we can use this estimate to compute the variance of the readout values.

In the CiS Construction, the readout is the mean value of all neurons that are in class A (i.e. have a weight of 1 for $v_{1}$ and $v_{2}$ ). There are $d p^{2}$ such neurons (in expectation) so we can divide by that to get the variance of the mean.

{Var}_{CiS} (z_{i}) = β s p (1 - p) / d p^{2}

Meanwhile in the Binary Weighted Circuit, we pick readout weights based on the inverse of the $4 \times 4$ truth table matrix. It can be shown that the total weight for each class is approximately $\pm 4 \sqrt{s}$ ^[7].

So we have $d p^{2}$ class A neurons, each with readout weight $\pm 4 \sqrt{s} / d p^{2}$ . This gives total variance

{Var}_{Binary} (z_{i}) = β (0.1 - - 0.25)^{2} s p (1 - p) ⎛ ⎝ d p^{2} {(\frac{4 \sqrt{s}}{d p^{2}})}^{2} + . . . ⎞ ⎠

= {0.35}^{2} 4^{2} s β s p (1 - p) (\frac{1}{p^{2}} + \frac{2}{p (1 - p)} + \frac{1}{(1 - p)^{2}}) / d

Both variance formulas are pretty similar in terms of constants and asymptotic behaviour, which is unsurprising as so far we’ve just evaluated different choices of $u$ and $l$ . The dense case is worse by a factor of $s$ , and includes some extra terms that were previously ignored because $p$ is small in the CiS Construction.

To compare these properly, we need to consider the different values of $p$ . For the CiS Construction, this was asymptotically small $p = {log}^{2} m / \sqrt{d}$ , while in the Binary Weighted Circuit $p$ was a constant $0.75$ .

When we work out the asymptotic behaviour we get

{Var}_{CiS} (z_{i}) = O (s / d p) = O (s / \sqrt{d} / {log}^{2} m)

{Var}_{Binary} (z_{i}) = O (s^{2} / d)

So the choice of circuit depends on $s$ , with the Binary Weighted circuit preferred when $s$ grows slower than $\sqrt{d} / {log}^{2} m$ .

In retrospect, I think this result makes sense. By increasing $p$ , you are “using” more neurons. This increases the variance of individual neurons but gives you many more to average over. $s$ determines that per-neuron noise, so determines the trade-off. The CiS Construction merely aimed to minimize the interference term of a single neuron, as this was critical for establishing provable error bounds.

Additionally, aside from accuracy considerations, dense circuits are more likely to result from training than equivalent sparse ones as they score better for regularization loss. Or in other words, adding weight decay adds an inductive bias towards “spreading out” circuits as much as possible.

Is This Just Feature Superposition?

In a sense, the circuit found here bears a lot of resemblance to simply randomly embedding the $m$ inputs in $d$ dimensional space. In both cases, you get a mix of neurons with different response patterns, and you can approximate the AND operation by taking a dense linear combination of the neurons.

I re-ran my code with frozen weights for the calculation layer to test this, and you do get a similar pattern of readouts.

The main difference between a randomized model and the one I found learnt was that the weights take on specific binary values. But this is unlikely to replicate outside of toy models. Using binary weights simplifies analysis and has an improved loss, but does not appear to be better than random uniform values except by a constant factor.

Viewed in this paradigm, the key takeaways of the results section are slightly different:

Mixing / superposing input variables is useful for computation, even when there is no pressure to store activations in a narrow dimension.
If you have a random embedding of sparse inputs and sum and take ReLU then any boolean logic can be readout with the appropriate linear combination.

Limitations and Future Work

Toy models by their nature are not very representative of real situations. That said I think there are some ways this model could be expanded that:

We should explore input in superposition or at least some input noise
Haven’t explored loss that uses a subset of all AND pairings, or has an uneven weighting for input co-occurence
Is RMS a realistic way to score this?

I would also like to see the model run for a sufficient range of parameters so that scaling laws can be established. Does the loss actually correspond to the variance formula I derive? What is the tradeoff of sparsity and number of neurons?

I’ve established that for this simple problem, the it is better to use a dense circuit, than a sparse one, even though the input features are sparse. That naturally suggests that there may be dense constructions for other problems, like sparse boolean circuits or circuit compression. There may also be implications for interpretability techniques which depend on assumptions of circuit sparsity to work.

Finally, I think a logical step would be to see if a similar sort of circuitry can be found in a real LLM. There’s indirect evidence, but no solid example.

Conclusion

This work provides some empirical evidence for Toward A Mathematical Framework for Computation in Superposition. It’s clear that many boolean boolean logic equations can be computed from from one set of weights, even for small dimension sizes.

We have solidified the understanding of how basic boolean logic is performed, specifically despite sparse features, there is a tendency for dense circuitry to be used. The asymptotic error is characterized in terms of sparsity, and hidden dimension size.

We’ve discussed applicability to XOR representations and linear probes.

Dense circuits like these challenge the common assumption that circuits can be found by finding a sparse subset of connections inside a larger model. If it’s better to have many shared noisy calculations than smaller set of isolated reliable ones, then a different set of techniques is needed to detect them.

^
I draw a distinction between a model’s ability to work directly on activations in superposition (computation in superposition) and the ability to share weights between multiple overlapping circuits (computational superposition).
Though closely related, I think these can be investigated separately.
^
As the readout matrix $R$ can supply an arbitrary positive scaling, the only important details of $u$ / $l$ are their signs and ratio.
The magnitude tends to be around $1 / \sqrt{d}$ as this minimizes regularization loss (weight decay) in a two layer network.
^
@jake_mendel uses a very similar argument in this comment.
^
The design also extends to truth tables of more than two inputs without much difficulty.
^
Actually when using $L_{1}$ norm there are some choices of $p$ where this result reverses. But the point still holds that one of the two possible probes will be favoured.
^
This approximation elides the correlation between $X$ and $v_{1}$ , $v_{2}$ , which for high values of $p$ can be important, but I don’t think effects the circuits examined significantly.
^
This comes from $X_{i}$ having variance proportional to $s$ , but the neuron classes only differ from each other by a constant translation of at most $2 (u - l)$ . As $s$ increases, the classes ReLU zero-points are at increasingly similar points on the probability distribution.
I also verified this empirically.
^
Even at $d = m^{2}$ I found no evidence of the model learning a naive 1:1 neuron:output strategy. I believe that the presense of weight decay makes the binary weighted circuit preferable even though it is a bit more noisy.