jake_mendel comments on Toward A Mathematical Framework for Computation in Superposition

jake_mendel 19 Jan 2024 18:35 UTC
LW: 6 AF: 3
0
AF
Thanks for the kind feedback!
I’d be especially interested in exploring either the universality of universal calculation
Do you mean the thing we call genericity in the further work section? If so, we have some preliminary theoretical and experimental evidence that genericity of U-AND is true. We trained networks on the U-AND task and the analogous U-XOR task, with a narrow 1-layer MLP and looked at the size of the interference terms after training with a suitable loss function. Then, we reinitialised and froze the first layer of weights and biases, allowing the network only to learn the linear readoff directions, and found that the error terms were comparably small in both cases.
This figure is the size of the errors for $d = d_{0} = 100$ (which is pretty small) for readoffs which should be zero in blue and one in yellow (we want all these errors to be close to zero).
This suggests that the AND/XOR directions were $ϵ$ -linearly readoffable at initialisation, but the evidence at this stage is weak because we don’t have a good sense yet of what a reasonable value of $ϵ$ is for considering the task to have been learned correctly: to answer this we want to fiddle around with loss functions and training for longer. For context, an affine readoff (linear + bias) directly on the inputs can read off $f_{1} \land f_{2}$ with $({\to f}_{1} + {\to f}_{2}) / 2 - 1 / 4$ , which has an error of $ϵ = 1 / 4$ . This is larger than all but the largest errors here, and you can’t do anything like this for XOR with affine readoff.

After we did this, Kaarel came up with an argument that networks randomly initialised with weights from a standard Gaussian and zero bias solve U-AND with inputs not in superposition (although it probably can be generalised to the superposition case) for suitable readoffs. To sketch the idea:

Let $W_{i}$ be the vector of weights from the $i$ th input to the neurons. Then consider the linear readoff vector with $k$ th component given by:
$({- -- \to AND}_{i, j})_{k} = α (1_{W_{i k} > 0} \land 1_{W_{j k} > 0}) + β 1_{W_{i k} > 0} + γ 1_{W_{j k} > 0} + δ$
where $1$ is the indicator function. There are 4 free parameters here, which are set by 4 constraints given by requiring that the expectation of this vector dotted with the activation vector has the correct value in the 4 cases $f_{i}, f_{j} \in {0, 1}$ . In the limit of large $d$ the value of the dot product will be very close to its expectation and we are done. There are a bunch of details to work out here and, as with the experiments, we aren’t 100% sure the details all work out, but we wanted to share these new results since you asked.
A big reason to use MSE as opposed to eps-accuracy in the Anthropic model is for optimization purposes (you can’t gradient descent cleanly through eps-accuracy).
We’ve suggested that perhaps it would be more principled to use something like $L^{p}$ loss for larger $p$ than 2, as this is closer to $ϵ$ -accuracy. It’s worth mentioning that we are currently finding that the best loss function for the task seems to be something like $L^{p}$ with extra weighting on the target values that should be $1$ . We do this to avoid the problem that if the inputs are sparse, then the ANDs are sparse too, and the model can get good loss on $L^{p}$ (for low $p$ ) by sending all inputs to the zero vector. Once we weight the ones appropriately, we find that lower values of $p$ may be better for training dynamics.
or the extension to arithmetic circuits (or other continuous/more continuous models of computation in superposition)
We agree and are keen to look into that!
(TeX compilation failure)
Thanks—fixed.
- LawrenceC 19 Jan 2024 19:55 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Fascinating, thanks for the update!
- Hoagy 22 Jan 2024 6:25 UTC
  LW: 1 AF: 1
  0
  AF Parent
  How are you setting $p$ when $d_{0} = 100$ ? I might be totally misunderstanding something but ${log}^{2} (d_{0}) / \sqrt{d} \approx 2.12$ at $d_{0} = d = 100$ - feels like you need to push $d$ up towards like 2k to get something reasonable? (and the argument in 1.4 for using $\frac{1}{{log}^{2} d_{0}}$ clearly doesn’t hold here because it’s not greater than $\frac{{log}^{2} d_{0}}{d^{1 / k}}$ for this range of values).
  - jake_mendel 23 Jan 2024 17:32 UTC
    3 points
    1
    Parent
    So, all our algorithms in the post are hand constructed with their asymptotic efficiency in mind, but without any guarantees that they will perform well at finite $d$ . They haven’t even really been optimised hard for asymptotic efficiency—we think the important point is in demonstrating that there are algorithms which work in the large $d$ limit at all, rather than in finding the best algorithms at any particular $d$ or in the limit. Also, all the quantities we talk about are at best up to constant factors which would be important to track for finite $d$ . We certainly don’t expect that real neural networks implement our constructions with weights that are exactly 0 or 1. Rather, neural networks probably do a messier thing which is (potentially substantially) more efficient, and we are not making predictions about the quantitative sizes of errors at a fixed $d$ .
    In the experiment in my comment, we randomly initialised a weight matrix with each entry drawn from $N (0, 1)$ , and set the bias to zero, and then tried to learn the readoff matrix $R$ , in order to test whether U-AND is generic. This is a different setup to the U-AND construction in the post, and I offered a suggestion of readoff vectors for this setup in the comment, although that construction is also asymptotic: for finite $d$ and a particular random seed, there are almost definitely choices of readoff vectors that achieve lower error.
    FWIW, the average error in this random construction (for fixed compositeness; a different construction would be required for inputs with varying compositeness) is (we think) $Θ (1 / \sqrt{d})$ with a constant that can be found by solving some ugly gaussian integrals but I would guess is less than 10, and the max error is $Θ (log d / \sqrt{d})$ whp, with a constant that involves some even uglier gaussian integrals.