Hoagy comments on Toward A Mathematical Framework for Computation in Superposition

Hoagy 22 Jan 2024 6:25 UTC
LW: 1 AF: 1
0
AF
How are you setting $p$ when $d_{0} = 100$ ? I might be totally misunderstanding something but ${log}^{2} (d_{0}) / \sqrt{d} \approx 2.12$ at $d_{0} = d = 100$ - feels like you need to push $d$ up towards like 2k to get something reasonable? (and the argument in 1.4 for using $\frac{1}{{log}^{2} d_{0}}$ clearly doesn’t hold here because it’s not greater than $\frac{{log}^{2} d_{0}}{d^{1 / k}}$ for this range of values).
- jake_mendel 23 Jan 2024 17:32 UTC
  3 points
  1
  Parent
  So, all our algorithms in the post are hand constructed with their asymptotic efficiency in mind, but without any guarantees that they will perform well at finite $d$ . They haven’t even really been optimised hard for asymptotic efficiency—we think the important point is in demonstrating that there are algorithms which work in the large $d$ limit at all, rather than in finding the best algorithms at any particular $d$ or in the limit. Also, all the quantities we talk about are at best up to constant factors which would be important to track for finite $d$ . We certainly don’t expect that real neural networks implement our constructions with weights that are exactly 0 or 1. Rather, neural networks probably do a messier thing which is (potentially substantially) more efficient, and we are not making predictions about the quantitative sizes of errors at a fixed $d$ .
  In the experiment in my comment, we randomly initialised a weight matrix with each entry drawn from $N (0, 1)$ , and set the bias to zero, and then tried to learn the readoff matrix $R$ , in order to test whether U-AND is generic. This is a different setup to the U-AND construction in the post, and I offered a suggestion of readoff vectors for this setup in the comment, although that construction is also asymptotic: for finite $d$ and a particular random seed, there are almost definitely choices of readoff vectors that achieve lower error.
  FWIW, the average error in this random construction (for fixed compositeness; a different construction would be required for inputs with varying compositeness) is (we think) $Θ (1 / \sqrt{d})$ with a constant that can be found by solving some ugly gaussian integrals but I would guess is less than 10, and the max error is $Θ (log d / \sqrt{d})$ whp, with a constant that involves some even uglier gaussian integrals.