How are you setting p when d0=100? I might be totally misunderstanding something but log2(d0)/√d≈2.12 at d0=d=100 - feels like you need to push d up towards like 2k to get something reasonable? (and the argument in 1.4 for using 1log2d0 clearly doesn’t hold here because it’s not greater than log2d0d1/kfor this range of values).
So, all our algorithms in the post are hand constructed with their asymptotic efficiency in mind, but without any guarantees that they will perform well at finite d. They haven’t even really been optimised hard for asymptotic efficiency—we think the important point is in demonstrating that there are algorithms which work in the large d limit at all, rather than in finding the best algorithms at any particular d or in the limit. Also, all the quantities we talk about are at best up to constant factors which would be important to track for finite d. We certainly don’t expect that real neural networks implement our constructions with weights that are exactly 0 or 1. Rather, neural networks probably do a messier thing which is (potentially substantially) more efficient, and we are not making predictions about the quantitative sizes of errors at a fixed d.
In the experiment in my comment, we randomly initialised a weight matrix with each entry drawn from N(0,1), and set the bias to zero, and then tried to learn the readoff matrix R, in order to test whether U-AND is generic. This is a different setup to the U-AND construction in the post, and I offered a suggestion of readoff vectors for this setup in the comment, although that construction is also asymptotic: for finite d and a particular random seed, there are almost definitely choices of readoff vectors that achieve lower error.
FWIW, the average error in this random construction (for fixed compositeness; a different construction would be required for inputs with varying compositeness) is (we think) Θ(1/√d) with a constant that can be found by solving some ugly gaussian integrals but I would guess is less than 10, and the max error is Θ(logd/√d) whp, with a constant that involves some even uglier gaussian integrals.
How are you setting p when d0=100? I might be totally misunderstanding something but log2(d0)/√d≈2.12 at d0=d=100 - feels like you need to push d up towards like 2k to get something reasonable? (and the argument in 1.4 for using 1log2d0 clearly doesn’t hold here because it’s not greater than log2d0d1/kfor this range of values).
So, all our algorithms in the post are hand constructed with their asymptotic efficiency in mind, but without any guarantees that they will perform well at finite d. They haven’t even really been optimised hard for asymptotic efficiency—we think the important point is in demonstrating that there are algorithms which work in the large d limit at all, rather than in finding the best algorithms at any particular d or in the limit. Also, all the quantities we talk about are at best up to constant factors which would be important to track for finite d. We certainly don’t expect that real neural networks implement our constructions with weights that are exactly 0 or 1. Rather, neural networks probably do a messier thing which is (potentially substantially) more efficient, and we are not making predictions about the quantitative sizes of errors at a fixed d.
In the experiment in my comment, we randomly initialised a weight matrix with each entry drawn from N(0,1), and set the bias to zero, and then tried to learn the readoff matrix R, in order to test whether U-AND is generic. This is a different setup to the U-AND construction in the post, and I offered a suggestion of readoff vectors for this setup in the comment, although that construction is also asymptotic: for finite d and a particular random seed, there are almost definitely choices of readoff vectors that achieve lower error.
FWIW, the average error in this random construction (for fixed compositeness; a different construction would be required for inputs with varying compositeness) is (we think) Θ(1/√d) with a constant that can be found by solving some ugly gaussian integrals but I would guess is less than 10, and the max error is Θ(logd/√d) whp, with a constant that involves some even uglier gaussian integrals.