It’s really interesting that (so far it seems) the quadratic activation can achieve the universal AND almost exponentially more efficiently than the ReLU function.
It seems plausible to me that the ReLU activation can achieve the same effect to approximate a quadratic function in a piecewise way. From the construction it seems that each component of the r2 space is the sum of at most 2 random variables, and it seems like when you add a sparse combination of a large number of nearly orthogonal vectors, each component of the output would be approximately normally distributed. So each component could be pretty easily bounded at high probability, and a quadratic function can be approximated within the bounds as a piecewise linear function. I’m not sure how the math on the error bounds works for this, but it sounds plausible to me that the error using the piecewise approximation is low enough to accurately calculate the boolean AND function.
Also I wonder if it’s possible to find experimental evidence from trained neural networks that it is using ReLU to implement the x2 function in a piecewise way like this? Basically, after training a ReLU network to compute the AND of all pairs of inputs, we check whether the network contains repeated implementations of a circuit that flip the sign of negative inputs and increases the output at a higher slope when the input is of large enough magnitude. Though the duplication of inputs before the ReLU and the recombination would be linear operations that get mixed into the linear embedding and linear readout matrices, making it confusing to find which entries correspond to which… Maybe it can be done by testing individual boolean variables at a time. If a trained network is doing this, then it would be evidence that this is how ReLU networks achieve exponential size circuits.
It’s really interesting that (so far it seems) the quadratic activation can achieve the universal AND almost exponentially more efficiently than the ReLU function.
It seems plausible to me that the ReLU activation can achieve the same effect to approximate a quadratic function in a piecewise way. From the construction it seems that each component of the r2 space is the sum of at most 2 random variables, and it seems like when you add a sparse combination of a large number of nearly orthogonal vectors, each component of the output would be approximately normally distributed. So each component could be pretty easily bounded at high probability, and a quadratic function can be approximated within the bounds as a piecewise linear function. I’m not sure how the math on the error bounds works for this, but it sounds plausible to me that the error using the piecewise approximation is low enough to accurately calculate the boolean AND function.
Also I wonder if it’s possible to find experimental evidence from trained neural networks that it is using ReLU to implement the x2 function in a piecewise way like this? Basically, after training a ReLU network to compute the AND of all pairs of inputs, we check whether the network contains repeated implementations of a circuit that flip the sign of negative inputs and increases the output at a higher slope when the input is of large enough magnitude. Though the duplication of inputs before the ReLU and the recombination would be linear operations that get mixed into the linear embedding and linear readout matrices, making it confusing to find which entries correspond to which… Maybe it can be done by testing individual boolean variables at a time. If a trained network is doing this, then it would be evidence that this is how ReLU networks achieve exponential size circuits.