Having digested this a bit more, I’ve got a question regarding the noise terms, particularly for section 1.3 that deals with constructing general programs over sparse superposed variables.
Unfortunately, since the →f1,…,→fm are random vectors, their inner product will have a typical size of 1/√d0. So, on an input which has no features connected to neuron i, the preactivation for that neuron will not be zero: it will be a sum of these interference terms, one for each feature that is connected to the neuron. Since the interference terms are uncorrelated and mean zero, they start to cause neurons to fire incorrectly when Θ(d0) neurons are connected to each neuron. Since each feature is connected to each neuron with probability p=log2d0√d) this means neurons start to misfire when m=~Θ(d0√d)[13].
It seems to me that the assumption of uncorrelated errors here is rather load-bearing. If you don’t get uncorrelated errors over the inputs you actually care about, you are forced to scale back to connecting only √d0 features to every neuron, correct? And the same holds for the construction right after this one, and probably most of the other constructions shown here?
And if you only get √d0 connected features per neuron, you scale back to only being able to compute |E|=~Θ(√d0d) arbitrary AND gates per layer, correct?
Now, the reason these errors are ‘uncorrelated’ is that the features were embedded as random vectors in our layer space. In other words, the distributions overwhich they are uncorrelated is the distribution of feature embeddings and sets of neurons chosen to connect to particular features. So for any given network, we draw from this distribution only once, when the weights of the network are set, and then we are locked into it.
So this noise will affect particular sets of inputs strongly, systematically, in the same direction every time. If I divide the set of features into two sets, where features in each half are embedded along directions that have a positive inner product with each other[1], I can’t connect more than √d0 from the same half to the same neuron without making it misfire, right? So if I want to implement a layer that performs |E|=~Θ(d0d) ANDs on exactly those features that happen to be embedded within the same set, I can’t really do that. Now, for any given embedding, that’s maybe only some particular sets of features which might not have much significance to each other. But then the embedding directions of features in later layers depend on what was computed and how in the earlier layers, and the limitations on what I can wire together apply every time.
I am a bit worried that this and similar assumptions about stochasticity here might turn out to prevent you from wiring together the features you need to construct arbitrary programs in superposition, with ‘noise’ from multiple layers turning out to systematically interact in exactly such a way as to prevent you from computing too much general stuff. Not because I see a gears-level way this could happen right now, but because I think rounding off things to ‘noise’ that are actually systematic is one of these ways an exciting new theory can often go wrong and see a structure that isn’t there, because you are not tracking the parts of the system that you have labeled noise and seeing how the systematics of their interactions constrain the rest of the system.
Like making what seems like a blueprint for perpetual motion machine because you’re neglecting to model some small interactions with the environment that seem like they ought not to affect the energy balance on average, missing how the energy losses/gains in these interactions are correlated with each other such that a gain at one step immediately implies a loss in another.
Aside from looking at error propagation more, maybe a way to resolve this might be to switch over to thinking about one particular set of weights instead of reasoning about the distribution the weights are drawn from?
Thinking the example through a bit further: In a ReLU layer, features are all confined to the positive quadrant. So superposed features computed in a ReLU layer all have positive inner product. So if I send the output of one ReLU layer implementing n2 AND gates in superposition directly to another ReLU layer implementing another n2 ANDs on a subset of the outputs of that previous layer[1], the assumption that input directions are equally likely to have positive and negative inner products is not satisfied.
Maybe you can fix this with bias setoffs somehow? Not sure at the moment. But as currently written, it doesn’t seem like I can use the outputs of one layer performing a subset of ANDs as the inputs of another layer performing another subset of ANDs.
EDIT: Talked it through with Jake. Bias setoff can help, but it currently looks to us like you still end up with AND gates that share a variable systematically having positive sign in their inner product. Which might make it difficult to implement a valid general recipe for multi-step computation if you try to work out the details.
Having digested this a bit more, I’ve got a question regarding the noise terms, particularly for section 1.3 that deals with constructing general programs over sparse superposed variables.
It seems to me that the assumption of uncorrelated errors here is rather load-bearing. If you don’t get uncorrelated errors over the inputs you actually care about, you are forced to scale back to connecting only √d0 features to every neuron, correct? And the same holds for the construction right after this one, and probably most of the other constructions shown here?
And if you only get √d0 connected features per neuron, you scale back to only being able to compute |E|=~Θ(√d0d) arbitrary AND gates per layer, correct?
Now, the reason these errors are ‘uncorrelated’ is that the features were embedded as random vectors in our layer space. In other words, the distributions over which they are uncorrelated is the distribution of feature embeddings and sets of neurons chosen to connect to particular features. So for any given network, we draw from this distribution only once, when the weights of the network are set, and then we are locked into it.
So this noise will affect particular sets of inputs strongly, systematically, in the same direction every time. If I divide the set of features into two sets, where features in each half are embedded along directions that have a positive inner product with each other[1], I can’t connect more than √d0 from the same half to the same neuron without making it misfire, right? So if I want to implement a layer that performs |E|=~Θ(d0d) ANDs on exactly those features that happen to be embedded within the same set, I can’t really do that. Now, for any given embedding, that’s maybe only some particular sets of features which might not have much significance to each other. But then the embedding directions of features in later layers depend on what was computed and how in the earlier layers, and the limitations on what I can wire together apply every time.
I am a bit worried that this and similar assumptions about stochasticity here might turn out to prevent you from wiring together the features you need to construct arbitrary programs in superposition, with ‘noise’ from multiple layers turning out to systematically interact in exactly such a way as to prevent you from computing too much general stuff. Not because I see a gears-level way this could happen right now, but because I think rounding off things to ‘noise’ that are actually systematic is one of these ways an exciting new theory can often go wrong and see a structure that isn’t there, because you are not tracking the parts of the system that you have labeled noise and seeing how the systematics of their interactions constrain the rest of the system.
Like making what seems like a blueprint for perpetual motion machine because you’re neglecting to model some small interactions with the environment that seem like they ought not to affect the energy balance on average, missing how the energy losses/gains in these interactions are correlated with each other such that a gain at one step immediately implies a loss in another.
Aside from looking at error propagation more, maybe a way to resolve this might be to switch over to thinking about one particular set of weights instead of reasoning about the distribution the weights are drawn from?
E.g. pick some hyperplanes and declare everything on one side of all of them to be the first set.
Thinking the example through a bit further: In a ReLU layer, features are all confined to the positive quadrant. So superposed features computed in a ReLU layer all have positive inner product. So if I send the output of one ReLU layer implementing n2 AND gates in superposition directly to another ReLU layer implementing another n2 ANDs on a subset of the outputs of that previous layer[1], the assumption that input directions are equally likely to have positive and negative inner products is not satisfied.
Maybe you can fix this with bias setoffs somehow? Not sure at the moment. But as currently written, it doesn’t seem like I can use the outputs of one layer performing a subset of ANDs as the inputs of another layer performing another subset of ANDs.
EDIT: Talked it through with Jake. Bias setoff can help, but it currently looks to us like you still end up with AND gates that share a variable systematically having positive sign in their inner product. Which might make it difficult to implement a valid general recipe for multi-step computation if you try to work out the details.
A very central use case for a superposed boolean general computer. Otherwise you don’t actually get to implement any serial computation.