The typical noise on feature f1 caused by 1 unit of activation from feature f2, for any pair of features f1, f2, is (derived from Johnson–Lindenstrauss lemma)
1. … This is a worst case scenario. I have not calculated the typical case, but I expect it to be somewhat less, but still same order of magnitude
Perhaps I’m misunderstanding your claim here, but the “typical” (i.e. RMS) inner product between two independently random unit vectors in Rn is n−1/2. So I think the √8lnm shouldn’t be there, and the rest of your estimates are incorrect.
This means that we can have at most l<n/(8ln(m)) simultaneously active features
Good point. I need to think about this a bit more. Thanks
Just quickly writing up my though for now...
What I think is going on here is that Johnson–Lindenstrauss lemma gives a bound on how well you can do, so it’s more like a worst case scenario. I.e. Johnson–Lindenstrauss lemma gives you the worst case error for the best possible feature embedding.
I’ve assumed that the typical noise would be same order of magnitude as the worst case, but now I think I was wrong about this for large m.
I’ll have to think about what is more important of worst case and typical case. When adding up noise one should probably use worst typical case. But when calculating how many features to fit in, one should probably use worst case.
I think it’s pretty tricky, because what matters to real networks is the cost difference between storing features pseudo-linearly (in superposition), versus storing them nonlinearly (in one of the host of ways it takes multiple nn layers to decode), versus not storing them at all. Calculating such a cost function seems like it has details that depend on the particulars of the network and training process, making it a total pain to try to mathematize (but maybe amenable to making toy models).
I think it’s reasonable to think about what can be stored in a way that can be read of in a linear way (by the next layer), since that are the features that can be directly used in the next layer.
storing them nonlinearly (in one of the host of ways it takes multiple nn layers to decode)
If it takes multiple nn layers to decode, then the nn need to unpack it before using it, and represent it as a linear readable feature later.
I think the √8lnm may be in there because JL is putting an upper bound on the interference, rather than describing the typical interference of two features. As you increase m (more features), it becomes more difficult to choose feature embeddings such that no features have high interference with any other features.
So its not really the ‘typical’ noise between any two given features, but it might be the relevant bound for the noise anyway? Not sure right now which one matters more for practical purposes.
Perhaps I’m misunderstanding your claim here, but the “typical” (i.e. RMS) inner product between two independently random unit vectors in Rn is n−1/2. So I think the √8lnm shouldn’t be there, and the rest of your estimates are incorrect.
This conclusion gets changed to l<n.
Good point. I need to think about this a bit more. Thanks
Just quickly writing up my though for now...
What I think is going on here is that Johnson–Lindenstrauss lemma gives a bound on how well you can do, so it’s more like a worst case scenario. I.e. Johnson–Lindenstrauss lemma gives you the worst case error for the best possible feature embedding.
I’ve assumed that the typical noise would be same order of magnitude as the worst case, but now I think I was wrong about this for large m.
I’ll have to think about what is more important of worst case and typical case. When adding up noise one should probably use worst typical case. But when calculating how many features to fit in, one should probably use worst case.
I think it’s pretty tricky, because what matters to real networks is the cost difference between storing features pseudo-linearly (in superposition), versus storing them nonlinearly (in one of the host of ways it takes multiple nn layers to decode), versus not storing them at all. Calculating such a cost function seems like it has details that depend on the particulars of the network and training process, making it a total pain to try to mathematize (but maybe amenable to making toy models).
I think it’s reasonable to think about what can be stored in a way that can be read of in a linear way (by the next layer), since that are the features that can be directly used in the next layer.
If it takes multiple nn layers to decode, then the nn need to unpack it before using it, and represent it as a linear readable feature later.
I think the √8lnm may be in there because JL is putting an upper bound on the interference, rather than describing the typical interference of two features. As you increase m (more features), it becomes more difficult to choose feature embeddings such that no features have high interference with any other features.
So its not really the ‘typical’ noise between any two given features, but it might be the relevant bound for the noise anyway? Not sure right now which one matters more for practical purposes.