Quadratic complexity isn’t that bad, if this is useful. If your feature vectors are normalized you can do it faster by taking a matrix product of the weights “the big way” and just penalizing the trace for being far from ones or zeros. I think?
Feature vector normalization is itself an examble of a quadratic thing that makes it in.
Even for a fairly small target model we might want to discover e.g. 100K features and and the input vectors might be e.g. 768D. That’s a lot of work to compute that matrix!
Hm. Okay, I remembered a better way to improve efficiency: neighbor lists. For each feature, remember a list of who its closest neighbors are, and just compute your “closeness loss” by calculating dot products in that list.
The neighbor list itself can either be recomputed once in a while using the naive method, or you can accelerate the neighbor list recomputation by keeping more coarse-grained track of where features are in activation-space.
Quadratic complexity isn’t that bad, if this is useful. If your feature vectors are normalized you can do it faster by taking a matrix product of the weights “the big way” and just penalizing the trace for being far from ones or zeros. I think?
Feature vector normalization is itself an examble of a quadratic thing that makes it in.
Even for a fairly small target model we might want to discover e.g. 100K features and and the input vectors might be e.g. 768D. That’s a lot of work to compute that matrix!
Hm. Okay, I remembered a better way to improve efficiency: neighbor lists. For each feature, remember a list of who its closest neighbors are, and just compute your “closeness loss” by calculating dot products in that list.
The neighbor list itself can either be recomputed once in a while using the naive method, or you can accelerate the neighbor list recomputation by keeping more coarse-grained track of where features are in activation-space.
Thanks, I mentioned this as a potential way forward for tackling quadratic complexity in my edit at the end of the post.