Even for a fairly small target model we might want to discover e.g. 100K features and and the input vectors might be e.g. 768D. That’s a lot of work to compute that matrix!
Hm. Okay, I remembered a better way to improve efficiency: neighbor lists. For each feature, remember a list of who its closest neighbors are, and just compute your “closeness loss” by calculating dot products in that list.
The neighbor list itself can either be recomputed once in a while using the naive method, or you can accelerate the neighbor list recomputation by keeping more coarse-grained track of where features are in activation-space.
Even for a fairly small target model we might want to discover e.g. 100K features and and the input vectors might be e.g. 768D. That’s a lot of work to compute that matrix!
Hm. Okay, I remembered a better way to improve efficiency: neighbor lists. For each feature, remember a list of who its closest neighbors are, and just compute your “closeness loss” by calculating dot products in that list.
The neighbor list itself can either be recomputed once in a while using the naive method, or you can accelerate the neighbor list recomputation by keeping more coarse-grained track of where features are in activation-space.
Thanks, I mentioned this as a potential way forward for tackling quadratic complexity in my edit at the end of the post.