This is cool! These sparse features should be easily “extractable” by the transformer’s key, query, and value weights in a single layer. Therefore, I’m wondering if these weights can somehow make it easier to “discover” the sparse features?
This is something we’re planning to look into! From the paper:
Future efforts could also try to improve feature dictionary discovery by incorporating information about the weights of the model or dictionary features found in adjacent layers into the training process.
Exactly how to use them is something we’re still working on...
This is cool! These sparse features should be easily “extractable” by the transformer’s key, query, and value weights in a single layer. Therefore, I’m wondering if these weights can somehow make it easier to “discover” the sparse features?
This is something we’re planning to look into! From the paper:
Exactly how to use them is something we’re still working on...