research_prime_space comments on Comparing Anthropic’s Dictionary Learning to Ours

research_prime_space 20 Oct 2023 18:06 UTC
LW: 2 AF: 2
0
AF
This is cool! These sparse features should be easily “extractable” by the transformer’s key, query, and value weights in a single layer. Therefore, I’m wondering if these weights can somehow make it easier to “discover” the sparse features?
- Robert_AIZI 21 Oct 2023 13:08 UTC
  1 point
  0
  Parent
  This is something we’re planning to look into! From the paper:
  Future efforts could also try to improve feature dictionary discovery by incorporating information about the weights of the model or dictionary features found in adjacent layers into the training process.
  Exactly how to use them is something we’re still working on...