David Udell comments on Sparse Coding, for Mechanistic Interpretability and Activation Engineering

David Udell 26 Sep 2023 0:47 UTC
2 points
0
I agree that stronger, more nuanced interpretability techniques should tell you more. But, when you see something like, e.g.,
25132 ▁vs, ▁differently, ▁compared, ▁greater, all, ▁per
25134 ▁I, ▁My, I, ▁personally
isn’t it pretty obvious what those two autoencoder neurons were each doing?
- Charlie Steiner 26 Sep 2023 1:23 UTC
  4 points
  0
  Parent
  It does seem obvious^[1], but I think this can easily be misleading. Are these activation directions always looking for these tokens regardless of context, or are they detecting the human-obvious theme they seem to be gesturing towards, or are they playing a more complicated functional role that merely happens to be activated by those tokens in the first position?
  E.g. Is the “▁vs, ▁differently, ▁compared” direction just a brute detector for those tokens? Or is it a more general detector for comparison and counting that would have rich but still human-obvious behavior on longer snippets? Or is it part of a circuit that needs to detect comparison words but is actually doing something totally different like completing discussions about shopping lists?
  1. ^
    certainly more so than
    31892 ▁she, bian, ▁recently, ▁means, ▁Because, ▁experienced