eggsyntax comments on Self-explaining SAE features

eggsyntax 10 Sep 2024 14:44 UTC
1 point
0
Thanks! I think the post (or later work) might benefit from a discussion of using your judgment as a proxy for accuracy, its strengths & weaknesses, maybe a worked example. I’m somewhat skeptical of human judgement because I’ve seen a fair number of examples of a feature seeming (to me) to represent one thing, and then that turning out to be incorrect on further examination (eg if my explanation, if used by an LLM to score whether a particular piece of text should trigger the feature, turns out not to do a good job of that).
- eggsyntax 10 Sep 2024 14:47 UTC
  1 point
  0
  Parent
  One common type of example of that is when a feature clearly activates in the presence of a particular word, but then when trying to use that to predict whether the feature will activate on particular text, it turns out that the feature activates on that word in some cases but not all, sometimes with no pattern I can discover to distinguish when it does and doesn’t.