Interesting work, thank you! A couple of thoughts:
In your evaluation section, when you classify cases as for example “Self-explanation is correct, auto-interp is not”, it’s not entirely clear to me how you’re identifying ground truth there. Human analysis of max activating examples, maybe? It doesn’t seem clear to me that human analysis of a feature is always accurate either, especially at lower magnitudes of activation where the meaning may shift somewhat.
In your example graphs of self-similarity, I notice a consistent pattern: the graph is a very smooth, concave function looking something like the top half of a sigmoid, or a logarithmic curve. But then that’s interrupted in the middle by a convex or near-convex section; in the examples I looked at, it seems plausible that the boundaries of the convex section are the boundaries of the magnitudes for the correct explanation. That could very well be an artifact of the particular examples, but I thought I’d toss it out for consideration.
We use our own judgement as a (potentially very inaccurate) proxy for accuracy as an explanation and let readers look on their own at the feature dashboard interface. We judge using a random sample of examples at different levels of activation. We had an automatic interpretation scoring pipeline that used Llama 3 70B, but we did not use it because (IIRC) it was too slow to run with multiple explanations per feature. Perhaps it is now practical to use a method like this.
That is a pattern that happens frequently, but we’re not confident enough to propose any particular form. It is sometimes thrown off by random spikes, self-similarity gradually rising at larger scales, or entropy peaking in the beginning. Because of this, there is still a lot of room for improvement in cases where a human (or maybe a peak-finding algorithm) could do better than our linear metric.
Thanks! I think the post (or later work) might benefit from a discussion of using your judgment as a proxy for accuracy, its strengths & weaknesses, maybe a worked example. I’m somewhat skeptical of human judgement because I’ve seen a fair number of examples of a feature seeming (to me) to represent one thing, and then that turning out to be incorrect on further examination (eg if my explanation, if used by an LLM to score whether a particular piece of text should trigger the feature, turns out not to do a good job of that).
One common type of example of that is when a feature clearly activates in the presence of a particular word, but then when trying to use that to predict whether the feature will activate on particular text, it turns out that the feature activates on that word in some cases but not all, sometimes with no pattern I can discover to distinguish when it does and doesn’t.
Interesting work, thank you! A couple of thoughts:
In your evaluation section, when you classify cases as for example “Self-explanation is correct, auto-interp is not”, it’s not entirely clear to me how you’re identifying ground truth there. Human analysis of max activating examples, maybe? It doesn’t seem clear to me that human analysis of a feature is always accurate either, especially at lower magnitudes of activation where the meaning may shift somewhat.
In your example graphs of self-similarity, I notice a consistent pattern: the graph is a very smooth, concave function looking something like the top half of a sigmoid, or a logarithmic curve. But then that’s interrupted in the middle by a convex or near-convex section; in the examples I looked at, it seems plausible that the boundaries of the convex section are the boundaries of the magnitudes for the correct explanation. That could very well be an artifact of the particular examples, but I thought I’d toss it out for consideration.
We use our own judgement as a (potentially very inaccurate) proxy for accuracy as an explanation and let readers look on their own at the feature dashboard interface. We judge using a random sample of examples at different levels of activation. We had an automatic interpretation scoring pipeline that used Llama 3 70B, but we did not use it because (IIRC) it was too slow to run with multiple explanations per feature. Perhaps it is now practical to use a method like this.
That is a pattern that happens frequently, but we’re not confident enough to propose any particular form. It is sometimes thrown off by random spikes, self-similarity gradually rising at larger scales, or entropy peaking in the beginning. Because of this, there is still a lot of room for improvement in cases where a human (or maybe a peak-finding algorithm) could do better than our linear metric.
Thanks! I think the post (or later work) might benefit from a discussion of using your judgment as a proxy for accuracy, its strengths & weaknesses, maybe a worked example. I’m somewhat skeptical of human judgement because I’ve seen a fair number of examples of a feature seeming (to me) to represent one thing, and then that turning out to be incorrect on further examination (eg if my explanation, if used by an LLM to score whether a particular piece of text should trigger the feature, turns out not to do a good job of that).
One common type of example of that is when a feature clearly activates in the presence of a particular word, but then when trying to use that to predict whether the feature will activate on particular text, it turns out that the feature activates on that word in some cases but not all, sometimes with no pattern I can discover to distinguish when it does and doesn’t.