I’m confused about SAE feature descriptions. In Anthropic’s and Google’s demos both, there’re a lot of descriptions that seem not to match a naked-eye reading of the top activations. (E.G. “Slurs targeting sexual orientation” also has a number of racial slurs in its top activations; the top activations for “Korean text, Chinese name yunfan, Unicode characters” are almost all the word “fused” in a metal-related context; etc.). I’m not sure if these short names are the automated Claude descriptions or if there are longer more accurate real descriptions somewhere; and if these are the automated descriptions, I’m not sure if there’s some reason to think they’re more accurate than they look, or if it doesn’t matter if they’re slightly off, or some third thing?
These are LLM generated labels, there are no “real” labels (because they’re expensive!). Especially in our demo, Neuronpedia made them with gpt 3.5 which is kinda dumb.
I mostly think they’re much better than nothing, but shouldn’t be trusted, and I’m glad our demo makes this apparent to people! I’m excited about work to improve autointerp, though unfortunately the easiest way is to use a better model, which gets expensive
One can think of this as cases where auto-interp exhibits a precision-recall trade-off. At one extreme, you can generate super broad annotations like “all English text” to capture a a lot, which would overkill; and at the other end, you can generate very specific ones like “Slurs targeting sexual orientation” which would risk mislabeling, say, racial slurs.
I’m confused about SAE feature descriptions. In Anthropic’s and Google’s demos both, there’re a lot of descriptions that seem not to match a naked-eye reading of the top activations. (E.G. “Slurs targeting sexual orientation” also has a number of racial slurs in its top activations; the top activations for “Korean text, Chinese name yunfan, Unicode characters” are almost all the word “fused” in a metal-related context; etc.). I’m not sure if these short names are the automated Claude descriptions or if there are longer more accurate real descriptions somewhere; and if these are the automated descriptions, I’m not sure if there’s some reason to think they’re more accurate than they look, or if it doesn’t matter if they’re slightly off, or some third thing?
These are LLM generated labels, there are no “real” labels (because they’re expensive!). Especially in our demo, Neuronpedia made them with gpt 3.5 which is kinda dumb.
I mostly think they’re much better than nothing, but shouldn’t be trusted, and I’m glad our demo makes this apparent to people! I’m excited about work to improve autointerp, though unfortunately the easiest way is to use a better model, which gets expensive
One can think of this as cases where auto-interp exhibits a precision-recall trade-off. At one extreme, you can generate super broad annotations like “all English text” to capture a a lot, which would overkill; and at the other end, you can generate very specific ones like “Slurs targeting sexual orientation” which would risk mislabeling, say, racial slurs.
Section 4.3 of the OpenAI SEA paper also discusses this point.