Forgive me if the answer to this would be obvious given more familiarity with natural abstractions, but is your claim that interpretability research should identify mathematically defined high-level features rather than fuzzily defined features? Supposing that in optimistic versions of interpretability, we’re able to say that this neuron corresponds to this one concept and this one circuit in the network is responsible for this one task (and we don’t have to worry about polysemanticity). How do we define concepts like “trees” and “summarizing text in a way that labelers like” in mathematical way?
So my new main position is: which potential alignment targets (human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AI’s internal language (which itself probably includes a lot of mathematics) is an empirical question, and that’s the main question which determines what we should target.
Do you expect that the network will have an accurate understanding of its goals? I’d expect that we could train an agentic language model which is still quite messy and isn’t able to reliably report information about itself and even if it could, it probably wouldn’t know how to express it mathematically. I think a model could be able to write a lot of text about human values and corrigibility and yet fail to have a crisp or mathematical concept for either of them.
Forgive me if the answer to this would be obvious given more familiarity with natural abstractions, but is your claim that interpretability research should identify mathematically defined high-level features rather than fuzzily defined features? Supposing that in optimistic versions of interpretability, we’re able to say that this neuron corresponds to this one concept and this one circuit in the network is responsible for this one task (and we don’t have to worry about polysemanticity). How do we define concepts like “trees” and “summarizing text in a way that labelers like” in mathematical way?
Do you expect that the network will have an accurate understanding of its goals? I’d expect that we could train an agentic language model which is still quite messy and isn’t able to reliably report information about itself and even if it could, it probably wouldn’t know how to express it mathematically. I think a model could be able to write a lot of text about human values and corrigibility and yet fail to have a crisp or mathematical concept for either of them.