I want to register a weak but nonzero prediction that Anthropic’s interpretability publication of A Mathematical Framework for Transformers Circuits will turn out to lead to large capabilities gains and in hindsight will be regarded as a rather bad move that it was published.
Something like we’ll have capabilities-advancing papers citing it and using its framework to justify architecture improvements.
Agreed, and I don’t think this is bad, nor that they did anything but become the people to implement what the zeitgeist demanded. It was the obvious next step, if they hadn’t done it, someone else who cared less about trying to use it to make systems actually do what humans want would have done it. So the question is, are they going to release their work for others to use, or just hoard it until someone less scrupulous releases their models? It’s looking like they’re trying to keep it “in the family” so only corporations can use it. Kinda concerning.
If human understandability hadn’t happened, the next step might have been entirely automated sparsification, and those don’t necessarily produce anything humans can use to understand. Distillation into understandable models is an extremely powerful trajectory.
I want to register a weak but nonzero prediction that Anthropic’s interpretability publication of A Mathematical Framework for Transformers Circuits will turn out to lead to large capabilities gains and in hindsight will be regarded as a rather bad move that it was published.
Something like we’ll have capabilities-advancing papers citing it and using its framework to justify architecture improvements.
Agreed, and I don’t think this is bad, nor that they did anything but become the people to implement what the zeitgeist demanded. It was the obvious next step, if they hadn’t done it, someone else who cared less about trying to use it to make systems actually do what humans want would have done it. So the question is, are they going to release their work for others to use, or just hoard it until someone less scrupulous releases their models? It’s looking like they’re trying to keep it “in the family” so only corporations can use it. Kinda concerning.
If human understandability hadn’t happened, the next step might have been entirely automated sparsification, and those don’t necessarily produce anything humans can use to understand. Distillation into understandable models is an extremely powerful trajectory.
Not the same paper, but related: https://twitter.com/jamespayor/status/1634447672303304705